Running Active Learning Glide from the Command Line
Active Learning Glide (AL-Glide) combines traditional Glide SP docking with state-of-the-art deep learning to efficiently screen ultra-large compound libraries. In the screening process, a DeepAutoQSAR ML model is iteratively trained to a selection of docking scores sampled from an input library chosen by a selection rule.
How to run AL-Glide from the command line
Run AL-Glide Command
Make sure $SCHRODINGER is configured in your shell’s environment and pointing to the appropriate installation of the Schrödinger Core Suite.
$SCHRODINGER/run -FROM glide glide_active_learning.py [task] [args]
Command Help
For a help message summarizing all available command options, see glide_active_learning.py Command Help, or enter the following command
$SCHRODINGER/run -FROM glide glide_active_learning.py screen -h
Linux Host
To run the job, you must specify a properly configured Linux host with the -HOST option.
AL-Glide Commands
AL-Glide has both required and optional arguments.
Required Commands
| Command Line Only 1 | Command Line OR Input File 2 (preferred) | Keyword | Description | |
| √ | HOST | Properly configured host that should be listed in your $SCHRODINGER/schrodinger.hosts file | ||
| √ | task | pilot | Scaled down version of the full Active Learning Glide workflow with just a single iteration (docking, training, evaluation). Additionally, the pilot library (sampled from the screening database) is docked in full using Glide SP to gauge how well the ML models are learning the Glide scores. It is recommended to run a pilot job before running the full workflow. | |
| √ | evaluate | Evaluate a pretrained AL-Glide model against a library of ligands. | ||
| √ | screen | Run a production screen using the full Active Learning Glide workflow. | ||
Task Commands
| Keyword | Description | pilot | evaluate | screen |
| GRID filename | Path to Glide gridfile to be used for screen | √ | √ | √ |
| INFILE ligand_input_file |
Path to a single ligand file in SMILES or SMILESCSV format. Multiple input files can be specified by supplying additional -infile arguments (eg. -infile input1.smi, -infile input2.smi) | √ | √ | √ |
| JOBNAME | The human-readable name to be associated with this particular job record. | √ | √ | √ |
| num_rescore_ligand | After performing the training of the ML model, you can specify how many of the top scoring ligands (according to the ML model) should be re-docked with Glide. Default: 1,000,000 |
|
√ | √ |
1. Command-Line Only: These arguments must be specified on the command-line and cannot be specified in an Active Learning Glide input file.
2. These arguments can be specified on the command line or through an optional input file (preferred). Commands can be specified only in an input file or on the command line and cannot be mixed.
Optional Commands
| Command Line Only 1 | Command Line OR Input File 2 (preferred) | Keyword | Description | pilot | evaluate | screen | |
| √ | √ | input_file | A text file containing parameters for an Active Learning Glide calculation. All Active Learning Glide parameters can be provided either on the command-line or in an input file. *example:TASK screenGRID gridgen.zipINFILE D4.csvTRAIN_TIME 8KEEP 100000TRAIN_SIZE 10000NUM_TRAIN_CORE 1NUM_ITER 2JOBNAME D4_screenNUM_GLIDE_SUBJOBS 2[EXTRA_DOCKING_INPUTS]LIG_VSCALE 0.8GLIDE_TORCONSFILE dock.torcons | NA | NA | NA | |
| √ | selection_rule | Specify algorithm for selecting each batch of ligands for training the ML model. Note: Each batch of ligands selected will be unique to all other batches. | √ | ||||
| √ | (Default) diverse | A chemically diverse (by 2D fingerprints) is chosen from the top 10% scoring ligands according to Machine Learning (ML). | √ | √ | |||
| √ | most_uncertain | The ligands with the greatest uncertainty among the top 10% scoring ligands according to ML is chosen. | √ | ||||
| √ | random | The batch is chosen by random selection among the top 10% scoring ligands according to ML. | √ | √ | |||
| √ | distinct_scaffolds | Minimizes the number of distinct Bemis-Murcko (BM) scaffolds in the training set in all training rounds. This increases the diversity of training sets and hence also the diversity of the top predictions coming out of the active learning workflow. | √ | ||||
| √ | max_glide_cpu | Maximum number of Glide docking jobs that can be running concurrently. Useful for limiting license use. If -HOST:N is provided, the minimum of N and -max_glide_cpu will be used to limit concurrent Glide jobs. | √ | √ | √ | ||
| √ | max_ml_eval_cpu | Maximum number of DeepAutoQSAR (ML) evaluation jobs that can be running concurrently. Useful for limiting license use. If -HOST:N is provided, the minimum of N and -max_ml_eval_cpu will be used to limit concurrent ML evaluation jobs. | √ | √ | √ | ||
| √ | num_iter | Number of active learning iterations. Default is 3. | √ | ||||
| √ | block_size | This determines the number of structures in each ML evaluation subjob and total number of ML evaluation jobs. | √ | √ | √ | ||
| √ | train_size | Number of training ligands for each active learning round. A minimal number of 15 is required. Default is 5,000. Larger train_size will have a higher recovery ratio of top ligands but requires more computational resources especially CPU-RAM of the -train_host. In practice, a reasonable -train_size could be between 10,000 and 100,000. | √ | √ | |||
| √ | train_host | Machine learning model training host. Default is the same as -HOST. Using a training host with a high performance GPU will largely speed up the training process. | √ | √ | |||
| √ | num_train_core | Number of core(s) for ML training. Default is 1. It is recommended to keep the default value. | √ | √ | |||
| √ | ligprep_args | By default, ligands are prepared prior to docking with LigPrep on-the-fly. This argument allows you to specify what LigPrep options are applied. Default: (“-pht 1.0 -epik -s16”). This should be a quoted string with all arguments separated by spaces. | √ | √ | √ | ||
| √ | glide_subjob_size | Number of ligands in each Glide docking subjob. This is equivalent to the -NSTRUCTS argument in Glide. By default, Glide’s smart distribution mechanism will determine the docking subjob size automatically. | √ | √ | √ | ||
| √ | result_prefix | Defines the name of the output result files. By default the prefix is set to the jobname. | √ | √ | √ | ||
| √ | keep | The number of top-scoring ligands according to the ML model that are returned. Default: 10,000,000 | √ | √ | √ | ||
| √ | no_pose | Whether Maestro poses of the rescored ligands are returned. Default: Keep the pose of rescored ligands. | √ | √ | |||
| √ | extra_docking_inputs | A Glide-like (sif format) input file that contains extra inputs for running the Glide dock jobs in Active Learning Glide. This is useful for specifying user defined docking options such as Grid-based constraints. | √ | √ | √ | ||
Preparing the input structures and receptor
A single AL-Glide calculation screens a ligand library by docking into a single receptor. The receptor file must be prepared prior to running an AL-Glide calculation.
-
The receptor should be prepared using the Protein Preparation Workflow prior to generating a Glide grid (see Protein Preparation Workflow Panel or Protein Preparation Command Help for more information). Grid-based constraints (positional, hydrogen-bond, metal, metal coordination) are not supported by AL-Glide.
-
Ligands are prepared on-the-fly using LigPrep (see the LigPrep User Manual for more information on the process). 3D preparation is performed only for those ligands that are docked for the purpose of training the underlying machine learning model or for top scoring ligands that are rescored in the final step of the workflow. The ligands should be supplied using a 2D representation of the ligand. Both SMILES (
.smi) and SMILESCSV (.csv) formats are supported. The ligand library can consist of one or more ligand files in these formats. To specify multiple ligand files, use the-infileoption multiple times, with one file specified for each instance.
For more detailed information, please refer to Preparing Inputs for Active Learning Glide and the Best Practices for Protein Preparation.
Specifying input options
The input to glide_active_learning.py can be specified by using the command options, or with an input file, which is specified by -input_file. The command options are listed in glide_active_learning.py Command Help. The input file keywords are upper case versions of the command options with no initial dash. An example input file is given below.
TASK screen GRID gridgen.zip INFILE D4.csv TRAIN_TIME 8 KEEP 100000 TRAIN_SIZE 10000 NUM_TRAIN_CORE 1 NUM_ITER 2 JOBNAME D4_screen NUM_GLIDE_SUBJOBS 2 WRITE_POSE False WITH_HEADER True [EXTRA_DOCKING_INPUTS] LIG_VSCALE 0.8 GLIDE_TORCONSFILE dock.torcons
Restarting a failed job
If the Active Learning Glide workflow fails for any reason, intermediate files will be transferred to the launching directory as long as the driver node is alive. The user can restart the workflow with the following commands:
Command line options:
$SCHRODINGER/run -FROM glide glide_active_learning.py screen -restart_file {jobname}_restart.pkl -HOST cpu:X -DRIVERHOST driver
Input file options:
$SCHRODINGER/run -FROM glide glide_active_learning.py screen -input_file {input_file} -RESTART -HOST cpu:X -DRIVERHOST driver