Running Active Learning Glide from the Command Line

Active Learning Glide (AL-Glide) combines traditional Glide SP docking with state-of-the-art deep learning to efficiently screen ultra-large compound libraries. In the screening process, a DeepAutoQSAR ML model is iteratively trained to a selection of docking scores sampled from an input library chosen by a selection rule.

 

 

 

How to run AL-Glide from the command line

Run AL-Glide Command

Make sure $SCHRODINGER is configured in your shell’s environment and pointing to the appropriate installation of the Schrödinger Core Suite.

$SCHRODINGER/run -FROM glide glide_active_learning.py [task] [args]

 

Command Help

For a help message summarizing all available command options, see glide_active_learning.py Command Help, or enter the following command

$SCHRODINGER/run -FROM glide glide_active_learning.py screen -h

 

Linux Host

To run the job, you must specify a properly configured Linux host with the -HOST option.

 

NOTE: Given the costs of a large-scale screen with Active Learning Glide, it is strongly recommended to run a pilot screen prior to the production run to ensure everything is working as expected. To learn more about Pilot Mode vs Glide, see the Introduction to Active Learning Glide

 

 


 

AL-Glide Commands

AL-Glide has both required and optional arguments.

 

Required Commands

Command Line Only 1 Command Line OR Input File 2 (preferred) Keyword Description
    HOST Properly configured host that should be listed in your $SCHRODINGER/schrodinger.hosts file
  task pilot Scaled down version of the full Active Learning Glide workflow with just a single iteration (docking, training, evaluation). Additionally, the pilot library (sampled from the screening database) is docked in full using Glide SP to gauge how well the ML models are learning the Glide scores. It is recommended to run a pilot job before running the full workflow.
  evaluate Evaluate a pretrained AL-Glide model against a library of ligands.
  screen Run a production screen using the full Active Learning Glide workflow.

 

Task Commands

Keyword Description pilot evaluate screen
GRID filename Path to Glide gridfile to be used for screen
INFILE
ligand_input_file
Path to a single ligand file in SMILES or SMILESCSV format. Multiple input files can be specified by supplying additional -infile arguments (eg. -infile input1.smi, -infile input2.smi)
JOBNAME The human-readable name to be associated with this particular job record.
num_rescore_ligand After performing the training of the ML model, you can specify how many of the top scoring ligands (according to the ML model) should be re-docked with Glide. Default: 1,000,000

 

 

1. Command-Line Only: These arguments must be specified on the command-line and cannot be specified in an Active Learning Glide input file.

2. These arguments can be specified on the command line or through an optional input file (preferred). Commands can be specified only in an input file or on the command line and cannot be mixed.

 

 

Optional Commands

Command Line Only 1 Command Line OR Input File 2 (preferred) Keyword Description pilot evaluate screen
input_file A text file containing parameters for an Active Learning Glide calculation. All Active Learning Glide parameters can be provided either on the command-line or in an input file. *example:TASK screenGRID gridgen.zipINFILE D4.csvTRAIN_TIME 8KEEP 100000TRAIN_SIZE 10000NUM_TRAIN_CORE 1NUM_ITER 2JOBNAME D4_screenNUM_GLIDE_SUBJOBS 2[EXTRA_DOCKING_INPUTS]LIG_VSCALE 0.8GLIDE_TORCONSFILE dock.torcons NA NA NA
  selection_rule   Specify algorithm for selecting each batch of ligands for training the ML model. Note: Each batch of ligands selected will be unique to all other batches.    
  (Default) diverse A chemically diverse (by 2D fingerprints) is chosen from the top 10% scoring ligands according to Machine Learning (ML).  
  most_uncertain The ligands with the greatest uncertainty among the top 10% scoring ligands according to ML is chosen.    
  random The batch is chosen by random selection among the top 10% scoring ligands according to ML.  
  distinct_scaffolds Minimizes the number of distinct Bemis-Murcko (BM) scaffolds in the training set in all training rounds. This increases the diversity of training sets and hence also the diversity of the top predictions coming out of the active learning workflow.      
  max_glide_cpu Maximum number of Glide docking jobs that can be running concurrently. Useful for limiting license use. If -HOST:N is provided, the minimum of N and -max_glide_cpu will be used to limit concurrent Glide jobs.
  max_ml_eval_cpu Maximum number of DeepAutoQSAR (ML) evaluation jobs that can be running concurrently. Useful for limiting license use. If -HOST:N is provided, the minimum of N and -max_ml_eval_cpu will be used to limit concurrent ML evaluation jobs.
  num_iter Number of active learning iterations. Default is 3.    
  block_size This determines the number of structures in each ML evaluation subjob and total number of ML evaluation jobs.
  train_size Number of training ligands for each active learning round. A minimal number of 15 is required. Default is 5,000. Larger train_size will have a higher recovery ratio of top ligands but requires more computational resources especially CPU-RAM of the -train_host. In practice, a reasonable -train_size could be between 10,000 and 100,000.  
  train_host Machine learning model training host. Default is the same as -HOST. Using a training host with a high performance GPU will largely speed up the training process.  
  num_train_core Number of core(s) for ML training. Default is 1. It is recommended to keep the default value.  
  ligprep_args By default, ligands are prepared prior to docking with LigPrep on-the-fly. This argument allows you to specify what LigPrep options are applied. Default: (“-pht 1.0 -epik -s16”). This should be a quoted string with all arguments separated by spaces.
  glide_subjob_size Number of ligands in each Glide docking subjob. This is equivalent to the -NSTRUCTS argument in Glide. By default, Glide’s smart distribution mechanism will determine the docking subjob size automatically.
  result_prefix Defines the name of the output result files. By default the prefix is set to the jobname.
  keep The number of top-scoring ligands according to the ML model that are returned. Default: 10,000,000
  no_pose Whether Maestro poses of the rescored ligands are returned. Default: Keep the pose of rescored ligands.  
  extra_docking_inputs A Glide-like (sif format) input file that contains extra inputs for running the Glide dock jobs in Active Learning Glide. This is useful for specifying user defined docking options such as Grid-based constraints.

 

*example:

TASK screen
GRID gridgen.zip
INFILE D4.csv
TRAIN_TIME 8
KEEP 100000
TRAIN_SIZE 10000
NUM_TRAIN_CORE 1
NUM_ITER 2
JOBNAME D4_screen
NUM_GLIDE_SUBJOBS 2


[EXTRA_DOCKING_INPUTS]
LIG_VSCALE 0.8
GLIDE_TORCONSFILE dock.torcons

 


Preparing the input structures and receptor

A single AL-Glide calculation screens a ligand library by docking into a single receptor. The receptor file must be prepared prior to running an AL-Glide calculation.

  • The receptor should be prepared using the Protein Preparation Workflow prior to generating a Glide grid (see Protein Preparation Workflow Panel or Protein Preparation Command Help for more information). Grid-based constraints (positional, hydrogen-bond, metal, metal coordination) are not supported by AL-Glide.

  • Ligands are prepared on-the-fly using LigPrep (see the LigPrep User Manual for more information on the process). 3D preparation is performed only for those ligands that are docked for the purpose of training the underlying machine learning model or for top scoring ligands that are rescored in the final step of the workflow. The ligands should be supplied using a 2D representation of the ligand. Both SMILES (.smi) and SMILESCSV (.csv) formats are supported. The ligand library can consist of one or more ligand files in these formats. To specify multiple ligand files, use the -infile option multiple times, with one file specified for each instance.

For more detailed information, please refer to Preparing Inputs for Active Learning Glide and the Best Practices for Protein Preparation.

 

Specifying input options

The input to glide_active_learning.py can be specified by using the command options, or with an input file, which is specified by -input_file. The command options are listed in glide_active_learning.py Command Help. The input file keywords are upper case versions of the command options with no initial dash. An example input file is given below.

TASK screen
GRID gridgen.zip
INFILE D4.csv
TRAIN_TIME 8
KEEP 100000
TRAIN_SIZE 10000
NUM_TRAIN_CORE 1
NUM_ITER 2
JOBNAME D4_screen
NUM_GLIDE_SUBJOBS 2
WRITE_POSE False
WITH_HEADER True

[EXTRA_DOCKING_INPUTS]                                                       
  LIG_VSCALE   0.8
  GLIDE_TORCONSFILE dock.torcons

 

Restarting a failed job

If the Active Learning Glide workflow fails for any reason, intermediate files will be transferred to the launching directory as long as the driver node is alive. The user can restart the workflow with the following commands: 

 

Command line options:

$SCHRODINGER/run -FROM glide glide_active_learning.py screen -restart_file {jobname}_restart.pkl -HOST cpu:X  -DRIVERHOST driver

 

Input file options:

$SCHRODINGER/run -FROM glide glide_active_learning.py screen -input_file {input_file} -RESTART  -HOST cpu:X  -DRIVERHOST driver