Evaluating Large Ligand Libraries with Active Learning Glide

Tutorial Created with Software Release: 2024-1

Topics: Hit Discovery, Machine Learning, Small Molecule Drug Discovery, Virtual Screening

Products Used: AL-Glide

Tutorial files

0.6 GB

This tutorial is written for use with a 3-button mouse with a scroll wheel.

Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayedthe 3D display area in the center of the main window, where molecular structures are displayed

Abstract:

In this tutorial, you will learn how to screen large ligand libraries efficiently using Active Learning Glide. With a recent publication (Ultra-large library docking for discovering new chemotypes. Nature. 2019 Feb;566(7743):224-229. doi: 10.1038/s41586-019-0917-9) indicating that screening larger ligand libraries leads to finding more binders and better binders for protein targets, there is an increased interest in screening ligand libraries of 1 million to 1+ billion ligands. There are several commercially available ligand libraries that range into the hundreds of million to billion-plus range currently. Virtual enumeration tools make accessing virtual libraries of that size, and larger, also easy for researchers. With structure-based screening techniques, such as docking, evaluating ligand libraries of this size can be untenable. With Active Learning Glide, large ligand libraries can be evaluated at a much more rapid pace than using traditional Glide docking. By using an active learning approach to iteratively train a machine learning model then screens ligands, ligands can be evaluated in ~20 ms per ligand and then a smaller subset of the top scoring ligands is rescreened using Glide SP.

Tutorial Content

Creating Projects and Importing Structures

Setting up Active Learning Glide Docking

Analyzing Active Learning Glide Results

Conclusion and References

Glossary of Terms

1. Creating Projects and Importing Structures

At the start of the session, change the file path to your chosen Working Directorythe location that files are saved in Maestro to make file navigation easier. Each session in Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is created, the project is automatically saved each time a change is made.

Structures can be imported from the PDB directly, or from your Working Directorythe location that files are saved using File > Import Structures, and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.

Double-click the Maestro icon
- (No icon? See Starting Maestro)

Figure 1-1. Change Working Directory option.

Go to File > Change Working Directory
Find your directory, and click Choose
Pre-generated input and results files are included for running jobs or examining output. Download the zip file here: https://www.schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/al-glide_d4-enamine.zip
After downloading the zip file, unzip the contents in your Working Directorythe location that files are saved for ease of access throughout the tutorial

Figure 1-2. Save Project panel.

Go to File > Open Project > AL-Glide.prjzip
In Save scratch project, click OK
Go to File > Save Project As
Change the File name to AL-Glide_D4 click Save
- The project is now named AL-Glide_D4.prj

Note: Please refer to the Glossary of Terms for the difference between includedthe entry is represented in the Workspace, the circle in the In column is blue and selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries.

2. Setting up Active Learning Glide Docking

Structure files obtained from the PDB, vendors, and other sources often lack the necessary information for performing modeling-related tasks. Typically, these files are missing hydrogens, partial charges, side chains, and/or whole loop regions. In order to make these structures suitable for modeling tasks, we use the Protein Preparation Workflow to resolve issues. Similarly, ligand files can be sourced from numerous places, such as vendors or databases, often in the form of 1D or 2D structures with unstandardized chemistry. Active Learning Glide requires ligand input to be in a comma-separated or SMILES format. Ligands are converted to 3D structures, with the chemistry properly standardized and extrapolated, using LigPrep as part of the Active Learning Glide workflow. Filtering of large ligand libraries to remove ligands with unfavorable properties before screening is encouraged.

In this tutorial, the protein has already been prepared in order to save time. However, these preparation steps are a necessary part of a virtual screen and must be done before docking. Please see the Introduction to Structure Preparation and Visualization tutorial for instructions on using the Protein Preparation Workflow. For the 5WIU structure, the binding site waters were retained as they were shown to be important for reproducing the known binding pose.

Active Learning Glide will generate a receptor grid from a prepared protein, prepare the ligands, and dock a subset of these ligands using Glide SP. From here, a machine learning model will be developed and used to screen the rest of the ligands. The number of iterative training rounds can be set within the panel, with a recommended default setting of three rounds. After all the ligands have been screened using the last model, a selection of the top ligands will then be docked using Glide SP.

Figure 2-1. Open Active Learning Docking.

Go to Tasks > Browse > Receptor-Based Virtual Screening > Active Learning Docking
- The Batch Glide Screening with Active Learning panel opens

Figure 2-2. Load in the receptor grid.

Next to Receptor grid, click Load File
- The Select a Grid File panel opens
Choose glide-grid_5WIU.zip

Figure 2-3. Add the ligand file.

Next to Ligands, click Add Files

Note: Click Preparation Options to adjust the Ligand Preparation settings. We recommend matching the target pH of the ligand preparation to the pH value used during the receptor structure preparation.

Figure 2-4. Choose the SMILES file.

In the bottom of the panel, next to Files of type, choose SMI (*.smi)
- The ligand smiles file is now able to be chosen in the panel
Choose rand_1M_enamine_REAL.smi and click Open
- The ligand file is loaded into the panel
- Text in the bottom right corner of the panel shows the number of ligands in the file

Figure 2-5. Change the Outputs, Job name, and Run Settings.

Under Outputs, change the value of Dock and import best ligands to 1%
- This will rescore the top 10,000 compounds from the Active Learning Docking model
- The number of ligands rescored will greatly impact job time
Next to Job name, type D4_5WIU_1M_Enamine
Click Run Settings (cog)

Note: We will stay with the default Training options of using 50,000 ligands per iteration and 3 iterations of training. Depending on the size of your ligand library of interest and compute resources, you may want to alter the Sample size per round. As we will see later, we recommend using 3 iterations of training for optimal results.

Note: The recommended maximum number of samples per iteration is 100,000. Benchmarks show the enrichment score is effectively the same and the recovery rate is the same for larger samples sizes. While it is common to assume that as library sizes grow so should training sets, we would advise against scaling training sets beyond 100,000.

Note: The settings used to estimate the time to completion can be adjusted to reflect different amounts of licenses and ligands.

Figure 2-6. Set the hosts and run the job.

Choose your Driver host, GPU subhost_, and CPU subhost
Adjust the total number of processors

Note: Choose your total number of processors based on your licenses and computational resources.

Click Run
- This job requires significant CPU and GPU resources to run, so we will look at pre-generated results

Figure 2-7. Write out the input files and launch script.

Note: For a job of this length, we would recommend writing out the input files and launching the job from the command line, versus launching from the GUI.

Figure 2-8. Run a pilot screen.

Optional: Active Learning Glide can also be run in “pilot mode.” This is a way to test the effectiveness of Active Learning Glide on a pilot library (default size is 50K) instead of the whole library. One iteration of training on a 5K subset of ligands is performed and the whole pilot library is docked in full. A performance report is generated to aid analysis. The docking results from the pilot library are reused in full Active Learning Glide run, so as not to duplicate calculations. Hover over the option in the panel to learn more.

3. Analyzing Active Learning Glide Results

As it can take several minutes to load in very large ligand files into the Entry List, the results of the Active Learning Glide job set up in the previous section has already been added as part of the project file. For reference, here are descriptions of the groups in the project file Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion:

Pre-generated files:

glide-dock-SP_5WIU_cognate_pv: Glide docking of the cognate ligand of 5WIU into the prepared 5WIU structure. This was used to validate keeping the crystal structure waters.

D4_5WIU_1M_Enamine_iter_3_rescore: The output of rescoring the top 10K ligands with Glide using three iterations of the Active Learning Glide model. In the Active Learning Glide output, this corresponds to the <jobname>_RescoreNode_iter_3_rescore_lib.maegz file. As LigPrep typically expands the number of ligands in a library, there are +40K structures that correspond to ~10K ligands.

D4_5WIU_1M_Enamine_iter_1_rescore: The output of rescoring the top 10K ligands with Glide using one iteration of the Active Learning Glide model. In the Active Learning Glide output, this corresponds to the <jobname>_RescoreNode_iter_1_rescore_lib.maegz file. As LigPrep typically expands the number of ligands in a library, there are +35K structures that correspond to ~10K ligands.

D4_5WIU_1M_Enamine_iter_3_rescore-unique: The unique ligands by Title from the 10K Glide rescore of the output from the Active Learning Glide results that trained using 3 iterations.

D4_5WIU_1M_Enamine_iter_1_rescore-unique: The unique ligands by Title from the 10K Glide rescore of the output from the Active Learning Glide results that trained using 1 iteration.

5WIU_1M_lib_top_10K: The top 10K unique ligands from the 1M Enamine library screened using Glide SP.

New files created in this analysis:

3-iter_clusters_rescore_unique: Clusters based on volume overlap of the unique rescored ligands from the 3-iteration Active Learning Glide calculation.

1-iter_clusters_rescore_unique: Clusters based on volume overlap of the unique rescored ligands from the 1-iteration Active Learning Glide calculation.

3-iter_Glide_comparison: The number of ligands found in the top 10K unique 3-iteration Active Learning Glide calculation rescored results that are also in the top 10K unique Glide SP results.

1-iter_Glide_comparison: The number of ligands found in the top 10K unique 1-iteration Active Learning Glide calculation rescored results that are also in the top 10K unique Glide SP results.

Please note that as copying and analyzing ligand files of this size can take time, many of the steps outlined below have been done for you. If time is short, you can skip to section 3.2.

3.1 Create files of unique ligands and cluster ligands by volume

Figure 3-1. Open the Project Table (top) and select the group (bottom).

In Maestro, open the Project Table
Select the group D4_5WIU_1M_Enamine_iter_3_rescore

Figure 3-2. Choose the Deselect Duplicate Titles Project Table Operation.

Go to Tasks > Browse > Project Table and Project Operations > Deselect Duplicate Titles
- In the Project Table, duplicate titles are deselected
- The first instance of a ligand title is selected, corresponding to the lowest docking score for the ligand

Figure 3-3. Export the unique structures.

Expand the D4_5WIU_1M_Enamine_iter_3_rescore group and right-click on a selected ligand
Choose Export > Structures
- The Export panel opens
Next to File name, type D4_5WIU_1M_Enamine_iter_3_rescore-unique
Click Save

Note: Since it can take several minutes to export this many structures to file and then import them back into Maestro, this step has been done for you.

Repeat steps 2 - 7 for the D4_5WIU_1M_Enamine_iter_1_rescore

Figure 3-4. Open the Clustering of Ligands panel.

In the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion, select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries the group D4_5WIU_1M_Enamine_iter_3_rescore-unique
Go to Tasks > Browse > Discovery Informatics and QSAR > Clustering of Ligands
- The Clustering Based on Volume Overlap panel opens

Figure 3-5. Cluster the unique ligands.

Next to Use structures from, choose Project Table (selected entries)

Note: As these calculations take a few hours for this number of ligands, we will use pre-generated results. We have listed Steps 12 - 20 as reference only.

Click Calculate Volume Overlap Matrix
- A .csv file of volume overlaps is generated
- You can read more about how volume overlaps are calculated and options for clustering here
After the volume overlap calculation is complete, next to Linkage method, choose Average
Click Calculate Clustering
- Clusters will be determined based on volume overlap
- The panel will update when the calculation is complete to highlight the Results and Apply tabs

Figure 3-6. View the clustering results.

Click Results
- The best number of clusters is shown
- For the 3-iteration Active Learning Glide results, 52 clusters of ligands have been found

Note: If you did perform this calculation, you could view the clustering statistics, a dendogram, and a distance matrix of the ligands.

Figure 3-7. Apply the best number of clusters to the group of ligands.

Click Apply
Next to Number of clusters, type 52, the best number of clusters from the Results section
Choose to create Files corresponding to each cluster
Click Apply Clustering
- New entries for each group of clusters will be added to the Entry List
- Due to the size of this file, if you performed these steps you would get a Warning alerting that it may take several minutes to copy all the ligands
- In this tutorial, this has already been done for you and the results are in the group 3-iter_clusters_rescore_unique

Figure 3-8. Repeat the clustering process for the other set of ligands.

Repeat steps 9 - 19 for the D4_5WIU_1M_Enamine_iter_1_rescore-unique
- For the 1-iteration Active Learning Glide results, 44 clusters of ligands have been found
- In this tutorial, this has already been done for you and the results are in the group 1-iter_clusters_rescore_unique

Note: The 3-iteration results of Active Learning Glide returned 52 clusters and the 1-iteration results returned only 44 clusters. This indicates that the 3-iteration results contain a more diverse set of ligands.

Figure 3-9. Include the ligand clusters in the Workspace.

In the Entry List, expand the group 3-iter_clusters_rescore_unique
- Each cluster is its own group
Expand and select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries Cluster 1
Right-click and choose Include
- The ligands in this cluster are visualized in the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed

Figure 3-10. Visualize the ligand clusters in the binding site.

Optional: In Quick Select, click L to select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed ligands. Open the Style Toolbox and change the ligand rendering to ball-and-stick with green carbon color. In the Entry List, includethe entry is represented in the Workspace, the circle in the In column is blue 5WIU_prepared. Now you can visualize the Cluster 1 ligands in the binding pocket with the 5WIU cognate ligand in white.

3.2. Compare Active Learning Glide results with Glide SP results

Figure 3-11. Select the unique ligands by title.

In the Project Table, shift-click to select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries the 5WIU_1M_lib_top_10K and D4_5WIU_1M_Enamine_iter_3_rescore-unique groups
Go to Tasks > Browse > Project Table and Project Operations > Deselect Duplicate Tiles
- In the Project Table, duplicate titles are deselected
- The first instance of a ligand title in both groups is selected
- This corresponds to deselecting all the ligands in the D4_5WIU_1M_ Enamine_iter_3_rescore-unique group that are found in the 5WIU_1M_lib_top_10K group

Figure 3-12. Invert the selection within the group.

Ctrl-click (Cmd-click) to deselect the 5WIU_1M_lib_top_10K group
Go to Select > Invert Within Groups
- The selection within the D4_5WIU_1M_Enamine_iter_3_rescore-unique group is inverted
- Now, all ligands that were found in the 5WIU_1M_lib_top_10K group are selected

Note: Since it can take several minutes to export this many structures to file and then import them back into Maestro, the following steps have been done for you.

Figure 3-13. Export the common structures.

Expand the D4_5WIU_1M_Enamine_iter_3_rescore group and right-click on a selected ligand
Choose Export > Structures
- The Export panel opens
Next to File name, type 3-iter_Glide_comparison
Click Save
- 5636 structures are saved to the 3-iter_Glide_comparison group

Figure 3-14. Repeat the selection process for the other set of ligands.

Repeat steps 1 - 6 for the D4_5WIU_1M_Enamine_iter_1_rescore
Next to File name, type 1-iter_Glide_comparison
Click Save
- 2866 structures are saved to the 1-iter_Glide_comparison group

Note: The 3-iter_Glide_comparison group contains far more structures found in the 5WIU_1M_lib_top_10K group than the 1-iter_Glide_comparison group. This indicates that the 3-iteration Active Learning Glide results more strongly overlap with the Glide SP results of the same 1M ligand file.

Optional: These steps can also be done from the command line. Use the structconvert utility to convert the _lib.maegz file to a .smi. The SMILES will be in the first column and the title in the second. Then use the following command to compare the title column of the first file to the title column of the second file. If it matches, it will print the line which is then written to the output file.

awk -F" " 'NR==FNR{c[$2]++;next};c[$2] > 0' D4_3-iter_rescore_unique.smi 5WIU_1M_lib_top_10K.smi  > 3-iter_Glide_comparison.smi

This will result in the same output as generated via the Maestro interface in steps 1 - 11. To compare the SMILES strings of each file, use the following command:

awk -F" " 'NR==FNR{c[$1]++;next};c[$1] > 0' D4_3-iter_rescore_unique.smi 5WIU_1M_lib_top_10K.smi  > 3-iter_Glide_comparison.smi

This will result in the 3_iter_Glide_comparison group containing 5472 structures and the 1_iter_Glide_comparison group containing 2780. The discrepancy here between results comparing titles and SMILES strings is that the same ligand could have two different titles, depending on the ligand library (or libraries) they originally came from. Alternatively, a ligand may have the same title but different SMILES strings depending on LigPrep settings. It can be worth comparing both by title and by SMILES string. In either case, the 3_iter_Glide_comparison group more strongly overlaps with the Glide SP results of the same 1M ligand file.

4. Conclusion and References

In this tutorial, we used Active Learning Glide to screen a commercially available library of 1M ligands. We set up the Active Learning Glide job and analyzed the results using two methods. The first method assessed for ligand diversity between results from 1 iteration of machine learning training and 3 iterations of machine learning training for the docking model. The 3-iteration results showed that more diverse ligands were reported in the top 10K results that were rescored with Glide SP docking. The second assessment compared the number of ligands that overlapped between the top 10K ligands from the two Active Learning Glide results and the top 10K ligands from a Glide SP evaluation of the 1M ligand library. Again, we saw that the 3-iteration results performed better with having a stronger overlap with the Glide SP results.

Click to Expand

For further learning:

5. Glossary of Terms

Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion

included - the entry is represented in the Workspace, the circle in the In column is blue

Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project

selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries

Working Directory - the location that files are saved

Workspace - the 3D display area in the center of the main window, where molecular structures are displayed