Ligand-based Screening for Ultra-Large Libraries with Quick Shape and the Hit Analyzer

Tutorial Created with Software Release: 2024-1
Topics: Hit Discovery, Small Molecule Drug Discovery, Virtual Screening
Products Used: Phase, Shape Screening

Tutorial files

4.0 MB

This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed

 

Tip: You can hover over a glossary term to display its definition. You can click on an image to expand it in the page.
Abstract:

In this tutorial, you will learn how to use Quick Shape for ligand-based screening in ultra-large library settings. You will choose a reference compound for the screening by cluster analysis of a set of CDK2 small-molecule inhibitors and perform the preparation necessary to set up the Quick Shape screen. You will then screen a library of 28000 compounds provided by DUD-E for the CDK2 target against this reference compound and analyze the screening results using the Hit Analyzer.

 

Tutorial Content
  1. Introduction

  1. Creating Projects and Importing Structures

  1. Choosing and Preparing Reference Compounds

  1. Screening a Single Database File with Quick Shape

  1. Analyzing the Results with the Hit Analyzer

  1. Conclusion and References

  1. Glossary of Terms

1. Introduction

High-throughput virtual screening has become established as a reliable technique to explore large chemical spaces in search of molecules with specific properties, e.g. which are likely to bind to a target protein. As the number of synthetically readily accessible compounds keeps growing well into the billions, new methods and workflows have been developed to make these ultra-large libraries accessible to ligand-based virtual screening. Methods like Shape GPU are fast enough to handle libraries containing many millions of compounds. However, the computational and storage costs of working with ultra-large libraries require even faster approaches. Quick Shape is a combination of a 1D-fingerprint based similarity search and subsequent Shape CPU screening which allows for efficient screening of ultra-large libraries. For an overview of the relative speeds of Quick Shape, Shape CPU and Shape GPU, see the Shape Screening web page.

The overall workflow is summarized in the following schematic, with the order of magnitude for the number of compounds handled in each step given for each step:

In this tutorial, we will use Quick Shape to perform a small virtual screen for the CDK2 system. First, we will import a set of known active compounds and choose a reference compound for the screening using cluster analysis. We will then prepare the reference compound and set up the screening against a library of 28,000 compounds containing both known actives and decoys from the DUD-E dataset for CDK2. Finally, we will perform the Quick Shape screening and analyze the resulting hits using the Hit Analyzer.

If you are following this tutorial with your own data and have already selected the reference compounds you want to screen against, you can import your reference compounds into your project and directly skip to Section 3.2: Preparing the reference compounds.

2. Creating Projects and Importing Structures

At the start of the session, change the file path to your chosen Working Directorythe location that files are saved in Maestro to make file navigation easier. Each session in Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is created, the project is automatically saved each time a change is made.

Structures can be imported from the PDB directly, or from your Working Directorythe location that files are saved using File > Import Structures, and are automatically added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.

  1. Double-click the Maestro icon.

Figure 2-1. Change Working Directory option.

  1. Go to File > Change Working Directory.
  2. Find your directory, and click Choose.
  3. Pre-generated files are included for running jobs or examining output. Download the zip file here: https://www.schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/quickshape.zip
  4. After downloading the zip file, unzip the contents in your Working Directorythe location that files are saved for ease of access throughout the tutorial.

 

Figure 2-2. Saving the project.

  1. Go to File > Save Project As.
  2. Change the File name to quickshape, click Save.
    • The project is now named quickshape.prj.

Figure 2-3. Importing the SMILES file.

  1. Click on File > Import Structures.
  2. Find and select actives_final.smi in your working directory and click Open.

Figure 2-4. Import Structures Confirmation dialog.

  1. Click Import in the confirmation panel

 

Figure 2-5 Import Smiles dialog.

  1. Click OK.

Note: When using your own screening library, you may want to include additional information in the .smi file and add a row of column headers. In that case, the SMILES and ENTRY TITLE columns should be automatically recognized.

Figure 2-6. Imported structures in the Entry List.

After the import, the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion shows a new group named actives_final containing 474 molecules.

 

Note: The entry names for the active compounds contain their CHEMBL IDs. This is also the case in the provided screening library for this tutorial, so that you can easily recognize the known actives in the screening results later.

3. Choosing and Preparing Reference Compounds

After importing the known active molecules for CDK2, you will perform a diversity-based clustering and select a molecule to use as a reference compound in the Quick Shape screen. In productive settings, you may only have a few known actives to start from, and if they are diverse enough you may skip the clustering step and use them as reference compounds for Quick Shape directly.

In this example, we will choose a single molecule as a reference in order to speed up the calculation. In a real project, we recommend that you select up to 5-10 reference compounds, depending on how much data is available. The selection is usually performed based on diversity in order to maximize the novelty of the molecules.

The reference compounds then need to be prepared before they can be used in Quick Shape by ensuring that all Hydrogen atoms are present and generating a reasonable 3D conformation for them.

3.1 Choosing reference compound(s) for screening

This step demonstrates how to use clustering of known active compounds in order to select representative reference compounds for screening. It is completely independent of the Quick Shape workflow, so if you are following this tutorial with your own data and have already chosen reference compounds, feel free to skip ahead to Section 3.2: Preparing the reference compounds.

Figure 3-1. Fingerprints tab of the Canvas Similarity and Clustering panel.

  1. Go to Tasks > Browse > Discovery Informatics and QSAR > Fingerprint Similarity.
    • The Canvas Similarity and Clustering panel opens.
  2. In the Fingerprints tab, leave everything at the default settings.
    • Fingerprint type: Linear
    • Atom Typing Scheme 10

Note that the results of the clustering are impacted by your choices of molecular fingerprinting method, similarity metric, cluster linkage method and the number of clusters you decide on. Keep in mind that a “correct” or “optimal” clustering does not exist for real-world data sets. Determining how useful a given clustering is in a particular context is not a trivial task.

Selections of different Fingerprint types impact the overall composition of the reduced representation of the respective molecules. Imagine the difference between a barcode and an actual fingerprint. The Atom Typing scheme determines how a molecule is interpreted into its constituent components e.g. item 1 means that no distinction between atoms will be made and the molecular representation is a bare skeleton connectivity graph of points and connections between them without considering whether a point is a carbon or hydrogen atom. Daylight invariant atom types distinguish atoms and bonds beyond their elements, such that e.g. an aliphatic carbon atom and an aromatic one are considered as different. See the Further Reading section in the Conclusion and References for more details.

The clustering is performed in two steps:

First, we run the algorithm to analyze the molecule set with the specified fingerprint and linkage method settings. This step provides us some metrics to inform us how well the data decomposes under the chosen clustering settings and the optimal number of clusters to minimize similarity between clusters while maximizing it within clusters.

Second, we can choose the number of clusters to actually use and which output we are interested in, and then perform the clustering step.

The choice for number of clusters is determined by both the results of the cluster analysis and the size of output you are willing to inspect. If the output molecules from different clusters are very similar to one another, we recommend reducing the number of clusters the algorithm creates. If the individual cluster contents are too diverse this is e.g. visible in the distance matrix visualization when the similarity between molecules of neighboring clusters is similar to the similarity within a cluster. In that case, it is recommended to increase the number of clusters.

Figure 3-2. Cluster tab of the Canvas Similarity and Clustering panel before cluster analysis is performed.

  1. Go to the Cluster tab.
  2. Click the Calculate Clustering Button.
    • This should only take a few seconds.

 

Note: Feel free to investigate the results of the clustering using the metrics in the Clustering Statistics plot as well as the Distance Matrix visualization.

Figure 3-3. The Cluster tab of the Canvas Similarity and Clustering panel after cluster analysis is performed.

Next, the most representative molecule for each cluster can be extracted. This is the molecule for which the similarity to its other cluster members is most equal.

 

  1. After the Cluster calculation has finished, select A Group Containing the structures nearest the centroid in each cluster in the Apply Clustering section.
  2. Click the Apply Clustering Button.
    • A group containing 10 molecules appears in the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion.

 

Note: In this case, we will use the recommended number of clusters (10).

Figure 3-4. Representative entries for each cluster shown in the Entry List.

You can look through the list to visually inspect the cluster representatives. In this case, they cover a variety of diverse scaffolds.

 

  1. Close the Canvas Similarity and Clustering panel.

Figure 3-5. The CHEMBL556881 entry selected and included in the Workspace.

We choose CHEMBL556881 as the reference compound for the screening.

 

  1. Includethe entry is represented in the Workspace, the circle in the In column is blue and Select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries the CHEMBL556881 entry.

As mentioned above, we select only a single reference compound in this tutorial in order to speed up the calculation. If this were a real project, you could for example use the representative molecules for each cluster, which would give a total of 10 reference compounds.

3.2 Preparing the reference compound

The ionization state for the compound at a pH of 7.4 is already assigned in the provided file, but hydrogen atoms are left implicit. However, the 1D pharmacophore fingerprint used in Quick Shape requires explicit hydrogen atoms for all compounds, including the reference compound. You will simply add the missing hydrogen atoms using the 3D builder. Please note that Phase’s pharmacophore recognition is sensitive to protonation and tautomer states, so if you want to screen against several potentially relevant states of your reference compound, you’ll have to explicitly add them as reference compounds. The 1D pharmacophore representation is not sensitive to stereoisomers as only the number of bonds is considered rather than the spatial coordinates of the atoms.

A reasonable conformation for the reference ligand is also needed because while the 1D pharmacophore pattern used in the initial screening is derived from the connectivity rather than the spatial structure, the 3D alignment step to the reference ligand is performed in the second stage of the Quick Shape workflow in order to perform the Shape screen. If a co-crystal structure with the reference compound is not available, a reasonable low-energy conformation for the compound should be used. In this tutorial, you will assume that the energy-minimal conformation is close to the one adopted in the bound state, so you will perform a force-field minimization of the compound. Note that this is a simplification, and more representative conformations can be obtained in other ways, e.g. by using ConfGen or performing a molecular dynamics simulation & subsequent clustering to sample the ligand’s conformational space.

Note that only the reference ligand must be prepared in this way. The ligands from the screening library which are not discarded in the initial 1D fingerprint-based screening stage will automatically be prepared using LigPrep before they go through the Shape stage. This greatly reduces the computational and storage footprint of the screening compared to screening the full library with Shape GPU.

Figure 3-6. Adding hydrogens to the reference ligand using the 3D Builder.

  1. Under Quick Select, click All.
  2. Open the 3D Builder and Add Hydrogens to all the selected atoms.

 

Note: For a more detailed introduction to using the 3D Builder, see section 3.2 of the A Chemist’s Guide to Maestro tutorial.

Figure 3-7. Minimizing the reference ligand structure using the 3D Builder.

  1. Under Quick Select, click All again to ensure the newly-added hydrogen atoms are selected.
  2. In the 3D Builder, click the Minimize Selected Atoms button.
    • The reference compound now has 3D coordinates suitable for the Shape step in Quick Shape.
  3. Close the 3D Builder.

4. Screening a Single Database File with Quick Shape

In this section, you will set up a Quick Shape job with the reference compound we previously prepared as well as a database of pre-prepared actives and decoys. In the provided database, the actives are purposefully put at the very end to facilitate analysis later on.

For generating the .1dbin files for your own project, use the $SCHRODINGER/oned_screen create -source <path_to_input_file> command. You can find additional information on the oned_screen command line tool documentation page.

The provided screening database contains ~28000 ligands (among which the 474 actives we chose the reference compound from). Using Quick Shape, the entire database will be pre-screened using the 1D fingerprint as described in the introduction. Then, LigPrep, ConfGen & Shape CPU will be performed on the top 2000 structures from the 1D screening step to give the final list of hits. A detailed introduction to ligand-based screening using Shape can be found in the Rapid Screening of Chemical Libraries with GPU Shape tutorial.

Figure 4-1. Setting up the query and screening library in the Quick Shape panel.

  1. Go to Tasks > Browse > Shape Screening > Quick Shape Screening.
    • The Quick Shape panel opens.
  2. Make sure the prepared CHEMBL556881 entry is includedthe entry is represented in the Workspace, the circle in the In column is blue and Use shape query from is set to Workspace.
  3. Click Browse, navigate to the provided cdk2_merged.1dbin file, and click Open to load the screening library.

Figure 4-2. Setting up output options in the Quick Shape panel.

You now need to specify how many compounds are allowed to pass the 1D-similarity filter and pass to the Shape screening stage, as well as how many of the top-ranked compounds to output as hits at the end of the screening.

  1. Set Max 1D-screen hits supplied to Shape-screen to 2000 and Max Shape-screen hits output to 2000.
  2. Check the box for Generate report and database files.

 

Note: In a realistic scenario for screening an ultra-large library, the number of 1D-Screen hits to forward to the Shape screen would be ~10 million, and the number of hits to keep after the Shape screen step would be 10-100 times lower.

 

  1. Click Run to start the job.
    • The job should take about 5-10 minutes to complete depending on your CPU.
    • The results are not automatically incorporated into the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. They are written to the job folder instead.
    • You can also find the output of this job in the QuickShapeTask_1 folder of the tutorial zip archive.
  2. Close the Quick Shape panel.

Once the screening is complete, a file named QuickShapeTask_1-out.maegz can be found in the job folder, which contains the previously specified number of screening hits ranked by Shape similarity (in our case, the top 2000 compounds from the library). You can use this file to analyze the results in whichever way you prefer, e.g. by using the enrichment calculator to see how well the method recovered the known actives recognizable by the CHEMBL portion of their names.

 

If you checked “Generate report and database files”, the job folder also contains a PDF report of the screening hits with their respective shape similarities and pharmacophore representation, as well as a file named QuickShapeTask_1_report.vsdb, which can be used to inspect the results using the Hit Analyzer panel.

5. Analyzing the Results with the Hit Analyzer

With Ultra-large libraries usually containing billions of compounds, analyzing the obtained hits quickly becomes challenging. The Hit Analyzer panel provides an interface for searching through the results of a screening based on customizable filters.

We will use the Hit Analyzer panel to analyze the output of the Quick Shape screen performed in the previous section. Note that while we’ll walk you through some of the available filters in this section, when working with your own data, you will need to determine which filters and cutoffs are helpful for understanding your screening results and determining next steps.

Figure 5-1. Loading a screening results database into the Hit Analyzer panel.

  1. Go to Tasks > Browse > Ligand-based Virtual Screening > Hit Analyzer.
    • The Hit Analyzer panel opens.
  2. Click the option menu for Load screening database and click Browse.
  3. Navigate to the job directory for QuickShapeTask_1 in your Working Directorythe location that files are saved, select the QuickShapeTask_1_report.vsdb file and click Open.

Figure 5-2. Importing the reference ligand into the Workspace and enabled filter section of the Hit Analyzer panel.

  1. Click Import in the Reference Ligand section.
    • The structure for the reference ligand is added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and shown in the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed.
    • The possible filters for hits are automatically populated based on the reference ligand’s properties and pharmacophore features.

Figure 5-3. The reference ligand shown in the Workspace with its pharmacophore representation.

Optional: Feel free to examine the reference ligand and its pharmacophoric representation in the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The pharmacophore feature labels listed in the Hit Analyzer panel are mapped to the 3D structure.

 

Note: For an introduction to pharmacophore-based screening in the Schrödinger Suite, consult the Ligand-Based Virtual Screening Using Phase tutorial and the Phase documentation.

 

Note: Please note that calculations may have been performed with an earlier version of the software and the results may not be exactly the same as those you produced in this tutorial.

Figure 5-4. Applying a Shape similarity based filter to the screening results.

First, you will examine only the most Shape-similar molecules to the reference.

 

  1. Set the text box in the Shape Similarity section to 0.60.
  2. Uncheck the boxes for Pharmacophore Features and Molecular Properties.
  3. Click Apply Filters.
    • The panel updates to show the 39 most similar entries with a similarity above 0.60.

Figure 5-5. View the filtered hits in the Hit Analyzer panel and save them to the Workspace.

Now, you will incorporate this subset of hits into the project for further analysis.

  1. Click Save Grid > Import Structures Into Project at the bottom right of the panel.
    • A new group appears at the bottom of the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion.
    • The new entries are sorted by shape similarity in descending order.

Figure 5-6. Entry List with the reference ligand fixed in the Workspace and multiple other ligands from the group of filtered hits included in the Workspace.

You can now compare the filtered hits to the reference compound. Feel free to adjust the visualization style to make the visual comparison to the hits easier. This tutorial uses the default styling settings.

 

  1. Double-click the “In” circle to fix the reference molecule in the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed.
  2. Includethe entry is represented in the Workspace, the circle in the In column is blue the first screening compound and select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries the group containing 39 filtered hits.

Figure 5-7. Comparing hits with the reference ligand.

You can now inspect the results visually by stepping between entries from the output group using the left and right arrow keys or including multiple hits at the same time.

Figure 5-8. Applying a diversity filter in the Hit Analyzer panel.

Finally, you will filter the hits to identify the most diverse hits. This can help you understand how large the chemical space covered by your hits is.

 

  1. In the Hit Analyzer panel, uncheck Shape similarity to Reference.
  2. Check the box next to Sample.
  3. Click Apply Filters
    • The results update to show the 200 most diverse molecules from the 2000 hits in the output (purple box).
  4. Use Save Grid > Import Structures Into Project to load this filtered set of hits into the project.

Figure 5-9. Showing the Shape Sim property in the Entry List.

You can now inspect how similar to the reference molecule the most diverse hits are.

 

  1. Click the three dots in the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and choose Show Property.
  2. In the Show Property dialog box, click Choose and search for and click on Shape Sim in the list.
  3. Click OK.
    • The Shape similarity is shown next to each entry in the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion.

Optional: You can use the plotting tools in the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data to visualize the results for the various output groups.

Comparing the results for the two filtered groups, we can see that the diversity set contains more decoys compared to the similarity-based filtering. However, there are known actives whose similarity to the reference compound is quite low which are only recovered in the set filtered by diversity as they use quite different scaffolds (e.g. CHEMBL215803). This highlights why using multiple references for the screening is helpful if they are available in your project.

6. Conclusion and References

In this tutorial, we used Quick Shape to screen a library of 28000 compounds against a reference ligand for CDK2. We chose the reference from a set of known actives by applying cluster analysis, and then prepared the reference for the Quick Shape screen by adding explicit hydrogens and finding an energy-minimal conformation. We then set up the Quick Shape screening against the library containing decoys and known actives, and finally used the Hit Analyzer panel to filter through the screening hits based on shape similarity to the reference compound and diversity.

For further reading:

7. Glossary of Terms

Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion

included - the entry is represented in the Workspace, the circle in the In column is blue

incorporated - once a job is finished, output files from the Working directory are added to the project and shown in the Entry List and Project Table

Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project

selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries

Working Directory - the location that files are saved

Workspace - the 3D display area in the center of the main window, where molecular structures are displayed