Training and Evaluating ADMET Models with DeepAutoQSAR
Tutorial Created with Software Release: 2025-4
Topics: Hit Discovery , Hit-to-Lead & Lead Optimization , Machine Learning , Small Molecule Drug Discovery
Products Used: DeepAutoQSAR
|
3.84 GB |
This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed
Abstract:
In this tutorial, you will learn how to use DeepAutoQSAR to train a supervised machine learning model on a publicly available aqueous solubility dataset compiled by Sorkun et al. in their paper “AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds” (Scientific Data 2019, 6:143). This dataset is one of several used in Schrödinger’s study that benchmarks DeepAutoQSAR’s performance on ADMET regression and classification problems from the Therapeutic Data Commons’ single instance prediction challenge.
You will first train a regression model on the ~6000 labeled training compounds and then examine the structure of the model, looking at the optimized submodels within the ensemble and their associated performance metrics. Next, the model will be used to generate predictions for two additional datasets to explore the generalization strength of the model. Finally, you can access 22 pretrained DeepAutoQSAR models for various ADMET endpoints which can be loaded into Maestro for use.
Note that you can also find the pre-trained models on the TDC ADMET prediction challenge data in the files provided for this tutorial.
Tutorial Content
1. Creating Projects and Importing Structures
At the start of the session, change the file path to your chosen Working Directorythe location that files are saved in Maestro to make file navigation easier. Each session in Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is created, the project is automatically saved each time a change is made.
Structures can be imported from the PDB directly, or from your Working Directorythe location that files are saved using File > Import Structures, and are added to the Entriesa simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entriesa simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion table is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be opened by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.
- Double-click the Maestro icon.
- (No icon? See Starting Maestro)
- Go to File > Change Working Directory.
- Find your directory, and click Choose.
- Pre-generated input and results files are included for running jobs or examining output. Download the zip file here: https://www.schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/deepautoqsar.zip
- After downloading the zip file, unzip the contents in your Working Directorythe location that files are saved for ease of access throughout the tutorial.
- Go to File > Save Project As.
- Change the File name to DeepAutoQSAR.
- Click Save.
- The project is now named DeepAutoQSAR.prj.
Note: Banners appear when files have been imported, jobs are incorporated into the Entriesa simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion, or there is a prompt for a common next step.
Note: By default the structure corresponding to the imported file is both includedthe entry is represented in the Workspace, the circle in the In column is blue in the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed and selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entries (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries in the Entriesa simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion. Please refer to the Glossary of Terms for the difference between includedthe entry is represented in the Workspace, the circle in the In column is blue and selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entries (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries.
2. Training a DeepAutoQSAR Model for Aqueous Solubility
All supervised machine learning models require a data set of inputs and outputs to use for training. In this example, you will use a collection of curated compound structures as inputs with log solubility labels as outputs. These data are originally sourced from the AqSolDB (Sorkun et al, 2019), a reference set of ligands and associated solubilities compiled from nine different datasets.
Aqueous solubility is an important molecular property that is often modeled in drug discovery campaigns because of its impact on drug concentration in the systemic circulatory system. Machine learning, specifically deep learning, models have become very popular tools to predict solubility, and many different ML approaches have been introduced to estimate solubility from molecular structure.
Solubility is one of many ADMET properties (absorption distribution metabolism excretion toxicity), a collection of physicochemical, pharmacodynamic, and pharmacokinetic properties, which must be balanced in order to de-risk a drug candidate. This solubility set has been chosen as a representative example from a larger collection of ADMET datasets in the Therapeutic Data Common’s (TDC) Single-Instance Prediction problems. The TDC set is a comprehensive industry challenge for supervised learning on molecules. As a public resource, the TDC benchmark allows for easy comparison between the performance of DeepAutoQSAR and other ADMET prediction methods.
The supplied data set has been downloaded from the AqSolDB data on the TDC website. The molecules have been standardized using utilities from the RDKit, which saves the results in the SDF format. Next, you will train a model with this data and later check its performance.
2.1 Loading Structures and Submitting a Training Job
- Go to File > Import Structures.
- Navigate to where you downloaded the tutorial files and select
daq_tutorial_solubility.sdf. - Click Open.
- A banner appears.
- The structures in the training set are now added to and selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entries (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries in the Entriesa simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion.
- The top entry is included in the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed.
Figure 2-2. Opening the DeepAutoQSAR panel.
- Find and open DeepAutoQSAR from Tasks.
- The DeepAutoQSAR panel opens.
- Next to Choose Task, pick Build Model.
- Next to Model type, choose Regression (numeric).
- For Use structures from, choose Project Table (selected entries).
- For Prediction property, choose logS.
Next, you will need to specify which parts of the dataset should be used for training, validation, and testing the model.
The Therapeutic Data Commons ADMET single instance prediction datasets come with assigned training, validation, and testing data splits. Here, the data is preprocessed so that the training set will match the assigned training set and the combined validation/testing sets to check the model performance. For many applications, it may be advantageous to use separate training, validation, and testing sets, or to construct data splits for cross validation.
In the DeepAutoQSAR panel, the Custom split feature allows the definition of custom test/holdout splits using any pre-existing numeric property in a dataset. For example, the data could be split on molecular weight, solubility, or some other computed property. Alternatively, the data can be split on a binary value, as done here to match the TDC assignment. The data has been preprocessed so that training compounds are labeled 1, and holdout compounds are labeled 0.
- For Training set, choose Custom split.
- For Split on property, choose training_set.
- For Split threshold, set property = 1.0.
- The panel automatically updates to indicate the size of training and holdout sets.
Finally, you need to specify the training time. The default training time for this model and the minimum recommended training time is 4 hours. Longer training times will allow for more model optimization, up to a point of diminishing returns. It is recommended that training is done for greater than four hours on datasets containing thousands of ligands.
- Set the training time to 4.0 hours.
- Change the job name to aqsol_with_tdc_splits.
- Click the Job Settings cog to choose the desired host.
Note: This job must be run on a Linux host for GPU acceleration.
Note: As this job takes ~ 4 hours to run, a pre-generated file has been provided. If you run the job yourself, the pre-generated files may show differences compared to your results due to differences in software version, hardware configuration, or stochastic effects.
- Click Run.
You have now trained a regression model for predicting logS. DeepAutoQSAR also supports training classification models for data with binary labels. Remember to specify the correct model type for the dataset you are using.
3. Analysis of the Trained DeepAutoQSAR Model
After training, the contents of the model can be inspected and used to generate predictions. Load the model into the DeepAutoQSAR panel and inspect the performance metrics. Then, examine the plot showing performance on the holdout set. Finally, evaluate the uncertainties provided by the model.
3.1 Loading and Inspecting the Model
Loading the trained model into the DeepAutoQSAR panel shows a summary of key model metrics from the training and validation, as well as allowing you to make predictions.
- In the DeepAutoQSAR panel, for Choose task, pick Make Prediction.
- For Model file, click Browse.
- Navigate to the 25-4 folder in your Working Directorythe location that files are saved and choose
aqsol_with_tdc_splits_model.qzip. - Click Open.
- The Model Summary section of the panel populates (this may take a few seconds).
This summary contains several metrics that measure model performance. For a regression model, two key metrics to check are the r2 score (coefficient of determination) and mae (mean absolute error).
Note: If you trained the model yourself, the values for the metrics you see may differ slightly from those in the screenshot due to differences in software version, hardware configuration, or stochastic effects.
You can access the full report generated by DeepAutoQSAR during training. This report contains information about the ensemble members (hyper-parameters, cross validation performance, etc.) and various statistics about the training run in a JSON format.
- Click View Full Report.
- The DeepAutoQSAR Report Viewer opens.
Part of the report from model training is a plot showing the correlation between model predictions and the labels on the holdout set. We’ll return to this image later in this section.
- Click Plot.
- Close the DeepAutoQSAR Report Viewer.
Since this is a trained regression model, the performance plot shows the correlation between the ground truth labels (x-axis) and our model’s prediction (y-axis) across the collection of compounds in the holdout set. For a classification model, the results would be reported as a ROC plot.
In this plot, the y axis shows the mean prediction across our ensemble of 15 separately trained ML models (3 different architectures, each trained on 5 cross-validation splits of the data), with a 95% confidence interval around each prediction based on the ensemble standard deviations. The confidence interval is constructed as:
|
Notation |
Description |
|
|
Prediction sample mean for the |
|
|
Solubility prediction for the |
|
|
Prediction sample standard deviation for the
|
|
|
For a two-sided 95% confidence interval, we choose |
In the plot, many predictions further from the diagonal (which have higher errors) tend to show larger uncertainties, indicating that the model is less confident in those predictions.
4. Generating Predictions with the Trained DeepAutoQSAR Model
To better understand the model’s real-world utility, you will now predict the solubility of new compounds using two different datasets. These datasets have been chosen to demonstrate the variability of model performance when predicting on novel compounds. The data are contained in the files ochem_holdout.sdf and chembl_holdout.sdf, both of which have been generated from publicly available datasets containing experimentally determined solubility labels. These two datasets will be used to evaluate if the model performance is sensitive to the difference in chemistry between the training set and the test set.
These data sets were aggregated together in a recent work (Meng et al., 2022) focused on the collection and curation of solubility data sets. Here, the OChem Water Solubility dataset is used as an example of a high quality and chemically relevant collection as compared to the training set. The ChEMBL set contains more chemical diversity with higher labeling noise. Both sets have been curated by the authors in Meng et al. and the datasets were modified by removing any compounds that are in the AQSolDB data set (which the model was trained on). Below, the steps needed to generate predictions and plot the correlation between the estimated and true data labels are shown for the OChem set. The steps needed to examine performance for the ChEMBL set are identical.
4.1 Generate Predictions Using a Trained DeepAutoQSAR Model
- Go to File > Import Structures.
- Choose
ochem_holdout.sdfand click Open.- A banner appears and a new group ochem_holdout is added to the Entries.
- The imported structures are selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entries (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries in the Entriesa simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and the top entry is includedthe entry is represented in the Workspace, the circle in the In column is blue in the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed.
- In the DeepAutoQSAR panel’s Make Predictions section, for Use structures from choose Project Table (656 selected entries).
- For Output property name, type ochem_predictions.
- Change the Job name to ochem_predictions.
- Click Run.
- This job takes about ~ 1 minute to run and does not need a GPU host.
- A banner appears and a new group ochem_predictions_output is added to the Entriesa simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion.
Note: If you did not run the job, pre-generated predictions are included in the tutorial files.
- Repeat steps 1-7 using the chembl_holdout dataset.
Note: The chembl_holdout dataset is chemically different from the dataset this model was trained on.
4.2 Plot Generated Predictions
Scatter plots can be used to examine the model’s performance and identify outliers where the model is less accurate. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data scatter plot functionality can plot the true logS values against the predicted values, with pointwise uncertainty illustrated by coloring individual points.
Structures and their predictions from the previous step must be in the Entriesa simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion to complete this section. If they are not, import them by using File > Import Structures.
- Click Table.
- The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data opens.
By default, the Project Table shows a variety of properties. You can adjust the view to focus on the experimental reference for logS and the predictions.
- In the Property Tree, click the All check box twice to hide all the shown properties.
- Click the arrow next to All.
- Click the arrow next to Maestro to see all the properties in this group.
- Check the boxes to show the ochem predictions score, ochem predictions uncertainty, chembl predictions score, and chembl predictions uncertainty.
- Click the arrow next to SD and then click the arrow next to Secondary.
- Check the box for logS.
Note: You might need to scroll down the Project Table to view the values for the properties that have been added.
Note: Only properties shown in the Project Table are available to be used in plots.
You will first plot the reference logS against the prediction for the OChem dataset, and color the data points by the prediction uncertainty.
- For the X-Axis, choose logS.
- For the Y-Axis, choose ochem predictions score.
- For Color by, select ochem predictions uncertainty.
You can adjust the graph settings to make it easier to see how well the prediction matches the reference.
- Check Best fit.
- Check Diagonal.
- In the Plot section, click Axis settings.
- The Change Axis Settings dialog box opens.
- Check the box next to X-Axis.
- Click the chain symbol to link the X and Y axis ranges.
- Click OK.
- Optional: Click Save to persist the plot in the project, allowing further modification or Export to save it as an image.
- Close the plot.
For this holdout data, we see good correlation with the experiment. Many of the highest model uncertainties (warm colors) are for compounds with higher error (distance from the y = x line). The best fit line and y = x diagonal show close agreement, and the data dispersion from the diagonal across the dynamic range of the data is small. This is evidence of a well performing model with significant signal.
- Repeat steps 7 - 20 using the chembl_holdout dataset.
It is clear that the model performs well for the OChem solubility data and poorly for the ChEMBL solubility data. Quantitatively, the performance gap is shown represented in the for the best fit lines of the two models. The OChem set achieves an
, whereas the ChEMBL set shows a very weak fit of
.
To gain a better understanding of why the model shows high error on only one of these sets, it is useful to examine the data distributions for the training set and the holdout sets. ML models are known to suffer performance degradation for out-of-distribution predictions. The hypothesis is that the ChEMBL compounds have a different distribution than the AQSolDB compounds. This hypothesis can be checked by plotting the distributions and looking for this chemical shift.
We can visualize the data distributions using common dimensionality reduction techniques applied to bit-fingerprint representations of our molecules. In particular, we will represent the ligands as 512-dimensional Morgan/Circular bit fingerprints, which are then transformed into 50 dimensional vectors via principal component analysis. From this representation, we can further transform the data into two-dimensional vectors using the t-distributed stochastic neighbor embedding algorithm; these two-dimensional vectors can then be plotted to better understand the data distributions. Our approach to visualization is very similar to the method described in this blog post by Patrick Walters.
In the following T-SNE plots, we show the complete AQSolDB dataset in blue, with ligands from the OChem and ChEMBL holdout sets in orange. We observe that the OChem ligands are distributed amongst the ligands in the AQSolDB set. In contrast, we see that the ChEMBL data set has only a partial overlap with the AQSolDB set, and many compounds are removed in chemical space from the training data. We might assume that these compounds are likely to be the outliers contributing to the poor performance.
Figure 4-10. T-SNE visualization of AQSolDB fingerprints (blue dots) and OChem (orange dots, left) or ChEMBL (orange dots, right) fingerprints showing their distributions in chemical space.
To test this intuition, the following figure shows the same distributions of the AQSolDB, OChem, and ChEMBL data, but points in the holdout set are colored by their absolute error. While the error distribution for the OChem set is fairly homogeneous, for the ChEMBL data, clusters on the data periphery show higher than average error. This indicates that the model is not generalized to these areas of chemical space.
Figure 4-11. Distribution of absolute errors for the OChem (left) and ChEMBL (right) holdout sets in projected chemical space.
The conclusion of this analysis is that DeepAutoQSAR models are powerful tools for molecular property prediction, but the models are naturally limited by the size and diversity of their training sets. The solubility model here performs well on its training and holdout data set, but its accuracy for predicting the solubility of other compounds will depend on the data distribution of those compounds.
Because of this, we recommend that you train DeepAutoQSAR models on your own data sets if you intend to inference your models on similar ligands. Global models can be useful, but take care to ensure that models are not used for ill-posed conditions arising from data drift.
5. Pretrained Models Available For Download
As a complement to Schödinger’s DeepAutoQSAR Benchmark Study of the TDC ADMET prediction challenges, we provide a collection of DeepAutoQSAR models with this tutorial. These models have been trained on the benchmark data sets and can also be downloaded with the links below. These models can be incorporated into Maestro to make predictions for matching experimental endpoints. You can find all models in the Pretrained Models directory of the tutorial zip archive.
Please note that these models are not intended to be global predictive models and users should be aware of the distributional differences between compounds in these training sets and their own ligands.
Finally, these pre-trained models were created without holdout sets to avoid the issues of set selection randomness and bias. Since there is no holdout set, it is expected that there are no metrics showing the model's performance when loaded into the DeepAutoQSAR panel.
Dataset Abbr.
Dataset Type
Dataset Description and TDC Link
6. Conclusion and References
In this tutorial, you trained a DeepAutoQSAR model on the AQSolDB, a large, publicly available data set of ligands and associated water solubility labels. The structure and ensemble composition of this model was observed and then the model was used to predict solubilities for two other publicly available datasets, one from the OChem database and one from ChEMBL. Results showed that DeepAutoQSAR can construct highly accurate models when trained on similar ligands and that model performance can degrade as data shifts away from the training data distribution. Knowing this, many pre-trained machine learning models for several ADMET endpoints have been provided with this tutorial, allowing users to utilize the models with an understanding of their strengths and limitations.
7. Glossary of Terms
Entries - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion
included - the entry is represented in the Workspace, the circle in the In column is blue
Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data
Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project
selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entries (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries
Working Directory - the location that files are saved
Workspace - the 3D display area in the center of the main window, where molecular structures are displayed