Machine Learning for Sweetness

Quantitative Structure-Activity Relationships (QSAR) are useful modeling tools to efficiently predict material properties for a wide-range of molecules. Schrödinger’s AutoQSAR tool for generating machine learning models is easy to use, facilitating automated generation of accurate QSAR models. For practice, tutorials are available using the Materials Science Maestro suite to predict properties of small molecules, polymers and periodic systems: Machine Learning for Materials Science, Polymer Descriptors for Machine Learning, Cheminformatics Machine Learning for Homogeneous Catalysis and Periodic Descriptors for Inorganic Solids.

AutoQSAR models perform well when trained on ‘small’ datasets (<5000 molecules). However, in big-data scenarios, deep learning approaches have emerged as powerful tools for generating predictive models that are more accurate than conventional QSAR models (see comparison here). For example, one DeepAutoQSAR approach to QSAR is to use a graph convolutional neural network to predict material properties given a molecular structure as input. DeepAutoQSAR treats a molecule as a graph, where nodes are atoms and edges are bonds. Chemical features of the molecule (atom type, valence, charge, etc.) are attached to each node in the graph. For each atom, convolution operations are applied to neighboring atoms (and itself) to identify patterns relevant to the property of interest. Altogether, DeepAutoQSAR provides an automated way to leverage graph convolutional neural networks and accurately predict material properties for large datasets. You can read more about DeepAutoQSAR on our website.

In this tutorial, we will use the DeepAutoQSAR panel in MS Maestro to create a machine learning model to predict whether or not a molecule is sweet, using a data set from BitterSweet (see References). Training data of ~2000 molecules are provided that are labeled sweet (1) or not sweet (0) for a binary classification task. While this dataset is small enough that AutoQSAR could also be efficiently implemented, here we will learn to use the DeepAutoQSAR workflow, which in this case results in a somewhat better model than AutoQSAR alone.

Figure 1. Tutorial workflow showing the conversion of SMILES to 3D structures, the DeepAutoQSAR panel used to build machine learning models and the output receiver operating characteristic curve after model training.

2. Creating Projects and Importing Structures

At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.

Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.

Double-click the Materials Science icon
- (No icon? See Starting Maestro)

Figure 2-1. Change Working Directory option.

Go to File > Change Working Directory
Find your directory, and click Choose
Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/sweetness_ml.zip
After downloading the zip file, unzip the contents in your Working Directory for ease of access throughout the tutorial

Figure 2-2. Save Project panel.

Go to File > Save Project As
Change the File name to sweetness_ml_tutorial, click Save
- The project is now named sweetness_ml_tutorial.prj

Figure 2-3. Import the starting structures.

In this tutorial, we will use a data set of 1918 small molecules, defined by SMILES strings and marked as either sweet (1) or not sweet (0). The data is from BitterSweet: Building machine learning models for predicting the bitter and sweet taste of molecules.

Let’s import these structures now:

Go to File > Import Structures
Change Files of type to Smiles (*.smi*.csv…)
Navigate to where you downloaded the provided tutorial files (presumably in your working directorythe location where files are saved) and choose Section_02 > train.csv from the provided tutorial files
Click Open
- The Import SMILES panel pops up

Figure 2-4. Import SMILES settings.

For SMILES Column: choose SMILES
- The SMILES column in the CSV contains the SMILES patterns
For ENTRY TITLE Column: choose Title
Ensure Discard any additional properties is unchecked
Click OK
- The import will take a few moments

Figure 2-5. Viewing the Sweet tag in the Project Table.

Each entry has a ‘Sweet’ value associated with it (0 or 1). These can be visualized in the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data (). Use the Property Tree () to add the Sweet property (under All > [Unclassified] > Sweet). You can also add the ‘Sweet’ property to your entry list via the Change table settings ().

3. Building a Machine Learning Model Using DeepAutoQSAR

In this section, we will use the DeepAutoQSAR panel to train a machine learning model for categorical sweetness prediction.

Figure 3-1. Choosing task and options.

Ensure that all 1918 entries are selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list by selecting the entry group
Go to Tasks > Materials > Informatics > DeepAutoQSAR
- The DeepAutoQSAR panel opens
Ensure that Build model is checked
For Model type, choose Classification
- Because the data is categorical (0, 1), we use a classification model
- For categorical inputs, the data should always be (0, 1, 2, etc. as needed)
Ensure that for Use structures from, Project Table (1918 selected entries) is chosen

Figure 3-2. Defining the training.

Change the Prediction property to Sweet via the dropdown menu
Set 90% for the Random split
- This is the percentage of data to set aside between train and test sets, where 90% of the data is used to train the model and 10% of the data is used to test the model
- This is a relatively large data set. The 90:10 split ensures that there is significantly more data in the training set than the test set, but still enough data in the test set to assess model performance.
Set the Training time to 12 hours

Note: Training for the current model concludes after the specified time limit is reached, but no subsequent models are trained. Note that the total elapsed time may significantly exceed this limit if a particular model requires an extended training duration. A minimum of two replicates is trained for every model type. For datasets with ~2000 inputs like this one, a 12 hour training time can help ensure the best models are determined

Figure 3-3. Naming and running the job.

Change the Job name to BuildTask_Sweetness
Adjust the job settings () as needed
- This job requires a Linux host
This job takes roughly 5 hours to complete. If you would like to run the job yourself, click Run. Otherwise, provided data is available for proceeding in Section 4.

Note: If you do not have remote submission to Linux from MS Maestro set up, an alternative is to use the Write option in the job settings dropdown to generate a directory containing the submission scripts, which could then be transferred and run on a Linux host.

Close the DeepAutoQSAR panel

4. Viewing the Machine Learning Model and Predicting

Using the DeepAutoQSAR panel, we can proceed to view the generated models, and use these to make predictions on a small data set.

Figure 4-1. Loading the trained model.

When the job is complete, note that no new entry group is added to the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion. The output can be analyzed and used for predictions back in the DeepAutoQSAR panel:

Return to Tasks > Materials > Informatics > DeepAutoQSAR
- The DeepAutoQSAR panel opens
For Choose task, switch to Make Predictions
To choose the Model file click Browse, navigate to the Section_04 > BuildTask_sweetness > BuildTask_sweetness_model.qzip file, select it, and click Open
- The panel will parse the .qzip file and the Model Summary section will be populated

Figure 4-2. Viewing the Model Summary.

Begin by analyzing the Model Summary output. The data presented is a summary of the statistics of the model.

Click View Full Report

The precision, sensitivity and specificity values are reported at the F1 point of the models (i.e. the classification threshold where the harmonic mean of precision and sensitivity is maximized; this is a conventional ‘sweet spot’ for classification model analysis). Each of these metrics represents an important dimension for measuring the performance of a classification model:

AUC

Precision

Sensitivity

(aka Recall)

Specificity

Definition

Area under the ROC curve

TP / (TP + FP)

TP / (TP + FN)

TN / (TN + FP)

Interpretation

Overall classification performance at different precision/recall thresholds.

How many of a model’s positively classified points are actually relevant?

How good is a model at detecting positives in a data set?

How good is a model at avoiding false alarms?

Note: TP = true positive; TN = true negative; FP = false positive; FN = false negative

Figure 4-3. Viewing the Report tab.

The Report tab includes a raw copy of the JSON output of DeepAutoQSAR. This report provides detailed information about the top-performing model ensemble throughout the training process, including its evaluation metrics, the model architecture used (e.g., deep neural networks, random forests, etc.), and relevant model hyperparameters.

Click on the Plot tab

Figure 4-4. Viewing the ROC curve.

For categorical models such as this, the Plot tab shows an ROC plot. An ideal classifier would have a ROC plot shifted upper left with a True Positive Rate of 1, False Positive Rate of 0, and Area Under the Curve of 1.

Close the DeepAutoQSAR Report Viewer

Note: For regression models, a scatter plot and regression line are shown.

Figure 4-5. Imported structures in the entry list.

Now, we will use the constructed model to make predictions on an unseen data set of organic molecules that were not in the training data. These complexes have known sweetness values from BitterSweet, which we can use to assess the quality of the model for making predictions outside the training set.

Close the DeepAutoQSAR panel (or simply minimize it or adjust to the side of your window – we will return to it in a moment)
Go to File > Import Structures
Navigate to where you downloaded the provided tutorial files (presumably in your working directorythe location where files are saved), choose Section_04 > test.csv and click Open. If you need a refresher on how to import SMILES strings, revisit Section 2.
- 214 new entries are added to the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion.
Select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries the entire test (214) group from the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion
- Recall that selecting means to highlight the group in the entry list

Figure 4-6. Settings for the prediction job in the DeepAutoQSAR panel.

Return to the DeepAutoQSAR panel
Ensure that the panel reflects the progress from the above steps: Make Predictions is selected, the .qzip file is loaded and the Model Summary is shown
In the Make Predictions section of the panel, ensure that Project Table (214 selected entries) is chosen for Use structures from
For Output property name, input Predicted_Sweetness
- This will be the name of the predicted property in the project table

Figure 4-7. Running the prediction job.

Change the Job name to PredictTask_Sweetness
Adjust the job settings () as needed
- This job requires a Linux host
This job takes roughly 5 minutes to complete. If you would like to run the job yourself, click Run. Otherwise, provided data is available for import from the provided tutorial files: Section_04 > PredictTask_Sweetness > PredictTask_Sweetness_output.maegz
Close the DeepAutoQSAR panel

Figure 4-8. Viewing the output in the Project Table.

When the job is complete or after importing, a new entry group is added to the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion titled PredictTask_Sweetness_output (214) containing the same 214 entries, but now with predicted sweetness properties. The data can be analyzed in the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Open the Project Table ()
Use the Property Tree () to include the Predicted Sweetness score and uncertainty properties (Check the properties of interest under All > Maestro)

We can see predicted scores for the various molecules as well as uncertainty values. Note that the output values are probabilities. We can apply a simple cutoff of <0.5 as 0 and >0.5 as 1 for making sweetness predictions with our categorical model.

To compare these values to the known values we can export data to a spreadsheet (Project Table > Data > Export > Spreadsheet) and analyze.

The analysis between the predicted and actual values shows that the model predicted the sweetness of 86% of the unseen compounds correctly. View the sweetness_prediction_analysis.xlsx or .csv spreadsheet in the provided tutorial files for more details on the analysis. The results suggest that the ML model derived from the DeepAutoQSAR panel could generalize to an unseen data set of organic molecules. While this tutorial uses a relatively small dataset, one could expect that a larger training set would further improve prediction accuracy.

5. Conclusion and References

In this tutorial, we learned how to use the DeepAutoQSAR panel to build machine learning models to accurately classify whether a molecule is sweet or not.

Click to Expand

For further learning:

For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 100+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.

For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.

For some related practice, proceed to explore other relevant tutorials:

Click to Expand

For further reading:

Help documentation on DeepAutoQSAR
BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules. DOI:10.1038/s41598-019-4AutoQSAR/DeepChem3664-y
BitterDB: Taste ligands and receptors database. DOI:10.1093/nar/gky974
BitterDB: a database of bitter compounds. DOI:10.1093/nar/gkr755
DeepAutoQSAR: Scalable, Intuitive, Deep-learning QSAR models for Big Data Applications (Schrödinger white paper)
DeepAutoQSAR Hardware Benchmark (Schrödinger white paper)
Design of Organic Electronic Materials With a Goal-Directed Generative Model Powered by Deep Neural Networks and High-Throughput Molecular Simulations. DOI:10.3389/fchem.2021.800370
Active Learning Accelerates Design and Optimization of Hole-Transporting Materials for Organic Electronics. DOI:10.3389/fchem.2021.800371

6. Glossary of Terms

Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion

Included - the entry is represented in the Workspace, the circle in the In column is blue

Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)

Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project

Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries

Working Directory - the location where files are saved

Workspace - the 3D display area in the center of the main window, where molecular structures are displayed