Molecular Dynamics Descriptors for Machine Learning

Tutorial Created with Software Release: 2023-4

Topics: Consumer Packaged Goods, Informatics and Team Collaboration, Pharmaceutical Formulations, Polymeric Materials

Methodology: All-Atom Molecular Dynamics, Machine Learning

Products Used: DeepAutoQSAR, MS Informatics, MS Maestro

Tutorial files

27 MB

This tutorial is written for use with a 3-button mouse with a scroll wheel.

Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed

Tip: You can hover over a glossary term to display its definition. You can click on an image to expand it in the page.

Abstract:

In this tutorial, we will learn to generate descriptors using molecular dynamics simulations, which can be used to build and improve machine learning models for predicting material properties.

Tutorial Content

Introduction to Molecular Dynamics Descriptors

Creating Projects and Importing Structures

Generating Molecular Dynamics Descriptors

Building Machine Learning Models Using DeepAutoQSAR

Viewing the Machine Learning Models

Conclusion and References

Glossary of Terms

1. Introduction to Molecular Dynamics Descriptors

Accurately measuring complex material properties like viscosity, which measures resistance to fluid flow or deformation, can be expensive through trial-and-error experimentation. Physics-based models that can measure these materials properties are a promising alternative solution to alleviate the need for extensive experimentation. However, even physics-based models can be computationally expensive and can take in the order of days to complete a single measurement. To speed up predictions from physics-based models, we can leverage computationally efficient machine learning models that can connect the complex relationship between molecular structure and bulk material property. In this tutorial, we demonstrate how descriptors from physics-based models can be used in conjunction with traditional cheminformatics descriptors to improve the accuracy of ML models to predict experimental viscosity. We focus on a small dataset of ~200 examples, which is often observed in materials science applications, and demonstrate that physics-based descriptors are useful at the small data scale.

This tutorial provides step-by-step instructions for calculating MD descriptors using the Materials Science Maestro interface. The tutorial also demonstrates the utility of these descriptors by constructing ML models with DeepAutoQSAR for predicting viscosity. We will generate ML models with and without the MD descriptors to demonstrate their effectiveness in predicting viscosity. The overall workflow is summarized in the figure below:

Tutorial workflow showing the conversion of SMILES to molecular structures, the molecular dynamics descriptors and DeepAutoQSAR panels, and the output scatterplots.

The Molecular Dynamics Descriptors panel automates the calculation of several MD descriptors, but we will focus on the following descriptors: density, free volume %, heat of vaporization, radius of gyration, three solubility parameters and specific heat. The panel takes a molecular structure or formulation as input, prepares a MD simulation by populating a periodic box with that molecule or formulation, computes MD descriptors after a short MD simulation, and tabulates the descriptors for subsequent ML model building.

For complete background on the Molecular Dynamics Descriptors panel, including a complete summary of the available descriptors, see the help documentation.

For more information about building machine learning models in Materials Science Maestro, see the introductory sections of the Machine Learning for Materials Science tutorial. To learn about using pre-built machine learning models to predict properties, please refer to the Machine Learning Property Prediction tutorial.

For an introduction to using physics-based methods alone for soft matter property prediction, please refer to any of the following tutorials: Polymer Property Prediction, Viscosity, Surface Tension and Dielectric Properties.

2. Creating Projects and Importing Structures

At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.

Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.

Double-click the Materials Science icon
- (No icon? See Starting Maestro)

Figure 2-1. Change Working Directory option.

Go to File > Change Working Directory
Find your directory, and click Choose
Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/md_descriptors.zip
After downloading the zip file, unzip the contents in your Working Directory for ease of access throughout the tutorial

Figure 2-2. Save Project panel.

Go to File > Save Project As
Change the File name to MD_descriptors_tutorial, click Save
- The project is now named MD_descriptors_tutorial.prj

Figure 2-3. Import the starting structures.

In this tutorial, we will use a data set of 200 small molecules. For tutorial purposes, this data set is a randomly selected subset of the complete data set found in the recent literature (DOI:10.26434/chemrxiv-2023-1qfw8). The small molecules and their experimentally determined viscosities are from scientific literature, publications, and online databases.

A .csv is available in the provided files which contains SMILES strings for each molecule, as well as corresponding viscosity data and literature references. Let’s import these structures now:

Go to File > Import Structures
Choose input_train.csv from the provided tutorial files
Click Open
- The Import SMILES panel pops up

Figure 2-4. Import SMILES settings.

For SMILES Column: choose CANON_SMILES
For ENTRY TITLE Column: choose Name
Ensure Discard any additional properties is unchecked
Click OK

Figure 2-5. The entry list and a stylized molecule after importing.

The entry list is updated to include the 200 entries. Feel free to stylize and visualize any of the provided structures.

Note: Hydrogen atoms are not added when importing from SMILES. This will not impact this exercise, but it is good to be aware of if running any quantum mechanical calculations that require hydrogens to be present.

Figure 2-6. The Project Table with some imported data displayed.

Each imported molecule also has a viscosity value as determined and reported in the literature (see the Reference column in the provided .csv file). For building machine learning models, we will predict the log transform of viscosity to ameliorate the skewed distribution of viscosity values.

To view the imported data in MS Maestro (e.g. Viscosity, log(Viscosity) or the literature references), open the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. If any data of interest are not displayed in a column, add them via the Property Tree () under All > Canvas > Secondary.

3. Generating Molecular Dynamics Descriptors

Prior to creating ML models, we will first generate molecular dynamics descriptors for the structures using the MD Descriptors panel.

Figure 3-1. Preparing the MD Descriptors job.

Ensure that all 200 entries are selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list (use Shift + Click or create an entry group)
Go to Tasks > Materials > Informatics > Molecular Dynamics
- The MD Descriptors panel opens
Keep Use pure materials from selected
- If using a dataset with formulations you would select Use formulations from
Change the Temperature to 298 K
- The provided data was gathered mostly at or around room temperature
Change the Job name to md_descriptors_viscosity

The MD descriptors panel performs high-throughput, routine molecular dynamics simulations on the selected structures. The protocol includes automated construction of a disordered system, equilibration and tabulation of the eight MD descriptors of interest. The workflow resembles the steps taught in the Disordered System Building and Molecular Dynamics Multistage Workflows tutorial for single component systems. For complete details on the methods underlying the MD Descriptors workflow, please visit the help documentation as well as the publication used to construct the tutorial (DOI:10.26434/chemrxiv-2023-1qfw8).

Figure 3-2. Job settings.

Adjust the job settings () as needed
- Note that the run time for this job is highly dependent on available compute resources, requiring CPU and GPU compute nodes for each input structure.

For tutorial purposes, we will not run the job here, and will instead import the output of a pre-run calculation.

Close the MD Descriptors panel

Figure 3-3. Importing the output.

Go to File > Import Structures
Navigate to where you downloaded the tutorial files. Open Section_03 > md_descriptors_viscosity and choose the md_descriptors_viscosity-out.maegz file
Click Open
- A new entry group is added to the entry list. The entry group contains the same 200 molecules from the original input_train (200) entry group, but now these entries have associated molecular dynamics descriptors

Figure 3-4. Viewing some of the descriptors in the Project Table.

We can see these descriptors by opening the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Open the Project Table ()

The properties may not appear by default. To add some of the properties as columns in the Project Table:

Go to the Property Tree (), expand All > Materials Science > Primary and All > Materials Science > Secondary and select any of the properties for which you want to see the quantity

The MD Descriptors include Density, Free Volume %, Heat of Vaporization, Radius of Gyration, three solubility parameters and Specific Heat.

Note: You can export directly from the Project Table to spreadsheet form if needed by clicking Data > Export > Spreadsheet

Close the Project Table before proceeding to the next section

4. Building Machine Learning Models Using DeepAutoQSAR

In the following section, we want to build a Quantitative Structure-Activity Relationship (QSAR) model to predict log(viscosity) of an input structure. We will build two types of models 1) not using the MD descriptors 2) using the MD descriptors. We will compare the quality of the models to assess the importance of including MD descriptors for this particular model.

Figure 4-1. Selecting the entries and opening the DeepAutoQSAR panel.

Ensure that the entire md_descriptors_viscosity-out (200) entry group is selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list
- Be sure you have the new entry group selected, containing all of the descriptor data, as opposed to the original structures
Go to Tasks > Browse All > Discovery Informatics and QSAR > DeepAutoQSAR
- The DeepAutoQSAR panel opens

DeepAutoQSAR is one of the main machine learning model-building tools available for materials informatics. DeepAutoQSAR treats a molecule as a graph, where nodes are atoms and edges are bonds. Chemical features of the molecule (atom type, valence, charge, etc.) are attached to each node in the graph. For each atom, convolution operations are applied to neighboring atoms (and itself) to identify patterns relevant to the property of interest. Altogether, DeepAutoQSAR provides an automated way to leverage graph convolutional neural networks and accurately predict material properties for large datasets. You can read more about DeepAutoQSAR on our website or in the help documentation.

Figure 4-2. Setting the Options in the DeepAutoQSAR panel.

Ensure that Build model is checked
Change the Model type to Regression (numeric)
- The log(viscosity) property that we will predict is a continuous numeric value
Ensure that for Use structures from, Project Table (selected entries) is chosen
Change the Prediction property dropdown to log(Viscosity)
For Training set, choose Custom split
- Set Split on property to Train (0) vs Holdout (1)
- Set the Split threshold to property <= 0.00

In this example, rather than randomly splitting the training and holdout data, we have pre-selected a group of 20 random molecules. The reason for doing so is to enable us to perform a direct comparison on the quality of the machine learning models with and without the molecular dynamics descriptors.

Maintain the rest of the panel defaults

Figure 4-3. Naming and running the job.

Change the Job name to BuildTask_viscosity_nomddescriptors
Adjust the job settings () as needed
- This job requires a CPU or GPU host. The job will be completed in about 4 hours. In the provided tutorial files, a CPU host with 16 processors was used.
If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next section.

Figure 4-4. Adding the MD Descriptors.

Keep the DeepAutoQSAR panel open. We will now also run the job with the MD Descriptors included.

Click Add Descriptors
Check the eight main MD descriptors:
- Density, Free Volume, Heat of Vaporization, Radius of Gyration, Three Solubility Parameters and Specific Heat
Click Select
- The panel updates to mention that the Model includes 8 extra descriptors

Figure 4-5. Naming and running the job.

Keep the remaining Options the same to enable a direct comparison of the machine learning models.

Change the Job name to BuildTask_viscosity_mddescriptors
Adjust the job settings () as needed
- This job requires a CPU or GPU host. The job will be completed in about 4 hours. In the provided tutorial files, a CPU host with 16 processors was used.
If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next section.

5. Viewing the Machine Learning Models

We can proceed to view the machine learning models that were generated, both with and without MD descriptors.

Figure 5-1. Loading the models without MD descriptors.

When the job is complete, note that no new entry group is added to the entry list. The output can be analyzed back in the DeepAutoQSAR panel.

If closed, reopen the DeepAutoQSAR panel
For Choose task, switch to Make Predictions
Click Browse

First, let’s look at the results from the models without MD descriptors.

Navigate to where you downloaded the tutorial files and choose Section_05 > BuildTask_viscosity_nomdddescriptors > BuildTask_viscosity_nomddescriptors_model.qzip
Click Open

Figure 5-2. Viewing the model summary.

Once the model loads, the high level statistics are printed in the Model Summary.

Immediately we can see that the model is relatively poor, with an r² of 0.1047.

Click View Full Report

Figure 5-3. Viewing the scatterplot.

Go to the Plot tab

The plot shows the predicted versus experimental for the 20 holdout data points. It is clear that the ML model is not effective and should not be used for making predictions in this case.

Figure 5-4. Loading the models with MD Descriptors.

Close the DeepAutoQSAR Report Viewer and return to the panel
Click Browse again

Now, let’s look at the results from the models including MD descriptors.

Navigate to where you downloaded the tutorial files and choose Section_05 > BuildTask_viscosity_mdddescriptors > BuildTask_viscosity_mddescriptors_model.qzip
Click Open
Once the model loads, the high level statistics are printed in the Model Summary

Immediately we can see that the model is far better performing, with an r² of 0.7709.

Click View Full Report

Figure 5-5. Viewing the scatterplot.

Go to the Plot tab

The plot shows the predicted versus experimental for the 20 holdout data points. It is clear that the ML model with MD descriptors is better at predicting experimental viscosities as compared to the ML model without MD descriptors.

6. Conclusion and References

In this tutorial, we learned how to use the MD Descriptors panel to generate descriptors. We then learned how to use those descriptors to build ML models using the DeepAutoQSAR panel. Finally, we compared the effectiveness of the ML models with and without the inclusion of the MD descriptors. We observed that including MD descriptors significantly improved the prediction accuracy of experimental viscosity. While this tutorial is focused on predicting liquid viscosity, one can imagine these workflows being applied to other material properties such as melting point, glass transition temperatures, and so on.

Click to Expand

For further learning:

For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 70+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.

For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.

For some related practice, proceed to explore other relevant tutorials:

Click to Expand

For further reading:

Help documentation on Molecular Dynamics Descriptors and DeepAutoQSAR panels
Advancing Material Property Prediction: Using Physics-Informed Machine Learning Models for Viscosity. DOI:10.26434/chemrxiv-2023-1qfw8
DeepAutoQSAR Hardware Benchmark (Schrödinger white paper)

7. Glossary of Terms

Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion

Included - the entry is represented in the Workspace, the circle in the In column is blue

Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)

Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project

Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries

Working Directory - the location where files are saved

Workspace - the 3D display area in the center of the main window, where molecular structures are displayed