Molecular Dynamics Descriptors for Machine Learning
Tutorial Created with Software Release: 2023-4
Topics: Consumer Packaged Goods , Informatics and Team Collaboration , Pharmaceutical Formulations , Polymeric Materials
Methodology: All-Atom Molecular Dynamics , Machine Learning
Products Used: DeepAutoQSAR , MS Informatics , MS Maestro
|
27 MB |
This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed
Abstract:
In this tutorial, we will learn to generate descriptors using molecular dynamics simulations, which can be used to build and improve machine learning models for predicting material properties.
Tutorial Content
1. Introduction to Molecular Dynamics Descriptors
Accurately measuring complex material properties like viscosity, which measures resistance to fluid flow or deformation, can be expensive through trial-and-error experimentation. Physics-based models that can measure these materials properties are a promising alternative solution to alleviate the need for extensive experimentation. However, even physics-based models can be computationally expensive and can take in the order of days to complete a single measurement. To speed up predictions from physics-based models, we can leverage computationally efficient machine learning models that can connect the complex relationship between molecular structure and bulk material property. In this tutorial, we demonstrate how descriptors from physics-based models can be used in conjunction with traditional cheminformatics descriptors to improve the accuracy of ML models to predict experimental viscosity. We focus on a small dataset of ~200 examples, which is often observed in materials science applications, and demonstrate that physics-based descriptors are useful at the small data scale.
This tutorial provides step-by-step instructions for calculating MD descriptors using the Materials Science Maestro interface. The tutorial also demonstrates the utility of these descriptors by constructing ML models with DeepAutoQSAR for predicting viscosity. We will generate ML models with and without the MD descriptors to demonstrate their effectiveness in predicting viscosity. The overall workflow is summarized in the figure below:
Tutorial workflow showing the conversion of SMILES to molecular structures, the molecular dynamics descriptors and DeepAutoQSAR panels, and the output scatterplots.
The Molecular Dynamics Descriptors panel automates the calculation of several MD descriptors, but we will focus on the following descriptors: density, free volume %, heat of vaporization, radius of gyration, three solubility parameters and specific heat. The panel takes a molecular structure or formulation as input, prepares a MD simulation by populating a periodic box with that molecule or formulation, computes MD descriptors after a short MD simulation, and tabulates the descriptors for subsequent ML model building.
For complete background on the Molecular Dynamics Descriptors panel, including a complete summary of the available descriptors, see the help documentation.
For more information about building machine learning models in Materials Science Maestro, see the introductory sections of the Machine Learning for Materials Science tutorial. To learn about using pre-built machine learning models to predict properties, please refer to the Machine Learning Property Prediction tutorial.
For an introduction to using physics-based methods alone for soft matter property prediction, please refer to any of the following tutorials: Polymer Property Prediction, Viscosity, Surface Tension and Dielectric Properties.
2. Creating Projects and Importing Structures
At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.
Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.
- Double-click the Materials Science icon
- (No icon? See Starting Maestro)
- Go to File > Change Working Directory
- Find your directory, and click Choose
- Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/md_descriptors.zip
- After downloading the zip file, unzip the contents in your Working Directory for ease of access throughout the tutorial
- Go to File > Save Project As
- Change the File name to MD_descriptors_tutorial, click Save
- The project is now named
MD_descriptors_tutorial.prj
- The project is now named
In this tutorial, we will use a data set of 200 small molecules. For tutorial purposes, this data set is a randomly selected subset of the complete data set found in the recent literature (DOI:10.26434/chemrxiv-2023-1qfw8). The small molecules and their experimentally determined viscosities are from scientific literature, publications, and online databases.
A .csv is available in the provided files which contains SMILES strings for each molecule, as well as corresponding viscosity data and literature references. Let’s import these structures now:
- Go to File > Import Structures
- Choose
input_train.csvfrom the provided tutorial files - Click Open
- The Import SMILES panel pops up
- For SMILES Column: choose CANON_SMILES
- For ENTRY TITLE Column: choose Name
- Ensure Discard any additional properties is unchecked
- Click OK
The entry list is updated to include the 200 entries. Feel free to stylize and visualize any of the provided structures.
Note: Hydrogen atoms are not added when importing from SMILES. This will not impact this exercise, but it is good to be aware of if running any quantum mechanical calculations that require hydrogens to be present.
Each imported molecule also has a viscosity value as determined and reported in the literature (see the Reference column in the provided .csv file). For building machine learning models, we will predict the log transform of viscosity to ameliorate the skewed distribution of viscosity values.
To view the imported data in MS Maestro (e.g. Viscosity, log(Viscosity) or the literature references), open the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. If any data of interest are not displayed in a column, add them via the Property Tree (
) under All > Canvas > Secondary.
3. Generating Molecular Dynamics Descriptors
Prior to creating ML models, we will first generate molecular dynamics descriptors for the structures using the MD Descriptors panel.
- Ensure that all 200 entries are selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list (use Shift + Click or create an entry group)
- Go to Tasks > Materials > Informatics > Molecular Dynamics
- The MD Descriptors panel opens
- Keep Use pure materials from selected
- If using a dataset with formulations you would select Use formulations from
- Change the Temperature to 298 K
- The provided data was gathered mostly at or around room temperature
- Change the Job name to md_descriptors_viscosity
The MD descriptors panel performs high-throughput, routine molecular dynamics simulations on the selected structures. The protocol includes automated construction of a disordered system, equilibration and tabulation of the eight MD descriptors of interest. The workflow resembles the steps taught in the Disordered System Building and Molecular Dynamics Multistage Workflows tutorial for single component systems. For complete details on the methods underlying the MD Descriptors workflow, please visit the help documentation as well as the publication used to construct the tutorial (DOI:10.26434/chemrxiv-2023-1qfw8).
- Adjust the job settings (
) as needed
- Note that the run time for this job is highly dependent on available compute resources, requiring CPU and GPU compute nodes for each input structure.
For tutorial purposes, we will not run the job here, and will instead import the output of a pre-run calculation.
- Close the MD Descriptors panel
- Go to File > Import Structures
- Navigate to where you downloaded the tutorial files. Open
Section_03 > md_descriptors_viscosityand choose themd_descriptors_viscosity-out.maegzfile - Click Open
- A new entry group is added to the entry list. The entry group contains the same 200 molecules from the original input_train (200) entry group, but now these entries have associated molecular dynamics descriptors
We can see these descriptors by opening the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data
The properties may not appear by default. To add some of the properties as columns in the Project Table:
- Go to the Property Tree (
), expand All > Materials Science > Primary and All > Materials Science > Secondary and select any of the properties for which you want to see the quantity
The MD Descriptors include Density, Free Volume %, Heat of Vaporization, Radius of Gyration, three solubility parameters and Specific Heat.
Note: You can export directly from the Project Table to spreadsheet form if needed by clicking Data > Export > Spreadsheet
- Close the Project Table before proceeding to the next section
4. Building Machine Learning Models Using DeepAutoQSAR
In the following section, we want to build a Quantitative Structure-Activity Relationship (QSAR) model to predict log(viscosity) of an input structure. We will build two types of models 1) not using the MD descriptors 2) using the MD descriptors. We will compare the quality of the models to assess the importance of including MD descriptors for this particular model.
- Ensure that the entire md_descriptors_viscosity-out (200) entry group is selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list
- Be sure you have the new entry group selected, containing all of the descriptor data, as opposed to the original structures
- Go to Tasks > Browse All > Discovery Informatics and QSAR > DeepAutoQSAR
- The DeepAutoQSAR panel opens
DeepAutoQSAR is one of the main machine learning model-building tools available for materials informatics. DeepAutoQSAR treats a molecule as a graph, where nodes are atoms and edges are bonds. Chemical features of the molecule (atom type, valence, charge, etc.) are attached to each node in the graph. For each atom, convolution operations are applied to neighboring atoms (and itself) to identify patterns relevant to the property of interest. Altogether, DeepAutoQSAR provides an automated way to leverage graph convolutional neural networks and accurately predict material properties for large datasets. You can read more about DeepAutoQSAR on our website or in the help documentation.
- Ensure that Build model is checked
- Change the Model type to Regression (numeric)
- The log(viscosity) property that we will predict is a continuous numeric value
- Ensure that for Use structures from, Project Table (selected entries) is chosen
- Change the Prediction property dropdown to log(Viscosity)
- For Training set, choose Custom split
- Set Split on property to Train (0) vs Holdout (1)
- Set the Split threshold to property <= 0.00
In this example, rather than randomly splitting the training and holdout data, we have pre-selected a group of 20 random molecules. The reason for doing so is to enable us to perform a direct comparison on the quality of the machine learning models with and without the molecular dynamics descriptors.
- Maintain the rest of the panel defaults
- Change the Job name to BuildTask_viscosity_nomddescriptors
- Adjust the job settings (
) as needed
- This job requires a CPU or GPU host. The job will be completed in about 4 hours. In the provided tutorial files, a CPU host with 16 processors was used.
- If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next section.
Keep the DeepAutoQSAR panel open. We will now also run the job with the MD Descriptors included.
- Click Add Descriptors
-
Check the eight main MD descriptors:
- Density, Free Volume, Heat of Vaporization, Radius of Gyration, Three Solubility Parameters and Specific Heat
- Click Select
- The panel updates to mention that the Model includes 8 extra descriptors
Keep the remaining Options the same to enable a direct comparison of the machine learning models.
- Change the Job name to BuildTask_viscosity_mddescriptors
- Adjust the job settings (
) as needed
- This job requires a CPU or GPU host. The job will be completed in about 4 hours. In the provided tutorial files, a CPU host with 16 processors was used.
- If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next section.
5. Viewing the Machine Learning Models
We can proceed to view the machine learning models that were generated, both with and without MD descriptors.
When the job is complete, note that no new entry group is added to the entry list. The output can be analyzed back in the DeepAutoQSAR panel.
- If closed, reopen the DeepAutoQSAR panel
- For Choose task, switch to Make Predictions
- Click Browse
First, let’s look at the results from the models without MD descriptors.
- Navigate to where you downloaded the tutorial files and choose
Section_05 > BuildTask_viscosity_nomdddescriptors > BuildTask_viscosity_nomddescriptors_model.qzip - Click Open
Once the model loads, the high level statistics are printed in the Model Summary.
Immediately we can see that the model is relatively poor, with an r2 of 0.1047.
- Click View Full Report
- Go to the Plot tab
The plot shows the predicted versus experimental for the 20 holdout data points. It is clear that the ML model is not effective and should not be used for making predictions in this case.
- Close the DeepAutoQSAR Report Viewer and return to the panel
- Click Browse again
Now, let’s look at the results from the models including MD descriptors.
- Navigate to where you downloaded the tutorial files and choose
Section_05 > BuildTask_viscosity_mdddescriptors > BuildTask_viscosity_mddescriptors_model.qzip - Click Open
- Once the model loads, the high level statistics are printed in the Model Summary
Immediately we can see that the model is far better performing, with an r2 of 0.7709.
- Click View Full Report
- Go to the Plot tab
The plot shows the predicted versus experimental for the 20 holdout data points. It is clear that the ML model with MD descriptors is better at predicting experimental viscosities as compared to the ML model without MD descriptors.
6. Conclusion and References
In this tutorial, we learned how to use the MD Descriptors panel to generate descriptors. We then learned how to use those descriptors to build ML models using the DeepAutoQSAR panel. Finally, we compared the effectiveness of the ML models with and without the inclusion of the MD descriptors. We observed that including MD descriptors significantly improved the prediction accuracy of experimental viscosity. While this tutorial is focused on predicting liquid viscosity, one can imagine these workflows being applied to other material properties such as melting point, glass transition temperatures, and so on.
For further learning:
For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 70+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.
For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.
For some related practice, proceed to explore other relevant tutorials:
-
For more machine learning:
- Machine Learning for Materials Science
- Periodic Descriptors for Inorganic Solids
- Optoelectronics Active Learning
- Machine Learning for Sweetness
- Machine Learning for Ionic Conductivity
- Cheminformatics Machine Learning for Homogeneous Catalysis
- Machine Learning Property Prediction
- Machine Learning for Formulations
- Optimizing Viscosity and Cost in Formulations with Missing Structural Data
- For relevant physics-based workflows:
For further reading:
- Help documentation on Molecular Dynamics Descriptors and DeepAutoQSAR panels
- Advancing Material Property Prediction: Using Physics-Informed Machine Learning Models for Viscosity. DOI:10.26434/chemrxiv-2023-1qfw8
- DeepAutoQSAR Hardware Benchmark (Schrödinger white paper)
7. Glossary of Terms
Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion
Included - the entry is represented in the Workspace, the circle in the In column is blue
Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data
Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)
Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project
Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries
Working Directory - the location where files are saved
Workspace - the 3D display area in the center of the main window, where molecular structures are displayed