Machine Learning for Formulations

Tutorial Created with Software Release: 2025-3
Topics: Consumer Packaged Goods, Informatics and Team Collaboration, Pharmaceutical Formulations, Polymeric Materials
Methodology: Machine Learning
Products Used: AutoQSAR, MS Informatics, MS Maestro

Tutorial files

212 MB

This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed

 

Tip: You can hover over a glossary term to display its definition. You can click on an image to expand it in the page.
Abstract:

 

In this tutorial, we will learn to use the Formulation Machine Learning panel to build and apply machine learning models to predict the density of multicomponent mixtures.

 

Tutorial Content
  1. Introduction to Machine Learning for Formulations

  1. Creating Projects and Importing Structures

  1. Building a Machine Learning Model for Formulations

  1. Applying the Machine Learning Model to a Test Set

  1. Conclusion and References

  1. Glossary of Terms

1. Introduction to Machine Learning for Formulations

A formulation is a multicomponent mixture prepared according to a specific process and particular composition of each component. There are many approaches for measuring the properties of a formulation, such as traditional experimentation or physics-based simulations. While physics-based simulations can help alleviate extensive trial-and-error experimentation, they can be computationally expensive and can take on the order of hours to days to complete a single measurement. To speed up predictions, we can leverage computationally efficient machine learning models that can connect the complex relationship between a formulation and property with a lower computational expense as compared to physics-based models.

In this tutorial, we demonstrate how to predict density based on formulation data. Specifically, we train a machine learning model using 500 formulation examples and their corresponding densities. The formulations are characterized by the number of components, the molecular structures of the components, and the percent compositions. The data set was created by performing molecular dynamics simulations of solvent mixtures selected by miscibility tables from the CRC handbook. We considered multiple solvents as miscible with each other if each pair of solvents is experimentally miscible. For these mixtures, we arbitrarily varied the number of components from 1 to 5. From the molecular dynamics simulations, we extracted the density of the mixture, which is a representative formulation property relevant to many applications (e.g. battery electrolytes). Once the machine learning model is trained, we can visualize its performance, and apply it towards the prediction of density of new, unseen formulations. The overall workflow is summarized in the figure below:

Figure 1. General workflow for using the formulation machine learning panel to develop formulation-property relationships.

Key to the workflow is leveraging the Formulation Machine Learning panel, which facilitates model building, visualization of model performance, and prediction of new formulation examples.

For more information about building machine learning models in Materials Science Maestro, see the introductory sections of the Machine Learning for Materials Science tutorial. To learn about using pre-built machine learning models to predict properties, please refer to the Machine Learning Property Prediction tutorial.

For an introduction to using physics-based methods for formulation property prediction, please refer to any of the following tutorials: Disordered System Building and Molecular Dynamics Multistage Workflows, Building Solvated Systems, or Molecular Dynamics Simulations for Active Pharmaceutical Ingredient (API) Miscibility.

2. Creating Projects and Importing Structures

At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.

Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.

  1. Double-click the Materials Science icon

Figure 2-1. Change Working Directory option.

  1. Go to File > Change Working Directory
  2. Find your directory, and click Choose
  3. Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/ml_formulations.zip
  4. After downloading the zip file, unzip the contents in your Working Directorythe location where files are saved for ease of access throughout the tutorial

Figure 2-2. Save Project panel.

  1. Go to File > Save Project As
  2. Change the File name to ml_formulations_tutorial, click Save
    • The project is now named ml_formulations_tutorial.prj

3. Building a Machine Learning Model for Formulations

In this section, we will use the Formulation Machine Learning panel to build a model for predicting density of formulations.

For tutorial purposes, we have randomly selected 500 data points from a full data set of >30,000 possible combinations of up to 5 component systems (though note that the panel can handle thousands of data points).

The data is provided in the tutorial files: train.csv. Each data point contains an ID, up to 5 SMILES strings, compositions made of up to 5 components, a descriptive label, a density (g/cm3), and a count of the number of components. Some example rows are shown here:

Notice that for each component of the mixture (a SMILES string identifying the molecule), there is a corresponding composition percentage. Some entries for SMILES or composition are empty, which means that those components do not exist. For example, a 2-component system will only have 2 SMILES and compositions filled and all other SMILES or composition columns empty. Compositions must sum up to 100%. When preparing your own formulation data set, be sure to follow a similar template. For more on preparing input data for the Formulation Machine Learning panel, please see the help documentation.

Figure 3-1. The Formulation Machine Learning panel after opening.

The provided .csv file is loaded directly into the Formulation Machine Learning panel (not into MS Maestro):

  1. Go to Tasks > Materials > Informatics > Formulations Machine Learning

 

Figure 3-2. Loading the training data into the panel.

  1. Keep Formulation type as Simple
  2. Click Load Training Data …
  3. Navigate to the provided files, presumably in your working directorythe location where files are saved, choose train.csv and click Open
    • The panel is populated with the training data

Figure 3-3. Viewing the training data in the panel.

Take a moment to view the data in the panel. Scrolling down, we can see examples with the number of components varying from 1 to 5, as well as the density and other imported data columns.

Note that the data is editable within the panel. To alter any molecules, click on the corresponding SMILES string to open the 2D Sketcher. Click on the row to edit the values. Importantly, editing the data in the panel will not edit the imported .csv file (though the Export Data button can be used to create a new file). For this example, we will not edit the data.

Figure 3-4. Plotting the distribution of the training data.

The panel offers a useful way to visualize the distribution of the data set prior to model construction.

  1. On the right side of the panel, for Properties, choose num_components and then density
    • The plot updates to show the distribution of densities in the data set with respect to the number of components
    • Statistical parameters such as R2, root-mean-squared error (RMSE) and Pearson’s r correlation coefficient are printed, and can be toggled on and off with the checkboxes above the plot

Figure 3-5. Setting the parameters on the Build tab.

  1. Go to the Build tab
  2. Maintain the Featurizers as 4 items
    • Clicking the dropdown enables specific selection of features. In this case, we maintain the default of generating all of the available features
  3. Change the Machine learning models to all
    • We include all of the available machine learning models
  4. Change the Target property to density
    • The target property is that which we wish to predict with the machine learning model
  5. Maintain Descriptors as None
    • If you had additional descriptors available with your dataset, you could refer to them here
  6. Change the Hyperparameter tuning steps to 20

The “Hyperparameter tuning steps” setting dictates the total number of model architectures explored in hyperparameter selection. While increasing values of this setting will increase the computation time, it may provide models with higher accuracy by allowing the automated machine learning hyperparameter selection workflow to find better models. By hyperparameters, we mean the choice of featurizers or models that were selected in the “Featurizers” and “Machine learning models” dropdown respectively. Hence, a value of 20 means that we will explore a total of 20 models, whereby each new model will have hyperparameters selected by Bayesian Optimization to maximize model performance based on hyperparameters and performance of the previous models. For more information about the automated machine learning model development, please see the DeepAutoQSAR white paper.  The final model will use an ensemble of 3 top-performing models to generate predictions and uncertainties. As a result, a minimum value of 3 is required for this parameter.

Figure 3-6. Naming and running the training calculation.

  1. Change the Job name to formulation_ml_train_density
  2. Adjust the job settings () as needed
    • This job requires a CPU host. The job will be completed in about 10 minutes on a CPU host
  3. If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step.

By default, the panel will automatically perform a random 90:10 train:test split on your dataset. The panel will use the training set for model training and testing set to evaluate whether the model can generalize to unseen formulations. Performance on the train and test sets are reported after building a model.

Figure 3-7. Loading the model file.

If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. Here, we will assume that you are proceeding with the provided files:

  1. Go to the Performance tab
  2. Click Load Model …
  3. In the provided files, go to the Section_03 > formulation_ml_train_density directory, choose the formulation_ml_train_density.mlform file and click Open
    • If you get a warning saying that the uploaded ML model was trained on an older version, feel free to click okay and proceed. If you wish to retrain the model, you can use the ML Model Manager panel.

Figure 3-8. Viewing the model.

The performance tab is populated with the corresponding scatterplot.

Above the plot, the Featurizers and Machine learning models used in building the model are listed.

The plot contains predicted versus actual values from the train and test set, with corresponding R2 and RMSE values in a table below.

In this case, we can clearly see that the model generalized well on the test set.

Note: If you performed the calculations yourself, you should expect slight variance in the results.

4. Applying the Machine Learning Model to a Test Set

In this section, we will learn to apply the model generated in the previous section to predict density on a set of formulations unseen by the model.

Figure 4-1. The Predict tab. 

  1. In the Formulation Machine Learning panel, go to the Predict tab
  2. Ensure that next to Load Model, the .mlform file associated with Section 3 is shown. If not, load the formulation_ml_train_density.mlform file

Figure 4-2. Loading the prediction input data. 

We can load new formulation data into the panel and use our model to predict density values.

  1. Make sure the dropdown shows Prediction Input and click Load …
  2. Navigate to the provided tutorial files and choose the test.csv file. Click Open

Figure 4-3. Viewing the prediction input data. 

The panel is updated with the data in the test.csv file. Note that the densities are provided only to compare the quality of the predictions; the machine learning model will be able to predict the density of these formulations with only SMILES and composition information.

Similar to the “Training Data” tab,  the data is editable within the “Predict” tab. To alter any molecules, click on the corresponding SMILES string to open the 2D Sketcher. To edit any values, click on the row to enable editing. For this example, we will not edit the data.

Figure 4-4. Naming and running the job. 

With the trained model and input data loaded, we can run the job to predict density for each formulation.

  1. Change the Job name to formulation_ml_test_density
  2. Adjust the job settings () as needed
    • This job requires a CPU host. The job will be completed in about 1 minute on a CPU host
  3. If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step.

Figure 4-5. Loading the prediction output.

If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. We will assume that you are proceeding with the provided files:

  1. Change the dropdown from Predict Input Data to Prediction Output
  2. Click Load …
  3. Navigate to the formulation_ml_test_density_predict.csv file in the Section_04 > formulation_ml_test_density > formulation_ml_test_density_predict.csv
  4. Click Open

Figure 4-6. Viewing the prediction output.

The second data column contains the provided density. The fourth and fifth columns contain the outputs from the machine learning model, specifically the predicted density and associated uncertainty

 

Note: If you performed the calculations yourself, you should expect slight variance in the results..

Figure 4-7. Plotting a scatter plot of the data.

The panel enables quick generation of a scatter plot to assess the performance of the model.

  1. Click on the density_predict column header to pull up the plotter tool
  2. Set the Properties from the dropdowns to density and density_predict

A scatterplot of predicted density versus provided density is shown. Statistical parameters such as R2, RMSE and Pearson’s r are printed, and can be toggled on and off with the checkboxes above the plot.

The best fit line between predicted and calculated values shows a reasonable R2 of 0.95 (an ideal model would have an R2 of 1.00). The results suggest that the ML model derived from the Formulation Machine Learning panel could generalize to unseen formulations. Furthermore, this workflow highlights the computational efficiency achieved when using ML approaches as compared to other computational (e.g. performing molecular dynamics simulations on each formulation) or experimental approaches. While this tutorial uses a relatively small dataset, one could expect that a larger training set would further improve prediction accuracy.

5. Conclusion and References

In this tutorial, we learned to use the Formulation Machine Learning panel to build and apply machine learning models to predict the density of multicomponent mixtures. While this panel is focused on miscible solvents and densities, one could imagine extending this workflow to any mixture of chemicals with composition information, such as copolymer systems, battery electrolytes, consumer-goods, and more.

For further learning:

For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 70+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.

For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.

For some related practice, proceed to explore other relevant tutorials:

For further reading:

6. Glossary of Terms

Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion

Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project

Working Directory - the location where files are saved

Workspace - the 3D display area in the center of the main window, where molecular structures are displayed