Applied Machine Learning for Formulations

Tutorial Created with Software Release: 2025-3
Topics: Catalysis & Reactivity, Consumer Packaged Goods, Organic Electronics, Pharmaceutical Formulations, Polymeric Materials, Thin Film Processing
Methodology: Machine Learning
Products Used: MS Formulation ML, MS Maestro

Tutorial files

342.6 MB

This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed

 

Tip: You can hover over a glossary term to display its definition. You can click on an image to expand it in the page.
Abstract:

 

In this tutorial, we will learn to apply the Formulation Machine Learning Panel across a range of materials applications. This tutorial assumes that you have already completed the Machine Learning for Formulations tutorial.

 

Tutorial Content
  1. Introduction to Applied Machine Learning for Formulations

  1. Creating Projects and Importing Structures

  1. Building and Applying a Machine Learning Model for Solubility

  1. Building and Applying a Machine Learning Model for Viscosity

  1. Building a Machine Learning Model for Glass Transition Temperature

  1. Building a Machine Learning Model for Compression Strength

  1. Conclusion and References

  1. Glossary of Terms

1. Introduction to Applied Machine Learning for Formulations

Chemical mixtures with specific compositions of ingredients, or formulations, are ubiquitous across materials science applications. In a previous tutorial (Machine Learning for Formulations), we have described a workflow to leverage machine learning (ML) algorithms to rapidly and accurately predict properties of mixtures using ingredient structure and composition as inputs. The Machine Learning for Formulations tutorial focuses on predicting mixture properties outputted from molecular dynamics simulations, which is a specific application towards solvent mixtures using computed properties.

In this tutorial, we demonstrate the Formulation Machine Learning panel can be applied to distinct experimental datasets from the literature, which will showcase the flexibility of this tool to broad materials design. We focus on training machine learning models to predict four relevant materials properties:

  1. Temperature-dependent drug solubility in pure or binary solvents. Drug solubility - or the amount of drug dissolved in solution - is useful for understanding how to better design or process drugs in various solvent environments. Drug solubility is important for pharmaceutical formulation applications to design new medicine.
  2. Temperature-dependent viscosity of pure or binary solvents. Viscosity is useful to measure the “stickiness” of a solution; for example, honey is sticky and has a high viscosity, whereas water flows easily and has a low viscosity. Viscosity is a crucial parameter that is found in battery, consumer goods, and pharmaceutical applications.
  3. Glass transition temperature (Tg) of copolymer systems. Tg dictates the temperature that a polymer transforms from amorphous phase (soft) to glassy phase (hard), which is critical for designing plastic material polymers that are stable in a desired temperature range.
  4. Compression strength of geopolymer concrete. Compression strength dictates how much pressure a material can withstand, which is useful to designing reliable concrete for roads and homes.

The overall workflow for one of the demonstrations is summarized in the figure below:

Figure 1. General workflow of inputting a CSV file containing ingredients and compositions, training new machine learning models with the formulation machine learning panel, and prediction of new ingredients and compositions using a trained machine learning model.

2. Creating Projects and Importing Structures

At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.

Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.

  1. Double-click the Materials Science icon

Figure 2-1. Change Working Directory option.

  1. Go to File > Change Working Directory
  2. Find your directory, and click Choose
  3. Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/ml_formulations_applied.zip
  4. After downloading the zip file, unzip the contents in your Working Directorythe location where files are saved for ease of access throughout the tutorial

Figure 2-2. Save Project panel.

  1. Go to File > Save Project As
  2. Change the File name to ml_formulations_applied_tutorial, click Save
    • The project is now named ml_formulations_applied_tutorial.prj

3. Building and Applying a Machine Learning Model for Solubility

In this section, we will use the Formulation Machine Learning panel to build a model for predicting solubility of formulations. Then, we will learn to apply the model generated to predict solubility on a set of formulations unseen by the model.

Figure 3-1. The Formulation Machine Learning panel after opening.

The provided .csv file is loaded directly into the Formulation Machine Learning panel (not into MS Maestro):

  1. Go to Tasks > Materials > Informatics > Formulations Machine Learning

Figure 3-2. Loading the training data into the panel.

  1. Click Load CSV …
  2. Navigate to the provided files, presumably in your working directorythe location where files are saved, select Section_03 > Pharmaceutical_Formulations-2025BAOTowards-train.csv and click Open
    • The training CSV data contains the drug as SMILES_0, first solvent as SMILES_1, and second solvent (if present) as SMILES_2. The composition of the drug (comp_0) is fixed at 50%, and relative compositions of the solvents (comp_1 and comp_2) are set such that they sum up to 50%.
    • The panel is populated with the training data

Figure 3-3. Viewing the training data in the panel.

Take a moment to view the data in the panel. Note that the data is editable within the panel. To alter any molecules, click on the corresponding SMILES string to open the 2D Sketcher. Click on the row to edit the values. Importantly, editing the data in the panel will not edit the imported .csv file. If you edit the structure, you can create a new CSV file through the Export Data button to save your changes. For this example, we will not edit the data.

Figure 3-4. Showing the LogS property.

The panel offers a useful way to visualize the distribution of the data set prior to model construction.

  1. On the right side of the panel, for Properties, choose LogS
    • Statistical parameters such as R2, root-mean-squared error (RMSE) and Pearson’s r correlation coefficient are printed, and can be toggled on and off with the checkboxes above the plot

Figure 3-5. Setting the parameters on the Build tab.

  1. Go to the Build tab
  2. Select Fingerprint and Graph Representation as the Featurizers
    • Deselect all other options
  3. Select Dense Neural Networks, Set2Set, and Graph-based Models for the Machine learning models
  4. Change the Target property to LogS
    • The target property will be  predicted by the machine learning model
  5. Select Temperature (K) as the Descriptors
    • If you have additional descriptors relevant to the target property, you always add them as additional inputs into the machine learning model. In this example, solubility was measured as a function of temperature, which is why temperature is included in the model.
  6. Change the Hyperparameter tuning steps to 20

Figure 3-6. Opening the Advanced Options.

  1. Open the Advanced Options
    • If additional input features like temperature are missing for data points, you can check the Enable descriptor imputation option to train a model and infer missing features.
  2. Check Out-of-sample splitting
    • Instead of randomly splitting, we chose out-of-sample splitting because we have temperature-dependent data that can have the repeated drug-solvent combinations at different temperatures. Out-of-sampling splits the training and testing data such that the test set has unique combinations of structures outside of the training set. Out-of-sampling is a better approach to rigorously measure whether a model can predict new drug-solvent combinations.
  3. Click OK to close the panel

Figure 3-7. Naming and running the training calculation.

  1. Change the Job name to formulation_ml_train_logS
  2. Adjust the job settings () as needed
    • This job requires a CPU host. The job will be completed in about 1 day on a CPU host
  3. If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step.

By default, the panel will automatically perform a 90:10 train:test split on your dataset, which means we use 90% of the data for training the model and leave out 10% of the data as the test set to evaluate model generalizability to unseen formulations. Performance on the train and test sets are reported after building a model.

Figure 3-8. Loading the model file.

If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. Here, we will assume that you are proceeding with the provided files:

  1. Go to the Performance tab
  2. Click Load Model …
  3. In the provided files, go to the Section_03 > formulation_ml_train_logS directory, choose the formulation_ml_train_logS.mlform file
  4. Click Open
    • If you get a warning saying that the uploaded ML model was trained on an older version, feel free to click okay and proceed. If you wish to retrain the model, you can use the ML Model Manager panel

 

Note: If you performed the calculations yourself, you should expect slight variance in the results arising from the randomness of the train/test split and machine learning model hyperparameters.

v

Figure 3-9. Viewing the model.

The performance tab is populated with the corresponding scatterplot.

Above the plot, the Featurizers and Machine learning models used in building the model are listed.

The plot contains predicted versus actual values from the train and test set, with corresponding R2 and RMSE values in a table below.

In this case, we can clearly see that the model generalized well on the test set with an R2 of 0.85 (ideal model would have R2 of 1).

Figure 3-10. The Predict tab. 

  1. In the Formulation Machine Learning panel, go to the Predict tab
  2. Ensure that next to the Load Model, the .mlform file is shown. If not, load the formulation_ml_train_logS.mlform file again

Figure 3-11. Loading the prediction input data. 

We can load new formulation data into the panel and use our model to predict solubility values.

  1. Make sure the dropdown shows Prediction Input and click Load …
  2. Navigate to the provided tutorial files and choose the Pharmaceutical_Formulations-2025BAOTowards-test.csv file. Click Open
    • Note that this test CSV file has unique drug-solvent combinations outside of model training to truly evaluate new formulations.

Figure 3-12. Viewing the prediction input data. 

The panel is updated with the data in the Pharmaceutical_Formulations-2025BAOTowards-test.csv file. Note that the experimental solubilities are included only to compare the quality of the predictions; the practical application of this model would be to predict the solubility of these formulations using only SMILES, composition, and temperature information (i.e. without experimental solubility).

Similar to the “Training Data” tab, the data is editable within the “Predict” tab. To alter any molecules, click on the corresponding SMILES string to open the 2D Sketcher. To edit any values, click on the row to enable editing. For this example, we will not edit the data.

Figure 3-13. Naming and running the job. 

With the trained model and input data loaded, we can run the job to predict solubility for each formulation.

  1. Change the Job name to formulation_ml_test_logS
  2. Adjust the job settings () as needed
    • This job requires a CPU host. The job will be completed in about 1 minute on a CPU host
  3. If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step.

Figure 3-14. Loading the prediction output.

If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. We will assume that you are proceeding with the provided files:

  1. Change the dropdown from Predict Input Data to Prediction Output
  2. Click Load …
  3. Navigate to the formulation_ml_test_logS_predict.csv file in the Section_03 > formulation_ml_test_logS > formulation_ml_test_logS_predict.csv
  4. Click Open

Figure 3-15. Viewing the prediction output.

The blue text data columns contain the experimental temperature and solubility. The black text data columns contain the outputs from the machine learning model, such as the predicted solubility and the uncertainty of the predictions.

Figure 3-16. Plotting a scatter plot of the data.

The panel enables quick generation of a scatter plot to assess the performance of the model.

  1. Click on the LogS column header to pull up the plotter tool
  2. Set the Properties from the dropdowns to LogS and LogS_predict

A scatterplot of predicted solubility versus provided solubility is shown. Statistical parameters such as R2, RMSE and Pearson’s r are printed, and can be toggled on and off with the checkboxes above the plot.

The best fit line between predicted and calculated values shows a reasonable R2 of 0.90 (an ideal model would have an R2 of 1.00). The results suggest that the ML model derived from the Formulation Machine Learning panel could accurately predict new drug solubilities with varying solvent structure, composition, and temperature.

4. Building and Applying a Machine Learning Model for Viscosity

In this section, we will use the Formulation Machine Learning panel to build a model for predicting viscosity of formulations. Then, we will learn to apply the model generated to predict viscosity on a set of formulations unseen by the model.

Figure 4-1. The Formulation Machine Learning panel after opening.

Figure 4-2. Loading the training data into the panel.

  1. Click Load Training Data …
  2. Navigate to the provided files, presumably in your working directorythe location where files are saved, select Section_04 > Viscosity-Bilodeau2023-train.csv and click Open
    • The training CSV file contains the first solvent as SMILES_0 and second solvent (if present) as SMILES_1. Their corresponding compositions are stored as comp_0 and comp_1 such that the compositions sum up to 100%.
    • The panel is populated with the training data

Figure 4-3. Viewing the training data in the panel.

Once again, take a moment to view the data in the panel.

Figure 4-4. Viewing the logV values of the training data.

  1. On the right side of the panel, for Properties, choose logV
    • Statistical parameters such as R2, root-mean-squared error (RMSE) and Pearson’s r correlation coefficient are printed, and can be toggled on and off with the checkboxes above the plot

Figure 4-5. Setting the parameters on the Build tab.

  1. Go to the Build tab
  2. Select Fingerprint and Graph Representation as the Featurizers
    • Deselect all other options
  3. Select Dense Neural Networks, Set2Set, and Graph-based Models for the Machine learning models
  4. Change the Target property to logV
    • The target property is that which we wish to predict with the machine learning model
  5. Select Temperature (K) as the Descriptors
    • If you had additional descriptors available with your dataset, you could refer to them here
  6. Change the Hyperparameter tuning steps to 10

Figure 4-6. Opening the Advanced Options.

  1. Open the Advanced Options
  2. Check Out-of-sample splitting
    • Similar to the previous example, this dataset has temperature-dependent viscosity data. Out-of-sampling splits the training and testing data such that the test set has unique combinations of structures outside of the training set. Out-of-sampling is a better approach to rigorously measure whether a model can predict new pure or binary solvent combinations.
  3. Click OK to close the panel

Figure 4-7. Naming and running the training calculation.

  1. Change the Job name to formulation_ml_train_viscosity
  2. Adjust the job settings () as needed
    • This job requires a CPU host. The job will be completed in about 10 hours on a CPU host
  3. If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step.

By default, the panel will automatically perform a random 90:10 train:test split on your dataset. The panel will use the training set for model training and testing set to evaluate whether the model can generalize to unseen formulations. Performance on the train and test sets are reported after building a model.

Figure 4-8. Loading the model file.

If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. Here, we will assume that you are proceeding with the provided files:

  1. Go to the Performance tab
  2. Click Load Model …
  3. In the provided files, go to the Section_04 > formulation_ml_train_viscosity directory, choose the formulation_ml_train_viscosity.mlform file and click Open
    • If you get a warning saying that the uploaded ML model was trained on an older version, feel free to click okay and proceed. If you wish to retrain the model, you can use the ML Model Manager panel

Note: If you performed the calculations yourself, you should expect slight variance in the results.

Figure 4-9. Viewing the model.

The performance tab is populated with the corresponding scatterplot.

Above the plot, the Featurizers and Machine learning models used in building the model are listed.

The plot contains predicted versus actual values from the train and test set, with corresponding R2 and RMSE values in a table below.

In this case, we can clearly see that the model generalized well on the test set.

Figure 4-10. The Predict tab. 

  1. In the Formulation Machine Learning panel, go to the Predict tab
  2. Ensure that next to Load Model, the .mlform file is shown. If not, load the formulation_ml_train_viscosity.mlform file

Figure 4-11. Loading the prediction input data. 

We can load new formulation data into the panel and use our model to predict viscosity values.

  1. Make sure the dropdown shows Prediction Input and click Load …
  2. Navigate to the provided tutorial files and choose the Viscosity-Bilodeau2023-test.csv file. Click Open

Figure 4-12. Viewing the prediction input data. 

The panel is updated with the data in the Viscosity-Bilodeau2023-test.csv file. Note that the experimental viscosities are provided only to compare the quality of the predictions; the machine learning model will be able to predict the viscosity of these formulations with only SMILES, composition, and temperature information.

Similar to the “Training Data” tab,  the data is editable within the “Predict” tab. To alter any molecules, click on the corresponding SMILES string to open the 2D Sketcher. To edit any values, click on the row to enable editing. For this example, we will not edit the data.

Figure 4-13. Naming and running the job. 

With the trained model and input data loaded, we can run the job to predict viscosity for each formulation.

  1. Change the Job name to formulation_ml_test_viscosity
  2. Adjust the job settings () as needed
    • This job requires a CPU host. The job will be completed in about 1 minute on a CPU host
  3. If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step.

Figure 4-14. Loading the prediction output.

If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. We will assume that you are proceeding with the provided files:

  1. Change the dropdown from Predict Input Data to Prediction Output
  2. Click Load …
  3. Navigate to Section_04 > formulation_ml_test_viscosity > formulation_ml_test_viscosity_predict.csv directory
  4. Click Open

Figure 4-15. Viewing the prediction output.

The first data column contains the provided temperature and the second data column contains the provided logV. The third and fourth columns contain the outputs from the machine learning model, specifically the predicted viscosity and associated uncertainty.

Figure 4-16. Plotting a scatter plot of the data.

The panel enables quick generation of a scatter plot to assess the performance of the model.

  1. Set the Properties from the dropdowns to logV and logV_predict

A scatterplot of predicted logV versus provided logV is shown. Statistical parameters such as R2, RMSE and Pearson’s r are printed, and can be toggled on and off with the checkboxes above the plot.

The best fit line between predicted and calculated values shows a reasonable R2 of 0.96 (an ideal model would have an R2 of 1.00). The results suggest that the ML model derived from the Formulation Machine Learning panel could accurately predict viscosity of pure or binary solvents with varying structure, composition, and temperature.

5. Building a Machine Learning Model for Glass Transition Temperature

In this section, we will use the Formulation Machine Learning panel to build a model for predicting the glass transition temperature for copolymers.

Figure 5-1. The Formulation Machine Learning panel after opening.

Figure 5-2. Loading the training data into the panel.

  1. Click Load Training Data …
  2. Navigate to the provided files, presumably in your working directorythe location where files are saved, select Section_05 > Copolymer_Tg_penzel1997Glass.csv and click Open
    • The training CSV file contains binary copolymer systems, where the repeat unit of monomer 1 and 2 is stored as SMILES_0 and SMILES_1, respectively. The compositions (comp_0 and comp_1) denote the extent of these monomers present in the copolymer system.
    • The panel is populated with the training data

Figure 5-3. Viewing the training data in the panel.

Take a moment to view the values in the panel. Each input structure contains a monomer repeat unit of a copolymer. The compositions represent the relative ratios of each monomer unit for a random copolymer system.

Note: Th and Lr are dummy atoms that serve as the head and tail groups of the monomer, and to mark where to connect the monomers to form a polymer system.

Figure 5-4. Viewing the training data.

  1. On the right side of the panel, for Properties, choose Tg(K)
    • Statistical parameters such as R2, root-mean-squared error (RMSE) and Pearson’s r correlation coefficient are printed, and can be toggled on and off with the checkboxes above the plot

Figure 5-5. Setting the parameters on the Build tab.

  1. Go to the Build tab
  2. Select Fingerprint and MACCS Keys as the Featurizers
  3. Select Dense Neural Networks and Set2Set for the Machine learning models
  4. Change the Target property to Tg(K)
  5. Change the Hyperparameter tuning steps to 20

Figure 5-6. Opening the Advanced Options.

  1. Open the Advanced Options
  2. Check Out-of-sample splitting
    • Instead of randomly splitting, we chose out-of-sample splitting because we have a copolymer dataset with varying compositions per copolymer combination. Out-of-sampling splits the training and testing data such that the test set has unique combinations of copolymer structures outside of the training set. Out-of-sampling is a better approach to rigorously measure whether a model can predict new copolymer combinations.
  3. Click OK to close the panel

Figure 5-7. Naming and running the training calculation.

  1. Change the Job name to formulation_ml_train_Tg
  2. Adjust the job settings () as needed
    • This job requires a CPU host. The job will be completed in about 20 minutes on a CPU host
  3. If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step.

By default, the panel will automatically perform a 90:10 train:test split on your dataset. The panel will use the training set for model training and testing set to evaluate whether the model can generalize to unseen formulations. Performance on the train and test sets are reported after building a model.

Figure 5-8. Loading the model file.

If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. Here, we will assume that you are proceeding with the provided files:

  1. Go to the Performance tab
  2. Click Load Model …
  3. In the provided files, go to the Section_05 > formulation_ml_train_Tg directory, choose the formulation_ml_train_Tg.mlform file and click Open
    • If you get a warning saying that the uploaded ML model was trained on an older version, feel free to click okay and proceed. If you wish to retrain the model, you can use the ML Model Manager panel

Note: If you performed the calculations yourself, you should expect slight variance in the results.

Figure 5-9. Viewing the model.

The performance tab is populated with the corresponding scatterplot.

Above the plot, the Featurizers and Machine learning models used in building the model are listed.

The plot contains predicted versus actual values from the train and test set, with corresponding R2 and RMSE values in a table below.

In this case, we can clearly see that the model generalized well on the test set.

The best fit line between predicted and calculated values shows a reasonable R2 of 0.98 (an ideal model would have an R2 of 1.00). The results suggest that the ML model derived from the Formulation Machine Learning panel could accurately predict Tg as a function of monomer structure and composition for binary copolymer systems.

6. Building a Machine Learning Model for Compression Strength

In this section, we will use the Formulation Machine Learning panel to build a model for predicting the compression strength for concrete geopolymers.

Figure 6-1. The Formulation Machine Learning panel after opening.

Figure 6-2. Loading the training data into the panel.

  1. Click Load Training Data …
  2. Navigate to the provided files, presumably in your working directorythe location where files are saved, select Section_06 > Geopolymer_Concrete-Rao2018Quantitative.csv and click Open
    • The training CSV file contains 240 mixtures of fly ash (FA) and ground granulated blast-surface slag (GGBFS) used to create the geopolymer systems, where their compositions are labeled as comp_0 and comp_1, respectively. Chemical structures are ignored since these systems cannot be easily represented as a SMILES; hence, they are denoted as “MISSING” in SMILES_0 and SMILES_1. We use the composition of each ingredient as inputs to machine learning models.
    • The panel is populated with the training data

Figure 6-3. Viewing the training data in the panel.

Take a moment to view the data in the panel. Since no SMILES structure is available, each ingredient is labeled as “comp_0” and “comp_1” to represent their individual compositions.

Figure 6-4. Viewing the training data.

  1. On the right side of the panel, for Properties, choose target_compression_strength_MPa
    • Statistical parameters such as R2, root-mean-squared error (RMSE) and Pearson’s r correlation coefficient are printed, and can be toggled on and off with the checkboxes above the plot

Figure 6-5. Setting the parameters on the Build tab and running the calculation.

  1. Go to the Build tab
  2. Select Composition only as the Featurizers
    • Deselect all other options
    • Composition only option will ignore chemical structure and use only the composition as inputs to machine learning models
  3. Select All for the Machine learning models
  4. Change the Target property to target_compression_strength_MPa
  5. Select Powderkg, Liquidkg, WC, Admixturekg, Aggregateskg, temperature as the Descriptors. The additional descriptors are experimental features that were varied and summarized below:
    • Powderkg: Powder of fly ash + GGBFS content in kg/m3
    • Liquidkg: Alkaline solution in kg/m3
    • WC: Water-to-cement ratio / Alkali-binder ratio
    • Admixturekg: Total superplasticizer in kg/m3 (4% of binder)
    • Aggregateskg: Total aggregate in kg/m3
    • temperature: Curing temperature in Celsius
  6. Change the Hyperparameter tuning steps to 20
  7. Change the Job name to formulation_ml_train_geopolymer
  8. Adjust the job settings () as needed
    • This job requires a CPU host. The job will be completed in about 10 minutes on a CPU host
  9. If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step.

By default, the panel will automatically perform a random 90:10 train:test split on your dataset. The panel will use the training set for model training and testing set to evaluate whether the model can generalize to unseen formulations. Performance on the train and test sets are reported after building a model. Note that for this example, we select random splitting since the compositions of the same ingredients are being varied across the dataset (e.g. extent of FA versus GGBFS). If there were distinct, unique formulations with varying compositions, out-of-sample splitting would be useful to evaluate whether the model can generalize to new formulations as shown in the previous examples.

Figure 6-6. Loading the model file.

If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. Here, we will assume that you are proceeding with the provided files:

  1. Go to the Performance tab
  2. Click Load Model …
  3. In the provided files, go to the Section_06 > formulation_ml_train_geopolymer directory, choose the formulation_ml_train_geopolymer.mlform file and click Open
    • If you get a warning saying that the uploaded ML model was trained on an older version, feel free to click okay and proceed. If you wish to retrain the model, you can use the ML Model Manager panel

Note: If you performed the calculations yourself, you should expect slight variance in the results.

Figure 6-7. Viewing the model.

The performance tab is populated with the corresponding scatterplot.

Above the plot, the Featurizers and Machine learning models used in building the model are listed.

The plot contains predicted versus actual values from the train and test set, with corresponding R2 and RMSE values in a table below.

In this case, we can clearly see that the model generalized well on the test set.

  1. Go to the Feature Importance tab

Figure 6-8. Running the Feature Importance.

  1. Change the Job name to formulation_ml_feature_importance_geopolymer
  2. Click Calculate Feature Importance
    • This calculation will run in the panel in just a couple minutes

 

Figure 6-9. Viewing the Feature Importance.

If you performed the calculation, the results should automatically be incorporated into the panel when the job is complete. Here, we will assume that you are proceeding with the provided files:

  1. Go to the Performance tab
  2. Click Load Model …
  3. In the provided files, go to the Section_06 > formulation_ml_feature_importance_geopolymer directory, choose the formulation_ml_train_geopolymer.mlform file and click Open
    • This is an updated formulation_ml_train_geopolymer.mlform file that includes the calculated feature importance
    • If you get a warning saying that the uploaded ML model was trained on an older version, feel free to click okay and proceed. If you wish to retrain the model, you can use the ML Model Manager panel

The high test set R2 of 0.99 (an ideal model would have an R2 of 1.00) demonstrates that the ML model derived from the Formulation Machine Learning panel can be used to screen compression strength of geopolymer concrete. In addition, feature importance tools shows which composition plays the largest role to strength; for example, COMPOSITION_1 shows a high positive Mean |SHAP| value, demonstrating that increasing the extent of GGBFS can lead to increased compression strength. Similarly, increasing curing temperature or decreasing WC (water-to-cement ratio) can yield higher compression strengths.

7. Conclusion and References

In this tutorial, we learned how to use the formulation machine learning panel to train practical models with experimental datasets derived from the literature. These machine learning models can be applied to broad materials applications, such as pharmaceutical formulations, consumer packaged goods, batteries, plastics, and solid materials. While this tutorial demonstrates the utility of this tool for a subset of literature examples, one can envision training broad formulation machine learning models for diverse properties, which can then be used for downstream screening of large formulation libraries to suggest best candidates with tailored material properties.

For further learning:

For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 70+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.

For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.

For some related practice, proceed to explore other relevant tutorials:

For further reading:

8. Glossary of Terms

Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion

Included - the entry is represented in the Workspace, the circle in the In column is blue

Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)

Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project

Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries

Working Directory - the location where files are saved

Workspace - the 3D display area in the center of the main window, where molecular structures are displayed