Optimizing Viscosity and Cost in Formulations with Missing Structural Data

Tutorial Created with Software Release: 2025-4
Topics: Consumer Packaged Goods
Methodology: Machine Learning
Products Used: MS Formulation ML, MS Maestro

Tutorial files

This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed

 

Tip: You can hover over a glossary term to display its definition. You can click on an image to expand it in the page.
Abstract:

 

In this tutorial, we will learn to build a machine learning (ML) model to predict cost and viscosity of shampoo formulations with missing structural data.

 

Tutorial Content
  1. Introduction to Formulation Optimization

  1. Creating Projects and Importing Structures

  1. Optimization of Viscosity in Shampoo Formulations

  1. Optimization of Formulation Mixtures

  1. Conclusion and References

  1. Glossary of Terms

1. Introduction to Formulation Optimization

Multiparameter optimization (MPO), or the simultaneous optimization of multiple properties concurrently, is crucial for materials design, particularly in formulations where modifications in ingredient compositions can lead to drastic changes in property space. Optimization workflows that can navigate vast design spaces and potentially conflicting properties provide a promising avenue to identify formulations that can satisfy multiple criteria. The Formulation Machine Learning panel leverages machine learning (ML) models to accurately predict formulation properties based on ingredient structure and composition. These ML models can then be combined with optimization approaches like Bayesian optimization to provide data-driven experimental suggestions based on variations of ingredient structures, compositions, ingredient cost, and experimental features to fine-tune properties.

In this tutorial, we will use the Formulation Machine Learning panel to train a model for predicting the viscosity of shampoo formulations. We will demonstrate how to train a model with potentially missing ingredients, which is often seen in datasets containing extremely large molecules (e.g. polymers) or unknown structures. Then, we will use the Formulation Machine Learning Optimization panel to optimize formulations for fine-tuned viscosity values and to minimize the cost of ingredients.

Figure 1. General workflow of optimizing formulations with the Formulation Machine Learning and Formulation Machine Learning Optimization panels.

2. Creating Projects and Importing Structures

At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.

Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.

  1. Double-click the Materials Science icon

Figure 2-1. Change Working Directory option.

  1. Go to File > Change Working Directory
  2. Find your directory, and click Choose
  3. Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/ml_form_opt_shampoo.zip
  4. After downloading the zip file, unzip the contents in your Working Directorythe location where files are saved for ease of access throughout the tutorial

Figure 2-2. Save Project panel.

  1. Go to File > Save Project As
  2. Change the File name to ml_optimization_tutorial, click Save
    • The project is now named ml_form_opt_shampoo_tutorial.prj

3. Optimization of Viscosity in Shampoo Formulations

We will begin by training an ML model to predict the viscosity of complex mixtures using the shampoo formulation dataset from Chitre et al. This trained model will then be utilized to optimize the cost and viscosity for a shampoo formulation.

Figure 3-1. Opening the Formulation Machine Learning Panel.

  1. Go to Tasks > Materials > Informatics > Formulation Machine Learning

 

Figure 3-2. Loading the input data.

  1. Choose Complex as the Formulation type
    • By choosing the Complex option, the panel expects the two input CSV files instead of one input CSV file for Simple mixtures
  2. Click Load CSV
  3. Navigate to the provided files and choose Section_03 > formulation_shampoo_input_viscosity_smal_subset500.csv file and click Open
    • A prompt appears about the requirement of group information, which will be loaded separately into the panel. Click OK.

Figure 3-3. Loading the group information.

  1. Choose Section_03 > formulation_shampoo_group_info_missing_arlypon_f.csv file and click Open
    • The panel is updated with the loaded information.

For complex formulation the panel requires two sets of input CSV files:

  1. Data of complex formulations with descriptors and properties (viscosity in this case)
  2. Composition of each mixture in the formulation.

 

For file #1, each row represents a distinct formulation, defined by multiple components/mixtures and their corresponding compositions. The input also includes relevant descriptors and the property to be predicted.

For file #2, the CSV file contains ingredient and composition information for all mixtures such that each mixture should have compositions that sum up to 100%. There is an additional column called “label”. As shown below, we have MISSING SMILES for arlypon_f and it has also been labelled “arlypon_f”

 

Figure 3-4. Viewing the input data.

  1. Go to the Build tab  

Figure 3-5. Setting up the descriptors and running the job.

  1. Select Fingerprint for the Featurizers
  2. Select XGBoost for Machine learning models
  3. Select log(avg_viscosity) as the Target property
  4. Choose log(shear_rate) for Formulation descriptors
  5. Set the Hyperparameter tuning steps to 3
  6. Change the Job name to formulation_ml_shampoo_viscosity
  7. Adjust the job settings () as needed
    • This job requires a CPU host. The job will be completed in about 10 minutes on a CPU host
  8. If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results.

Figure 3-6. Importing the results.

If you perform the calculation, the results will pop up in the Select Formulations Models panel widget to be selected or we can incorporate the provided results in the performance tab to visualize the parity plot. You might see multiple models in the panel depending on what calculations you have performed. Here, we will assume that you are proceeding with the provided files:

 

  1. Go to the Performance tab
  2. Click Load Model
  3. Click Add Model from Folder and go to the Section_03 > formulation_ml_shampoo_viscosity directory, click Choose
  4. Select the formulation_ml_shampoo_viscosity model and click OK
    • Your panel might show different models than what it shown in the figure depending on what you have recently ran

 

Note: If you performed the calculations yourself, you should expect slight variance in the results.

Figure 3-7. Viewing the results.

We can see that the model generalized well on the test set. Close the panel when finished.

4. Optimization of Formulation Mixtures

We will optimize the shampoo formulation for viscosity and cost by incorporating a linear cost model with the trained model from the previous section.

Figure 4-1. Opening the Formulation Machine Learning Optimization panel.

  1. Go to Tasks > Materials > Informatics > Formulation Machine Learning Optimization

Figure 4-2. Loading the input file.

  1. Select Complex for the Formulation Type
  2. In the Ingredients section, click Load CSV
  3. In the provided files, go to the Section_04 directory, choose the shampoo_formulation_optimization_input.csv file and click Open
    • A prompt appears about the requirement of group information. We will load that data also into the panel. Click OK.

Figure 4-3. Loading the group information.

  1. Choose Section_04 > formulation_shampoo_group_info_missing_arlypon_f.csv file and click Open

Figure 4-4. Viewing the components of mixtures.

The information is loaded into the panel. Within each group, there are multiple components. Clicking on the dropdown arrow shows more information about each mixture. Clicking on each mixture shows the composition and the structure of each component in the mixture. However, arlypon_f’s structural data is incomplete. Click OK to close the panel.

Figure 4-5. Loading the ML model.

  1. In the ML Models and Properties section, click Add ML Model
  2. Select the formulation_ml_shampoo_viscosity model and click OK

Figure 4-6. Choosing the target property range.

  1. Ensure log(avg_viscosity) is chosen as the Properties
  2. Change Objective to Middle Good
  3. Set the values of constraints as follows:
    1. Lower Good: 1.8
    2. Lower Okay: 0.4
    3. Upper Good: 2.1
    4. Upper Okay: 2.9
  4. Set the log(shear_rate) constraints as follows:
    1. Min: 2.0
    2. Max: 2.0
  5. Click OK

Figure 4-7. Loading the linear model.

  1. In the ML Models and Properties section, click Add Linear Model
  2. In the provided files, go to the Section_04 directory, choose the Ingredient_cost_estimates.csv file and click Open

Figure 4-8. Choosing the target property range.

  1. Ensure cost_usd_per_kg is chosen as the Properties
  2. Change Objective to Minimize
  3. Change Aggregator to Weighted Sum
  4. Change Good to 3.5
  5. Click OK

Figure 4-9. Running the job.

Note: Ingredient descriptors can be used to incorporate domain-specific knowledge and enable the flexibility to embed custom descriptors outside of the usual chemical descriptors.

  1. Change the Job name to formulation_ml_optimization_cost_viscosity
  2. Adjust the job settings () as needed
    • This job requires a CPU host. The job will be completed in about 30 minutes on a CPU host
  3. If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step

Figure 4-10. Loading the results.

  1. Go to the Results tab
  2. Click Load Optimization Results
    • If you performed the calculation, navigate to the job directory. Otherwise, proceed to import from the provided files
  3. Choose Section_04 > formulation_ml_optimization_cost_viscosity > formulation_ml_optimization_cost_viscosity.omlform and click Open

Figure 4-11. Viewing the MPO vs predicted viscosity information.

The panel is loaded with results of the calculation.

 

  1. Select log(avg_viscosity)_predict and MPO for the Properties
  2. Uncheck Same x and y
    • MPO and prediction by viscosity are not in the same scale

 

Note: To export the new formulations, use the Export Data button.

 

Figure 4-12. Viewing the MPO vs cost information.

  1. Change the log(avg_viscosity)_predict property to cost_usd_per_kg_predict

Figure 4-13. Viewing the MPO score over the iterations.

  1. Go to the MPO Score subtab

 

The average MPO score for this calculation was 0.88, which is considered good. Increasing the number of iterations could lead to an even higher score.

 

5. Conclusion and References

In this tutorial, we learned how to build a ML model to predict cost and viscosity of shampoo formulations with incomplete structural data.

For further learning:

For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 100+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.

For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.

For some related practice, proceed to explore other relevant tutorials:

For further reading:
  • See the help documentation
  • Leveraging high-throughput molecular simulations and machine learning for the design of chemical mixtures. DOI: 10.1038/s41524-025-01552-2
  • Accelerating Formulation Design via Machine Learning: Generating a High-throughput Shampoo Formulations Dataset. DOI:10.1038/s41597-024-03573-w

6. Glossary of Terms

Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion

Included - the entry is represented in the Workspace, the circle in the In column is blue

Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)

Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project

Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries

Working Directory - the location where files are saved

Workspace - the 3D display area in the center of the main window, where molecular structures are displayed