Optimizing Viscosity and Cost in Formulations with Missing Structural Data
Tutorial Created with Software Release: 2025-4
Topics: Consumer Packaged Goods
Methodology: Machine Learning
Products Used: MS Formulation ML , MS Maestro
|
|
This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed
Abstract:
In this tutorial, we will learn to build a machine learning (ML) model to predict cost and viscosity of shampoo formulations with missing structural data.
Tutorial Content
1. Introduction to Formulation Optimization
Multiparameter optimization (MPO), or the simultaneous optimization of multiple properties concurrently, is crucial for materials design, particularly in formulations where modifications in ingredient compositions can lead to drastic changes in property space. Optimization workflows that can navigate vast design spaces and potentially conflicting properties provide a promising avenue to identify formulations that can satisfy multiple criteria. The Formulation Machine Learning panel leverages machine learning (ML) models to accurately predict formulation properties based on ingredient structure and composition. These ML models can then be combined with optimization approaches like Bayesian optimization to provide data-driven experimental suggestions based on variations of ingredient structures, compositions, ingredient cost, and experimental features to fine-tune properties.
In this tutorial, we will use the Formulation Machine Learning panel to train a model for predicting the viscosity of shampoo formulations. We will demonstrate how to train a model with potentially missing ingredients, which is often seen in datasets containing extremely large molecules (e.g. polymers) or unknown structures. Then, we will use the Formulation Machine Learning Optimization panel to optimize formulations for fine-tuned viscosity values and to minimize the cost of ingredients.
Figure 1. General workflow of optimizing formulations with the Formulation Machine Learning and Formulation Machine Learning Optimization panels.
2. Creating Projects and Importing Structures
At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.
Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.
-
Double-click the Materials Science icon
- (No icon? See Starting Maestro)
- Go to File > Change Working Directory
- Find your directory, and click Choose
- Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/ml_form_opt_shampoo.zip
- After downloading the zip file, unzip the contents in your Working Directorythe location where files are saved for ease of access throughout the tutorial
- Go to File > Save Project As
-
Change the File name to ml_optimization_tutorial, click Save
-
The project is now named
ml_form_opt_shampoo_tutorial.prj
-
The project is now named
3. Optimization of Viscosity in Shampoo Formulations
We will begin by training an ML model to predict the viscosity of complex mixtures using the shampoo formulation dataset from Chitre et al. This trained model will then be utilized to optimize the cost and viscosity for a shampoo formulation.
-
Go to Tasks > Materials > Informatics > Formulation Machine Learning
- The Formulation Machine Learning panel opens.
-
Choose Complex as the Formulation type
- By choosing the Complex option, the panel expects the two input CSV files instead of one input CSV file for Simple mixtures
- Click Load CSV
-
Navigate to the provided files and choose
Section_03 > formulation_shampoo_input_viscosity_smal_subset500.csvfile and click Open- A prompt appears about the requirement of group information, which will be loaded separately into the panel. Click OK.
-
Choose
Section_03 > formulation_shampoo_group_info_missing_arlypon_f.csvfile and click Open- The panel is updated with the loaded information.
For complex formulation the panel requires two sets of input CSV files:
- Data of complex formulations with descriptors and properties (viscosity in this case)
- Composition of each mixture in the formulation.
For file #1, each row represents a distinct formulation, defined by multiple components/mixtures and their corresponding compositions. The input also includes relevant descriptors and the property to be predicted.
For file #2, the CSV file contains ingredient and composition information for all mixtures such that each mixture should have compositions that sum up to 100%. There is an additional column called “label”. As shown below, we have MISSING SMILES for arlypon_f and it has also been labelled “arlypon_f”
- Select Fingerprint for the Featurizers
- Select XGBoost for Machine learning models
- Select log(avg_viscosity) as the Target property
- Choose log(shear_rate) for Formulation descriptors
- Set the Hyperparameter tuning steps to 3
- Change the Job name to formulation_ml_shampoo_viscosity
-
Adjust the job settings (
) as needed
- This job requires a CPU host. The job will be completed in about 10 minutes on a CPU host
- If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results.
If you perform the calculation, the results will pop up in the Select Formulations Models panel widget to be selected or we can incorporate the provided results in the performance tab to visualize the parity plot. You might see multiple models in the panel depending on what calculations you have performed. Here, we will assume that you are proceeding with the provided files:
- Go to the Performance tab
- Click Load Model
-
Click Add Model from Folder and go to the
Section_03 > formulation_ml_shampoo_viscositydirectory, click Choose -
Select the
formulation_ml_shampoo_viscositymodel and click OK- Your panel might show different models than what it shown in the figure depending on what you have recently ran
Note: If you performed the calculations yourself, you should expect slight variance in the results.
We can see that the model generalized well on the test set. Close the panel when finished.
4. Optimization of Formulation Mixtures
We will optimize the shampoo formulation for viscosity and cost by incorporating a linear cost model with the trained model from the previous section.
-
Go to Tasks > Materials > Informatics > Formulation Machine Learning Optimization
- The Formulation Machine Learning Optimization panel opens.
- Select Complex for the Formulation Type
- In the Ingredients section, click Load CSV
-
In the provided files, go to the
Section_04directory, choose theshampoo_formulation_optimization_input.csvfile and click Open- A prompt appears about the requirement of group information. We will load that data also into the panel. Click OK.
-
Choose
Section_04 > formulation_shampoo_group_info_missing_arlypon_f.csvfile and click Open
The information is loaded into the panel. Within each group, there are multiple components. Clicking on the dropdown arrow shows more information about each mixture. Clicking on each mixture shows the composition and the structure of each component in the mixture. However, arlypon_f’s structural data is incomplete. Click OK to close the panel.
- In the ML Models and Properties section, click Add ML Model
-
Select the
formulation_ml_shampoo_viscositymodel and click OK
- Ensure log(avg_viscosity) is chosen as the Properties
- Change Objective to Middle Good
-
Set the values of constraints as follows:
- Lower Good: 1.8
- Lower Okay: 0.4
- Upper Good: 2.1
- Upper Okay: 2.9
-
Set the log(shear_rate) constraints as follows:
- Min: 2.0
- Max: 2.0
- Click OK
- In the ML Models and Properties section, click Add Linear Model
-
In the provided files, go to the
Section_04directory, choose theIngredient_cost_estimates.csvfile and click Open
- Ensure cost_usd_per_kg is chosen as the Properties
- Change Objective to Minimize
- Change Aggregator to Weighted Sum
- Change Good to 3.5
- Click OK
Note: Ingredient descriptors can be used to incorporate domain-specific knowledge and enable the flexibility to embed custom descriptors outside of the usual chemical descriptors.
- Change the Job name to formulation_ml_optimization_cost_viscosity
-
Adjust the job settings (
) as needed
- This job requires a CPU host. The job will be completed in about 30 minutes on a CPU host
- If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step
- Go to the Results tab
-
Click Load Optimization Results
- If you performed the calculation, navigate to the job directory. Otherwise, proceed to import from the provided files
-
Choose
Section_04 > formulation_ml_optimization_cost_viscosity > formulation_ml_optimization_cost_viscosity.omlformand click Open
The panel is loaded with results of the calculation.
- Select log(avg_viscosity)_predict and MPO for the Properties
- Uncheck Same x and y
- MPO and prediction by viscosity are not in the same scale
Note: To export the new formulations, use the Export Data button.
- Change the log(avg_viscosity)_predict property to cost_usd_per_kg_predict
- Go to the MPO Score subtab
The average MPO score for this calculation was 0.88, which is considered good. Increasing the number of iterations could lead to an even higher score.
5. Conclusion and References
In this tutorial, we learned how to build a ML model to predict cost and viscosity of shampoo formulations with incomplete structural data.
For further learning:
For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 100+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.
For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.
For some related practice, proceed to explore other relevant tutorials:
-
For more machine learning:
- Machine Learning for Formulations
- Machine Learning for Materials Science
- Periodic Descriptors for Inorganic Solids
- Molecular Dynamics Descriptors for Machine Learning
- Optoelectronics Active Learning
- Machine Learning for Sweetness
- Machine Learning for Ionic Conductivity
- Cheminformatics Machine Learning for Homogeneous Catalysis
- Machine Learning Property Prediction
- Applied Machine Learning for Formulations
For further reading:
- See the help documentation
- Leveraging high-throughput molecular simulations and machine learning for the design of chemical mixtures. DOI: 10.1038/s41524-025-01552-2
- Accelerating Formulation Design via Machine Learning: Generating a High-throughput Shampoo Formulations Dataset. DOI:10.1038/s41597-024-03573-w
6. Glossary of Terms
Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion
Included - the entry is represented in the Workspace, the circle in the In column is blue
Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data
Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)
Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project
Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries
Working Directory - the location where files are saved
Workspace - the 3D display area in the center of the main window, where molecular structures are displayed