Optimization of Formulations Using Machine Learning
Tutorial Created with Software Release: 2025-3
Topics: Consumer Packaged Goods , Energy Capture & Storage , Pharmaceutical Formulations
Methodology: Machine Learning
Products Used: MS Formulation ML , MS Maestro
|
166.3 MB |
This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed
Abstract:
In this tutorial, we will learn to build machine learning (ML) models to predict distinct properties of formulations and leverage these models to optimize formulations for desired target properties.
Tutorial Content
1. Introduction
Multiparameter optimization (MPO) is a process of simultaneously optimizing multiple property constraints of a material, which is a challenging task due to the expansive design space and conflicting properties that oppose each other. In particular, formulations that consist of mixtures of chemical ingredients with a specified composition ratio are important in fields like pharmaceuticals, consumer packaged goods, or energy. Formulation products must meet a complex set of criteria to be successful. The challenge lies in the inverse design of formulations given a desired set of properties. For example, in drug design applications, increasing the drug potency might decrease its solubility in solution. MPO focuses on finding the "sweet spot" where multiple properties are all within acceptable ranges. MPO involves defining a criteria of good or bad scores to different properties, then combining these individual scores into an overall "desirability" score. MPO enables suggestions of formulations that satisfy multiple property criteria, which is useful for experimentalists to identify good formulation candidates.
To enable fast and accurate predictions, we leverage machine learning (ML) models that can map ingredient structure and composition to formulation properties. Please refer to the Machine Learning for Formulations tutorial for background on how these models are built using the Formulation Machine Learning panel. Leveraging ML models, we will optimize formulations to achieve desired target property ranges—maximize, minimize, or a specific target value—using optimization approaches like brute force and Bayesian optimization. Together, ML models and optimization tools facilitate data-driven suggestions for experiments, particularly enabling the tunability of ingredient structure, composition, experimental features (e.g. temperature) to fine-tune formulation properties.
In this tutorial, we will use the Formulation Machine Learning Optimization panel to optimize formulations for distinct materials applications. In Section 3, we will optimize a battery electrolyte formulation to maximize the logarithmic Coulombic efficiency (LCE), which informs on how fast a battery can charge or discharge (see References). In Section 4, we will optimize a formulation of miscible solvents to achieve a target density. Finally, in Section 5, we will optimize a shampoo using a mixture of formulations using the BASF dataset. The general workflow is as follows:
Figure 1. General workflow of optimizing formulations with the Formulation Machine Learning Optimization panel. The input training CSV file contains ingredient structure, compositions, additional features, and target property. The Formulation Machine Learning panel is used to train ML models to predict formulation properties (e.g. LCE). Then, the Formulation Machine Learning Optimization panel is used to suggest formulation candidates that maximizes the LCE of battery electrolytes.
2. Creating Projects and Importing Structures
At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.
Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.
-
Double-click the Materials Science icon
- (No icon? See Starting Maestro)
- Go to File > Change Working Directory
- Find your directory, and click Choose
- Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/formulation_optimization.zip
- After downloading the zip file, unzip the contents in your Working Directorythe location where files are saved for ease of access throughout the tutorial
- Go to File > Save Project As
-
Change the File name to ml_optimization_tutorial, click Save
-
The project is now named
ml_optimization_tutorial.prj
-
The project is now named
3. Optimizing Battery Electrolyte Formulation to Maximize LCE
We will first build a ML model to predict the logarithmic Coulombic efficiency (LCE) for a given composition using the Formulation Machine Learning panel. Then we will use the ML model to optimize the formulation to achieve maximum LCE. For additional examples of building ML models for formulations, refer to the Machine Learning for Formulations and the Applied Machine Learning for Formulations tutorials.
-
Go to Tasks > Materials > Informatics > Formulation Machine Learning
- The Formulation Machine Learning panel opens.
- Click Load Training Data
-
Navigate to the provided files, presumably in your working directory. Choose
Section_03 > train_tutorial_battery_electrolytes_input.csvfile and click Open- The panel is populated with the training data
The electrolyte formulation data was obtained from Kim et al.
- Go the Build tab
- Select XGBoost, Support Vector Machine, Dense Neural Networks, Random Forest, Elastic Net, and Set2Set for the Machine learning models
- Ensure Target property is LCE
- Change the Job name to formulation_ml_battery_LCE
-
Adjust the job settings (
) as needed
- This job requires a CPU host. The job will be completed in about 10 minutes on a CPU host
- If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step
If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. Here, we will assume that you are proceeding with the provided files:
- Go to the Performance tab
- Click Load Model
-
In the provided files, go to the
Section_03 > formulation_ml_battery_LCEdirectory, choose theformulation_ml_battery_LCE.mlformfile and click Open
Note: If you performed the calculations yourself, you should expect slight variance in the results.
The plot contains predicted versus actual values from the train and test set, with corresponding R2 and RMSE values in a table below.
In this case, the model generalized well on the test set.
In the following steps, we will utilize this ML model to optimize the electrolyte composition for maximum LCE.
- Close the Formulation Machine Learning panel
-
Go to Tasks > Materials > Informatics > Formulation Machine Learning Optimization
- The Formulation Machine Learning Optimization panel opens.
Before proceeding, let’s review the inputs for the Formulation Machine Learning Optimization panel.
At the very top, the panel offers four optimization methods:
- Bayesian optimization:,,. The Bayesian method is a sequential learning approach that iteratively trains a Gaussian Process model to identify and guide optimal formulations. This method has the advantages of well-defined prediction uncertainties and usefulness for global optimization. However, due to the sequential training of the models and their poor scaling at large datasets, it can be computationally expensive and challenging when screening large numbers of formulations.
- Brute Force optimization: Brute Force is a grid search method that exhaustively enumerates combinations of ingredients and compositions. This approach is fast and robust for small search spaces, but it can be expensive or even infeasible for large design spaces. Additionally, this approach lacks guidance from previous iterations to identify good candidates.
- Random optimization: Similar to Brute Force optimization, the Random approach randomly selects formulations from an exhaustive list. Instead of grid searching across composition space, random optimization uses the differential evolution algorithm to find best compositions and can yield faster results as compared to Brute Force. This method generally have similar disadvantages as the Brute Force method.
- Genetic Algorithm optimization: Genetic Algorithm uses natural selection principles to select the best formulations, similar to evolution in nature. This approach is robust to noisy and multimodal problems as well as scalable to large design spaces. However, it may converge to suboptimal solutions based on the input conditions.
The panel is structured into two primary sections: Ingredients and ML Models and Properties. The Ingredients section allows for precise component specification, including grouping, and minimum/maximum component number constraints. For instance, in our battery electrolyte example, components are categorized into 'salt' and 'solvent' groups. We can also designate components as default inclusions for all formulations. If Brute Force optimization approach is selected, compositional constraints like minimum and maximum compositions can be set. An example format for the battery electrolyte dataset is shown below:![]()
Components are defined using SMILES strings, with 'MIN_NUM_COMPONENTS' and 'MAX_NUM_COMPONENTS' columns specifying the allowed component count within their respective groups. The 'REQUIRED' column, with a 'True' value, designates components for inclusion in all formulations. Components are categorized into groups based on their properties, as indicated in the 'GROUP' column. The spreadsheet shows that we want the optimization to choose one salt and either single or binary solvent systems. Since ‘FC(F)COCCOCC(F)(F)’ SMILES has True set in the ‘REQUIRED’ column for Solvent, it will always be present as one of the two solvent spots available.
The ML Models and Properties section defines the machine learning model used for formulation optimization and specifies the target property, including maximization, minimization, or a desired value range.
- In the Ingredients section, click Load CSV
-
In the provided files, go to the
Section_03directory, choose theform_battery.csvfile and click Open
The panel reflects the provided data. Notably, FC(F)COCCOCC(F)(F)) is designated as 'REQUIRED' with a 'True' value, indicating its inclusion in all formulations.
The panel also displays component counts per group: 13 for 'Salt' and 46 for 'Solvent.' In addition to the .csv file input, users can directly adjust minimum/maximum component counts per group using the 'At least' and 'At most' text boxes within the panel. Note that changes in the panel do not modify the original.csv file input.
- In the ML Models and Properties section, click Add ML Model
-
In the provided files, go to the
Section_03 > formulation_ml_battery_LCEdirectory, choose theformulation_ml_battery_LCE.mlformfile and click Open
- In the popup window, ensure Properties is set to LCE
-
Choose Maximize as the Objective
- Alternatively, you can adjust the range of 'Good' or 'Okay' values by either moving the dashed sliders or by directly entering the corresponding numerical values.
- Click OK
- Increase the Number of Trials to 500
- Check Stop Early
- Change the Job name to formulation_ml_optimization_battery
-
Adjust the job settings (
) as needed
- This job requires a CPU host. The job will be completed in about 10 minutes on a CPU host
- If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step
- Go to the Results tab
-
Click Load Optimization Results
- If you performed the calculation, navigate to the job directory. Otherwise, proceed to import from the provided files
-
Choose
Section_03 > formulation_ml_optimization_battery > formulation_ml_optimization_battery.omlformand click Open
The panel is loaded with results of the calculation. Each formulation is shown in the list along with the composition of each component and corresponding MPO score. By default, the formulations are rank ordered according to their MPO scores.
- Go to the MPO Score subtab
This plot depicts the Average MPO score progression for the top ten formulations during optimization. The iteration process was terminated prior to 500 iterations as the average score converged, showing no further significant improvement. Proceed to explore the other options in the panel.
- Close the Formulation Machine Learning Optimization panel.
4. Optimizing Density of Miscible Solvents Formulation
We will now apply the optimization steps outlined in the preceding section to optimize the density of miscible solvent formulations, targeting a defined range of values. We will use a pre-trained ML model. The steps to do that are similar to those mentioned in Section 3. Please refer to the Machine Learning for Formulations tutorial for detailed steps to build the model.
-
Go to Tasks > Materials > Informatics > Formulation Machine Learning Optimization
- The Formulation Machine Learning Optimization panel opens.
- In the Ingredients section, click Load CSV
-
In the provided files, go to the
Section_04directory, choose theoptimize_input.csvfile and click Open
-
Change the At most number of components to 3
- This will use at least 1 solvent and at most 3 solvents for each formulation
- In the ML Models and Properties section, click Add ML Model
-
In the provided files, go to the
Section_04 > formulation_ml_miscible_densitydirectory, choose theformulation_ml_miscible_density.mlformfile and click Open
- Change Objective to Middle Good
-
Change the values of constraints as follows:
- Lower Good: 1.0
- Lower Okay: 0.9
- Upper Good: 1.10
- Upper Okay: 1.2
- Click OK
This ensures the formulation is optimized to adhere to a narrowly defined density range
- Change the Job name to formulation_ml_optimization_miscible
-
Adjust the job settings (
) as needed
- This job requires a CPU host. The job will be completed in about 10 minutes on a CPU host
- If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step
- Go to the Results tab
-
Click Load Optimization Results
- If you performed the calculation, navigate to the job directory. Otherwise, proceed to import from the provided files
-
Choose
Section_04 > formulation_ml_optimization_miscible > formulation_ml_optimization_miscible.omlformand click Open
- Go to the MPO Score subtab
The plot shows that the average MPO score is ~ 0.90, which indicates the workflow found formulations with the desired density values.
5. Optimization of Formulation Mixtures
A "complex mixture" refers to a blend of multiple mixtures. Complex mixtures, in contrast to simple formulations, contain numerous components that collectively determine product characteristics. Similar to the previous sections, we will begin by training a ML model to predict the viscosity of the complex mixtures using the shampoo formulation dataset from Chitre et al. This trained model will then be utilized to optimize the shampoo formulation.
-
Go to Tasks > Materials > Informatics > Formulation Machine Learning
- The Formulation Machine Learning panel opens.
For complex formulation the panel requires two sets of input CSV files:
- Data of complex formulations with descriptors and properties (viscosity in this case)
- Composition of each mixture in the formulation.
For file #1, the input data is structured as follows:![]()
Each row represents a distinct formulation, defined by multiple components/mixtures and their corresponding compositions. The input also includes relevant descriptors and the property to be predicted.
For file #2, Component-specific information is as follows:![]()
The input CSV file above includes the SMILES strings and compositions for each ingredient within a mixture component. The CSV file contains ingredient and composition information for all mixtures such that each mixture should have compositions that sum up to 100%.
-
Choose Complex as the Formulation type
- By choosing the Complex option, the panel expects the two input CSV files instead of one input CSV file for Simple mixtures
- Click Load Training Data
-
Navigate to the provided files and choose
Section_05 > formulation_shampoo_input.csvfile and click Open- A prompt appears about the requirement of group information, which will be loaded separately into the panel. Click OK.
-
Choose
Section_05 > formulation_shampoo_group_info.csvfile and click Open- The panel is updated with the loaded information.
- Go to the Build tab
- Select the four Featurizers as shown in the figure
- Select All for Machine learning models
- Select log(avg_viscosity) as the Target property
- Choose log(shear_rate) for Formulation descriptors
- Change the Job name to formulation_ml_shampoo
-
Adjust the job settings (
) as needed
- This job requires a CPU host. The job will be completed in about 45 minutes on a CPU host
- If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results.
If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. Here, we will assume that you are proceeding with the provided files:
- Go to the Performance tab
- Click Load Model
-
In the provided files, go to the
Section_05 > formulation_ml_shampoodirectory, choose theformulation_ml_shampoo.mlformfile and click Open
Note: If you performed the calculations yourself, you should expect slight variance in the results.
We can see that the model generalized well on the test set.
We will use the trained ML model for optimizing the formulation with desired viscosity.
-
Go to Tasks > Materials > Informatics > Formulation Machine Learning Optimization
- The Formulation Machine Learning Optimization panel opens.
- In the Ingredients section, click Load CSV
-
In the provided files, go to the
Section_05directory, choose theshampoo_formulation_optimization_input.csvfile and click Open- A prompt appears about the requirement of group information. We will load that data also into the panel. Click OK.
-
Choose
Section_05 > formulation_shampoo_group_info.csvfile and click Open
The information is loaded into the panel. Within each group, there are multiple components. Clicking on the dropdown arrow shows more information about each mixture. Clicking on each mixture shows the composition and the structure of each component in the mixture.
-
Change the Optimization Method to Brute Force
- Water is selected to be always present based on the original shampoo dataset, which is usually 60 - 90% of the mixture. By forcing water to be always present, we guide the optimization algorithm to select reasonable shampoo formulations. The step size of 1% indicates that the algorithm will incrementally vary the composition by 1%.
- In the ML Models and Properties section, click Add ML Model
-
In the provided files, go to the
Section_05 > formulation_ml_shampoodirectory, choose theformulation_ml_shampoo.mlformfile and click Open
- Ensure log(avg_viscosity) is chosen as the Property
- Change Objective to Middle Good
-
Set the values of constraints as follows:
- Lower Good: 1.3
- Lower Okay: 0.5
- Upper Good: 2.0
- Upper Okay: 2.8
- Click OK
- Change the Job name to formulation_ml_optimization_shampoo
-
Adjust the job settings (
) as needed
- This job requires a CPU host. The job will be completed in about 5 minutes on a CPU host
- If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step
- Go to the Results tab
-
Click Load Optimization Results
- If you performed the calculation, navigate to the job directory. Otherwise, proceed to import from the provided files
-
Choose
Section_05 > formulation_ml_optimization_shampoo > formulation_ml_optimization_shampoo.omlformand click Open
The panel is loaded with results of the calculation. Water is present in all the formulations in varying composition. The formulations are rank ordered according to their MPO scores.
- Go to the MPO Score subtab
The high average MPO score of ~0.91 signifies that the optimization algorithm has effectively produced formulations with the targeted viscosity
6. Conclusion and References
In this tutorial, we learned how to optimize formulations using machine learning models for various materials applications. For all examples, we first trained a formulation-property model using the Formulation Machine Learning Panel; then, we used the Formulation Machine Learning Optimization panel to optimize various formulations. This tutorial focused on optimizing LCE of battery electrolyte systems, density of miscible solvent systems, and viscosity of complex shampoo formulations using a BASF dataset. The Formulation Machine Learning Optimization panel can be used to suggest formulations while balancing multiple properties, which paves a way forward to creating unique formulations with fine-tuned properties for broad materials applications.
For further learning:
For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 70+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.
For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.
For some related practice, proceed to explore other relevant tutorials:
-
For more machine learning:
- Machine Learning for Formulations
- Machine Learning for Materials Science
- Periodic Descriptors for Inorganic Solids
- Molecular Dynamics Descriptors for Machine Learning
- Optoelectronics Active Learning
- Machine Learning for Sweetness
- Machine Learning for Ionic Conductivity
- Cheminformatics Machine Learning for Homogeneous Catalysis
- Machine Learning Property Prediction
- Applied Machine Learning for Formulations
- Optimizing Viscosity and Cost in Formulations with Missing Structural Data
For further reading:
- Data-driven electrolyte design for lithium metal anodes, DOI:10.1073/pnas.2214357120
- Formulation Graphs for Mapping Structure-Composition of Battery Electrolytes to Device Performance, DOI:10.1021/acs.jcim.3c01030
- Oxygen Assisted Lithium-Iodine Batteries: Towards Practical Iodine Cathodes and Viable Lithium Metal Protection Strategies, DOI:10.1002/admi.202300058
- Accelerating Formulation Design via Machine Learning: Generating a High-throughput Shampoo Formulations Dataset, DOI:10.1038/s41597-024-03573-w
7. Glossary of Terms
Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion
Included - the entry is represented in the Workspace, the circle in the In column is blue
Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data
Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)
Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project
Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries
Working Directory - the location where files are saved
Workspace - the 3D display area in the center of the main window, where molecular structures are displayed