Optimization of Formulations Using Machine Learning

Tutorial Created with Software Release: 2025-3

Topics: Consumer Packaged Goods, Energy Capture & Storage, Pharmaceutical Formulations

Methodology: Machine Learning

Products Used: MS Formulation ML, MS Maestro

Tutorial files

166.3 MB

This tutorial is written for use with a 3-button mouse with a scroll wheel.

Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed

Tip: You can hover over a glossary term to display its definition. You can click on an image to expand it in the page.

Abstract:

In this tutorial, we will learn to build machine learning (ML) models to predict distinct properties of formulations and leverage these models to optimize formulations for desired target properties.

Tutorial Content

Introduction

Creating Projects and Importing Structures

Optimizing Battery Electrolyte Formulation to Maximize LCE

Optimizing Density of Miscible Solvents Formulation

Optimization of Formulation Mixtures

Conclusion and References

Glossary of Terms

1. Introduction

Multiparameter optimization (MPO) is a process of simultaneously optimizing multiple property constraints of a material, which is a challenging task due to the expansive design space and conflicting properties that oppose each other. In particular, formulations that consist of mixtures of chemical ingredients with a specified composition ratio are important in fields like pharmaceuticals, consumer packaged goods, or energy. Formulation products must meet a complex set of criteria to be successful. The challenge lies in the inverse design of formulations given a desired set of properties. For example, in drug design applications, increasing the drug potency might decrease its solubility in solution. MPO focuses on finding the "sweet spot" where multiple properties are all within acceptable ranges. MPO involves defining a criteria of good or bad scores to different properties, then combining these individual scores into an overall "desirability" score. MPO enables suggestions of formulations that satisfy multiple property criteria, which is useful for experimentalists to identify good formulation candidates.

To enable fast and accurate predictions, we leverage machine learning (ML) models that can map ingredient structure and composition to formulation properties. Please refer to the Machine Learning for Formulations tutorial for background on how these models are built using the Formulation Machine Learning panel. Leveraging ML models, we will optimize formulations to achieve desired target property ranges—maximize, minimize, or a specific target value—using optimization approaches like brute force and Bayesian optimization. Together, ML models and optimization tools facilitate data-driven suggestions for experiments, particularly enabling the tunability of ingredient structure, composition, experimental features (e.g. temperature) to fine-tune formulation properties.

In this tutorial, we will use the Formulation Machine Learning Optimization panel to optimize formulations for distinct materials applications. In Section 3, we will optimize a battery electrolyte formulation to maximize the logarithmic Coulombic efficiency (LCE), which informs on how fast a battery can charge or discharge (see References). In Section 4, we will optimize a formulation of miscible solvents to achieve a target density. Finally, in Section 5, we will optimize a shampoo using a mixture of formulations using the BASF dataset. The general workflow is as follows:

Figure 1. General workflow of optimizing formulations with the Formulation Machine Learning Optimization panel. The input training CSV file contains ingredient structure, compositions, additional features, and target property. The Formulation Machine Learning panel is used to train ML models to predict formulation properties (e.g. LCE). Then, the Formulation Machine Learning Optimization panel is used to suggest formulation candidates that maximizes the LCE of battery electrolytes.

2. Creating Projects and Importing Structures

At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.

Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.

Double-click the Materials Science icon
- (No icon? See Starting Maestro)

Figure 2-1. Change Working Directory option.

Go to File > Change Working Directory
Find your directory, and click Choose
Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger .com/sites/default/files/s3/release/current/Tutorials/zip/formulation_optimization .zip
After downloading the zip file, unzip the contents in your Working Directorythe location where files are saved for ease of access throughout the tutorial

Figure 2-2. Save Project panel.

Go to File > Save Project As
Change the File name to ml_optimization_tutorial, click Save
- The project is now named ml_optimization_tutorial.prj

3. Optimizing Battery Electrolyte Formulation to Maximize LCE

We will first build a ML model to predict the logarithmic Coulombic efficiency (LCE) for a given composition using the Formulation Machine Learning panel. Then we will use the ML model to optimize the formulation to achieve maximum LCE. For additional examples of building ML models for formulations, refer to the Machine Learning for Formulations and the Applied Machine Learning for Formulations tutorials.

Figure 3-1. Loading the input files.

Go to Tasks > Materials > Informatics > Formulation Machine Learning
- The Formulation Machine Learning panel opens.
Click Load Training Data
Navigate to the provided files, presumably in your working directory. Choose Section_03 > train_tutorial_battery_electrolytes_input.csv file and click Open
- The panel is populated with the training data

The electrolyte formulation data was obtained from Kim et al.

Figure 3-2. Setting up the panel and running the job.

Go the Build tab
Select XGBoost, Support Vector Machine, Dense Neural Networks, Random Forest, Elastic Net, and Set2Set for the Machine learning models
Ensure Target property is LCE
Change the Job name to formulation_ml_battery_LCE
Adjust the job settings () as needed
- This job requires a CPU host. The job will be completed in about 10 minutes on a CPU host
If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step

Figure 3-3. Loading the results.

If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. Here, we will assume that you are proceeding with the provided files:

Go to the Performance tab
Click Load Model
In the provided files, go to the Section_03 > formulation_ml_battery_LCE directory, choose the formulation_ml_battery_LCE.mlform file and click Open

Note: If you performed the calculations yourself, you should expect slight variance in the results.

Figure 3-4. Viewing the results.

The plot contains predicted versus actual values from the train and test set, with corresponding R² and RMSE values in a table below.

In this case, the model generalized well on the test set.

In the following steps, we will utilize this ML model to optimize the electrolyte composition for maximum LCE.

Close the Formulation Machine Learning panel

Figure 3-5. Opening the Formulation Machine Learning Optimization panel.

Go to Tasks > Materials > Informatics > Formulation Machine Learning Optimization
- The Formulation Machine Learning Optimization panel opens.

Before proceeding, let’s review the inputs for the Formulation Machine Learning Optimization panel.

At the very top, the panel offers four optimization methods:

Bayesian optimization:,,. The Bayesian method is a sequential learning approach that iteratively trains a Gaussian Process model to identify and guide optimal formulations. This method has the advantages of well-defined prediction uncertainties and usefulness for global optimization. However, due to the sequential training of the models and their poor scaling at large datasets, it can be computationally expensive and challenging when screening large numbers of formulations.
Brute Force optimization: Brute Force is a grid search method that exhaustively enumerates combinations of ingredients and compositions. This approach is fast and robust for small search spaces, but it can be expensive or even infeasible for large design spaces. Additionally, this approach lacks guidance from previous iterations to identify good candidates.
Random optimization: Similar to Brute Force optimization, the Random approach randomly selects formulations from an exhaustive list. Instead of grid searching across composition space, random optimization uses the differential evolution algorithm to find best compositions and can yield faster results as compared to Brute Force. This method generally have similar disadvantages as the Brute Force method.
Genetic Algorithm optimization: Genetic Algorithm uses natural selection principles to select the best formulations, similar to evolution in nature. This approach is robust to noisy and multimodal problems as well as scalable to large design spaces. However, it may converge to suboptimal solutions based on the input conditions.

The panel is structured into two primary sections: Ingredients and ML Models and Properties. The Ingredients section allows for precise component specification, including grouping, and minimum/maximum component number constraints. For instance, in our battery electrolyte example, components are categorized into 'salt' and 'solvent' groups. We can also designate components as default inclusions for all formulations. If Brute Force optimization approach is selected, compositional constraints like minimum and maximum compositions can be set. An example format for the battery electrolyte dataset is shown below:

Components are defined using SMILES strings, with 'MIN_NUM_COMPONENTS' and 'MAX_NUM_COMPONENTS' columns specifying the allowed component count within their respective groups. The 'REQUIRED' column, with a 'True' value, designates components for inclusion in all formulations. Components are categorized into groups based on their properties, as indicated in the 'GROUP' column. The spreadsheet shows that we want the optimization to choose one salt and either single or binary solvent systems. Since ‘FC(F)COCCOCC(F)(F)’ SMILES has True set in the ‘REQUIRED’ column for Solvent, it will always be present as one of the two solvent spots available.

The ML Models and Properties section defines the machine learning model used for formulation optimization and specifies the target property, including maximization, minimization, or a desired value range.

Figure 3-6. Loading the input file.

In the Ingredients section, click Load CSV
In the provided files, go to the Section_03 directory, choose the form_battery.csv file and click Open

Figure 3-6. Viewing the components list.

The panel reflects the provided data. Notably, FC(F)COCCOCC(F)(F)) is designated as 'REQUIRED' with a 'True' value, indicating its inclusion in all formulations.

The panel also displays component counts per group: 13 for 'Salt' and 46 for 'Solvent.' In addition to the .csv file input, users can directly adjust minimum/maximum component counts per group using the 'At least' and 'At most' text boxes within the panel. Note that changes in the panel do not modify the original.csv file input.

Figure 3-7. Loading the ML model.

In the ML Models and Properties section, click Add ML Model
In the provided files, go to the Section_03 > formulation_ml_battery_LCE directory, choose the formulation_ml_battery_LCE.mlform file and click Open

Figure 3-8. Setting the range of target property.

In the popup window, ensure Properties is set to LCE
Choose Maximize as the Objective
- Alternatively, you can adjust the range of 'Good' or 'Okay' values by either moving the dashed sliders or by directly entering the corresponding numerical values.
Click OK

Figure 3-9. Running the job.

Increase the Number of Trials to 500
Check Stop Early
Change the Job name to formulation_ml_optimization_battery
Adjust the job settings () as needed
- This job requires a CPU host. The job will be completed in about 10 minutes on a CPU host
If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step

Figure 3-10. Loading the results.

Go to the Results tab
Click Load Optimization Results
- If you performed the calculation, navigate to the job directory. Otherwise, proceed to import from the provided files
Choose Section_03 > formulation_ml_optimization_battery > formulation_ml_optimization_battery.omlform and click Open

Figure 3-11. Viewing the output formulations.

The panel is loaded with results of the calculation. Each formulation is shown in the list along with the composition of each component and corresponding MPO score. By default, the formulations are rank ordered according to their MPO scores.

Figure 3-12. MPO score.

Go to the MPO Score subtab

This plot depicts the Average MPO score progression for the top ten formulations during optimization. The iteration process was terminated prior to 500 iterations as the average score converged, showing no further significant improvement. Proceed to explore the other options in the panel.

Close the Formulation Machine Learning Optimization panel.

4. Optimizing Density of Miscible Solvents Formulation

We will now apply the optimization steps outlined in the preceding section to optimize the density of miscible solvent formulations, targeting a defined range of values. We will use a pre-trained ML model. The steps to do that are similar to those mentioned in Section 3. Please refer to the Machine Learning for Formulations tutorial for detailed steps to build the model.

Figure 4-1. Opening the Formulation Machine Learning Optimization panel.

Go to Tasks > Materials > Informatics > Formulation Machine Learning Optimization
- The Formulation Machine Learning Optimization panel opens.

Figure 4-2. Loading the input file.

In the Ingredients section, click Load CSV
In the provided files, go to the Section_04 directory, choose the optimize_input.csv file and click Open

Figure 4-3. Loading the ML model.

Change the At most number of components to 3
- This will use at least 1 solvent and at most 3 solvents for each formulation
In the ML Models and Properties section, click Add ML Model
In the provided files, go to the Section_04 > formulation_ml_miscible_density directory, choose the formulation_ml_miscible_density.mlform file and click Open

Figure 4-4. Setting the target property range.

Change Objective to Middle Good
Change the values of constraints as follows:
- Lower Good: 1.0
- Lower Okay: 0.9
- Upper Good: 1.10
- Upper Okay: 1.2
Click OK

This ensures the formulation is optimized to adhere to a narrowly defined density range

Figure 4-5. Running the job.

Change the Job name to formulation_ml_optimization_miscible
Adjust the job settings () as needed
- This job requires a CPU host. The job will be completed in about 10 minutes on a CPU host
If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step

Figure 4-6. Loading the results.

Go to the Results tab
Click Load Optimization Results
- If you performed the calculation, navigate to the job directory. Otherwise, proceed to import from the provided files
Choose Section_04 > formulation_ml_optimization_miscible > formulation_ml_optimization_miscible.omlform and click Open

Figure 4-7. MPO score.

Go to the MPO Score subtab

The plot shows that the average MPO score is ~ 0.90, which indicates the workflow found formulations with the desired density values.

5. Optimization of Formulation Mixtures

A "complex mixture" refers to a blend of multiple mixtures. Complex mixtures, in contrast to simple formulations, contain numerous components that collectively determine product characteristics. Similar to the previous sections, we will begin by training a ML model to predict the viscosity of the complex mixtures using the shampoo formulation dataset from Chitre et al. This trained model will then be utilized to optimize the shampoo formulation.

Figure 5-1. Opening the Formulation Machine Learning Panel.

Go to Tasks > Materials > Informatics > Formulation Machine Learning
- The Formulation Machine Learning panel opens.

For complex formulation the panel requires two sets of input CSV files:

Data of complex formulations with descriptors and properties (viscosity in this case)
Composition of each mixture in the formulation.

For file #1, the input data is structured as follows:

Each row represents a distinct formulation, defined by multiple components/mixtures and their corresponding compositions. The input also includes relevant descriptors and the property to be predicted.

For file #2, Component-specific information is as follows:

The input CSV file above includes the SMILES strings and compositions for each ingredient within a mixture component. The CSV file contains ingredient and composition information for all mixtures such that each mixture should have compositions that sum up to 100%.

Figure 5-2. Loading the input data.

Choose Complex as the Formulation type
- By choosing the Complex option, the panel expects the two input CSV files instead of one input CSV file for Simple mixtures
Click Load Training Data
Navigate to the provided files and choose Section_05 > formulation_shampoo_input.csv file and click Open
- A prompt appears about the requirement of group information, which will be loaded separately into the panel. Click OK.

Figure 5-3. Loading the group information.

Choose Section_05 > formulation_shampoo_group_info.csv file and click Open
- The panel is updated with the loaded information.

Figure 5-4. Setting up the Featurizers.

Go to the Build tab
Select the four Featurizers as shown in the figure

Figure 5-5. Setting up the descriptors and running the job.

Select All for Machine learning models
Select log(avg_viscosity) as the Target property
Choose log(shear_rate) for Formulation descriptors
Change the Job name to formulation_ml_shampoo
Adjust the job settings () as needed
- This job requires a CPU host. The job will be completed in about 45 minutes on a CPU host
If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results.

Figure 5-6. Importing the results.

If you performed the calculation, the results will automatically be incorporated into the panel when the job is complete. Here, we will assume that you are proceeding with the provided files:

Go to the Performance tab
Click Load Model
In the provided files, go to the Section_05 > formulation_ml_shampoo directory, choose the formulation_ml_shampoo.mlform file and click Open

Note: If you performed the calculations yourself, you should expect slight variance in the results.

Figure 5-7. Viewing the results.

We can see that the model generalized well on the test set.

We will use the trained ML model for optimizing the formulation with desired viscosity.

Figure 5-8. Opening the Formulation Machine Learning Optimization panel.

Go to Tasks > Materials > Informatics > Formulation Machine Learning Optimization
- The Formulation Machine Learning Optimization panel opens.

Figure 5-9. Loading the input file.

In the Ingredients section, click Load CSV
In the provided files, go to the Section_05 directory, choose the shampoo_formulation_optimization_input.csv file and click Open
- A prompt appears about the requirement of group information. We will load that data also into the panel. Click OK.

Figure 5-10. Loading the group information.

Choose Section_05 > formulation_shampoo_group_info.csv file and click Open

Figure 5-11. Viewing the components of mixtures.

The information is loaded into the panel. Within each group, there are multiple components. Clicking on the dropdown arrow shows more information about each mixture. Clicking on each mixture shows the composition and the structure of each component in the mixture.

Figure 5-12. Setting up the optimization method.

Change the Optimization Method to Brute Force
- Water is selected to be always present based on the original shampoo dataset, which is usually 60 - 90% of the mixture. By forcing water to be always present, we guide the optimization algorithm to select reasonable shampoo formulations. The step size of 1% indicates that the algorithm will incrementally vary the composition by 1%.

Figure 5-13. Loading the ML model.

In the ML Models and Properties section, click Add ML Model
In the provided files, go to the Section_05 > formulation_ml_shampoo directory, choose the formulation_ml_shampoo.mlform file and click Open

Figure 5-14. Choosing the target property range.

Ensure log(avg_viscosity) is chosen as the Property
Change Objective to Middle Good
Set the values of constraints as follows:
1. Lower Good: 1.3
2. Lower Okay: 0.5
3. Upper Good: 2.0
4. Upper Okay: 2.8
Click OK

Figure 5-15. Running the job.

Change the Job name to formulation_ml_optimization_shampoo
Adjust the job settings () as needed
- This job requires a CPU host. The job will be completed in about 5 minutes on a CPU host
If you would like to perform the calculation, click Run. Otherwise, we will import pre-generated results in the next step

Figure 5-16. Loading the results.

Go to the Results tab
Click Load Optimization Results
- If you performed the calculation, navigate to the job directory. Otherwise, proceed to import from the provided files
Choose Section_05 > formulation_ml_optimization_shampoo > formulation_ml_optimization_shampoo.omlform and click Open

Figure 5-17. Viewing the formulation information.

The panel is loaded with results of the calculation. Water is present in all the formulations in varying composition. The formulations are rank ordered according to their MPO scores.

Figure 5-18. MPO score.

Go to the MPO Score subtab

The high average MPO score of ~0.91 signifies that the optimization algorithm has effectively produced formulations with the targeted viscosity

6. Conclusion and References

In this tutorial, we learned how to optimize formulations using machine learning models for various materials applications. For all examples, we first trained a formulation-property model using the Formulation Machine Learning Panel; then, we used the Formulation Machine Learning Optimization panel to optimize various formulations. This tutorial focused on optimizing LCE of battery electrolyte systems, density of miscible solvent systems, and viscosity of complex shampoo formulations using a BASF dataset. The Formulation Machine Learning Optimization panel can be used to suggest formulations while balancing multiple properties, which paves a way forward to creating unique formulations with fine-tuned properties for broad materials applications.

Click to Expand

For further learning:

For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 70+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.

For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.

For some related practice, proceed to explore other relevant tutorials:

For more machine learning:

Click to Expand

For further reading:

Data-driven electrolyte design for lithium metal anodes, DOI:10.1073/pnas.2214357120
Formulation Graphs for Mapping Structure-Composition of Battery Electrolytes to Device Performance, DOI:10.1021/acs.jcim.3c01030
Oxygen Assisted Lithium-Iodine Batteries: Towards Practical Iodine Cathodes and Viable Lithium Metal Protection Strategies, DOI:10.1002/admi.202300058
Accelerating Formulation Design via Machine Learning: Generating a High-throughput Shampoo Formulations Dataset, DOI:10.1038/s41597-024-03573-w

7. Glossary of Terms

Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion

Included - the entry is represented in the Workspace, the circle in the In column is blue

Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)

Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project

Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries

Working Directory - the location where files are saved

Workspace - the 3D display area in the center of the main window, where molecular structures are displayed