Polymer Descriptors for Machine Learning
Tutorial Created with Software Release: 2025-2
Topics: Consumer Packaged Goods , Informatics and Team Collaboration , Pharmaceutical Formulations , Polymeric Materials
Methodology: All-Atom Molecular Dynamics , Machine Learning
Products Used: AutoQSAR , MS Informatics , MS Maestro
|
4.0 MB |
This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed
Abstract:
In this tutorial, we will learn to generate descriptors for polymers, which can be used to build machine learning models.
Tutorial Content
1. Introduction to Polymer Descriptors
Modeling polymers is challenging due to the arbitrary number of monomers, which affects the chain length, molecular mass, and large-scale properties. Alternative to modeling the entire polymer, analyzing the monomer structure alone with machine learning (ML) methods can accelerate the design of new polymers. ML models can accurately predict bulk polymer properties a few orders of magnitude faster than ab initio or molecular dynamics calculations and trial-and-error experimental approaches.
To develop ML models for polymer systems, molecular descriptors that numerically encode chemical information about the polymer are required. To account for the arbitrary number of monomers in a polymer chain, the Materials Science Maestro suite enables the generation of polymer descriptors that are independent of chain length. Examples of these descriptors are fraction of rotatable bonds, fraction of fused ring atoms, fraction of atoms in the backbone, and more. Once descriptors are generated, they can be used in a machine learning model to predict bulk polymer properties, such as glass transition temperature (Tg), dielectric properties, and mechanical behavior. Such an approach can be efficient as compared to either performing experiments on a wide range of materials or running excessive, costly simulations.
This tutorial provides step-by-step instructions to calculate polymer descriptors using the Materials Science Maestro interface. This tutorial also demonstrates the utility of these descriptors by constructing ML models with AutoQSAR for predicting Tg. Finally, the best ML models are used to predict Tg for a small set of polymers that were not used during model training. The overall workflow is summarized in the figure below:
Figure 1. Tutorial workflow showing the conversion of SMILES to monomer structures, the polymer descriptor and AutoQSAR panel, and the output property predictions.
For background on the Polymer Descriptors panel which will be described in this tutorial, see the help documentation.
For more information about building machine learning models in Materials Science Maestro, see the introductory sections of the Machine Learning for Materials Science tutorial. To learn about using pre-built machine learning models to predict polymer properties, please refer to the Machine Learning Property Prediction tutorial.
For an introduction to building (with the Polymer Builder) and equilibrating (with Molecular Dynamics) homopolymers in Materials Science Maestro, see the Building, Equilibrating and Analyzing Amorphous Polymers tutorial.
2. Creating Projects and Importing Structures
At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.
Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.
- Double-click the Materials Science icon
- (No icon? See Starting Maestro)
- Go to File > Change Working Directory
- Find your directory, and click Choose
- Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/polymer_descriptors.zip
- After downloading the zip file, unzip the contents in your Working Directory for ease of access throughout the tutorial
- Go to File > Save Project As
- Change the File name to polymer _descriptors_tutorial, click Save
- The project is now named
polymer_descriptors_tutorial.prj
- The project is now named
In this tutorial, we will use a data set of 155 polymers, defined by SMILES strings with the head and tail denoted with the SMILES expression, [At] (see the .csv in the provided files). Let’s import these structures now:
- Go to File > Import Structures
- Change Files of type to Smiles (*.smi*.csv…)
- Select
monomers_training.csvfrom the provided tutorial files - Click Open
- The Import SMILES panel pops up
Note: If the head and tail of each monomer were not denoted in the SMILES strings, it can also be done using the Mark Monomer Head and Tail panel, which is described here.
- For SMILES Column: choose smiles
- For ENTRY TITLE Column: choose id
- Ensure Discard any additional properties is unchecked
- Click OK
The entry list is updated to include the 155 entries. Feel free to stylize and visualize any of the provided structures.
Note: Hydrogen atoms are not added when importing from SMILES. This will not impact this exercise, but it is good to be aware of if running any quantum mechanical calculations.
Each polymer also has a Tg value as determined and reported in the literature (Bicerano, J. Prediction of Polymer Properties. Marcel Dekker Inc.: New York, 1996 & Afzal, M.A.F. et al ACS Appl. Polym. Mater. 2021, 3, 2, 620-630). These can be visualized in the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data under the Tg property.
3. Generating Polymer Descriptors
To create a ML model we will first need to generate descriptors for the structures using the Polymer Descriptors panel.
- Ensure that all 155 entries are selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list (use Shift + Click or create an entry group)
- Go to Tasks > Materials > Informatics > Polymer
- The Polymer Descriptors panel opens
- Ensure that Polymer fingerprints and Polymer descriptors are checked
- The polymer fingerprints, constructed using RDkit, are topological torsion fingerprints that are invariant to the number of repeat units
- The polymer descriptors, constructed using RDkit, are also invariant to the length of the repeat unit, including backbone length and fraction of fused rings
- Change the Job name to polymer_descriptors_train
- Adjust the job settings (
) as needed
- This job requires a CPU host. The job can be completed in about 1 minute.
For a complete description of the Polymer Descriptors panel, see the help documentation.
- Close the Polymer Descriptors panel
When the job finishes, a new entry group is incorporated and added to the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion entitled polymer_descriptors_train-out1 (155). The group contains all of the same structures, but now each entry is also associated with the various descriptors.
We can see these descriptors by opening the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data
The properties may not appear by default. To add some of the properties as columns in the Project Table:
- Go to the Property Tree (
), expand All > Materials Science > Secondary and select any of the properties for which you want to see the quantity
Note: You can export directly from the Project Table to spreadsheet form if needed by clicking Data > Export > Spreadsheet
- Close the Project Table before proceeding to the next section
4. Building a Machine Learning Model Using AutoQSAR
In the following section, we want to build AutoQSAR models to predict the Tg of a homopolymer using the polymer descriptors generated in the previous section.
- Ensure that the entire polymer_descriptors_train-out1 (155) entry group is selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list
- Be sure you have the new entry group selected, containing all of the descriptor data, as opposed to the original structures
- Go to Tasks > Materials > Informatics > AutoQSAR
- The AutoQSAR panel opens
If you are interested in more background for utilizing the AutoQSAR panel, see the Machine Learning for Materials Science tutorial.
- Ensure that Build model is checked
- Ensure that for Use structures from, Project Table (selected entries) is chosen
- Change the Property to be fit dropdown to Tg (canvas)
- This is the glass transition temperature (Tg)
- Maintain 75% for the Random training set
- This is the percentage of data to set aside between train and test sets, where 75% of the data is used to train the model and 25% of the data is used to test the model
- With this relatively small data set, the 75:25 split ensures that there is significantly more data in the training set than the test set, but still enough data in the test set to assess model performance
- Input 30 for Number of models to keep
- AutoQSAR tries a variety of ML models on a variety of different training and test splits. Increasing to 30 models guarantees that we sample over a variety of different training:test splits to avoid serendipitously picking a split in which all of the ML models perform well
- Click Advanced Options
- Maintain 50 for the Number of models to build for each model type
- Change the Maximum allowed correlation between any pair of individual variables to 0.99
- A higher correlation threshold allows AutoQSAR to use descriptors that are linearly correlated with each other, which may obtain better results
- Uncheck Binary fingerprints and Numeric descriptors
- These default descriptors are not relevant for polymer systems
- Check Other Properties from and click Structures...
- From the Show family dropdown, select Materials Science
- The descriptors calculated in Section 3 appear in the Available properties list
- Click Select All and Add
- Then click OK to save the Advanced Options
5. Viewing the Machine Learning Model and Predicting
We can proceed to view the machine learning models that were generated, and use these to make predictions on a small data set.
When the job is complete, note that no new entry group is added to the entry list.
- Return to Tasks > Materials > Informatics > AutoQSAR
- The AutoQSAR panel opens
- For Choose task, switch to View model and make prediction
- From the dropdown, select qsar_build_polymers.qzip
- The Model Report section of the panel shows the scores for the best models
- Click on the + button to expand the model report, which shows the performance of the best models
- Click to Highlight the best model (the first row by default)
- Click the Report Details button
The Report Details pop-up shows the scores, import features, predict values and errors of the training and test data
- Click Scatter Plot
The scatter plot allows further visualization of the model.
Note: You can Save Image of the scatter plot if you would like to save a .png file.
- Feel free to visualize any of the other models. When you are finished, close the Scatter Plot window and all of the other windows associated with the AutoQSAR panel
Provided with the tutorial files are ten additional structures with known Tg values. We will now proceed to import these structures, calculate descriptors and then test how the model performs in predicting their Tg values.
- Go to File > Import Structures
Repeat the importing steps used at the end of Section 2, this time importing the Section_05 > monomers_test.csv file.
Ten new entries are added to the entry list.
- Repeat all of the steps in Section 3 for these ten entries to generate the descriptors
- Name the job polymer_descriptors_test_set. It should complete very quickly on a CPU host
- Check the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data to see that you have generated descriptors
- Ensure that the new entry group is selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list: polymer_descriptors_test_set-out1 (10)
- Be sure you have the entry group selected containing all of the descriptor data, as opposed to the original structures
- Return to Tasks > Materials > Informatics > AutoQSAR
- The AutoQSAR panel opens
- Ensure that Choose task is still set to View model and make prediction
- Ensure that from the dropdown, qsar_build_polymers.qzip is selected
- In the Make Prediction section of the panel, maintain the defaults:
- Keep the entry group selected
- Use All models
- Maintain Y for the AutoQSAR Prediction. This is going to be the output property name: Pred Y
- Change the Job name to qsar_test_polymers_ten
- Adjust the job settings (
) as needed
- This job requires a CPU host. The job can be completed in about 2 minute on a 12 CPU host
- Click Run
Note: Consensus prediction averages the results of the retained models, which can often increase the accuracy of the predictions.
When the job is complete, a new entry group is added to the entry list entitled qsar_test_polymers_ten-out1 (10) containing the same ten structures. These structures now have the predicted Tg values associated with them.
- Close the AutoQSAR panel
- Select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries the qsar_test_polymers_ten-out1 (10) entry group
- Open the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data (
)
- The predicted values are listed for the new entries in the Pred Y column, as well as their standard deviations
To compare these values to the known values we will draw a scatter plot.
- Go to Window > Manage Charts
- For X-Axis select Tg
- These are the actual values of the target property
- For Y-Axis select Pred Y
- These are the ML predicted values
- Check Best fit
- A regression line and equation is added
Feel free to stylize the graph and save the image if you wish.
The best fit line between predicted and actual values shows a reasonable R2 of 0.88 (an ideal model would have an R2 of 1.00). The results suggest that the ML model derived from the polymer descriptors and AutoQSAR panel could generalize to unseen polymers. Furthermore, this workflow highlights the computational efficiency achieved when using ML approaches as compared to other computational (e.g. ab initio calculations) or experimental approaches. While this tutorial uses a relatively small dataset, one could envision a larger training set would further improve prediction accuracy.
6. Conclusion and References
In this tutorial, we learned how to use the Polymer Descriptors panel to generate descriptors for polymer systems. We then learned how to use those descriptors to build ML models using the AutoQSAR panel. Finally, we used the model to make predictions on additional test examples that the model has not seen before. Altogether, the polymer descriptors panel enables ML approaches for property predictions in polymer systems, which could be used to rapidly screen materials for selective properties.
For further learning:
For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 70+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.
For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.
For some related practice, proceed to explore other relevant tutorials:
-
For more machine learning:
- Machine Learning for Materials Science
- Periodic Descriptors for Inorganic Solids
- Optoelectronics Active Learning
- Machine Learning for Sweetness
- Cheminformatics Machine Learning for Homogeneous Catalysis
- Machine Learning Property Prediction
- Machine Learning for Ionic Conductivity
- Molecular Dynamics Descriptors for Machine Learning
- Machine Learning for Formulations
- Optimizing Viscosity and Cost in Formulations with Missing Structural Data
- For general polymer workflows:
For further reading:
- Help documentation on Polymer Descriptors and AutoQSAR panels
- Bicerano, J. Prediction of Polymer Properties. Marcel Dekker Inc.: New York, 1996
- Afzal, M.A.F. et al ACS Appl. Polym. Mater. 2021, 3, 2, 620-630. DOI:10.1021/acsapm.0c00524
- Design of Organic Electronic Materials With a Goal-Directed Generative Model Powered by Deep Neural Networks and High-Throughput Molecular Simulations. DOI:10.3389/fchem.2021.800370
- Active Learning Accelerates Design and Optimization of Hole-Transporting Materials for Organic Electronics. DOI:10.3389/fchem.2021.800371
- DeepAutoQSAR Hardware Benchmark (Schrödinger white paper)
7. Glossary of Terms
Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion
Included - the entry is represented in the Workspace, the circle in the In column is blue
Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data
Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)
Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project
Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries
Working Directory - the location where files are saved
Workspace - the 3D display area in the center of the main window, where molecular structures are displayed