Polymer Descriptors for Machine Learning

Modeling polymers is challenging due to the arbitrary number of monomers, which affects the chain length, molecular mass, and large-scale properties. Alternative to modeling the entire polymer, analyzing the monomer structure alone with machine learning (ML) methods can accelerate the design of new polymers. ML models can accurately predict bulk polymer properties a few orders of magnitude faster than ab initio or molecular dynamics calculations and trial-and-error experimental approaches.

To develop ML models for polymer systems, molecular descriptors that numerically encode chemical information about the polymer are required. To account for the arbitrary number of monomers in a polymer chain, the Materials Science Maestro suite enables the generation of polymer descriptors that are independent of chain length. Examples of these descriptors are fraction of rotatable bonds, fraction of fused ring atoms, fraction of atoms in the backbone, and more. Once descriptors are generated, they can be used in a machine learning model to predict bulk polymer properties, such as glass transition temperature (T_g), dielectric properties, and mechanical behavior. Such an approach can be efficient as compared to either performing experiments on a wide range of materials or running excessive, costly simulations.

This tutorial provides step-by-step instructions to calculate polymer descriptors using the Materials Science Maestro interface. This tutorial also demonstrates the utility of these descriptors by constructing ML models with AutoQSAR for predicting T_g. Finally, the best ML models are used to predict T_g for a small set of polymers that were not used during model training. The overall workflow is summarized in the figure below:

Figure 1. Tutorial workflow showing the conversion of SMILES to monomer structures, the polymer descriptor and AutoQSAR panel, and the output property predictions.

For background on the Polymer Descriptors panel which will be described in this tutorial, see the help documentation.

For more information about building machine learning models in Materials Science Maestro, see the introductory sections of the Machine Learning for Materials Science tutorial. To learn about using pre-built machine learning models to predict polymer properties, please refer to the Machine Learning Property Prediction tutorial.

For an introduction to building (with the Polymer Builder) and equilibrating (with Molecular Dynamics) homopolymers in Materials Science Maestro, see the Building, Equilibrating and Analyzing Amorphous Polymers tutorial.

2. Creating Projects and Importing Structures

At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.

Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.

Double-click the Materials Science icon
- (No icon? See Starting Maestro)

Figure 2-1. Change Working Directory option.

Go to File > Change Working Directory
Find your directory, and click Choose
Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/polymer_descriptors.zip
After downloading the zip file, unzip the contents in your Working Directory for ease of access throughout the tutorial

Figure 2-2. Save Project panel.

Go to File > Save Project As
Change the File name to polymer _descriptors_tutorial, click Save
- The project is now named polymer_descriptors_tutorial.prj

Figure 2-3. Import the starting structures.

In this tutorial, we will use a data set of 155 polymers, defined by SMILES strings with the head and tail denoted with the SMILES expression, [At] (see the .csv in the provided files). Let’s import these structures now:

Go to File > Import Structures
Change Files of type to Smiles (*.smi*.csv…)
Select monomers_training.csv from the provided tutorial files
Click Open
- The Import SMILES panel pops up

Note: If the head and tail of each monomer were not denoted in the SMILES strings, it can also be done using the Mark Monomer Head and Tail panel, which is described here.

Figure 2-4. Import SMILES settings.

For SMILES Column: choose smiles
For ENTRY TITLE Column: choose id
Ensure Discard any additional properties is unchecked
Click OK

Figure 2-5. The entry list and a stylized molecule after importing.

The entry list is updated to include the 155 entries. Feel free to stylize and visualize any of the provided structures.

Note: Hydrogen atoms are not added when importing from SMILES. This will not impact this exercise, but it is good to be aware of if running any quantum mechanical calculations.

Each polymer also has a T_g value as determined and reported in the literature (Bicerano, J. Prediction of Polymer Properties. Marcel Dekker Inc.: New York, 1996 & Afzal, M.A.F. et al ACS Appl. Polym. Mater. 2021, 3, 2, 620-630). These can be visualized in the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data under the Tg property.

3. Generating Polymer Descriptors

To create a ML model we will first need to generate descriptors for the structures using the Polymer Descriptors panel.

Figure 3-1. Choosing descriptor options and running the job.

Ensure that all 155 entries are selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list (use Shift + Click or create an entry group)
Go to Tasks > Materials > Informatics > Polymer
- The Polymer Descriptors panel opens
Ensure that Polymer fingerprints and Polymer descriptors are checked
- The polymer fingerprints, constructed using RDkit, are topological torsion fingerprints that are invariant to the number of repeat units
- The polymer descriptors, constructed using RDkit, are also invariant to the length of the repeat unit, including backbone length and fraction of fused rings
Change the Job name to polymer_descriptors_train
Adjust the job settings () as needed
- This job requires a CPU host. The job can be completed in about 1 minute.

For a complete description of the Polymer Descriptors panel, see the help documentation.

Figure 3-2. Viewing some of the descriptors in the Project Table.

Close the Polymer Descriptors panel

When the job finishes, a new entry group is incorporated and added to the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion entitled polymer_descriptors_train-out1 (155). The group contains all of the same structures, but now each entry is also associated with the various descriptors.

We can see these descriptors by opening the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Open the Project Table ()

The properties may not appear by default. To add some of the properties as columns in the Project Table:

Go to the Property Tree (), expand All > Materials Science > Secondary and select any of the properties for which you want to see the quantity

Note: You can export directly from the Project Table to spreadsheet form if needed by clicking Data > Export > Spreadsheet

Close the Project Table before proceeding to the next section

4. Building a Machine Learning Model Using AutoQSAR

In the following section, we want to build AutoQSAR models to predict the T_g of a homopolymer using the polymer descriptors generated in the previous section.

Figure 4-1. Parameterizing the AutoQSAR panel.

Ensure that the entire polymer_descriptors_train-out1 (155) entry group is selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list
- Be sure you have the new entry group selected, containing all of the descriptor data, as opposed to the original structures
Go to Tasks > Materials > Informatics > AutoQSAR
- The AutoQSAR panel opens

If you are interested in more background for utilizing the AutoQSAR panel, see the Machine Learning for Materials Science tutorial.

Ensure that Build model is checked
Ensure that for Use structures from, Project Table (selected entries) is chosen
Change the Property to be fit dropdown to Tg (canvas)
- This is the glass transition temperature (T_g)
Maintain 75% for the Random training set
- This is the percentage of data to set aside between train and test sets, where 75% of the data is used to train the model and 25% of the data is used to test the model
- With this relatively small data set, the 75:25 split ensures that there is significantly more data in the training set than the test set, but still enough data in the test set to assess model performance
Input 30 for Number of models to keep
- AutoQSAR tries a variety of ML models on a variety of different training and test splits. Increasing to 30 models guarantees that we sample over a variety of different training:test splits to avoid serendipitously picking a split in which all of the ML models perform well
Click Advanced Options

Figure 4-2. AutoQSAR advanced options.

Maintain 50 for the Number of models to build for each model type
Change the Maximum allowed correlation between any pair of individual variables to 0.99
- A higher correlation threshold allows AutoQSAR to use descriptors that are linearly correlated with each other, which may obtain better results
Uncheck Binary fingerprints and Numeric descriptors
- These default descriptors are not relevant for polymer systems
Check Other Properties from and click Structures...

Figure 4-3. Selecting descriptors.

From the Show family dropdown, select Materials Science
- The descriptors calculated in Section 3 appear in the Available properties list
Click Select All and Add
Then click OK to save the Advanced Options

Figure 4-4. Naming and running the job.

Change the Job name to qsar_build_polymers
Adjust the job settings () as needed
- This job requires a CPU host. The job can be completed in about 5 minutes on a 12 CPU host
Click Run
Close the AutoQSAR panel

5. Viewing the Machine Learning Model and Predicting

We can proceed to view the machine learning models that were generated, and use these to make predictions on a small data set.

Figure 5-1. Loading the models.

When the job is complete, note that no new entry group is added to the entry list.

Return to Tasks > Materials > Informatics > AutoQSAR
- The AutoQSAR panel opens
For Choose task, switch to View model and make prediction
From the dropdown, select qsar_build_polymers.qzip
- The Model Report section of the panel shows the scores for the best models
Click on the + button to expand the model report, which shows the performance of the best models

Figure 5-2. Viewing the models.

Click to Highlight the best model (the first row by default)
Click the Report Details button

Figure 5-3. Viewing the Report Details.

The Report Details pop-up shows the scores, import features, predict values and errors of the training and test data

Click Scatter Plot

Figure 5-4. Viewing the Scatter Plot.

The scatter plot allows further visualization of the model.

Note: You can Save Image of the scatter plot if you would like to save a .png file.

Feel free to visualize any of the other models. When you are finished, close the Scatter Plot window and all of the other windows associated with the AutoQSAR panel

Figure 5-5. Importing the test set.

Provided with the tutorial files are ten additional structures with known T_g values. We will now proceed to import these structures, calculate descriptors and then test how the model performs in predicting their T_g values.

Go to File > Import Structures

Repeat the importing steps used at the end of Section 2, this time importing the Section_05 > monomers_test.csv file.

Ten new entries are added to the entry list.

Repeat all of the steps in Section 3 for these ten entries to generate the descriptors
- Name the job polymer_descriptors_test_set. It should complete very quickly on a CPU host
- Check the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data to see that you have generated descriptors

Figure 5-6. Selecting the descriptor output and opening the AutoQSAR panel.

Ensure that the new entry group is selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list: polymer_descriptors_test_set-out1 (10)
- Be sure you have the entry group selected containing all of the descriptor data, as opposed to the original structures
Return to Tasks > Materials > Informatics > AutoQSAR
- The AutoQSAR panel opens
Ensure that Choose task is still set to View model and make prediction
Ensure that from the dropdown, qsar_build_polymers.qzip is selected

Figure 5-7. Naming and running the prediction job.

In the Make Prediction section of the panel, maintain the defaults:
- Keep the entry group selected
- Use All models
- Maintain Y for the AutoQSAR Prediction. This is going to be the output property name: Pred Y
Change the Job name to qsar_test_polymers_ten
Adjust the job settings () as needed
- This job requires a CPU host. The job can be completed in about 2 minute on a 12 CPU host
Click Run

Note: Consensus prediction averages the results of the retained models, which can often increase the accuracy of the predictions.

Figure 5-8. The predicted values in the Project Table.

When the job is complete, a new entry group is added to the entry list entitled qsar_test_polymers_ten-out1 (10) containing the same ten structures. These structures now have the predicted T_g values associated with them.

Close the AutoQSAR panel
Select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries the qsar_test_polymers_ten-out1 (10) entry group
Open the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data ()
- The predicted values are listed for the new entries in the Pred Y column, as well as their standard deviations

To compare these values to the known values we will draw a scatter plot.

Go to Window > Manage Charts

Figure 5-9. Choosing Scatterplot option.

Click Create and choose Scatterplot

Figure 5-10. A scatter plot of the predicted data versus the known values for the test set.

For X-Axis select Tg
- These are the actual values of the target property
For Y-Axis select Pred Y
- These are the ML predicted values
Check Best fit
- A regression line and equation is added

Feel free to stylize the graph and save the image if you wish.

The best fit line between predicted and actual values shows a reasonable R² of 0.88 (an ideal model would have an R² of 1.00). The results suggest that the ML model derived from the polymer descriptors and AutoQSAR panel could generalize to unseen polymers. Furthermore, this workflow highlights the computational efficiency achieved when using ML approaches as compared to other computational (e.g. ab initio calculations) or experimental approaches. While this tutorial uses a relatively small dataset, one could envision a larger training set would further improve prediction accuracy.

6. Conclusion and References

In this tutorial, we learned how to use the Polymer Descriptors panel to generate descriptors for polymer systems. We then learned how to use those descriptors to build ML models using the AutoQSAR panel. Finally, we used the model to make predictions on additional test examples that the model has not seen before. Altogether, the polymer descriptors panel enables ML approaches for property predictions in polymer systems, which could be used to rapidly screen materials for selective properties.

Click to Expand

For further learning:

For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 70+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.

For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.

For some related practice, proceed to explore other relevant tutorials:

Click to Expand

For further reading:

Help documentation on Polymer Descriptors and AutoQSAR panels
Bicerano, J. Prediction of Polymer Properties. Marcel Dekker Inc.: New York, 1996
Afzal, M.A.F. et al ACS Appl. Polym. Mater. 2021, 3, 2, 620-630. DOI:10.1021/acsapm.0c00524
Design of Organic Electronic Materials With a Goal-Directed Generative Model Powered by Deep Neural Networks and High-Throughput Molecular Simulations. DOI:10.3389/fchem.2021.800370
Active Learning Accelerates Design and Optimization of Hole-Transporting Materials for Organic Electronics. DOI:10.3389/fchem.2021.800371
DeepAutoQSAR Hardware Benchmark (Schrödinger white paper)

7. Glossary of Terms

Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion

Included - the entry is represented in the Workspace, the circle in the In column is blue

Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)

Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project

Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries

Working Directory - the location where files are saved

Workspace - the 3D display area in the center of the main window, where molecular structures are displayed