Periodic Descriptors for Inorganic Solids

Tutorial Created with Software Release: 2024-2
Topics: Catalysis & Reactivity, Energy Capture & Storage, Informatics and Team Collaboration, Metals, Alloys & Ceramics, Thin Film Processing
Methodology: Machine Learning, Periodic Quantum Mechanics
Products Used: AutoQSAR, MS Informatics, MS Maestro

Tutorial files

3.0 MB

This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed

 

Tip: You can hover over a glossary term to display its definition. You can click on an image to expand it in the page.
Abstract:

 

In this tutorial, we will learn to generate descriptors for inorganic, periodic crystal systems which can be used to build machine learning models.

 

Tutorial Content
  1. Introduction to Periodic Descriptors

  1. Setting Up the Environment

  1. Creating Projects and Importing Structures

  1. Generating Periodic Descriptors

  1. Building a Machine Learning Model Using AutoQSAR

  1. Viewing the Machine Learning Model and Predicting 

  1. Conclusion and References

  1. Glossary of Terms

1. Introduction to Periodic Descriptors

Machine learning (ML) methods can accelerate the design of new materials by predicting material properties a few orders of magnitude faster than ab initio calculations and with comparable accuracy. One particular material of interest is inorganic crystal structures, such as bulk metal, perovskites, composite materials, and more. These structures are often represented as a crystal unit cell, which has the smallest repeat unit that captures the three-dimensional pattern of the entire crystal. Figure 1 shows an example of an inorganic crystal structure that has a collection of atoms bounded by periodic boundary conditions (shown as black lines). The arbitrary size and shape of these periodic boundaries make crystal systems challenging to numerically quantify as molecular descriptors, which are values that encode chemical information about the crystal and are often a prerequisite for developing ML models to predict properties of the crystal. To resolve this challenge,  descriptors are computed by accounting for the periodic nature of these crystal structures, which are collectively referred to as periodic descriptors.

Once periodic descriptors are generated, they can be used in an ML model to predict bulk properties, such as ionic conductivity, band gap, bulk modulus, formation energy, and more. Figure 1 summarizes the workflow for this tutorial, which provides step-by-step instructions to calculate two types of periodic descriptors using the Materials Science Maestro interface: Matminer and Smooth Overlap of Atomic Positions (SOAP) descriptors. This tutorial further demonstrates the utility of periodic descriptors by constructing ML models with AutoQSAR to predict the bulk modulus of periodic systems.

Figure 1. Tutorial workflow showing an example of an inorganic crystal structure with periodic boundary conditions, the periodic descriptor and AutoQSAR panel, and the output property predictions.

For background on the Periodic Descriptors panel which will be described in this tutorial, see the help documentation.

For more information about building machine learning models in Materials Science Maestro, see the introductory sections of the Machine Learning for Materials Science tutorial.

For practice working with crystal structures in Materials Science Maestro, see the Building and Manipulating Crystal Structures tutorial.

2. Setting Up the Environment

By default, the AutoQSAR panel calculates binary fingerprints and numeric descriptors. However, these descriptors are not suitable to use for periodic systems. To ignore these descriptors, the environment variable SCHRODINGER_AUTOQSAR_IGNORE_STRUCTURES must be set to the “any” value before building AutoQSAR ML models for periodic systems. This section will explain the steps for doing so depending on your operating system. For complete information about setting environment variables, visit the knowledge base.

Figure 2-1. Setting up the environment variable on Mac.

On Mac OS:

Make sure you do not have a Materials Science Maestro session open

  1. Open the terminal
  2. Set your environment variable by running:  launchctl setenv SCHRODINGER_AUTOQSAR_IGNORE_STRUCTURES any
  • Note that the command above should be entered in a single line
  1. Close the terminal
    • Note that you must close the terminal for the environment variable to be set
    • Proceed to Section 3 to open the software

Figure 2-2. Setting up the environment variable on Linux (bash is shown).

On Linux:

Make sure you do not have a Materials Science Maestro session open

  1. Open the terminal
  2. a) For csh/tcsh, set your environment variable by running: setenv  SCHRODINGER_AUTOQSAR_IGNORE_S TRUCTURES any b) For bash, set your environment variable by running: export SCHRODINGER_AUTOQSAR_IGNORE_S TRUCTURES=any
  • Note that the command(s) above should be entered in a single line
  1. Close the terminal
    • Note that you must close the terminal for the environment variable to be set
    • Proceed to Section 3 to open the software

Figure 2-3. Setting up the environment variable on Windows.

On Windows:

Make sure you do not have a Materials Science Maestro session open

  1. Open the Environment Variables dialog box
    • This is accessed from Search > Edit environment variables for your account
  2. In the User variables for section, click New to open the New User Variable dialog box
  3. Set the Variable name to SCHRODINGER_AUTOQSAR_IGNORE_S TRUCTURES and the Variable value to any
  4. Close the Environment Variables dialog box

3. Creating Projects and Importing Structures

At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.

Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.

  1. Double-click the Materials Science icon

Figure 3-1. Change Working Directory option.

  1. Go to File > Change Working Directory
  2. Find your directory, and click Choose
  3. Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/periodic_descriptors_inorganic.zip
  4. After downloading the zip file, unzip the contents in your Working Directory for ease of access throughout the tutorial

Figure 3-2. Save Project panel.

  1. Go to File > Save Project As
  2. Change the File name to periodic _descriptors_tutorial, click Save
    • The project is now named periodic_descriptors_tutorial.prj

Figure 3-3. Import the starting structures.

In this tutorial, we will use a data set of 500 inorganic crystal structures. To import these structures:

  1. Go to File > Import Structures
  2. Choose train_500.mae from the provided tutorial files
  3. Click Open
    • A new entry group is added to the entry list containing 500 entries

Feel free to stylize and visualize any of the provided structures. If you are unfamiliar with working with periodic systems in Materials Science Maestro, refer to the Building and Manipulating Crystal Structures tutorial.

The provided data set includes a known bulk modulus value for each crystal structure. These can be visualized in the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data under the K VRH property

4. Generating Periodic Descriptors

To create a ML model we will first need to generate descriptors for the structures using the Periodic Descriptors panel.

Figure 4-1. Choosing descriptor options.

  1. Ensure that the entire train_500 (500) entry group is selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion
  2. Go to Tasks > Materials > Informatics > Periodic
    • The Periodic Descriptors panel opens
  3. Ensure that Element Descriptors, Oxidation state descriptors and Structure descriptors are checked
    • These are the Matminer descriptors (see the References)
    • Intercalation descriptors are not relevant in this case with pure solids. These descriptors are often used when modeling Li-ion battery materials
  4. Check 3D-based SOAP descriptors with dimensionality reduction via PCA
    • These are the SOAP descriptors (see the References)
    • Principal component analysis (PCA) is used to reduce the dimensions of SOAP descriptors

Figure 4-2. Parameterize the SOAP descriptors.

  1. Ensure that the Create new PCA radio button is checked
  2. Click Detect
    • The elements present in the dataset will appear, which will be used to calculate SOAP descriptors
  3. Set the Number of principal components to 3
    • You can choose any number between 2 and 10
    • Generally, a higher number of components equates to more SOAP descriptor information being stored

 

Note: The size of the SOAP descriptor vector may increase exponentially relative to the number of distinct elements in a system. Therefore, the periodic descriptors panel reduces the dimensions of this vector without losing relevant information using PCA. The resulting columns are called principal components. You can choose to have between 2 and 10 columns in your final feature vector. There is no universally accepted method of choosing the number of components. Read more about PCA here.

Figure 4-3. Naming and running the job.

  1. Change the Job name to inorganic_descriptors
  2. Adjust the job settings () as needed
    • This job requires a CPU host. The job can be completed in about 2 minutes.
  3. Click Run

Figure 4-4. Viewing some of the descriptors in the Project Table.

  1. Close the Periodic Descriptors panel

When the job finishes, a new entry group is incorporated and added to the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion entitled inorganic_descriptors-out (500). The group contains all of the same structures as the original group, but now each entry is also associated with the various descriptors.

We can see these descriptors by opening the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data in the next step

  1. Open the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data ()

The properties may not appear by default. To add some of the properties as columns in the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data:

  1. Go to the Property Tree (), expand All > Materials Science > Secondary and select any of the properties for which you want to see the quantity

Note: You can export directly from the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data to spreadsheet form if needed by clicking Data > Export > Spreadsheet

  1. Close the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data before proceeding to the next section

5. Building a Machine Learning Model Using AutoQSAR

Now that we have descriptors for the 500 structures, we can proceed to build a ML model using AutoQSAR.

Figure 5-1. Parameterizing the AutoQSAR panel.

  1. Ensure that the entire inorganic_descriptors-out1 (500) entry group is selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion
    • Make sure you have the new entry group selected, containing all of the descriptor data, as opposed to the original structures
  2. Go to Tasks > Materials > Informatics > AutoQSAR
    • The AutoQSAR panel opens

 

If you are interested in more background for utilizing the AutoQSAR panel, see the Machine Learning for Materials Science tutorial. 

 

  1. Ensure that Build model is checked
  2. Ensure that for Use structures from, Project Table (selected entries) is chosen
  3. Change the Property to be fit dropdown to K VRH (User)
    • You may need to scroll down to find this property.
    • This property is the Voigt-Reuss-Hill (VRH) average of the bulk modulus
  4. Change the Random training set to 80%
    • This is the percentage of data to set aside between train and test sets, where 80% of the data is used to train the model and 20% of the data is used to test the model. An 80:20 training/testing split is typical for machine learning models; other splits are also reasonable depending on the dataset size
    • Maintain 10 for Number of models to keep
    • AutoQSAR performs extensive model selection. We arbitrarily choose to retain the 10 best models to check their performance. We can choose to keep more models or even just the single best model. If multiple models are kept, they can be used together for consensus prediction (see Section 6)
  5. Click Advanced Options

Figure 5-2. AutoQSAR advanced options.

  1. Maintain 50 for the Number of models to build for each model type and 0.80 for the Maximum allowed correlation between any pair of individual variables
    • The removal of correlated independent variables will help tackle the problem of multicollinearity
  2. Uncheck Binary fingerprints and Numeric descriptors
    • These default descriptors are not calculable for periodic systems
  3. Check Other Properties from and click Structures...

Figure 5-3. Selecting descriptors.

  1. From the Show family dropdown, select Materials Science
    • The descriptors calculated in Section 3 appear in the Available properties list
  2. Click Select All and Add
  3. Then click OK to save the Advanced Options

Figure 5-4. Naming and running the job.

  1. Change the Job name to qsar_build_inorganic
  2. Adjust the job settings () as needed
    • This job requires a CPU host. The job can be completed in about 10 minutes on a 12 CPU host
  3. Click Run
    • If there is an error upon running, it is likely that you did not properly set the environment variable in Section 2. Try to repeat the steps in Section 2 (make sure to close the software first). If you still have trouble, please contact education@schrodinger.com
  4. Close the AutoQSAR panel

6. Viewing the Machine Learning Model and Predicting

We can proceed to view the machine learning models that were generated, and use these to make predictions on a small data set.

Figure 6-1. Loading the models.

When the job is complete, note that no new entry group is added to the entry list.

 

  1. Return to Tasks > Materials > Informatics > AutoQSAR
    • The AutoQSAR panel opens
  2. For Choose task, switch to View model and make prediction
  3. From the dropdown, select qsar_build_inorganic.qzip
    • The Model Report section of the panel shows the scores for the best models
  4. Click on the + button to expand the model report, which shows the performance of the 10 best models

Figure 6-2. Viewing the models.

  1. Click to Highlight the best model (the first row by default)
  2. Click the Report Details button

Figure 6-3. Viewing the Report Details.

The Report Details pop-up shows the scores, import features, predict values and errors of the training and test data

 

  1. Click Scatter Plot

Figure 6-4. Viewing the Scatter Plot.

The scatter plot allows further visualization of the model.

 

 

 

Note: You can Save Image of the scatter plot if you would like to save a .png file.

 

  1. Feel free to visualize any of the other models. When you are finished, close the Scatter Plot window and all of the other windows associated with the AutoQSAR panel

Figure 6-5. Importing the test set.

Provided with the tutorial files are 20 additional structures with known bulk modulus values. We will now proceed to import these structures, calculate descriptors, and then test how the model performs in predicting their bulk modulus values.

  1. Go to File > Import Structures
  2. Select Section_06 > test_20.mae from the provided tutorial files
  3. Click Open
    • A new entry group is added to the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion containing 20 entries
  4. Repeat all of the steps in Section 4 for these 20 entries to generate the descriptors
    • Be sure to click the Detect button again when parameterizing the panel
    • Name the job inorganic_descriptors_test_set. It should complete very quickly on a CPU host
    • Check the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data to see that you have generated descriptors

Figure 6-6. Selecting the descriptor output and opening the AutoQSAR panel.

  1. Ensure that the new entry group is selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion: inorganic_descriptors_test_set-out1 (20)
    • Be sure you selected the entry group containing all of the descriptor data, as opposed to the original structures
  2. Return to Tasks > Materials > Informatics > AutoQSAR
    • The AutoQSAR panel opens
  3. Ensure that Choose task is still set to View model and make prediction
  4. Ensure that from the dropdown, qsar_build_inorganic.qzip is selected

Figure 6-7. Naming and running the prediction job.

  1. In the Make Prediction section of the panel, maintain the defaults:
    • Keep the entry group selected
    • For Model to test, use All models
    • Maintain Y for the AutoQSAR Prediction. This is going to be the output property name: Pred Y
  2. Change the Job name to qsar_test_inorganic_20
  3. Adjust the job settings () as needed
    • This job requires a CPU host. The job can be completed in about 5 minute on a 12 CPU host
  4. Click Run

Note: Consensus prediction averages the results of the retained models. This can often increase the accuracy of the predictions.

Figure 6-8. The predicted values in the Project Table.

When the job is complete, a new entry group is added to the entry list entitled qsar_test_inorganic_20-out (20) containing the same 20 structures. These structures now have the predicted bulk modulus values associated with them.

  1. Close the AutoQSAR panel
  2. Select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries the qsar_test_inorganic_20-out1 (20) entry group
  3. Open the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data ()
    • The predicted values are listed for the new entries in the Pred Y column, as well as their standard deviations

To compare these values to the known values we will draw a scatter plot.

  1. Click the Manage Plots () button

Figure 6-9. A scatter plot of the predicted data versus the known values for the test set.

  1. Click Create > Scatterplot
  2. For X-Axis select K VRH
    • These are the actual values of the target property
  3. For Y-Axis select Pred Y
    • These are the ML predicted values
  4. Check Best fit line
    • A regression line is added

 

Feel free to stylize the graph and save an image if you wish.

The best fit line between predicted and actual values shows a high R2 of 0.95 (an ideal model would have an R2 of 1.00), which suggests that the ML model from AutoQSAR can make accurate predictions for new crystal structures. The overall workflow also highlights the improved computational efficiency when using ML approaches as compared to ab initio calculations or physical experiments, where ML could achieve property predictions in minutes versus ~hours-days for the other approaches. While this tutorial uses a relatively small dataset, one could envision a larger training set would further improve prediction accuracy.

7. Conclusion and References

In this tutorial, we learned how to use the Periodic Descriptors panel to generate Matminer and SOAP descriptors for crystal systems. We then learned how to use those descriptors to build ML models using the AutoQSAR panel. Finally, we used the model to make predictions on additional test examples that the model has not seen before. Altogether, the periodic descriptor panel enables ML approaches for property predictions in crystal systems, which could be used to rapidly screen materials for selective properties.  

For further learning:

For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 70+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.

For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.

For some related practice, proceed to explore other relevant tutorials:

For further reading:
  • For Matminer descriptors, read more here
  • For SOAP descriptors, read more here
  • For more information about Voigt-Reuss-Hill averaging, visit here
  • The data in this tutorial is from: Charting the complete elastic properties of inorganic crystalline compounds. DOI:10.1038/sdata.2015.9
  • Help documentation on Periodic Descriptors and AutoQSAR panels
  • Design of Organic Electronic Materials With a Goal-Directed Generative Model Powered by Deep Neural Networks and High-Throughput Molecular Simulations. DOI:10.3389/fchem.2021.800370
  • DeepAutoQSAR Hardware Benchmark (Schrödinger white paper)
  • Active Learning Accelerates Design and Optimization of Hole-Transporting Materials for Organic Electronics. DOI:10.3389/fchem.2021.800371

8. Glossary of Terms

Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion

Included - the entry is represented in the Workspace, the circle in the In column is blue

Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data

Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)

Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project

Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries

Working Directory - the location where files are saved

Workspace - the 3D display area in the center of the main window, where molecular structures are displayed