Cheminformatics Machine Learning for Homogeneous Catalysis
Tutorial Created with Software Release: 2024-1
Topics: Catalysis & Reactivity , Energy Capture & Storage , Metals, Alloys & Ceramics , Thin Film Processing
Methodology: Machine Learning
Products Used: DeepAutoQSAR , MS Informatics , MS Maestro
|
0.3 GB |
This tutorial is written for use with a 3-button mouse with a scroll wheel.
Words found in the Glossary of Terms are shown like this: Workspacethe 3D display area in the center of the main window, where molecular structures are displayed
Abstract:
In this tutorial, we will learn to develop and use a machine learning model to predict reaction rate constants for iridium catalysts.
Tutorial Content
1. Introduction
Discovering new catalysts for improved reactivity or selectivity is challenging because of the large number of laborious experiments or stepwise quantum mechanical calculations necessary to explore the catalyst design space. Alternative to these approaches, employing machine learning (ML) for catalyst discovery and design is a promising avenue to rapidly screen catalysts for enhanced properties (see References for recent literature examples).
A useful ML tool is Quantitative Structure-Activity Relationships (QSAR), which can efficiently predict material properties for a wide-range of molecules. Schrödinger’s AutoQSAR tools automates the generation of accurate QSAR models, which allows users to leverage machine learning tools without extensive background knowledge. For a complete description of how AutoQSAR automatically tests various models and makes selections, visit the Machine Learning for Materials Science tutorial.
DeepAutoQSAR integrates graph convolutional neural networks into the traditional AutoQSAR workflow, where DeepAutoQSAR treats a molecule as a graph consisting of nodes as atoms and edges as bonds. DeepAutoQSAR has been found to outperform traditional AutoQSAR for ‘large’ datasets (>5000 molecules) and perform similarly to traditional AutoQSAR for ‘small’ datasets (<5000 molecules) (see comparison here). A distinct advantage of DeepAutoQSAR is its ability to identify hidden patterns relevant to the property of interest through a series of convolution operations. You can read more about DeepAutoQSAR on our webpage as well as the references therein.
In this tutorial, we will use the DeepAutoQSAR panel in MS Maestro to create a machine learning model to predict rate constants for a radical reaction (reductive dehalogenation of aryl halide) catalyzed by a series of organometallic iridium complexes. The experimental data set is provided from a recent publication from Mdluhi et al. (High-throughput Synthesis and Screening of Iridium (III) Photocatalysts for the Fast and Chemoselective Dehalogenation of Aryl Bromides. DOI:10.1021/acscatal.0c02247). This experimental dataset explores a series of ~1000 [Ir(C^N)2(N^N)]+ photocatalysts (octahedral iridium complexes with three, bidentate ligands) and measures rate constants using high-throughput colorimetric monitoring.
Herein, we use a data set of 863 of the iridium complexes and the experimental rate constants to train and evaluate machine learning models. The DeepAutoQSAR panel is used to generate a model to predict rate constants by training on the structure of each Ir complex and the associated rate constant. To test the generalizability of the model, rate constants are predicted for an unseen set of 50 complexes. The overall workflow is summarized in Figure 1.
Figure 1. Tutorial workflow showing the input Ir complexes, DeepAutoQSAR panel used to build machine learning models, and the output parity plot after model training. After training the model, an unseen test set was used to evaluate model performance. The workflow subsequently shows the Ir complexes, output predictions, and parity plot for the test set.
Note that while this dataset is small enough that AutoQSAR could also be used, this tutorial focuses on using DeepAutoQSAR, which produces slightly more accurate predictions than traditional AutoQSAR.
For additional practice with the DeepAutoQSAR workflow, but with a categorical classification task, see the Machine Learning for Sweetness tutorial.
For additional practice with AutoQSAR, tutorials are available using the Materials Science Maestro suite to predict properties of small molecules, polymers and periodic systems: Machine Learning for Materials Science, Polymer Descriptors for Machine Learning and Periodic Descriptors for Inorganic Solids.
To learn about using pre-built machine learning models to predict volatility of organometallic complexes, please refer to the Machine Learning Property Prediction tutorial.
For alternative computational approaches for catalyst discovery, namely elucidating reaction mechanisms via various workflows, visit the Locating Transition States: Part 1 and Part 2 tutorials, as well as the RxnProfiler with Polyethylene Insertion tutorial.
2. Creating Projects and Importing Structures
At the start of the session, change the file path to your chosen Working Directorythe location where files are saved in MS Maestro to make file navigation easier. Each session in MS Maestro begins with a default Scratch Projecta temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project, which is not saved. A MS Maestro project stores all your data and has a .prj extension. A project may contain numerous entries corresponding to imported structures, as well as the output of modeling-related tasks. Once a project is saved, the project is automatically saved each time a change is made.
Structures can be built in MS Maestro or can be imported using File > Import Structures (or drag-and-dropped), and are added to the Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion and Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data. The Entry Lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion is located to the left of the Workspacethe 3D display area in the center of the main window, where molecular structures are displayed. The Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data can be accessed by Ctrl+T (Cmd+T) or Window > Project Table if you would like to see an expanded view of your project data.
- Double-click the Materials Science icon
- (No icon? See Starting Maestro)
- Go to File > Change Working Directory
- Find your directory, and click Choose
- Pre-generated files are included for running jobs or examining output. Download the zip file here: schrodinger.com/sites/default/files/s3/release/current/Tutorials/zip/catalysis_ml.zip
- After downloading the zip file, unzip the contents in your Working Directory for ease of access throughout the tutorial
- Go to File > Save Project As
- Change the File name to catalysis_ml_tutorial, click Save
- The project is now named
catalysis_ml_tutorial.prj
- The project is now named
Let’s import the data set:
- Go to File > Import Structures
- Navigate to where you downloaded the provided tutorial files (presumably in your working directorythe location where files are saved) and choose
train.maefrom the provided tutorial files - Click Open
The entry list is updated to include the 863 entries. Feel free to stylize and visualize any of the provided structures.
Note: The model complexes were prepared using Materials Science Maestro structure building capabilities (see the Organometallic Complexes tutorial for relevant workflows).
Each entry has a rate constant associated with it. These can be visualized in the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data (
). Use the Property Tree (
) to add the rate constant property (under All > Materials Science > Secondary > rate constant)
3. Building a Machine Learning Model Using DeepAutoQSAR
In this section, we will use the DeepAutoQSAR panel to train a machine learning model for rate constant prediction. For a complete description of how AutoQSAR automatically tests various models and makes selections, visit the Machine Learning for Materials Science tutorial.
- Ensure that all 863 entries are selected(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries from the entry list (use Shift + Click or click on the entry group header)
- Go to Tasks > Browse All > Discovery Informatics and QSAR > DeepAutoQSAR
- The DeepAutoQSAR panel opens
- Ensure that Build model is checked
- For Model type, choose Regression
- Because the data is numerical and continuous, we use the regression Model type
- Ensure that for Use structures from, Project Table (863 selected entries) is chosen
- Change the Prediction property dropdown to rate constant
- Set 90% for the Random split
- This is the percentage of data to set aside between train and test sets, where 90% of the data is used to train the model and 10% of the data is used to test the model
- This is a relatively large data set. The 90:10 split ensures that there is significantly more data in the training set than the test set, but still enough data in the test set to assess model performance
- Set the Training time to 10 hours
- For datasets with >800 inputs like this one, a 10 hour training time is sufficient to ensure the best models are determined
- Change the Job name to BuildTask_rate_constant
Adjust the job settings (
) as needed. This job requires a Linux host. The job will run for 10 hours as prescribed. If you would like to run the job yourself, click Run. Otherwise, provided data is available for proceeding in Section 4. You can proceed to Section 4 where steps are provided for importing the pre-computed models.
- Close the DeepAutoQSAR panel
Note: If you do not have remote submission to Linux from MS Maestro set up, an alternative is to use the Write option in the job settings (
) dropdown to generate a directory containing the submission scripts, which could then be transferred and run on a Linux host.
4. Viewing the Machine Learning Model and Predicting
Using the DeepAutoQSAR panel, we can proceed to view the generated models, and use these to make predictions on an unseen data set.
When the job is complete, note that no new entry group is added to the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion. The output can be analyzed and used for predictions back in the DeepAutoQSAR panel:
- Return to Tasks > Browse All > Discovery Informatics and QSAR > DeepAutoQSAR
- The DeepAutoQSAR panel opens
- For Choose task, switch to Make Predictions
- To choose the Model file click Browse, navigate to the
Section_03 > BuildTask_rate_constant > BuildTask_rate_constant_model.qzipfile (either in the provided files or from your job, depending on if you ran the job in Section 3) and click Open- The panel will parse the .qzip file and the Model Summary section will be populated
Begin by analyzing the Model Summary output. The data presented is a summary of the statistics of the model on the test set. We observe that the DeepAutoQSAR achieves a high R2 of ~0.80 (denoted as r2) and low root-mean-squared error (rmse) of ~0.20 (an ideal model would have R2 of 1 and RMSE of 0).
- Click View Full Report
The Report tab includes a raw copy of the JSON output of DeepAutoQSAR. This report contains information on the top four best-performing model ensembles, including their metrics, the classification method used (e.g. dNN, random forest, etc.) and relevant model meta-parameters.
For a complete description of how AutoQSAR automatically tests various models and makes selections, visit the Machine Learning for Materials Science tutorial.
- Click on the Plot tab
For regression models such as this, the Plot tab shows a parity plot.
- Close the DeepAutoQSAR Report Viewer
Now, we will use the trained model to make predictions on an unseen data set of iridium complexes that were not in the training data. These complexes have known rate constants from the same experimental study, which we can use to assess the quality of the model for making predictions outside the training set.
- Close the DeepAutoQSAR panel (or simply minimize it or adjust to the side of your window – we will return to it in a moment)
- Go to File > Import Structures
- Navigate to where you downloaded the provided tutorial files (presumably in your working directorythe location where files are saved), choose
test.maeand click Open- A new entry group is added to the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion titled test (50)
- Select(1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries the entire test (50) group from the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion
- Recall that select means to highlight the group in the entry list
- Return to the DeepAutoQSAR panel
- Ensure that the panel reflects the progress from the above steps: Make Predictions is selected, the .qzip file is loaded and the Model Summary is shown
- In the Make Predictions section of the panel, ensure that Project Table (50 selected entries) is chosen for Use structures from
- For Output property name, maintain PredictTask
- This will be the name of the predicted property in the project table
- Change the Job name to PredictTask_test_set
Adjust the job settings (
) as needed. This job requires a Linux host. The job can be completed in about 5 minutes, which is of course many orders of magnitude faster than computing the rate constants of 50 systems from first principles. If you do not have access to a Linux host or do not wish to run the job, feel free to simply import Section_04 > PredictTask_test_set > PredictTask_test_set_output.maegz from the provided tutorial files
- Click Run
- Close the DeepAutoQSAR panel
Note: If you do not have remote submission to Linux from MS Maestro set up, an alternative is to use the Write option in the job settings (
) dropdown to generate a directory containing the submission scripts, which could then be transferred and run on a Linux host.
When the job is complete or after importing, a new entry group is added to the entry lista simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion titled PredictTask_test_set_output1 (50) containing the same 50 entries, but now with predicted rate constant property. The data can be analyzed in the Project Tabledisplays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data
- Open the Project Table (
)
- Use the Property Tree (
) to include the Predicted Task score and uncertainty properties (Check the properties of interest under All > Maestro > Predict Task score/uncertainty)
We can see predicted scores for the various molecules as well as uncertainty values.
To compare these values to the known values we will draw a scatter plot.
- Click New Scatter Plot
- For X-Axis select matsci_rate_constant
- These are the actual values of the target property
- For Y-Axis select PredictTask score
- These are the ML predicted values
- Check Best fit line
- A regression line and equation is added
Feel free to stylize the graph as you wish, and to save an image if you wish.
The best fit line between predicted and actual values shows a reasonable R2 of 0.81 (an ideal model would have an R2 of 1.00). The results suggest that the ML model derived from the DeepAutoQSAR panel could generalize to unseen iridium complexes. Furthermore, this workflow highlights the computational efficiency achieved when using ML approaches as compared to other computational (e.g. ab initio calculations) or experimental approaches. While this tutorial uses a relatively small dataset, one could expect that a larger training set would further improve prediction accuracy.
5. Conclusion and References
In this tutorial, we learned how to use the DeepAutoQSAR panel to build machine learning models to predict experimentally determined rate constants for a series of iridium complexes. The DeepAutoQSAR model can generalize to unseen data sets and generate fast predictions (~seconds-minutes) as compared to ab initio or experimental measurements (~hours-days), enabling the screening of catalysts for enhanced reaction rates. While this tutorial focuses on reaction rate constants of iridium complexes, the workflow can be extended to other catalyst types and properties.
For further learning:
For introductory content, focused on navigating the Schrödinger Materials Science interface, an Introduction to Materials Science Maestro tutorial is available. Please visit the materials science training website for access to 70+ tutorials. For scientific inquiries or technical troubleshooting, submit a ticket to our Technical Support Scientists at help@schrodinger.com.
For self-paced, asynchronous, online courses in Materials Science modeling, including access to Schrödinger software, please visit the Schrödinger Online Learning portal on our website.
For some related practice, proceed to explore other relevant tutorials:
-
For more machine learning:
- Machine Learning for Materials Science
- Polymer Descriptors for Machine Learning
- Periodic Descriptors for Inorganic Solids
- Optoelectronics Active Learning
- Machine Learning for Sweetness
- Machine Learning Property Prediction
- Machine Learning for Ionic Conductivity
- Molecular Dynamics Descriptors for Machine Learning
- Machine Learning for Formulations
- For transition state searching with quantum mechanical methods in molecular or periodic systems:
For further reading:
- Help documentation on DeepAutoQSAR
- High-throughput Synthesis and Screening of Iridium(III) Photocatalysts for the Fast and Chemoselective Dehalogenation of Aryl Bromides. DOI:10.1021/acscatal.0c02247
- DeepAutoQSAR: Scalable, Intuitive, Deep-learning QSAR models for Big Data Applications (Schrödinger white paper)
- DeepAutoQSAR Hardware Benchmark (Schrödinger white paper)
- Design of Organic Electronic Materials With a Goal-Directed Generative Model Powered by Deep Neural Networks and High-Throughput Molecular Simulations. DOI:10.3389/fchem.2021.800370
- Active Learning Accelerates Design and Optimization of Hole-Transporting Materials for Organic Electronics. DOI:10.3389/fchem.2021.800371
-
Some recent publications applying machine learning methods in catalysis and reactivity:
- Machine Learning in Catalysis, From Proposal to Practicing. DOI:10.1021/acsomega.9b03673
- Accelerated dinuclear palladium catalyst identification through unsupervised machine learning. DOI:10.1126/science.abj0999
- Univariate classification of phosphine ligation state and reactivity in cross-coupling catalysis. DOI:10.1126/science.abj4213
- Catalytic Performance of Cycloalkyl-Fused Aryliminopyridyl Nickel Complexes towards Ethylene Polymerization by QSPR Modeling. DOI:10.3390/catal11080920
6. Glossary of Terms
Entry List - a simplified view of the Project Table that allows you to perform basic operations such as selection and inclusion
Included - the entry is represented in the Workspace, the circle in the In column is blue
Project Table - displays the contents of a project and is also an interface for performing operations on selected entries, viewing properties, and organizing structures and data
Recent actions - This is a list of your recent actions, which you can use to reopen a panel, displayed below the Browse row. (Right-click to delete.)
Scratch Project - a temporary project in which work is not saved, closing a scratch project removes all current work and begins a new scratch project
Selected - (1) the atoms are chosen in the Workspace. These atoms are referred to as "the selection" or "the atom selection". Workspace operations are performed on the selected atoms. (2) The entry is chosen in the Entry List (and Project Table) and the row for the entry is highlighted. Project operations are performed on all selected entries
Working Directory - the location where files are saved
Workspace - the 3D display area in the center of the main window, where molecular structures are displayed