Formulation Machine Learning Featurizer and Model Information

Featurizer Information

Featurizer Description Array length for one molecule Array length for formulations used in ML models Applicable ML Models Reference
MACCS Keys 167-bit MACCS keys, which are 2D structure fingerprints commonly used to measure molecular similarity or virtual screening. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. 167 835 XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set RDKit: Open-source cheminformatics.
Fingerprint Morgan fingerprints that encodes groups of atoms into a binary vector of 1 (present) and 0 (not present) with a size of 500-2,060 and radius of 2-4. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. 500-2060 2500 - 10300 XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set RDKit: Open-source cheminformatics.
Learned Fingerprint Learned molecular fingerprints are generated using a graph-based encoder and multitask physicochemical property decoder trained on ~13 million drug-like molecules. They produce a continuous vector that captures structural and chemical patterns learned directly from data. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. 768 3840 XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set  
Matminer Matminer descriptors often used to featurizer materials, such as inorganic crystals. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. 132 660 XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set

Matminer: An open source toolkit for materials data mining

RDKit Descriptors 200 RDKit descriptors, which computes molecular features such as the number of carbon atoms, molecular weight, and so on. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. 200 1000 XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set RDKit: Open-source cheminformatics.
All Descriptors Concatenates Fingerprint, Matminer, MACCS Keys, and RDKit Descriptors. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. 999 - 2559 4995 - 12795 XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set  
Graph Representation Treats atom as nodes and bonds as edges. Each node consist of 75 atomic features and the composition of the ingredient. The 75 atomic features are used to featurize each of the heavy atoms; for example, one-hot encodings of atomic number, implicit valence, formal charge, atomic degree, number of radial electrons, hybridization, and aromaticity.   75 atomic features per node Set2Set, Graph-based Models

Benchmark Study

User Input Features Only Uses only the additional features specified by the user in the "Descriptors" dropdown.   Number of user-defined features XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net  

Model Information

Model Description Is feature importance available? Applicable featurizers Reference
Elastic Net Linear model that combines ridge regression and lasso regression to minimize prediction errors by keeping the coefficients of a linear equation small Yes Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, User Input Features Only  
Random Forest Aggregates multiple decision trees to improve accuracy and overfitting. Yes Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, User Input Features Only  
XGBoost Uses gradient boosted decision trees to make predictions. Yes Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, User Input Features Only XGBoost: A Scalable Tree Boosting System
Support Vector Machine Uses non-linear kernel functions to map input data to a higher-dimensional space to find the optimal hyperplane that best separates the data points to minimize the error between predicted and actual values. Yes Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, User Input Features Only  
Dense Neural Network Uses artificial neural network where each neuron is fully connected to every neuron in the next layer, forming a dense connection that can be used to map input data to property. Yes Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, User Input Features Only  
Set2Set Global pooling operator based on iterative content-based attention No Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, Graph Representation Order Matters: Sequence to sequence for sets
Graph-based Models Uses a variety of graph-based approaches, such as graph convolutional neural network, GraphSAGE, GIN, TopK, SAGPool, GlobalAttention, Set2Set, and SortPool. This model requires "Graph Representation" featurizer toggled on. No Graph Representation Benchmark Study