Formulation Machine Learning Featurizer and Model Information
Featurizer Information
| Featurizer | Description | Array length for one molecule | Array length for formulations used in ML models | Applicable ML Models | Reference |
| MACCS Keys | 167-bit MACCS keys, which are 2D structure fingerprints commonly used to measure molecular similarity or virtual screening. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. | 167 | 835 | XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set | RDKit: Open-source cheminformatics. |
| Fingerprint | Morgan fingerprints that encodes groups of atoms into a binary vector of 1 (present) and 0 (not present) with a size of 500-2,060 and radius of 2-4. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. | 500-2060 | 2500 - 10300 | XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set | RDKit: Open-source cheminformatics. |
| Learned Fingerprint | Learned molecular fingerprints are generated using a graph-based encoder and multitask physicochemical property decoder trained on ~13 million drug-like molecules. They produce a continuous vector that captures structural and chemical patterns learned directly from data. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. | 768 | 3840 | XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set | |
| Matminer | Matminer descriptors often used to featurizer materials, such as inorganic crystals. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. | 132 | 660 | XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set | |
| RDKit Descriptors | 200 RDKit descriptors, which computes molecular features such as the number of carbon atoms, molecular weight, and so on. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. | 200 | 1000 | XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set | RDKit: Open-source cheminformatics. |
| All Descriptors | Concatenates Fingerprint, Matminer, MACCS Keys, and RDKit Descriptors. For formulations, the compositionally weighted average, standard deviation, min, max, and median of this featurizer is used. | 999 - 2559 | 4995 - 12795 | XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net, Set2Set | |
| Graph Representation | Treats atom as nodes and bonds as edges. Each node consist of 75 atomic features and the composition of the ingredient. The 75 atomic features are used to featurize each of the heavy atoms; for example, one-hot encodings of atomic number, implicit valence, formal charge, atomic degree, number of radial electrons, hybridization, and aromaticity. | 75 atomic features per node | Set2Set, Graph-based Models | ||
| User Input Features Only | Uses only the additional features specified by the user in the "Descriptors" dropdown. | Number of user-defined features | XGBoost, Support Vector Machine, Dense Neural Network, Random Forest, Elastic Net |
Model Information
| Model | Description | Is feature importance available? | Applicable featurizers | Reference |
| Elastic Net | Linear model that combines ridge regression and lasso regression to minimize prediction errors by keeping the coefficients of a linear equation small | Yes | Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, User Input Features Only | |
| Random Forest | Aggregates multiple decision trees to improve accuracy and overfitting. | Yes | Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, User Input Features Only | |
| XGBoost | Uses gradient boosted decision trees to make predictions. | Yes | Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, User Input Features Only | XGBoost: A Scalable Tree Boosting System |
| Support Vector Machine | Uses non-linear kernel functions to map input data to a higher-dimensional space to find the optimal hyperplane that best separates the data points to minimize the error between predicted and actual values. | Yes | Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, User Input Features Only | |
| Dense Neural Network | Uses artificial neural network where each neuron is fully connected to every neuron in the next layer, forming a dense connection that can be used to map input data to property. | Yes | Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, User Input Features Only | |
| Set2Set | Global pooling operator based on iterative content-based attention | No | Fingerprint, Matminer, MACCS Keys, RDKit Descriptors, Composition Only, All Descriptors, Graph Representation | Order Matters: Sequence to sequence for sets |
| Graph-based Models | Uses a variety of graph-based approaches, such as graph convolutional neural network, GraphSAGE, GIN, TopK, SAGPool, GlobalAttention, Set2Set, and SortPool. This model requires "Graph Representation" featurizer toggled on. | No | Graph Representation | Benchmark Study |