AutoQSAR Panel

Generate QSAR models for a chosen property of a set of compounds, and apply the QSAR model to other compounds, with traditional QSAR methods.

To open this panel: click the Tasks button and browse to Discovery Informatics and QSAR → AutoQSAR.

Overview
Features
Additional Resources

AutoQSAR Overview

The purpose of the AutoQSAR panel is to automatically produce robust QSAR models with minimal user input or understanding. This panel uses the traditional approach, in which QSAR models are built with standard QSAR techniques. The traditional approach can be built on data sets of up to several thousand structures; for larger data sets you can use a deep learning approach powered by DeepAutoQSAR (see DeepAutoQSAR Panel).

In the traditional methods, models are based on a large pool of several hundred descriptors and fingerprints, use a variety of methods, and draw on best practices to evaluate and select the best of the generated models. An overview of the methods is given below. It is recommended that you read AutoQSAR Best Practices before choosing your data set and developing your models.

AutoQSAR provides capabilities for both numeric models that yield continuously valued predictions, or categorical models that yield discrete predictions for two or more classes. Numeric models are built using ensemble best subsets multiple linear regression (MLR), partial least-squares regression (PLS), kernel-based partial least-squares regression (KPLS), and principal components regression (PCR). Categorical models are built using Naïve Bayes classification (Bayes) and ensemble recursive partitioning (RP).

Several hundred 2D descriptors (molecular, topological, and feature counts) are automatically generated for model building. In addition, 2D fingerprints for Bayes and KPLS models are generated, retaining the 10,000 most informative bits for each fingerprint type (linear, radial, dendritic, and molprint2D). As many of these descriptors are strongly correlated, a selection process is applied to identify the maximally informative subset for which all the correlation coefficients are below a specified threshold. Descriptors that have the same value for more than 90% of the structures are also eliminated, as these cannot contribute significantly to the variation in the property.

The informative descriptors and fingerprints are used to build a large number of numeric or categorical models, where a given model is trained against a particular random subset of the input structures. The model is applied to the remaining input structures, and the accuracy of those predictions is used to arrive at an optimal number of factors for KPLS, PCR, and PLS models, and to assign an overall ranking score to the model. Models that perform well on the test set and have highly consistent training and test set statistics are given higher ranking scores. (The ranking score increases with the accuracy of the test set predictions, and decreases as the accuracy of the training and test set predictions diverge.)

In addition to the training and test set statistics, a "null hypothesis" is generated, in which the variation in the property is assumed to be predicted by the molecular weight. The Q² value for this hypothesis is reported for each model. If it is close to the Q² value for the model itself, it implies that the property for the test set is simply explained by variation in the molecular weight.

The domain of applicability of a model is assessed by calculating a similarity score. For each structure in the training set, dendritic fingerprints are generated and combined to form a modal fingerprint. The Tanimoto similarity of the fingerprints of each structure to the modal fingerprint is calculated and the mean and standard deviation of these similarities is evaluated. For each structure whose properties are predicted, the Tanimoto similarity is evaluated using the same procedure, and this is compared with the training set mean. Structures whose similarity is outside two standard deviations of the mean are flagged as outside the domain of applicability of the model. When multiple models are used (with different training sets), the values are averaged over the models.

QSAR Task section

In this section, you choose the task to perform. The panel is updated to show the relevant settings for the task you choose.

Choose Task options

Choose the task to perform. The choice affects the features displayed in the rest of the panel.

Build model—Build QSAR models, using the settings made in the Build Model section and the Options section These sections are displayed when you select this option.

When the build job finishes, the generated QSAR models are stored in an archive named jobname.qzip, and the path to this archive is added to the project. You should not rename or move this archive, if you want it to be available from the project later on.
View model and make prediction—Examine a report on the QSAR models built, or make a prediction of properties for new structures. The Model Report section and Make Prediction section are displayed when you select this option.

When you select this option, the option menu is activated, and you can choose a model from the menu, or choose From File to load a QSAR model from a file (.qzip). If you choose the latter, you can specify the file in the File name text box, or click the Browse button to locate it. The model is then loaded from this file and added to the option menu.

Build Model section

In this section you specify the structures to use for the test and training sets, the property to be predicted and its type.

Use structures from option menu

Specify the source of the structures to use for the model.

Project Table (selected entries)—Use the entries that are selected in the Project Table as the source of the structures.
File—Use the specified file as the source of the structures. All structures in the file are used. The supported file types are Maestro (.mae, .mae.gz, .maegz), SD (.sdf, .sdf.gz, .sdfgz).

When you select this option, a File name text box and Browse button are displayed, so you can enter the file name in the text box, or click Browse and navigate to the file.
(no structures)—Use CSV (.csv) files that contain descriptors and no structures.

Use validation set option and Select button

Select structures to be used in a separate validation set. The structures are taken from the input set, and are not used in model building. Instead, they are used for the final testing to generate a Q-squared score for the models; the prediction on which the score is based is a consensus of all the models (see under Model to test). Click Select to choose the ligands in the Select Validation Set dialog box.

Prediction property option menu

Choose the property that is to be predicted by the QSAR model. The option menu lists the available properties. These are taken from the file or the Project Table, depending on the structure source. All structures must have a value for this property.

Property type option menu

Specify the property type. The choices are Numerical or Categorical. The settings available for the property depend on the choice you make. Properties that are numerical (integer, real) can be treated as categorical, if you want to do predictions of property ranges rather than property values. String-valued properties can only be treated as categorical.

Options section

In this section you specify options for building the QSAR model. The options depend on the choices made for the property type.

Number of categories box: Specify the number of categories to use for the prediction property. This applies if the prediction property has integer or real values, and the type is set to Categorical. For string-valued properties, the number of categories is determined automatically. By default, the property is split into ranges of equal width, with each range defining a category. You can change the way in which the categories are defined in the Advanced Options dialog box. This box is only present if you chose Categorical for the property type.
Random training set box: Specify the percentage of structures to use for the training set. The remaining structures are assigned to the test set. The structures that are not used for the validation set are partitioned into a training set and a test set using this percentage for the training set. During model building, different random selections of the training and test set are made, to evaluate the statistical validity of the models.
Number of models to keep box: Specify the number of QSAR models to keep, out of all the models that are generated. The selection is made from the top of the ranked list of models, i.e. the best models are kept. Only available for the Traditional method.
Advanced Options button: Make additional settings, for definition of categorical properties from numerical data, for model building, and for manual selection of independent variables. Opens the AutoQSAR - Advanced Options dialog box.

Model Report section

This section displays a report on the models built.

Model text

This text is the abbreviated version of the report, and is shown by default. It gives the score and Q-squared value of the best-ranked model from the set of models generated in the run.

Expand button

Click this button to expand the report to show details for each model kept. Click again to hide the detailed report.

Report table

This table lists the models that were kept. The Model Code column gives a unique identifier for the model. The table columns for numerical models are described below. By default only the Score and Q^2 columns are shown. For categorical models, the columns train(n) give the fraction of training set compounds in category n that are correctly identified as being in category n, and test(n) is the corresponding quantity for test set compounds in category n. For a 2-category model, the train(1) and test(1) columns are the specificity values, and train(2) and test(2) are the sensitivity values, as calculated from the confusion matrices.

Model Code	Unique label for the model, which includes details of the method.
Score	Ranking score for the model. See Equation (1) in the referenced paper for the formula used to calculate the score. [Ref 1]
S.D.	Standard deviation of the model
R^2	R-squared value (coefficient of determination) for the training set
RMSE	Root-mean-square error of the test set predictions
Q^2	Q-squared value (the R-squared for the test set)
Q^2 MW (Null Hypothesis)	Q-squared value for a model in which the property is simply fit to the molecular weight

Report Details button

Show details for a particular model in a dialog box. You must select the table row for the model you want details of before you click this button. The dialog box lists the statistics shown in the table as well as the observed and predicted values of the property and the error in the prediction for each structure. The confusion matrices are reported with rows representing experimental data and columns representing predicted values. In this dialog box, you can show a scatter plot of the predicted vs observed property value, and save it as an image. You can also save the report as a plain text file.

Visualize Model button

Visualize atomic contributions to the selected KPLS model. Opens the 2D Viewer Panel, where the structures are displayed in 2D in a grid view.

For each structure, each atom that contributed to a fingerprint used in building the model is marked with a colored disk that represents the value of the contribution to the property due to that atom. The disks are blue for negative values and red for positive values. The color saturation indicates the magnitude of the contribution. Atoms that did not appear in any fingerprint are not marked with a disk.

Show More button

Show the full set of table columns. The button text changes to Show Less so you can hide all but the default columns.

Make Prediction section

In this section you choose structures to make predictions for and the model to apply. When the job finishes, the structures with the predicted values are incorporated into the project.

Use structures from option menu

Specify the source of the structures that you want to make predictions for.

Project Table—Use the entries that are selected in the Project Table as the source of the structures.
File—Use the specified file as the source of the structures. All structures in the file are used. The supported file types are Maestro (.mae, .mae.gz, .maegz), SD (.sdf, .sdf.gz, .sdfgz) .When you select this option, a File name text box and Browse button are displayed, so you can enter the file name in the text box, or click Browse and navigate to the file.
(no structures)—Use files that contain descriptors but no structures.

Model to test option menu

Choose the model to use for making the predictions.

All models (consensus prediction)—Use all of the models to predict the property, and take the average of the predicted properties.
Best model (model-code)—Use the best-ranked model to predict the property. The model code is shown in parentheses.
Selected models (consensus prediction)—Use the models that are selected in the report table to predict the property, and take the average of the predicted properties.

Consensus models can often provide more accurate results than that of a single model.

Prediction property title text

This noneditable text area shows the title of the property being predicted.

AutoQSAR Prediction text box

Enter a label for the predicted property. This label is included in the property name reported in Maestro, which is Pred label for numeric values, label Class and label Prob for categorical values. The label must not contain white space; use an underscore instead, as underscores are replaced with spaces when the property name is used in Maestro. The label is used in other properties as well—see AutoQSAR Output Properties for a list with descriptions of the properties.

Job toolbar

Manage job submission and settings. See Job Toolbar for a description of this toolbar.

The Job Settings button opens the AutoQSAR - Job Settings Dialog Box, where you can make settings for running the job.

Status bar

The status bar displays information about the current job settings and status for the panel. The settings includes the job name, task name and task settings (if any), number of subjobs (if any) and the host name and job incorporation setting. The job status can include messages about job start, job completion and incorporation.

Use the Reset button to reset the panel to its default settings and clear any data from the panel. You can also reset the panel from the Job toolbar.

The status bar also contains the Help button , which opens the help topic for the panel in your browser. If the panel is used by one or more tutorials, hovering over the Help button displays a button, which you can click to display a list of tutorials (or you can right-click the Help button instead). Choosing a tutorial opens the tutorial topic.

Tutorials