Build and Apply QSAR/QSPR Model — Build Task

Build individual QSAR models from a selection of regression and classification methods.

To display the options for this task, open the Build and Apply QSAR/QSPR Model Panel and select Build.

Build Task Features

Method option menu

Select the regression or classification model you want to build. The following models are available:

  • Partial Least-Squares Regression (PLS)—a multiple linear regression model based on variables derived from a partial least squares analysis of the input variables.

    This model is useful for large numbers of X variables with both underdetermined and overdetermined systems. PLS “factors” are constructed as linear combinations of the input X variables by correlating them with the Y variable. If poor test results are obtained, an alternative is principal components regression.

  • Multiple Linear Regression (MLR)—this model is useful for a smaller number of X variables, and (many) more Y values than X variables. For larger numbers of X variables, consider the partial least squares model instead.

  • Principal Components Regression (PCR)—a multiple linear regression model based on variables derived from a principal components analysis of the input variables.

    This model is useful when there are many X variables. It seeks to encapsulate the bulk of the variation in the X variables in a few linear combinations of the X variables, the principal components. The analysis is independent of Y, so the performance is not as good as the partial least squares models, but the test set performance can be better, as there is no bias in the variables to the Y values of the training set.

  • Recursive Partitioning (RP)—use recursive partitioning to build a model that can be used to predict property ranges or categories.

  • Bayes Classification—build a Bayes model from binary or continuous training data to predict the probability of a molecule in each category, or each property range, of the Y variable. The use of more than 10 categories is unlikely to produce meaningful predictions.

  • Kernel-based Partial-Least Squares Regression (KPLS)—useful for large numbers of X variables, with both underdetermined and overdetermined systems.

    KPLS is available with fingerprints as X variables. KPLS “factors” are constructed as linear combinations of the input X variables by correlating them with the Y variable. If poor test results are obtained, an alternative is principal components regression.

For more information on the methods, see Model-Building MethodsDefinitions for .

Structures option menu

Choose the structure source for building the QSAR model.

  • Project Table (n selected entries)—Use the entries that are currently selected in the Project Table or Entry List. The number of entries selected is shown on the menu item. An icon is displayed to the right which you can click to open the Project Table and select entries. When this option is selected, a Load button is displayed to the right.
  • File—Use the specified file. When this option is selected, the File name text box and Browse button are displayed.
Open Project Table button

Open the Project Table panel, so you can select the entries for the structure source.

File name text box and Browse button

Enter the file name in this text box, or click Browse and navigate to the file. The name of the file you selected is displayed in the text box.

Property to predict (Y) options
Select property option menu

Select a property to use as the Y variable.

Filter to option menu

Filter the Select property option menu by only showing numeric or categorical properties.

Value groupings option menu

For the recursive partitioning and Bayes classification models, choose if you want to bin the Y variables or use unique values.

Bins option and text box

Group the values of the Y variable into bins. Enter the number of bins in the text box.

Define button

Click Define to modify the number and width of the bins. Opens the Define Bins Dialog Box.

Unique values option

Choose this option to filter duplicate values and use unique values of the property as Y variable.

Descriptor properties (X) section and Select Properties For Descriptors dialog box

Click the Plus icon to open the Select Properties For Descriptors Dialog Box.

Fingerprints options

For the Bayes classification and KPLS regression models, use this option to add fingerprints as X variables. Click Change to select the type of fingerprint from the list. These fingerprints are available:

  • Linear
  • Radial
  • MolPrint2D
  • Atom Pairs
  • Atom Triplets
  • Topological Torsions
  • Dendritic
Advanced Options menu

For the KPLS regression model, click this button to set filtering criteria for the fingerprints. Opens the KPLS — Advanced Fingerprints Pane.

Listed entry properties option

For the Bayes classification and KPLS regression models, use this option to add selected properties as X variables. Click on the Plus icon to open the Select Properties For Descriptors Dialog Box.

Descriptor properties list

Selected properties from the Select Properties For Descriptors Dialog Box are listed here. Click on the - (minus) icon to remove properties from this list.

Model settings options

Set options specific to the model selected from the Method option menu

Partial Least-Squares (PLS) Regression Settings
Max.# of factors text box

Specify the maximum number of PLS factors to use in the regression model. Regression models are built for increasing numbers of PLS factors up to this number. It is rarely useful to build models with more than a few PLS factors, as such models tend to be overfit.

Stop adding factors when SD drops below N option and text box

Select this option to stop adding PLS factors when the standard deviation of the regression drops below the value specified in the text box. Using this option could result in fewer PLS factors than the number specified in the Max. # of factors box.

Autoscale X variables option

Scale the X variables by dividing the values of each property by the standard deviation in the value of that property.

Eliminate X variables with low t-value option

Eliminate X variables whose t-value is less than the value given in the text box. The t-value is the ratio of the coefficient of the variable in the fitted model to the standard error of the model. Small t-values indicate that the variable is not contributing significantly to the model.

Multiple Linear Regression (MLR) settings
Use all specified X variables option

Select this option to build the model from all the properties that were selected for X variables.

Choose best subsets option and pane

Select this option to build the model from the best subsets of the properties that were selected for X variables. The subsets are selected using a simulated annealing Monte Carlo technique. Click Options to set the parameters for the subset selection. Opens the Best Subsets Options Pane.

Suppress Y intercept option

Select this option to force the regression line to pass through the origin.

Principal Components Regression (PCR) settings
Max. # of principal components text box

Set the maximum number of principal components to generate for the regression variables.

Autoscale X variables option

Scale the property values by dividing by the standard deviation.

Recursive Partitioning (RP) settings
Single tree model option

Build only a single tree using recursive partitioning.

Ensemble model option

Build a number of trees and use the ensemble of all trees for prediction. This approach filters out noise and corrects the biases in a single tree.

Advanced Options button

Set options for model building. Opens the Recursive Partitioning — Advanced Options Pane.

Bayes classification settings
Advanced Options button

Click this button to set Kullback-Leibler acceptance criteria for fingerprint data, and the smoothing coefficient. Opens the Bayes Classification — Advanced Options Pane.

Kernel-Based Partial Least-Squares (KPLS) Regression settings
Max. # of factors text box

Specify the maximum number of KPLS factors to use in the regression model. Regression models are built for increasing numbers of KPLS factors up to this number.

Stop adding factors when SD drops too low option and text box

Select this option to stop adding KPLS factors when the standard deviation of the regression drops below the value specified in the text box. Using this option could result in fewer KPLS factors than the number specified in Max. # of factors.

Kernel nonlinearity text box and slider

Change the kernel nonlinearity value. The nonlinearity value is 1/sigma, so small values are almost linear, and large values are very nonlinear. Higher nonlinearity typically leads to tighter fitting, but it also tends to give poorer predictions on new compounds.

Reset button

Reset the Kernel nonlinearity to its default value.

Training set assignment option menu and N % of structures text box.

Choose the method of assigning structures to the training set from the Training set assignment options menu and enter the percentage of structures assigned to the training set in the text box.

  • Random %—Select this option to assign the training set randomly, and specify the percentage of structures to include in the training set. The rest are assigned to the test set.

  • Previous Set—Select this option to use the same training set assignment as the starting model. Only available if the Start with existing model option is used.

  • Custom—Select this option to manually assign structures to either training set or test set. Use the Review sets pane to perform the assignment.
Reset button

Reset the training set assignment to its default values.

Advanced training options menu

For the KPLS regression model, select if you want to calculate an uncertainty interval for the predictions in the test set, by bootstrapping.

Calculate uncertainty on test set predictions option

Calculate the uncertainty by sampling the training set randomly with replacement to generate a new test set of the same size with duplicates, building a model and making predictions of the test set, then repeating the procedure a specified number of times. The standard deviation from the original test set is then calculated as the uncertainty.

Bootstrapping cycles text box

Specify the number of times a random sample is made and a prediction obtained in the uncertainty calculations. This number determines how many values are used in calculating the standard deviation, and should be at least 5.

Review sets pane

Review and modify the assignment of structures to either training set or test set. Opens the Review sets pane.

The structures are listed in a table with a value of Training in the Set column if the structure is assigned to the training set, and a value of Test if the structure belongs to the test set. If you chose a custom training set assignment, you can select one or more rows in the table, right-click, and select Move to alternate set, to reassign the selected structures to the alternate set.