Model-Building MethodsDefinitions for

The sections below define terms and methods used in building QSAR/QSPR models. They briefly introduce some regression methods available for model-building: partial least squares, kernel-based partial least squares, principal component analysis, and multiple linear regression.

Independent and Dependent Variables

Dependent variable

The dependent variable (or response variable) is the variable that is being fitted to in a regression model. It is referred to as dependent as it is assumed that its values are dependent on the values of independent variables that will be used to generate the predictive model. This variable is also referred to as the dependent descriptor or the activity property (in a QSAR model).

Independent variables

The independent variables are the variables that are being used to fit a regression to a dependent variable in partial least squares, principal component analysis, or multiple linear regression. They are referred to as independent as their values are assumed not to depend on the values of the dependent variable. The term independent descriptors is also used.

Partial Least Squares

The partial least squares (PLS) method generates linear equations that describe the relationship between a number of factors derived from a set of independent descriptors and a dependent descriptor. The PLS procedure works by extracting successive linear combinations of the factors (also called components or latent vectors), which explain independent and dependent variations. In particular, the method of partial least squares balances these objectives, seeking factors that explain both response variation and predictor variation.

Partial least squares is particularly valuable because it can be applied in cases where the number of independent descriptors is greater than the number of molecules.

Partial least squares is similar to principal component analysis, but the goals of the two methods in extracting factors differ. In PLS one is concerned with the variance in both the dependent and independent descriptors, while in PCA one is trying to explain the maximum variance possible in only the dependent descriptors.

Kernel-Based Partial Least Squares

Kernel-based partial least-squares (KPLS) regression is an extension of the partial least-squares method that introduces some nonlinearity into the scalar products of independent variables used in the regression via a “kernel”, which is some nonlinear function of these scalar products [7]. In Canvas, the kernel is a Gaussian function,

(1)

where d_ij is the Euclidean distance between independent variables i and j, and 1/σ is the nonlinearity parameter. This kernel replaces the simple scalar products of the independent variables in the regression. In Canvas, no automatic tuning of σ is done. Small values of 1/σ are almost linear, and large values are very nonlinear. Higher nonlinearity typically leads to tighter fitting, but it also tends to give poorer predictions.

Principal Component Regression

Principal components analysis (PCA) transforms a pool of independent variables into a set of orthogonal latent factors that explain the variance in the corresponding data space. The first factor corresponds to the axis of maximum variance in the data space, while each succeeding factor accounts for as much of the remaining variance as possible.

Principal components regression (PCR) utilizes a small number of PCA factors as independent variables in a least-squares fit of a single dependent variable. PCR is appropriate in cases where the original pool of independent variables exceeds the number of observations or molecules.

PCR is similar to partial least-squares (PLS) regression, but PCR does not utilize the dependent variable in the construction of the latent factors. As such, PCR factors do not correlate as strongly with the dependent variable as PLS factors, and a larger number of PCR factors is needed to achieve a given level of fit to the dependent variable.

Multiple Linear Regression

Multiple linear regression (MLR) generates linear equations that describe the relationship between a set of independent descriptors and a dependent descriptor. As used in Canvas, MLR fits a straight line to the dependent descriptor using the following linear relationship:

(2)

In the above equation, P_j is the property or activity that is to be predicted for each molecule j, the c_i values are the regression coefficients, χ_ij is the ith independent property for molecule j, and c₀ is a constant. Values of the coefficients and c₀ are fitted to give P_j values that reproduce the dependent value for the jth molecule.

In general, when fitting data using MLR it is advisable to use a data set with at least five times as many molecules as there are independent descriptors.