Model-Building MethodsDefinitions for
The sections below define terms and methods used in building QSAR/QSPR models. They briefly introduce some regression methods available for model-building: partial least squares, kernel-based partial least squares, principal component analysis, and multiple linear regression.
Independent and Dependent Variables
Dependent variable
The dependent variable (or response variable) is the variable that is being fitted to in a regression model. It is referred to as dependent as it is assumed that its values are dependent on the values of independent variables that will be used to generate the predictive model. This variable is also referred to as the dependent descriptor or the activity property (in a QSAR model).
Independent variables
The independent variables are the variables that are being used to fit a regression to a dependent variable in partial least squares, principal component analysis, or multiple linear regression. They are referred to as independent as their values are assumed not to depend on the values of the dependent variable. The term independent descriptors is also used.
Partial Least Squares
The partial least squares (PLS) method generates linear equations that describe the relationship between a number of factors derived from a set of independent descriptors and a dependent descriptor. The PLS procedure works by extracting successive linear combinations of the factors (also called components or latent vectors), which explain independent and dependent variations. In particular, the method of partial least squares balances these objectives, seeking factors that explain both response variation and predictor variation.
Partial least squares is particularly valuable because it can be applied in cases where the number of independent descriptors is greater than the number of molecules.
Partial least squares is similar to principal component analysis, but the goals of the two methods in extracting factors differ. In PLS one is concerned with the variance in both the dependent and independent descriptors, while in PCA one is trying to explain the maximum variance possible in only the dependent descriptors.
Kernel-Based Partial Least Squares
Kernel-based partial least-squares (KPLS) regression is an extension of the partial least-squares method that introduces some nonlinearity into the scalar products of independent variables used in the regression via a “kernel”, which is some nonlinear function of these scalar products [7]. In Canvas, the kernel is a Gaussian function,
|
|
(1) |
where dij is the Euclidean distance between independent variables i and j, and 1/σ is the nonlinearity parameter. This kernel replaces the simple scalar products of the independent variables in the regression. In Canvas, no automatic tuning of σ is done. Small values of 1/σ are almost linear, and large values are very nonlinear. Higher nonlinearity typically leads to tighter fitting, but it also tends to give poorer predictions.
Principal Component Regression
Principal components analysis (PCA) transforms a pool of independent variables into a set of orthogonal latent factors that explain the variance in the corresponding data space. The first factor corresponds to the axis of maximum variance in the data space, while each succeeding factor accounts for as much of the remaining variance as possible.
Principal components regression (PCR) utilizes a small number of PCA factors as independent variables in a least-squares fit of a single dependent variable. PCR is appropriate in cases where the original pool of independent variables exceeds the number of observations or molecules.
PCR is similar to partial least-squares (PLS) regression, but PCR does not utilize the dependent variable in the construction of the latent factors. As such, PCR factors do not correlate as strongly with the dependent variable as PLS factors, and a larger number of PCR factors is needed to achieve a given level of fit to the dependent variable.
Multiple Linear Regression
Multiple linear regression (MLR) generates linear equations that describe the relationship between a set of independent descriptors and a dependent descriptor. As used in Canvas, MLR fits a straight line to the dependent descriptor using the following linear relationship:
|
|
(2) |
In the above equation, Pj is the property or activity that is to be predicted for each molecule j, the ci values are the regression coefficients, χij is the ith independent property for molecule j, and c0 is a constant. Values of the coefficients and c0 are fitted to give Pj values that reproduce the dependent value for the jth molecule.
In general, when fitting data using MLR it is advisable to use a data set with at least five times as many molecules as there are independent descriptors.