Bivariate Statistics: Covariance and Correlation

The statistics in this section describe the relationship between two variables in terms of covariance and correlation. These statistics are also applied to pairs of variables in models with multiple independent variables.

Correlation coefficient

See Pearson r or r-squared

Correlation matrix

The correlation matrix generates the Pearson r values for the half matrix of all pairs of selected variables.

Covariance

(1)

The covariance measures the extent to which two variables vary together. A positive value of the covariance indicates that larger than average values of one variable tend to be paired with larger than average values of the second variable. A negative value of the covariance indicates that larger than average values of one variable tend to be paired with smaller than average values of the second variable. A zero covariance indicates the two variables vary independently from one another. The covariance is dependent on the magnitude of the variables involved and is most useful when the variables have the same magnitude.

For a scatter plot of x and y the covariance measures how close the scatter is to a line. A negative covariance indicates a downward sloping line to the right, a positive covariance indicating an upward sloping line to the right, and a zero covariance indicating the best line lies along the horizontal axis.

Pearson r

(2)

The Pearson r is a correlation coefficient that determines the extent that two variables are proportional to one another. In other words, the Pearson r provides a measure of linear association between variables. Calculated Pearson r values lie on a scale from -1.0 to +1.0 with negative values indicating the best least-squares line between variables x and y is downward sloping to the right and positive values indicating the best line is upward sloping to the right. A value of zero indicates no correlation between the two variables. The Pearson r is independent of the magnitude of variables (unlike the covariance). Note that R is sometimes used instead of r.

R-squared

In Canvas, r-squared is the square of the Pearson r correlation coefficient. Its value ranges from 0.0 to 1.0 with a value of zero indicating the two variables have no correlation and a value of one indicating the variables are perfectly correlated. Like the Pearson r, the r-squared is independent of the magnitude of the two variables.

Spearman rho

The Spearman rho is a rank-order correlation coefficient. It measures the proportion of variability accounted for between two variables using the ranking of the data rather than the data values themselves. The Spearman rho is interpreted in an identical fashion to the Pearson r statistic.

Ties in ranking (data points with the same value) are given the mean rank of the tied observations. i.e. if three points are identified as having equal values with ranks of 5, 6, 7, and 8 in the sample, the average rank assigned to all four would be 6.5. In the definitions below, rank(x_i) is the rank of the point x_i, ties(x_i) is the number of times the value x_i occurs, and in ε(x) the sum is over the number of tied values.

Kendall tau

The Kendall tau is a rank-order correlation coefficient. It measures the proportion of variability accounted for between two variables using the ranking of the data rather than the data values themselves. The Kendall tau is interpreted in an identical fashion to the Pearson r statistic. It is defined in Eq. (3), where:

P is the number of concordant pairs of ranks

Q is the number of discordant pairs of ranks

Y₀ is the number of ties in the ranks of two x’s

X₀ is the number of ties in the ranks of two y’s

(3)

To calculate the Kendall tau the half matrix of data pairs is analyzed, i.e. (x_i,y_i) and (x_j,y_j) are compared for all i and j pairs. Each pair that shows the same rank order between the two data sets is counted as concordant. Each pair that shows a different rank order between the two data sets is counted as discordant. The rank order can be determined by the following expression:

(4)

Ties are counted in the Y₀ and X₀ variables.

Predictive index

This index is a predictor of rank ordering [16], with values from +1 to −1. A value of +1 indicates perfect prediction of rank; a value of -1 indicates predictions that are completely wrong, and a value of 0 indicates random predictions.

(5)

where:

and E_i are the experimental values, P_i are the predictions.