Illustrations of bivariate correlation

Richard Kennaway

(Last revised 2 January 2006.)

The examples shown here are for bivariate normal distributions in which both variables are scaled to have zero means and identical standard deviations. If the product-moment correlation is c, the probability density function is A exp(-(x2 + y2 - 2cxy)/2) for some constant A.

The general implication of the data presented below is that if one wants to make reliable predictions in individual cases of the value of one variable from another, the correlation coefficient serves primarily as an indication that no such predictions can be made. As the final table of this page shows, to be able to predict with 95% confidence which decile one variable lies in given the exact value of the other, a correlation of at least 0.997 is required. Even to predict only the sign of one variable from the other with 95% confidence requires a correlation above 0.995. Data which show that high a correlation are data that are so strongly correlated that no-one would bother to measure correlations in the first place.

One may have no concern with whether a prediction is correct in any particular case, but only with the overall success rate. In this case, a correlation of at least 0.5 is required to correctly predict just the sign of one variable from the other 2/3 of the time. Note that a 50% success rate is already guaranteed by chance.

The computations below have only been made for the bivariate normal distribution, but I believe they are unlikely to be substantially different for other bivariate distributions.

Scatter plots

The figure shows scatter plots of 100 points for bivariate normal distributions of various correlations.

Contours

This figure shows contours of the density function of bivariate normal distributions for the same correlations as in the preceding figure. All of the contours of the bivariate normal distribution are concentric ellipses of the same shape, so only one need be plotted for each value of c.

The straight line is the regression line showing the most likely value of y given x. It is the straight line joining the two points on the ellipse where the tangent is vertical.

Explained variance and mutual information

When the product-moment correlation is c, the covariance is c2. This is sometimes described as the amount of variance of y that is explained by or can be attributed to the variation of x. The remainder is the amount not so explained. (However, correlation is not causality: that which the above usage colloquially calls an explanation is not necessarily an explanation in the ordinary sense of the word.)

The improvement ratio is the ratio of the standard deviation of y given the value of x to the standard deviation of y when no information is given about x.

The mutual information of x and y is log2 of the improvement ratio. This is the amount of information in bits which one obtains about y from knowing the exact value of x.

All of these quantities are tabulated below.

Correlation Variance in y Improvement
ratio
Mutual
information (bits)
attributed to x unaccounted for
00 %100 %10
0.24 %96 %1.020.028
0.39 %91 %1.050.068
0.416 %84 %1.090.13
0.525 %75 %1.150.20
0.864 %36 %1.670.74
0.86675 %25 %21
0.981 %19 %2.291.20
0.9590.25 %9.75 %3.201.68
0.9998 %2 %7.092.83
0.99599 %1 %103.32
0.9999599.99 %0.01 %1006.64

Binary classification and screening

To predict the sign of y, given x, the best that one can do is guess that y has the same sign as x. The first quantity tabulated below is the proportion of cases in which this guess is correct. A closed formula for this can be given: cos-1(-c)/π.

When x is close to zero, the guess will be little better than chance, but when x is large, the guess will be more reliable. Given the correlation coefficient, we can ask, how large must x be for the prediction of the sign of y to be correct most of the time? The last four columns of the table tabulate this for values of "most" equal to 95% and 99%. For each confidence value, the minimum absolute value of x in standard deviations is given, together with the probability that the absolute value of x is at least that large.

The extreme values in the first few rows of the table are primarily of theoretical interest: no real distribution can even be observed, let alone measured, at 11 standard deviations from the mean, and 4.3*10-28 % is equivalent to one water molecule out of 280 tons of water. *

[*: The only place where I can imagine there might be a counterexample to this is particle physics. One particle going zig when the other 2.3*1029 go zag might be within the bounds of detection.]

Correlation Probability of correct
sign estimation
Required rate of reliable classification
95% 99%
min x proportion of such x min x proportion of such x
050 %undefined0 %undefined0 %
0.256 %8.067.5*10-14 %11.404.3*10-28 %
0.360 %5.231.71*10-5 %7.401.40*10-11 %
0.463 %3.771.67*10-2 %5.339.91*10-6 %
0.567 %2.850.4 %4.030.006 %
0.880 %1.2321.7 %1.748.1 %
0.86683.3 %0.9534.2 %1.3417.9 %
0.985.6 %0.8042.6 %1.1326.0 %
0.9589.9 %0.5458.9 %0.7644.4 %
0.9995.5 %0.2381.5 %0.3374.0 %
0.99597.8 %0.1786.9 %0.2381.5 %
0.9999599.68 %0.01698.7 %0.02398.1 %

Decile classification

We may wish to do more than merely predict the sign of y. The best prediction of the exact value of y from x is to guess that y is equal to cx. Given x, in what proportion of cases will this estimate of y differ from the correct value by an amount δ, where δ is such that only 10% of the whole population has a value of y within the range (x-δ ... x+δ)? This is tabulated here. (For the bivariate normal distribution, this happens to be independent of x.)

Correlation Prob. of estimation
within ± half a decile
010 %
0.825 %
0.936 %
0.9547 %
0.9978 %
0.99589 %
0.99795 %
0.998599 %

Other matters

To be added to this page when I get round to it.

  1. Errors in estimating c from a finite sample.
  2. Correlations that may be observed in non-random subpopulations of a finite population of known correlation.
  3. Multivariate distributions. Note that one may easily construct a distribution -- and not a particularly artificial one -- in which there are three variables A, B, and C, the correlations of A with B and B with C are positive, and the correlation of A with C is negative. The most extreme case possible is for the first two correlations to be 0.5, and the third -0.5. This implies that when dealing with low correlations, transitivity fails. Even when there are causal links involved, if A tends to cause B and B tends to cause C, it does not follow that A tends to cause C; it is quite possible that A tends to prevent C.
  4. Other distributions. In particular, the case where the two variables are both binary.