NAG Library g02 - Correlation and Regression Analysis
|
Scope of the Chapter
|
|
This chapter is concerned with two techniques – correlation analysis and regression modelling – both of which are concerned with determining the inter-relationships among two or more variables.
Other chapters of the NAG C Library which cover similar problems are Chapters e02 and e04. Chapter e02 routines may be used to fit linear models by criteria other than least-squares, and also for polynomial regression; Chapter e04 routines may be used to fit nonlinear models and linearly constrained linear models.
|
|
Background to the Problems
|
|
|
Correlation
|
|
|
Aims of correlation analysis
|
|
Correlation analysis provides a single summary statistic – the correlation coefficient – describing the strength of the association between two variables. The most common types of association which are investigated by correlation analysis are linear relationships, and there are a number of forms of linear correlation coefficients for use with different types of data.
|
|
Correlation coefficients
|
|
The (Pearson) product-moment correlation coefficients measure a linear relationship, while Kendall's tau and Spearman's rank order correlation coefficients measure monotonicity only. All three coefficients range from to . A coefficient of zero always indicates that no linear relationship exists; a coefficient implies a "perfect" positive relationship (i.e., an increase in one variable is always associated with a corresponding increase in the other variable); and a coefficient of indicates a "perfect" negative relationship (i.e., an increase in one variable is always associated with a corresponding decrease in the other variable).
Consider the bivariate scattergrams in Figure 1: (a) and (b) show strictly linear functions for which the values of the product-moment correlation coefficient, and (since a linear function is also monotonic) both Kendall's tau and Spearman's rank order coefficients, would be and respectively. However, though the relationships in figures (c) and (d) are respectively monotonically increasing and monotonically decreasing, for which both Kendall's and Spearman's non-parametric coefficients would be (in (c)) and (in (d)), the functions are nonlinear so that the product-moment coefficients would not take such "perfect" extreme values. There is no obvious relationship between the variables in figure (e), so all three coefficients would assume values close to zero, while in figure (f) though there is an obvious parabolic relationship between the two variables, it would not be detected by any of the correlation coefficients which would again take values near to zero; it is important therefore to examine scattergrams as well as the correlation coefficients.
In order to decide which type of correlation is the most appropriate, it is necessary to appreciate the different groups into which variables may be classified. Variables are generally divided into four types of scales: the nominal scale, the ordinal scale, the interval scale, and the ratio scale. The nominal scale is used only to categorise data; for each category a name, perhaps numeric, is assigned so that two different categories will be identified by distinct names. The ordinal scale, as well as categorising the observations, orders the categories. Each category is assigned a distinct identifying symbol, in such a way that the order of the symbols corresponds to the order of the categories. (The most common system for ordinal variables is to assign numerical identifiers to the categories, though if they have previously been assigned alphabetic characters, these may be transformed to a numerical system by any convenient method which preserves the ordering of the categories.) The interval scale not only categorises and orders the observations, but also quantifies the comparison between categories; this necessitates a common unit of measurement and an arbitrary zero-point. Finally, the ratio scale is similar to the interval scale, except that it has an absolute (as opposed to arbitrary) zero-point.
For a more complete discussion of these four types of scales, and some examples, the user is referred to Churchman and Ratoosh (1959) and Hays (1970).
Product-moment correlation coefficients are used with variables which are interval (or ratio) scales; these coefficients measure the amount of spread about the linear least-squares equation. For a product-moment correlation coefficient, , based on pairs of observations, testing against the null hypothesis that there is no correlation between the two variables, the statistic
has a Student's -distribution with degrees of freedom; its significance can be tested accordingly.
Ranked and ordinal scale data are generally analysed by non-parametric methods – usually either Spearman's or Kendall's tau rank-order correlation coefficients, which, as their names suggest, operate solely on the ranks, or relative orders, of the data values. Interval or ratio scale variables may also be validly analysed by non-parametric methods, but such techniques are statistically less powerful than a product-moment method. For a Spearman rank-order correlation coefficient, , based on pairs of observations, testing against the null hypothesis that there is no correlation between the two variables, for large samples the statistic
has approximately a Student's -distribution with degrees of freedom, and may be treated accordingly. (This is similar to the product-moment correlation coefficient, , see above.) Kendall's tau coefficient, based on pairs of observations, has, for large samples, an approximately Normal distribution with mean zero and standard deviation
when tested against the null hypothesis that there is no correlation between the two variables; the coefficient should therefore be divided by this standard deviation and tested against the standard Normal distribution, .
When the number of ordinal categories a variable takes is large, and the number of ties is relatively small, Spearman's rank-order correlation coefficients have advantages over Kendall's tau; conversely, when the number of categories is small, or there are a large number of ties, Kendall's tau is usually preferred. Thus when the ordinal scale is more or less continuous, Spearman's rank-order coefficients are preferred, whereas Kendall's tau is used when the data is grouped into a smaller number of categories; both measures do however include corrections for the occurrence of ties, and the basic concepts underlying the two coefficients are quite similar. The absolute value of Kendall's tau coefficient tends to be slightly smaller than Spearman's coefficient for the same set of data.
There is no authoritative dictum on the selection of correlation coefficients – particularly on the advisability of using correlations with ordinal data. This is a matter of discretion for the user.
|
|
|
Regression
|
|
|
Aims of regression modelling
|
|
In regression analysis the relationship between one specific random variable, the dependent or response variable, and one or more known variables, called the independent variables or covariates, is studied. This relationship is represented by a mathematical model, or an equation, which associates the dependent variable with the independent variables, together with a set of relevant assumptions. The independent variables are related to the dependent variable by a function, called the regression function, which involves a set of unknown parameters. Values of the parameters which give the best fit for a given set of data are obtained; these values are known as the estimates of the parameters.
The reasons for using a regression model are twofold. The first is to obtain a description of the relationship between the variables as an indicator of possible causality. The second reason is to predict the value of the dependent variable from a set of values of the independent variables. Accordingly, the most usual statistical problems involved in regression analysis are:
|
to obtain best estimates of the unknown regression parameters;
|
|
to test hypotheses about these parameters;
|
|
to determine the adequacy of the assumed model; and
|
|
to verify the set of relevant assumptions.
|
|
|
Linear regression models
|
|
When the regression model is linear in the parameters (but not necessarily in the independent variables), then the regression model is said to be linear; otherwise the model is classified as nonlinear.
The most elementary form of regression model is the simple linear regression of the dependent variable, , on a single independent variable, , which takes the form
(1)
where is the expected or average value of and and are the parameters whose values are to be estimated, or, if the regression is required to pass through the origin (i.e., no constant term),
(2)
where is the only unknown parameter.
An extension of this is multiple linear regression in which the dependent variable, , is regressed on the () independent variables, , which takes the form
(3)
where and are the unknown parameters.
A special case of multiple linear regression is polynomial linear regression, in which the independent variables are in fact powers of the same single variable (i.e., , for ).
In this case, the model defined by (3) becomes
(4)
There are a great variety of nonlinear regression models; one of the most common is exponential regression, in which the equation may take the form
(5)
It should be noted that equation (4) represents a linear regression, since even though the equation is not linear in the independent variable, , it is linear in the parameters , whereas the regression model of equation (5) is nonlinear, as it is nonlinear in the parameters (, and ).
|
|
Fitting the regression model – least-squares estimation
|
|
The method used to determine values for the parameters is, based on a given set of data, to minimize the sums of squares of the differences between the observed values of the dependent variable and the values predicted by the regression equation for that set of data – hence the term least-squares estimation. For example, if a regression model of the type given by equation (3), namely
where for all observations, is to be fitted to the data points
(6)
such that
where are unknown independent random errors with and , being a constant, then the method used is to calculate the estimates of the regression parameters by minimizing
(7)
If the errors do not have constant variance, i.e.,
then weighted least-squares estimation is used in which
is minimized. For a more complete discussion of these least-squares regression methods, and details of the mathematical techniques used, see Draper and Smith (1985) or Kendall and Stuart (1973).
|
|
Regression models and designed experiments
|
|
One application of regression models is in the analysis of experiments. In this case the model relates the dependent variable to qualitative independent variables known as factors. Factors may take a number of different values known as levels. For example, in an experiment in which one of four different treatments is applied, the model will have one factor with four levels. Each level of the factor can be represented by a dummy variable taking the values 0 or 1. So in the example there are four dummy variables , for such that:
along with a variable for the mean :
If there were 7 observations the data would be:
Models which include factors are sometimes known as General Linear (Regression) Models. When dummy variables are used it is common for the model not to be of full rank. In the case above, the model would not be of full rank because
This means that the effect of cannot be distinguished from the combined effect of and . This is known as aliasing. In this situation, the aliasing can be deduced from the experimental design and as a result the model to be fitted; in such situations it is known as intrinsic aliasing. In the example above no matter how many times each treatment is replicated (other than 0) the aliasing will still be present. If the aliasing is due to a particular data set to which the model is to be fitted then it is known as extrinsic aliasing. If in the example above observation 1 was missing then the term would also be aliased. In general intrinsic aliasing may be overcome by changing the model, e.g., remove or from the model, or by introducing constraints on the parameters, e.g., .
If aliasing is present then there will no longer be a unique set of least-squares estimates for the parameters of the model but the fitted values will still have a unique estimate. Some linear functions of the parameters will also have unique estimates; these are known as estimable functions. In the example given above the functions () and () are both estimable.
|
|
Selecting the regression model
|
|
In many situations there are several possible independent variables, not all of which may be needed in the model. In order to select a suitable set of independent variables, two basic approaches can be used.
|
a. All possible regressions
|
|
In this case all the possible combinations of independent variables are fitted and the one considered the best selected. To choose the best, two conflicting criteria have to be balanced. One is the fit of the model as measured by the residual sum of squares. This will decrease as more variables are added to the model. The second criterion is the desire to have a model with a small number of significant terms. To aid in the choice of model, statistics such as , which gives the proportion of variation explained by the model, and , which tries to balance the size of the residual sum of squares against the number of terms in the model, can be used.
|
|
b. Stepwise model building
|
|
In stepwise model building the regression model is constructed recursively, adding or deleting the independent variables one at a time. When the model is built up the procedure is known as forward selection. The first step is to choose the single variable which is the best predictor. The second independent variable to be added to the regression equation is that which provides the best fit in conjunction with the first variable. Further variables are then added in this recursive fashion, adding at each step the optimum variable, given the other variables already in the equation. Alternatively, backward elimination can be used. This is when all variables are added and then the variables dropped one at a time, the variable dropped being the one which has the least effect on the fit of the model at that stage. There are also hybrid techniques which combine forward selection with backward elimination.
|
|
|
Examining the fit of the model
|
|
Having fitted a model two questions need to be asked: first, "are all the terms in the model needed?" and second, "is there some systematic lack of fit?". To answer the first question either confidence intervals can be computed for the parameters or -tests can be calculated to test hypotheses about the regression parameters – for example, whether the value of the parameter, , is significantly different from a specified value, (often zero). If the estimate of is and its standard error is then the -statistic is
It should be noted that both the tests and the confidence intervals may not be independent. Alternatively -tests based on the residual sums of squares for different models can also be used to test the significance of terms in the model. If model 1, giving residual sum of squares with degrees of freedom , is a sub-model of model 2, giving residual sum of squares with degrees of freedom , i.e., all terms in model 1 are also in model 2, then to test if the extra terms in model 2 are needed the -statistic
may be used. These tests and confidence intervals require the additional assumption that the errors, , are Normally distributed.
To check for systematic lack of fit the residuals, , where is the fitted value, should be examined. If the model is correct then they should be random with no discernable pattern. Due to the way they are calculated the residuals do not have constant variance. Now the vector of fitted values can be written as a linear combination of the vector of observations of the dependent variable, , . The variance-covariance matrix of the residuals is then , being the identity matrix. The diagonal elements of , , can therefore be used to standardize the residuals. The are a measure of the effect of the th observation on the fitted model and are sometimes known as leverages.
If the observations were taken serially the residuals may also be used to test the assumption of the independence of the and hence the independence of the observations.
|
|
Computational methods
|
|
Let be the by matrix of independent variables and be the vector of values for the dependent variable. To find the least-squares estimates of the vector of parameters, , the decomposition of is found, i.e.,
where , being a by upper triangular matrix, and an by orthogonal matrix. If is of full rank then is the solution to
where and is the first rows of . If is not of full rank, a solution is obtained by means of a singular value decomposition (SVD) of ,
where is a by diagonal matrix with non-zero diagonal elements, being the rank of , and and are by orthogonal matrices. This gives the solution
being the first columns of and being the first columns of .
This will be only one of the possible solutions. Other estimates may be obtained by applying constraints to the parameters. If weighted regression with a vector of weights is required then both and are premultiplied by .
The method described above will, in general, be more accurate than methods based on forming (), (or a scaled version), and then solving the equations
|
|
Robust estimation
|
|
Least-squares regression can be greatly affected by a small number of unusual, atypical, or extreme observations. To protect against such occurrences, robust regression methods have been developed. These methods aim to give less weight to an observation which seems to be out of line with the rest of the data given the model under consideration. That is to seek to bound the influence. For a discussion of influence in regression, see Hampel et al. (1986) and Huber (1981).
There are two ways in which an observation for a regression model can be considered atypical. The values of the independent variables for the observation may be atypical or the residual from the model may be large.
The first problem of atypical values of the independent variables can be tackled by calculating weights for each observation which reflect how atypical it is, i.e., a strongly atypical observation would have a low weight. There are several ways of finding suitable weights; some are discussed in Hampel et al. (1986).
The second problem is tackled by bounding the contribution of the individual to the criterion to be minimized. When minimizing (7) a set of linear equations is formed, the solution of which gives the least-squares estimates. The equations are
These equations are replaced by
(8)
where is the variance of the , and is a suitable function which down weights large values of the standardized residuals . There are several suggested forms for , one of which is Huber's function,
(9)
The solution to (8) gives the -estimates of the regression coefficients. The weights can be included in (8) to protect against both types of extreme value. The parameter can be estimated by the median absolute deviations of the residuals or as a solution to, in the unweighted case,
where is a suitable function and is a constant chosen to make the estimate unbiased. is often chosen to be where is given in (9). Another form of robust regression is to minimize the sum of absolute deviations, i.e.,
For details of robust regression, see Hampel et al. (1986) and Huber (1981).
Robust regressions using least absolute deviations can be computed using routines in Chapter e02.
|
|
Generalized linear models
|
|
Generalized linear models are an extension of the general linear regression model discussed above. They allow a wide range of models to be fitted. These included certain non-linear regression models, logistic and probit regression models for binary data, and log-linear models for contingency tables. A generalized linear model consists of three basic components:
|
a. A suitable distribution for the dependent variable . The following distributions are common:
|
|
In addition to the obvious uses of models with these distributions it should be noted that the Poisson distribution can be used in the analysis of contingency tables while the gamma distribution can be used to model variance components. The effect of the choice of the distribution is to define the relationship between the expected value of , , and its variance and so a generalized linear model with one of the above distributions may be used in a wider context when that relationship holds.
|
|
b. A linear model , is known as a linear predictor.
|
|
logistic link: ;
|
|
probit link: ;
|
|
complementary log-log: .
|
|
For the Normal, Poisson, and gamma distributions:
|
|
exponent link: , for a constant ;
|
|
identity link: ;
|
|
log link: ;
|
|
square root link: ;
|
|
reciprocal link: .
|
|
For each distribution there is a canonical link. For the canonical link there exist sufficient statistics for the parameters. The canonical links are:
|
|
For the general linear regression model described above the three components are:
|
|
Linear model – ;
|
The model is fitted by maximum likelihood; this is equivalent to least-squares in the case of the Normal distribution. The residual sums of squares used in regression models is generalized to the concept of deviance. The deviance is the logarithm of the ratio of the likelihood of the model to the full model in which , where is the estimated value of . For the Normal distribution the deviance is the residual sum of squares. Except for the case of the Normal distribution with the identity link, the and -tests based on the deviance are only approximate; also the estimates of the parameters will only be approximately Normally distributed. Thus only approximate - or -tests may be performed on the parameter values and approximate confidence intervals computed.
The estimates are found by using an iterative weighted least-squares procedure. This is equivalent to the Fisher scoring method in which the Hessian matrix used in the Newton–Raphson method is replaced by its expected value. In the case of canonical links the Fisher scoring method and the Newton–Raphson method are identical. Starting values for the iterative procedure are obtained by replacing the by in the appropriate equations.
|
|
|
|
Recommendations on Choice and Use of Available Functions
|
|
|
Correlation
|
|
|
Non-parametric correlation
|
|
g02brc (nag_ken_spe_corr_coeff) computes Kendall and/or Spearman non-parametric rank correlation coefficients. The function allows for a subset of variables to be selected and for observations to be excluded from the calculations if, for example, they contain missing values.
|
|
|
Regression
|
|
|
Simple linear regression
|
|
Two functions are provided for simple linear regression. The function g02cac (nag_simple_linear_regression) calculates the parameter estimates for a simple linear regression with or without a constant term. The function g02cbc (nag_regress_confid_interval) calculates fitted values, residuals and confidence intervals for both the fitted line and individual observations. This function produces the information required for various regression plots.
|
|
Multiple linear regression – general linear model
|
|
|
g02dac (nag_regsn_mult_linear) fits a general linear regression model using the method and an SVD if the model is not of full rank. The results returned include: residual sum of squares, parameter estimates, their standard errors and variance-covariance matrix, residuals and leverages. There are also several routines to modify the model fitted by g02dac (nag_regsn_mult_linear) and to aid in the interpretation of the model.
|
|
g02dkc (nag_regsn_mult_linear_tran_model) calculates the estimates of the parameters for a given set of constraints, (e.g., parameters for the levels of a factor sum to zero) for a model which is not of full rank and the SVD has been used.
|
Note: g02dec (nag_regsn_mult_linear_add_var) also allows the user to initialise a model building process and then to build up the model by adding variables one at a time.
|
|
Selecting regression models
|
|
To aid the selection of a regression model the following routines are available.
|
g02eac (nag_all_regsn) computes the residual sums of squares for all possible regressions for a given set of dependent variables. The routine allows some variables to be forced into all regressions.
|
|
g02eec (nag_step_regsn) enables the user to fit a model by forward selection. The user may call g02eec (nag_step_regsn) a number of times. At each call the routine will calculate the changes in the residual sum of squares from adding each of the variables not already included in the model, select the variable which gives the largest change and then if the change in residual sum of squares meets the given criterion will add it to the model.
|
|
|
Residuals
|
|
|
Internally studentized residual;
|
|
Externally studentized residual;
|
|
Cook's statistic;
|
|
Atkinson's statistic.
|
|
|
Generalized linear models
|
|
There are four routines for fitting generalized linear models. The output includes: the deviance, parameter estimates and their standard errors, fitted values, residuals and leverages. The routines are:
While g02gac (nag_glm_normal) can be used to fit linear regression models (i.e., by using an identity link) this is not recomended as g02dac (nag_regsn_mult_linear) will fit these models more efficiently. g02gcc (nag_glm_poisson) can be used to fit log-linear models to contingency tables.
In addition to the routines to fit the models there are two routines to aid the interpretation of the model if a model which is not of full rank has been fitted, i.e., aliasing is present.
|
g02gkc (nag_glm_tran_model) computes parameter estimates for a set of constraints, (e.g., sum of effects for a factor is zero), from the SVD solution provided by the fitting routine.
|
|
|
Polynomial regression and non-linear regression
|
|
No routines are currently provided in this chapter for polynomial regression. Users wishing to perform polynomial regressions do however have three alternatives: they can use the multiple linear regression routines, g02dac (nag_regsn_mult_linear), with a set of independent variables which are in fact simply the same single variable raised to different powers, or they can use the routine g04eac (nag_dummy_vars) to compute orthogonal polynomials which can then be used with g02dac (nag_regsn_mult_linear), or they can use the routines in Chapter e02 (Curve and Surface Fitting) which fit polynomials to sets of data points using the techniques of orthogonal polynomials. This latter course is to be preferred, since it is more efficient and liable to be more accurate, but in some cases more statistical information may be required than is provided by those routines, and it may be necessary to use the routines of this chapter.
More general nonlinear regression models may be fitted using the optimization routines in Chapter e04, which contains routines to minimize the function
where the regression parameters are the variables of the minimization problem.
|
|
|
|
See Also
|
|
Atkinson A C (1986) Plots, Transformations and Regressions Clarendon Press, Oxford
Churchman C W and Ratoosh P (1959) Measurement Definitions and Theory Wiley
Cook R D and Weisberg S (1982) Residuals and Influence in Regression Chapman and Hall
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
Hammarling S (1985) The singular value decomposition in multivariate statistics SIGNUM Newsl. 20 (3) 2–25
Hampel F R, Ronchetti E M, Rousseeuw P J and Stahel W A (1986) Robust Statistics. The Approach Based on Influence Functions Wiley
Hays W L (1970) Statistics Holt, Rinehart and Winston
Huber P J (1981) Robust Statistics Wiley
Kendall M G and Stuart A (1973) The Advanced Theory of Statistics (Volume 2) (3rd Edition) Griffin
McCullagh P and Nelder J A (1983) Generalized Linear Models Chapman and Hall
Searle S R (1971) Linear Models Wiley
Siegel S (1956) Non-parametric Statistics for the Behavioral Sciences McGraw–Hill
Weisberg S (1985) Applied Linear Regression Wiley
NAG Toolbox Overview.
NAG Web Site.
|
|