Multicollinearity

Multicollinearity is a phenomenon in which two or more independent variables, used collectively to model a dependent variable, are highly correlated with one another (they can be linearly predicted from each other) [1]. For example, if canopy height and volume are highly correlated, using them both for modeling yield will suffer from multicollinearity issues. Multicollinearity adversely affects the statistical significance of the independent variables in common methods of regression analysis, such as least squares, and causes drastic fluctuations in estimated coefficients and weakens the statistical power[2]. When the number of independent variables is limited, understanding and handling multicollinearity is relatively simple, and in some cases, its effects are negligible since the model prediction is not affected. However, increasing the number of input variables makes the multicollinearity issue rather tricky. For example, hyperspectral data contain hundreds of highly correlated bands that should be handled before modeling. Usually, this leads to feature selection and feature engineering techniques that are discussed in the next section.

When using machine learning techniques, however, the multicollinearity issue might not be as severe as conventional statistical models due to their working principles, but if not eliminated, it adds to the problems complexity, fluctuations in predictions, and overfitting[3].

References

[1]       M. P. Allen, “The problem of multicollinearity,” Understanding regression analysis, pp. 176–180, 1997.

[2]       M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li, Applied linear statistical models, vol. 5. McGraw-Hill Irwin New York, 2005.

[3]       A. Garg and K. Tai, “Comparison of statistical and machine learning methods in modelling of data with multicollinearity,” International Journal of Modelling, Identification and Control, vol. 18, no. 4, pp. 295–312, 2013.