Inflation factors for variance range from one to ten. The numerical number for VIF indicates how much the variance (i.e. the standard error squared) is inflated for each coefficient (in decimal notation). A VIF of 1.9, for example, indicates that the variance of a specific coefficient is 90% higher than what you’d expect if there was no multicollinearity that is, if there was no connection with other variables.
The exact size of a VIF that causes problems is a point of contention. What is known is that when your VIF increases, your regression results will become less dependable. In general, a VIF greater than 10 shows substantial correlation and should be considered concerning. Some authors recommend a threshold of 2.5 or more as a more conservative level.
A high VIF is not always a cause for concern. For example, if you use products or powers from other variables in your regression, such as x and x2, you can achieve a high VIF. It is usually not a problem to have large VIFs for dummy variables representing nominal variables with three or more categories.
What is an appropriate inflation variance factor?
The majority of research studies use a VIF (Variance Inflation Factor) > 10 as a criterion for multicollinearity, however some use a lower threshold of 5 or even 2.5.
When deciding on a VIF threshold, keep in mind that multicollinearity is less of an issue with a big sample size than it is with a small one.
As a result, here is a list of references for various VIF thresholds that are recommended for detecting collinearity in a multivariable (linear or logistic) model:
What do you think of VIF and tolerance?
Multicollinearity may arise if the coefficients of variables are not individually significant that is, they cannot be rejected in a t-test but can together explain the variance of the dependent variable with rejection in the F-test and a high coefficient of determination (R2). It’s a technique for detecting multicollinearity.
Another often used approach for detecting multicollinearity in a regression model is VIF. It determines how much collinearity has inflated the variance (or standard error) of the predicted regression coefficient.
Use of Variance Inflation Factor
When regressing the ith independent variable on the remaining ones, Ri2 is the uncorrected coefficient of determination. Tolerance is defined as the reciprocal of VIF. Depending on personal taste, VIF or tolerance can be employed to detect multicollinearity.
The variance of the other independent variables cannot be predicted from the ith independent variable if Ri2 is equal to 0. As a result, when VIF or tolerance is equal to 1, the ith independent variable is unrelated to the others, implying that multicollinearity is absent in this regression model. The variance of the ith regression coefficient is not increased in this example.
In general, a VIF greater than 4 or a tolerance less than 0.25 suggests the presence of multicollinearity, and further analysis is required. There is severe multicollinearity that needs to be adjusted when VIF is greater than 10 or tolerance is less than 0.1.
There are, however, cases where high VFIs can be safely ignored without causing multicollinearity. Three such scenarios are as follows:
1. High VIFs are found only in control variables, not in the variables of interest. The variables of interest and the control variables are not collinear in this example. The regression coefficients are not influenced in any way.
2. Multicollinearity has no detrimental effects when large VIFs are created by the inclusion of the products or powers of other variables. A regression model, for example, comprises both x and x2 as independent variables.
3. Multicollinearity does not always exist when a dummy variable representing more than two categories has a high VIF. Regardless of whether the categorical variables are associated to other factors, the variables will always have high VIFs if there is a limited portion of cases in the category.
Correction of Multicollinearity
Multicollinearity inflates coefficient variance and generates type II errors, thus detecting and correcting it is critical. The following are two basic and widely used methods for correcting multicollinearity:
One (or more) of the highly associated variables should be removed initially. Because the variables give duplicate information, their elimination will not have a significant impact on the coefficient of determination.
2. Instead of using OLS regression, utilize principal components analysis (PCA) or partial least square regression (PLS). PLS regression can condense a large number of variables into a smaller number with no correlation between them. PCA generates new uncorrelated variables. It reduces information loss and enhances a model’s predictability.
What does the VIF have to say to you?
The variance inflation factor (VIF) is a metric for determining how much multicollinearity there is in a set of multivariate regression variables. The VIF for a regression model variable is equal to the ratio of the total model variance to the variance of a model that just includes that single independent variable in mathematics. For each independent variable, this ratio is determined. A high VIF shows that the linked independent variable has a high degree of collinearity with the model’s other variables.
What do you do with multicollinearity data?
Various methods can be used to detect multicollinearity. The most prevalent one – VIF will be the topic of this article (Variable Inflation Factors).
The strength of the correlation between the independent variables is determined by the VIF. It is predicted by regressing a variable against all other variables.
An independent variable’s VIF score indicates how well it is explained by other independent variables.
The R2 value is used to measure how well the other independent variables describe an independent variable. A high R2 score indicates that the variable is strongly associated with the others. The VIF, which is denoted below, captures this:
The greater the value of VIF and the higher the multicollinearity with the given independent variable, the closer the R2 value is to 1.
Is a VIF of 1 a good number?
There are various rules we can follow to see if our VIFs are within acceptable limits. A popular rule of thumb in practice is that if the VIF is greater than ten, the multicollinearity is high. We’re in good shape in our scenario, with values around 1, and we can continue with our regression.
Is VIF suitable for logistic regression?
The Variance Inflation Factor (VIF) approach is used to check for multi-collinearity in the independent variables. A VIF score of >10 indicates that the variables are highly linked. As a result, they are omitted from the logistic regression model.
What is a poor VIF?
When estimating linear or generalized linear models, such as logistic regression and Cox regression, multicollinearity is a typical issue. It arises when predictor variables have high correlations, resulting in unreliable and unstable regression coefficient estimations. Multicollinearity is not a good thing, as most data analysts are aware. However, many people are unaware that multicollinearity can be safely ignored in a number of scenarios.
Before we look at those scenarios, let’s look at the most commonly used multicollinearity diagnostic, the variance inflation factor (VIF). For each predictor, the VIF can be derived by performing a linear regression of that predictor on all other predictors and then calculating the R2 from that regression. The VIF is only 1/10th of a percent (1-R2).
The variance inflation factor is named from the fact that it measures how much a coefficient’s variance is “Because of linear dependence with other predictors, the results are “inflated.” A VIF of 1.8 indicates that the variance (square of the standard error) of a given coefficient is 80% higher than it would be if that predictor were totally uncorrelated with all other predictors.
There is no upper bound on the VIF, however it has a lower bound of 1. Authorities disagree about how high a VIF must be to be considered an issue. When a VIF is more than 2.50, which equates to an R2 of.60 with the other variables, I become concerned.
There are at least three instances in which a high VIF is not an issue and may be safely ignored, regardless of your criterion for what constitutes a high VIF:
1. High VIF variables are control variables, while low VIF variables are variables of interest. The difficulty with multicollinearity is that it only affects the variables that are collinear. It raises the standard errors of their coefficients, potentially making them unstable in a variety of ways. There’s no problem as long as the collinear variables are solely used as control variables and aren’t collinear with your variables of interest. The coefficients of the variables of interest are unaffected, and the control variables’ performance as controls is unaffected.
Here’s an example from my personal portfolio: The sample is made up of colleges in the United States, the dependent variable is graduation rate, and the variable of interest is a dummy indicator for public vs. private universities. Average SAT and ACT scores for new freshmen serve as two control variables. These two variables have a correlation of greater than.9, implying VIFs of at least 5.26 for each. The VIF for the public/private indicator, on the other hand, is only 1.04. So there’s no need to worry about anything and no need to eliminate one of the two controls.
2. The inclusion of powers or products of other variables results in high VIFs. If you use both x and x2 in a regression model, there’s a considerable probability the two variables will be highly connected. Similarly, if you have x, z, and xz in your model, both x and z are likely to be highly linked with their product. However, given the multicollinearity has no effect on the p-value for xz, this is not a cause for concern. This is easily demonstrated: by reducing the correlations, you may considerably lower the correlations “Before constructing the powers or products, “centering” the variables (i.e. removing their means) is necessary. However, regardless of whether you center or not, the p-value for x2 or xz will be the same. And in either scenario, all of the other variables’ findings (including the R2 but excluding the lower-order terms) will be the same. As a result, multicollinearity has no negative effects.
3. Indicator (dummy) variables that reflect a categorical variable with three or more categories have high VIFs. Even if the categorical variable is not connected with other variables in the regression model, the indicator variables will have high VIFs if the proportion of cases in the reference category is minimal.
Consider the following scenario: a marital status variable has three categories: now married, never married, and previously married. The reference category is formerly married, with indicator variables for the other two. As the proportion of people in the reference category decreases, the correlation between those two measures becomes more negative. The VIFs for the married and never-married indicators will be at least 3.0 if 45 percent of people are never married, 45 percent are married, and 10 percent are formerly married.
Is this an issue? It does, however, imply that the indicator variables’ p-values may be large. The overall test, however, which ensures that all indicators have zero coefficients, is unaffected by the high VIFs. The rest of the regression remains unaffected. Choose a reference category with a bigger fraction of the cases if you really want to avoid the high VIFs. This may be useful in order to avoid instances where none of the individual indicators are statistically significant but the overall collection is.