[ML] Basic Assumptions of Linear Regression

This posting is for preparing for the presentation of the Data Science English Study Group.

How to make good linear regression model?

  • To make a good model with linear regression analysis, the data must satisfy four basic assumptions.
  • If the four basic assumptions are not satisfied, a proper linear regression model cannot be created.
  • The four basic assumptions are,
    1. Linearity
    2. Independence
    3. Equal variance
    4. Normality

Linearity

  • Linearity is an important basic assumption in linear regression analysis.

image

  • Looking at the table, it can be see that the variable that do not have a linear relationship with Sepal.Length is Sepal.Width.
  • Making a linear regression model with this data, the P-Value is 0.152.
  • Therefore, it has no influence.
  • If some of the variables do not satisfy the linearity,
    1. Try adding another new variable.
    2. Try converting variables into logs, indices, and roots.
    3. Try removing variables that do not satisfy linearity.
    4. Force a linear regression model and pass the variable selection method.

Independence

  • It refers to a characteristic that has no correlation between independent variables.

image

  • Looking at the table, Force a variable with a high correlation.
  • Although it was originally a significant variable, many similar variables occurred, resulting in insignificant results.
  • This is multicollinearity. In Korean, ‘다중공선성’
  • Variables that cause multicollinearity should be removed.

Equal variance

  • Equal variance is the same variance.
  • The same variance means that it was evenly distributed without a specific pattern.

image

  • Looking at the table, I make weird ydata.
  • As a result of regression analysis, there is no significant model.
  • Let’s look at the distribution of standardized residuals.

image

  • The standardized residuals[스탠더다이즈드 리지주얼즈] does not satisfy the equal dispersibility and has a specific pattern with four lumps.
  • So, important variables are not added to the analysis data and dropped.

Nomality

  • Normality means whether it has a normal distribution.

image

  • Looking at the table, i create a variable ydata that is concentrated on one side.
  • With the hypothesis that “there is no difference from the normal distribution”, the hypothesis is rejected because the p-value is 0.001.
  • In order to satisfy normality, a similar method to solving equal variance is needed.

Reference

0%