This posting is for preparing for the presentation of the Data Science English Study Group.
How to make good linear regression model?
- To make a good model with linear regression analysis, the data must satisfy four basic assumptions.
- If the four basic assumptions are not satisfied, a proper linear regression model cannot be created.
- The four basic assumptions are,
- Linearity
- Independence
- Equal variance
- Normality
Linearity
- Linearity is an important basic assumption in linear regression analysis.
- Looking at the table, it can be see that the variable that do not have a linear relationship with Sepal.Length is Sepal.Width.
- Making a linear regression model with this data, the P-Value is 0.152.
- Therefore, it has no influence.
- If some of the variables do not satisfy the linearity,
- Try adding another new variable.
- Try converting variables into logs, indices, and roots.
- Try removing variables that do not satisfy linearity.
- Force a linear regression model and pass the variable selection method.
Independence
- It refers to a characteristic that has no correlation between independent variables.
- Looking at the table, Force a variable with a high correlation.
- Although it was originally a significant variable, many similar variables occurred, resulting in insignificant results.
- This is multicollinearity. In Korean, ‘다중공선성’
- Variables that cause multicollinearity should be removed.
Equal variance
- Equal variance is the same variance.
- The same variance means that it was evenly distributed without a specific pattern.
- Looking at the table, I make weird ydata.
- As a result of regression analysis, there is no significant model.
- Let’s look at the distribution of standardized residuals.
- The standardized residuals[스탠더다이즈드 리지주얼즈] does not satisfy the equal dispersibility and has a specific pattern with four lumps.
- So, important variables are not added to the analysis data and dropped.
Nomality
- Normality means whether it has a normal distribution.
- Looking at the table, i create a variable ydata that is concentrated on one side.
- With the hypothesis that “there is no difference from the normal distribution”, the hypothesis is rejected because the p-value is 0.001.
- In order to satisfy normality, a similar method to solving equal variance is needed.