程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Data cleaning: processing method of missing value and abnormal value -- operation of filling missing value with regression equation (with Python code)

編輯:Python

Operation method of filling missing values in regression equation ( attach python Code )

1. Background description :

In the process of data cleaning, we often encounter problems such as outliers and missing values , occasionally , Will treat outliers as missing values . General missing value processing methods include : Delete 、 Statistical value filling ( mean value 、 Median, etc )、 Regression equation predicts filling, etc .
Using the direct delete method is simple and easy , But the disadvantage is , In the case of less recorded data , It will further reduce the sample size , It may change the original distribution of response variables , Resulting in inaccurate analysis results . therefore , The advantage of treating outliers as missing values lies in that the information of existing variables can be used for modeling and mining , For outliers ( Missing value ) Fill in .( The purpose of this paper is to explore how to use regression equation to predict and estimate , For outliers 、 Operation method of filling with missing values )

2. Application scenarios :

Regression equation filling method , Is to select several independent variables that can predict the missing values , Estimating missing values by establishing regression equations . This method can make full use of the information in the original data set , But there are also some shortcomings :1. Although this is an unbiased estimate , But it ignores random errors , Underestimate the standard deviation and other measurements of unknown properties .2. Before using , It must be assumed that the variable with missing values has a linear relationship with other variables , But in reality, they do not necessarily have such a linear relationship , This can be distinguished with the help of statistical tools , But it often needs more practical experience and business knowledge of modelers to analyze and judge .

3. Methods and steps :

a. Variables that determine the filling missing values ( Characteristic column )
b. Split the original dataset :

Fill in variables with missing values as needed , Split the original data set into 2 A subset of (1. No missing values :dataset_train; 2. Contains only missing values dataset_pred)

c. Analyze and test the correlation of relevant variables :

Empirical analysis determines which attributes are listed in relation to variables that fill missing values , Apply statistical analysis tools , stay dataset_train View on the dataset to verify the correlation between the selected attribute columns .

d. Model and predict :

Use dataset_train Set up linear regression model , And apply the built model to dataset_pred The missing variables in the data set are predicted and estimated

e. Merge restore datasets :

Restore two subsets together into one dataset , Prepare data for subsequent modeling .

4. Sample code :

Data set description :
The data set is intercepted from a computational intensity ( The response value is ”strength“) As an example, some of the original data of .
In this case "force" Is an important feature , But with missing values , Try to use the regression equation to predict the filling missing value , To build a forecast "Strength" Model data preparation .

  1. Load the data and determine the characteristics that need to be filled with missing values :

  2. Split datasets :
    Find out "item" by 3, 12, 16, 26 Of "force" The feature has missing values . Considering that the amount of data is not too much , Check variables ”force“ Whether it conforms to the normal distribution .
    The results of the above analysis show that :p The value is 0.612, Greater than 0.05, The characteristic data conform to the normal distribution .

  3. Analyze and test the correlation of relevant variables :
    According to practical experience, we can learn that , Welding drawing force will be affected by temperature (temp)、 Time (duration) And amount of solder paste (paste_qty) Etc , therefore , We will choose to make use of the above 3 The regression equation is established by three factors , Before that, check each factor and prediction variable (force) The correlation between .
    The above results show that : The factors we choose are related to “force” There is a certain correlation between attributes .

  4. Model and predict :
    The regression equation thus obtained is :
    force = 12.246 + 0.238 * temp - 0.262 * time + 4.419 * paste_qty
    By modeling , Estimated at "force" Medium 4 Missing values .

  5. Merge restore datasets :
    Check "item" by 3, 12,16,26 Of "force" features , The results show that the original missing values have been filled in by the predicted values using the regression equation !

( remarks : This article aims to explore the method of operation , So there is no special focus on the performance results of modeling . But in real production applications , It must be judged according to the actual situation 、 Select and pay attention to the performance of the model built in the process .)

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved