程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Feature Engineering: basic operation of data cleaning (with Python code)

編輯:Python

@[TOC] Data cleaning methods and steps

The purpose of data cleaning – By analyzing the incomplete data in the original data set 、 Wrong data , Clean up abnormal data and duplicate data , So as to improve the performance of the mathematical model .

The state of data in the real world is very strange , Data sets are missing for various reasons 、 Errors and repetition . Data cleaning (Data Cleansing), According to the actual situation , Through a series of data “ clear ” step , Correct the error message , Discrimination of abnormal data , Delete duplicate values , Output the cleaned data in the appropriate modeling format .

Basic steps of data cleaning :

  1. Identify and handle missing values
  2. Identify and handle outliers
  3. Delete duplicate values

1. Identify and handle missing values

There are many reasons for missing values , For example, some observations are not recorded during data collection ; There are also some missing values because there are no recording criteria when recording data , for example “ The number of children ”, You will encounter such a record description :“ No, ”,“ nothing ”,“ individual ”,“ zero ”,“0 individual ”,“NA”,“?”, wait , Some descriptions represent “0” value , Some are caused by missing or wrong filling . Before entering the data cleaning phase , It is better to have a certain global understanding of the data set through browsing or some visual tools , So as to make correct judgment and decision in the process of data cleaning .

- Identification of missing values :

Method to check whether there are missing values in the data set :
.info() : See how many lines of data , Whether there are missing values , And the data type of each column
.isnull(): Count the number of missing values by column
Be careful : Because the system only determines ”None“,”NaN“ It's missing values , about "NA", “ nothing ”, "?" The computer will think that they are valid data , And the wrong result , So before using the above method to distinguish the missing values , Best use .head() Method or .sample() Method browse the original data set first .
Code example :

How to deal with erroneous data : take “NA”, " nothing " Replace with the missing value :
Inspection : The result shows the data “NA”,“ nothing ” Has been changed to “NaN”.

- How to deal with missing values :
  1. Delete records with missing values :DataFrame Methods :dropna()、drop()
    You can determine whether to delete sample records or attribute examples with missing values according to the actual situation , For example, consider the proportion of missing value records in the whole valid data . Personal modeling experience , In some cases, records with missing values are deleted directly , Does not degrade the performance of the model , Even the performance of the model will be better than that of the model using filled values . If the missing value occurs in the response value (response) On , In general, it is recommended to delete these records , Because the accuracy of the filling value will have a great impact on the accuracy of the model .

  2. The missing values are “0” Value padding :.fillna(“0”)

  3. Fill in the missing values with statistical data ( Include : mean value 、 Median or mode, etc ): Such as :fillna(median),fillna(mean); You can also use scikit-learn Medium imputer() Method to quickly populate the entire data set with missing values :
    Imputer Class can also be used in the pipeline of machine learning , There is not much to be said here .

  4. Call the regression method to predict the missing values and fill in : Such as linear regression , Tree regression, etc

2. Identify and handle outliers

During data cleaning , Except for the obvious wrong data , There are also some abnormal data . outliers ( Also known as outliers ) Refers to recording individual data in the sample , Its value obviously deviates from the other observed values of the attribute sample . General outliers are significantly larger or smaller than other values , Easier to identify , There are... Not obvious , It can be identified and eliminated by statistical test .

- Identification of outliers
Common methods for handling outliers :

Facing outliers , It is better to understand the reasons behind these outliers in detail , This may lead to an opportunity to better stabilize or control process performance . Handling outliers must be based on the actual situation , After marking and recording , You can select some common processing methods below :

  1. Delete records with outliers ;
  2. Use average / Mode and other statistics ;
  3. Think of it as a missing value , Apply missing values to deal with ;
  4. For the time being , Analyze the performance of the model ;
  5. Not to deal with .

3. Delete duplicate values

Example : View the duplicate values of the entire dataset , Count the number of duplicate values and distinguish which data are duplicate values , Finally, remove duplicate values .


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved