程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Operating experience of applying winsorize tail reduction in Python

編輯:Python

Recently, I found that , When shrinking the tail, the data is filled in the place that is originally empty or invalid value . Traditional research will eliminate null values and then shrink the tail , However, some data sets that do not need to eliminate null values need to eliminate extreme values , Therefore, the tail contraction cannot be omitted . Make some records based on your own operating experience :

To save in Excel Take the data in as an example :

from scipy.stats.mstats import winsorizeimport pandas as pddf = pd.read_excel('Excel.xlsx', engine='openpyxl', header=0)df_list=["a","b","c"]# Column names that need to be shortened

1: Direct application Winsorize, Do not consider null and invalid values , Tailing results may cause some null values to be filled with data

for i in df_list(): df[i]=winsorize(df[i],limits=[0.01, 0.01])# The continuous data in the specified column is 1% and 99% Shrinking tail of (Winsorize) Handle

2.1: Mask null and invalid values , Only for other values Winsorize Handle , The shrinking result does not change the original null and invalid values

for i in df_list(): df[i]=np.where(df[i].isnull(), np.nan, winsorize(np.ma.masked_invalid(df[i]),limits=(0.01,0.01)))#np.where(condition, x, y), Satisfy condition yes x, otherwise y# Here to judge whether the value is null , If yes, it is blank , If not, mask null and invalid values 1% and 99% Tail reduction treatment

2.2:winsorize Provided parameters , But I didn't succeed in this method … For reference only

for i in df_list(): df[i]=winsorize(df[i],limits=[0.01, 0.01], nan_policy='omit')

3: Mask null and invalid values , All values are Winsorize Handle , The shrinking result does not change the original null and invalid values , With the method 2 The difference is the method 3 There is no change in the length of data that needs to be shortened

for i in df_list(): mask = df[i].notna() df.loc[mask,i] = winsorize(df[i].loc[mask],limits=[0.01, 0.01]) # This mask It's just one. bool index, Indicate where nan # For example, a column of data is [1, NaN, 2], If you use df['A'].isnan() What you get is a [False, True, False] Array of # This array is called mask, It can be dataframe Select the specific data in

I encountered the problem of negative infinity in the follow-up descriptive statistics , So replace it with a null value

# If you need to replace infinite value with null value df=df.replace(-np.Inf,np.NaN)

( I would like to thank Mr. Zhang who took the trouble to provide me with reference 、 Miss li 、 Miss sun !)

Reference article :

1.Winsorize But in Python Ignored in nan

2. of numpy.ma.masked_invalid Usage of

3.Python Data analysis - Tail reduction treatment

summary

This is about Python Application in Winsorize This is the end of the article on tail reduction , More about Python application Winsorize Please search the previous articles of SDN or continue to browse the relevant articles below. I hope you will support SDN more in the future !



  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved