程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Four methods for pandas to perform box wise operation on values

編輯:Python
Use Pandas Of between 、cut、qcut and value_count Discretize numerical variables .

Bin splitting is a common data preprocessing technology, which is sometimes called bucket splitting or discretization , It can be used to group the intervals of continuous data into “ box ” or “ bucket ” in . In this paper , We will discuss the use of python Pandas The library divides the values into boxes 4 Methods .

Let's create the following composite data to demonstrate

import pandas as pd # version 1.3.5 
import numpy as np 
def create_df(): 
 df = pd.DataFrame({'score': np.random.randint(0,101,1000)}) 
 return df 
 
create_df() 
df.head()

The data includes 1000 Students 0 To 100 The test score of points . The task this time is to divide the numerical score into values “A”、“B” and “C” Level of , among “A” Is the best grade ,“C” Is the worst grade .

1、between & loc

Pandas .between Method returns a containing True Boolean vector of , Used to correspond to Series The element is at the boundary value left and right Between .

The parameters have the following three :

  • left: Left boundary

  • right: Right border

  • inclusive: Which boundary to include . The acceptable value is {“both”、“neither”、“left”、“right”}.

Students are graded according to the following spacing rules :

  • A: (80, 100]

  • B: (50, 80]

  • C: [0, 50]

The square brackets [ Parentheses ) Indicates that the boundary value is included and excluded respectively . We need to determine which score is between the intervals of interest , And assign the corresponding level value . Note that the following different parameters indicate whether the boundary is included

df.loc[df['score'].between(0, 50, 'both'), 'grade'] = 'C' 
df.loc[df['score'].between(50, 80, 'right'), 'grade'] = 'B' 
df.loc[df['score'].between(80, 100, 'right'), 'grade'] = 'A'

Here is the number of people in each score range :

df.grade.value_counts()
C 488
B 310
A 202
Name: grade, dtype: int64

This method needs to be used for each bin Write code for processing , Therefore, it only applies to bin In rare cases .

2、cut

have access to cut Classify values into discrete intervals . This function is also useful for moving from continuous variables to categorical variables .

cut The parameters of are as follows :

  • x: The array to be boxed . Must be one-dimensional .

  • bins: Scalar sequence : Defines the allowable non-uniform width bin edge .

  • labels: Specify the returned bin The label of . It has to do with bins Same parameter length .

  • include_lowest: (bool) Whether the first interval should be left contained .

bins = [0, 50, 80, 100] 
labels = ['C', 'B', 'A'] 
df['grade'] = pd.cut(x = df['score'], 
                      bins = bins, 
                      labels = labels, 
                      include_lowest = True)

This creates a file that contains bin Boundary value bins The list and one contain the corresponding bin Tag list of tags .

View the number of people in each section

df.grade.value_counts()
C 488
B 310
A 202
Name: grade, dtype: int64

The result is the same as the above example .

3、qcut

qcut Variables can be discretized into buckets of equal size according to ranking or based on sample quantiles [3].

In the previous example , We defined the score interval for each level , This makes the number of students at each level uneven . In the following example , We will try to classify students as 3 Two have equal ( about ) Score grade of quantity . Examples include 1000 Famous student , Therefore, each sub box should have about 333 Famous student .

qcut Parameters :

  • x: Input array to be boxed . Must be one-dimensional .

  • q: quantile .10 Represents the decile ,4 Indicates the quartile, etc . It can also be alternatively arranged quantiles , for example [0, .25, .5, .75, 1.] Four percentile .

  • labels: Appoint bin The label of . Must be consistent with the generated bin Same length .

  • retbins: (bool) Whether to return (bins, labels).

df['grade'], cut_bin = pd.qcut(df['score'], 
                          q = 3, 
                          labels = ['C', 'B', 'A'], 
                          retbins = True) 
df.head()

If retbins Set to True Will return bin The border .

print (cut_bin) 
>> [  0.  36.  68. 100.]

The score interval is as follows :

  • C:[0, 36]

  • B:(36, 68]

  • A:(68, 100]

Use .value_counts() Check how many students there are at each level . Ideally , Each box should have about 333 Famous student .

df.grade.value_counts()
C 340
A 331
B 329
Name: grade, dtype: int64

4、value_counts

although pandas .value_counts Usually used to calculate the number of unique values in a series , But it can also be used bins Parameters group values into half boxes .

df['score'].value_counts(bins = 3, sort = False)

By default , .value_counts Sort the returned Series in descending order of values . take sort Set to False Sort the series in ascending order of their index .

(-0.101, 33.333] 310
(33.333, 66.667] 340
(66.667, 100.0] 350
Name: score, dtype: int64

series Index refers to each bin Range of , The square brackets [ Parentheses ) Indicates that the boundary value is included and excluded respectively . return series The value of represents each bin How many records are there in .

And .qcut Different , Every bin The number of records in is not necessarily the same ( about )..value_counts The same number of records will not be assigned to the same category , Instead, the score range is divided into... According to the highest and lowest scores 3 An equal part . The minimum value of the score is 0, The maximum value is 100, So this 3 Each of the three parts is about 33.33 Within the scope of . It also explains why bin The boundary of this is 33.33 Multiple .

We can also define by passing in the boundary list bin The border .

df['score'].value_counts(bins = [0,50,80,100], sort = False)
(-0.001, 50.0] 488
(50.0, 80.0] 310
(80.0, 100.0] 202
Name: score, dtype: int64

This gives us examples 1 and 2 Same result .

summary

In this paper , How to use .between、.cut、.qcut and .value_counts Box the continuous values . Here is the source code of this article :

[1]

source : https://colab.research.google.com/drive/1yWTl2OzOnxG0jCdmeIN8nV1MoX3KQQ_1%3Fusp%3Dsharing

Long press attention - About data analysis and visualization  - Set to star , Dry goods express

NO.1

Previous recommendation

Historical articles

New generation reptile weapon — Playwright

Use LSTM Forecast sales (Python Code )

Python Handle PDF Artifact :PyMuPDF Installation and use of

Decision tree 、 Random forests 、bagging、boosting、Adaboost、GBDT、XGBoost summary

Share 、 Collection 、 give the thumbs-up 、 I'm looking at the arrangement ?


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved