您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

[Python basics] learn to use pandas to process classified data!

編輯：Python


Datawhale dried food  
author ： Geng Yuanhao ,Datawhale member , East China Normal University

Classified data (categorical data) It is the data reflecting the type of things obtained by classifying or grouping phenomena according to certain attributes , Also known as categorical data . To put it bluntly , Is that the value is limited , Or a fixed number of possible values . for example ： Gender 、 Blood type, etc .

today , Let's learn ,Pandas How to deal with classified data . It mainly focuses on the following aspects ：\

Contents of this article \

1. Category The creation and nature of

1.1. Creation of classification variables

1.2. Structure of categorical variables

1.3. Category modification

2. Sorting of categorical variables

2.1. Establishment of order

2.2. Sort

3. Comparison of classified variables

3.1. Comparison with scalar or equal length sequences

3.2. Comparison with another categorical variable

4. Questions and exercises

4.1. problem

4.2. practice

First , Read in the data ：\

import pandas as pd
import numpy as np
df = pd.read_csv('data/table.csv')
df.head()

One 、category The creation and nature of

1.1. Creation of classification variables

（a） use Series establish

pd.Series(["a", "b", "c", "a"], dtype="category")

（b） Yes DataFrame Specify type creation

temp_df = pd.DataFrame({'A':pd.Series(["a", "b", "c", "a"], dtype="category"),'B':list('abcd')})
temp_df.dtypes

（c） With built-in Categorical Type creation

cat = pd.Categorical(["a", "b", "c", "a"], categories=['a','b','c'])
pd.Series(cat)

（d） utilize cut Function creation , The default interval type is label

pd.cut(np.random.randint(0,60,5), [0,10,30,60])

You can specify characters as labels

pd.cut(np.random.randint(0,60,5), [0,10,30,60], right=False, labels=['0-10','10-30','30-60'])

1.2. Structure of categorical variables

A categorical variable consists of three parts , Element value （values）、 Classification categories （categories）、 Is it orderly （order）. As can be seen from the above , Use cut The classification variable created by the function defaults to the ordered classification variable . The following describes how to get or modify these properties .

（a）describe Method

This method describes the case of a classification sequence , Including the number of non missing values 、 Number of element value categories （ Not the number of categories ）、 The most frequently occurring element and its frequency .

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.describe()

（b）categories and ordered attribute , Check the classification category and whether to sort

s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

s.cat.ordered

False

1.3. Category modification

（a） utilize set_categories modify , Modify the classification , The value itself does not change

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.set_categories(['new_a','c'])

（b） utilize rename_categories modify , It should be noted that this method will modify the value and classification at the same time

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.rename_categories(['new_%s'%i for i in s.cat.categories])

Use the dictionary to modify the value

s.cat.rename_categories({'a':'new_a','b':'new_b'})

（c） utilize add_categories add to

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.add_categories(['e'])

（d） utilize remove_categories remove

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.remove_categories(['d'])

（e） Delete the classification type where the element value does not appear

s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.remove_unused_categories()

Two 、 Sorting of categorical variables

Mentioned earlier , Categorical data types are divided into ordered and unordered , It's very understandable , For example, the height of the score interval is an ordered variable , The categories of examination subjects are generally regarded as unordered variables

2.1. Establishment of order

（a） Generally speaking, a sequence is transformed into an ordered variable , You can use as_ordered Method

s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.as_ordered()
s

 Degenerate into unordered variables , Just use as_unordered

s.cat.as_unordered()

（b） utilize set_categories Methods order Parameters

pd.Series(["a", "d", "c", "a"]).astype('category').cat.set_categories(['a','c','d'],ordered=True)

（c） utilize reorder_categories Method , This method is characterized by , The newly set classification must be the same set as the original classification

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s.cat.reorder_categories(['a','c','d'],ordered=True)

#s.cat.reorder_categories(['a','c'],ordered=True) # Report errors 
#s.cat.reorder_categories(['a','c','d','e'],ordered=True) # Report errors

2.2. Sort

Previously on page 1 The value sorting and index sorting introduced in chapter are applicable

s = pd.Series(np.random.choice(['perfect','good','fair','bad','awful'],50)).astype('category')
s.cat.set_categories(['perfect','good','fair','bad','awful'][::-1],ordered=True).head()

s.sort_values(ascending=False).head()

df_sort = pd.DataFrame({'cat':s.values,'value':np.random.randn(50)}).set_index('cat')
df_sort.head()

df_sort.sort_index().head()

3、 ... and 、 Comparison of classified variables

3.1. Comparison with scalar or equal length sequences

（a） Scalar comparison

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == 'a'

（b） Equal length sequence comparison

s == list('abcd')

3.2. Comparison with another categorical variable

（a） Equality discrimination （ Including equal sign and unequal sign ）, The equality discrimination of two classification variables needs to meet the requirements that the classification is exactly the same .

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == s

s != s

s_new = s.cat.set_categories(['a','d','e'])
#s == s_new # Report errors

（b） Inequality discrimination （ contain >=,<=,<,>）, The inequality discrimination of two classification variables needs to meet two conditions ：① The classification is exactly the same ② The order is exactly the same

s = pd.Series(["a", "d", "c", "a"]).astype('category')
#s >= s # Report errors

s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.reorder_categories(['a','c','d'],ordered=True)
s >= s

Four 、 Questions and exercises

4.1. problem

【 Question 1 】 How to use union_categoricals Method ？ What is its role ？

If you want to combine categories that do not necessarily have the same category ,union_categoricals Function will combine categories similar to a list . The new category will be the union of the merged categories . As shown below ：

from pandas.api.types import union_categoricals
a = pd.Categorical(['b','c'])
b = pd.Categorical(['a','b'])
union_categoricals([a,b])

By default , The generated categories will be arranged in the order shown in the data . If you want to sort categories , You can use sort_categories=True Parameters .
union_categoricals It is also applicable to two classifications that combine the same category and sequence information .
union_categoricals You can recode the integer code of the category when merging classifications .

【 Question two 】 utilize concat Methods two sequences were spliced vertically , Must the result be a categorical variable ？ Under what circumstances is not ？

【 Question 3 】 When using groupby Methods or value_counts When the method is used , What is the difference between the statistical results of classified variables and ordinary variables ？\

Categorizing variables groupby Method /value_counts Method , The statistical object is the category .
Common variables groupby Method /value_counts Method , The statistical object is a unique value ( It doesn't contain NA).

【 Question 4 】 The following code shows Series Create a classification variable “ defects ”？ How to avoid ？（ Tips ： Use Series Medium copy Parameters ）

cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
s = pd.Series(cat, name="cat")
cat

s.iloc[0:2] = 10
cat

4.2. practice

【 Exercise one 】 Now continue to use the seismic data set in Chapter 4 , Please solve the following problems ：

（a） Now divide the depth into seven levels ：[0,5,10,15,20,30,50,np.inf], Please use depth grade Ⅰ,Ⅱ,Ⅲ,Ⅳ,Ⅴ,Ⅵ,Ⅶ Index and sort from light to deep .

Use cut Method to divide the depth in the list , And use this column as the index value . Then sort by index .

df = pd.read_csv('data/Earthquake.csv')
df_result = df.copy()
df_result[' depth '] = pd.cut(df[' depth '],[0,5,10,15,20,30,50,np.inf], right=False, labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ','Ⅴ','Ⅵ','Ⅶ'])
df_result = df_result.set_index(' depth ').sort_index()
df_result.head()

（b） stay （a） On the basis of , Divide the intensity into 4 Level ：[0,3,4,5,np.inf], The depth and intensity grades of the southern region are sorted by multi-level index .

Follow (a) Very similar ,cut Method versus depth , Intensity for segmentation , hold index Set to [‘ depth ’,‘ earthquake intensity ’], Then you can sort by index .

df[' earthquake intensity '] = pd.cut(df[' earthquake intensity '],[0,3,4,5,np.inf], right=False, labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ'])
df[' depth '] = pd.cut(df[' depth '],[0,5,10,15,20,30,50,np.inf], right=False, labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ','Ⅴ','Ⅵ','Ⅶ'])
df_ds = df.set_index([' depth ',' earthquake intensity '])
df_ds.sort_index()

【 Exercise 2 】 For categorical variables , Call No 4 The deformation function in the chapter will have a BUG（ Not fixed in the current version ）： For example, for crosstab function , According to the official documents , Even variables that do not appear will appear in the summary results after deformation , But it's not , For example, the following example lacks the lines that should have appeared 'c' And column 'f'. Based on this problem , Try designing my_crosstab function , It can return correct results in function .

because Categories There must be variables in the . So take the first parameter as index, The second parameter is columns, Build a DataFrame, Then combine the variables that appear , Fill in the corresponding position 1 that will do .

foo = pd.Categorical(['b','a'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
import numpy
def my_crosstab(a, b):
s1 = pd.Series(list(foo.categories), name='row')
s2 = list(bar.categories)
df = pd.DataFrame(np.zeros((len(s1), len(s2)),int),index=s1, columns=s2)
index_1 = list(foo)
index_2 = list(bar)
for loc in zip(index_1, index_2):
df.loc[loc] = 1
return df
my_crosstab(foo, bar)