程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Data classification in pandas

編輯:Python

official account : Youer cottage author :Peter edit :Pete

Hello everyone , I am a Peter~

This article introduces Categorical type , The main data classification problem , Used to carry integer based category presentation or encoded data , Help users get better performance and memory usage .

<!--MORE-->

background : Statistical duplicate value

In a Series Duplicate values often appear in the data , We need to extract these different values and calculate their frequency :

import numpy as np
import pandas as pd
data = pd.Series([" Chinese language and literature "," mathematics "," English "," mathematics "," English "," Geography "," Chinese language and literature "," Chinese language and literature "])
data
0 Chinese language and literature 
1 mathematics 
2 English 
3 mathematics 
4 English 
5 Geography 
6 Chinese language and literature 
7 Chinese language and literature 
dtype: object
# 1、 Extract different values
pd.unique(data)
array([' Chinese language and literature ', ' mathematics ', ' English ', ' Geography '], dtype=object)
# 2、 Count the number of each value
pd.value\_counts(data)
 Chinese language and literature 3
 mathematics 2
 English 2
 Geography 1
dtype: int64

classification 、 Dictionary code

By way of integer representation , It's called classification or dictionary coding . Different arrays can be called categories of data 、 Dictionary or hierarchy

df = pd.Series([0,1,1,0] \* 2)
df
0 0
1 1
2 1
3 0
4 0
5 1
6 1
7 0
dtype: int64
# dim Use dimension tables
dim = pd.Series([" Chinese language and literature "," mathematics "])
dim
0 Chinese language and literature 
1 mathematics 
dtype: object

How to integrate 0- Chinese language and literature ,1- Mathematics in df Make one-to-one correspondence ? Use **take** Method to implement

df1 = dim.take(df)
df1
0 Chinese language and literature 
1 mathematics 
1 mathematics 
0 Chinese language and literature 
0 Chinese language and literature 
1 mathematics 
1 mathematics 
0 Chinese language and literature 
dtype: object
type(df1) # Series data 
pandas.core.series.Series

Categorical Type creation

Generate a Categorical Instance object

Explain with examples Categorical Use of type

subjects = [" Chinese language and literature "," mathematics "," Chinese language and literature "," Chinese language and literature "] \* 2
N = len(subjects)
df2 = pd.DataFrame({
"subject":subjects,
"id": np.arange(N), # Continuous integer
"score":np.random.randint(3,15,size=N), # Random integers
"height":np.random.uniform(165,180,size=N) # Data of normal distribution
},
columns=["id","subject","score","height"]) # Specify the order of column names
df2

Can be subject Turn into Categorical type :

subject\_cat = df2["subject"].astype("category")
subject\_cat

We found out subject_cat Two characteristics of :

  • It is not numpy Array , It is a category data type
  • There are two values in it : Chinese and Mathematics
s = subject\_cat.values
s
[' Chinese language and literature ', ' mathematics ', ' Chinese language and literature ', ' Chinese language and literature ', ' Chinese language and literature ', ' mathematics ', ' Chinese language and literature ', ' Chinese language and literature ']
Categories (2, object): [' mathematics ', ' Chinese language and literature ']
type(s)
pandas.core.arrays.categorical.Categorical
s.categories # Check the categories 
Index([' mathematics ', ' Chinese language and literature '], dtype='object')
s.codes # View classification code 
array([1, 0, 1, 1, 1, 0, 1, 1], dtype=int8)

How to generate Categorical object

There are mainly two ways :

  • Appoint DataFrame One of the columns is Categorical object
  • adopt pandas.Categorical To generate
  • By constructor from_codes, The premise is that you must first obtain the classification and coding data
# The way 1
df2["subject"] = df2["subject"].astype("category")
df2.subject
0 Chinese language and literature 
1 mathematics 
2 Chinese language and literature 
3 Chinese language and literature 
4 Chinese language and literature 
5 mathematics 
6 Chinese language and literature 
7 Chinese language and literature 
Name: subject, dtype: category
Categories (2, object): [' mathematics ', ' Chinese language and literature ']
# The way 2
fruit = pd.Categorical([" Apple "," Banana "," grapes "," Apple "," Apple "," Banana "])
fruit
[' Apple ', ' Banana ', ' grapes ', ' Apple ', ' Apple ', ' Banana ']
Categories (3, object): [' Apple ', ' grapes ', ' Banana ']
# The way 3
categories = ["height","score","subject"]
codes = [0,1,0,2,1,0]
my\_data = pd.Categorical.from\_codes(codes, categories)
my\_data
['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height', 'score', 'subject']

Generally, classification transformation does not specify the order of categories , We can pass a parameter ordered To specify a meaningful order :

['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height' < 'score' < 'subject']

The output above height<socre, indicate height In the order of score In front of . If a classification instance is not sorted , We use as_ordered Sort :

# my\_data unsorted
my\_data.as\_ordered()
['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height' < 'score' < 'subject']

Categorical Object computing

Statistical calculation

np.random.seed(12345)
data1 = np.random.randn(100)
data1[:10]
array([-0.20470766, 0.47894334, -0.51943872, -0.5557303 , 1.96578057,
 1.39340583, 0.09290788, 0.28174615, 0.76902257, 1.24643474])
# Calculation data1 Of 4 Split bin , And extract statistical values
bins\_1 = pd.qcut(data1,4)
bins\_1
[(-0.717, 0.106], (0.106, 0.761], (-0.717, 0.106], (-0.717, 0.106], (0.761, 3.249], ..., (0.761, 3.249], (0.106, 0.761], (-2.371, -0.717], (0.106, 0.761], (0.106, 0.761]]
Length: 100
Categories (4, interval[float64]): [(-2.371, -0.717] < (-0.717, 0.106] < (0.106, 0.761] < (0.761, 3.249]]

You can see the value returned by the above result Categories object

  • Yes 4 Species value
  • See that the maximum and minimum values of the whole data are at the head and tail respectively
# Above 4 Use quartile names in quantiles :Q1\Q2\Q3\Q4
bins\_2 = pd.qcut(data1,4,labels=["Q1","Q2","Q3","Q4"])
bins\_2
['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q4', 'Q3', 'Q1', 'Q3', 'Q3']
Length: 100
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
bins\_2.codes[:10]
array([1, 2, 1, 1, 3, 3, 1, 2, 3, 3], dtype=int8)

Statistics groupby To make summary statistics :

bins\_2 = pd.Series(bins\_2, name="quartile") # named quartile
bins\_2
0 Q2
1 Q3
2 Q2
3 Q2
4 Q4
 ..
95 Q4
96 Q3
97 Q1
98 Q3
99 Q3
Name: quartile, Length: 100, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

The following code example is for data1 The data from bins_2 Grouping , Generate 3 A statistical function

results = pd.Series(data1).groupby(bins\_2).agg(["count","min","max"]).reset\_index()
results
results["quartile"] # quartile The original classification information maintained by the column 
0 Q1
1 Q2
2 Q3
3 Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

Memory reduction after classification

N = 10000000 # Millions of data
data3 = pd.Series(np.random.randn(N))
labels3 = pd.Series(["foo", "bar", "baz", "quz"] \* (N // 4))
categories3 = labels3.astype("category") # Classification conversion 
# Compare the memory of two
print("data3: ",data3.memory\_usage())
print("categories3: ",categories3.memory\_usage())
data3: 80000128
categories3: 10000332

classification method

Access classification information

The classification method is mainly through special attributes cat To achieve

data
0 Chinese language and literature 
1 mathematics 
2 English 
3 mathematics 
4 English 
5 Geography 
6 Chinese language and literature 
7 Chinese language and literature 
dtype: object
cat\_data = data.astype("category")
cat\_data # Classified data 
0 Chinese language and literature 
1 mathematics 
2 English 
3 mathematics 
4 English 
5 Geography 
6 Chinese language and literature 
7 Chinese language and literature 
dtype: category
Categories (4, object): [' Geography ', ' mathematics ', ' English ', ' Chinese language and literature ']

New category

When the category of actual data exceeds that observed in the data 4 A numerical :

actual\_cat = [" Chinese language and literature "," mathematics "," English "," Geography "," biological "]
cat\_data2 = cat\_data.cat.set\_categories(actual\_cat)
cat\_data2

In the above classification results " biological "

cat\_data.value\_counts()
 Chinese language and literature 3
 mathematics 2
 English 2
 Geography 1
dtype: int64
cat\_data2.value\_counts() # In the following results “ biological ”
 Chinese language and literature 3
 mathematics 2
 English 2
 Geography 1
 biological 0
dtype: int64

Delete category

cat\_data3 = cat\_data[cat\_data.isin([" Chinese language and literature "," mathematics "])] # Only Chinese and Mathematics
cat\_data3
0 Chinese language and literature 
1 mathematics 
3 mathematics 
6 Chinese language and literature 
7 Chinese language and literature 
dtype: category
Categories (4, object): [' Geography ', ' mathematics ', ' English ', ' Chinese language and literature ']
cat\_data3.cat.remove\_unused\_categories() # Delete unused categories 
0 Chinese language and literature 
1 mathematics 
3 mathematics 
6 Chinese language and literature 
7 Chinese language and literature 
dtype: category
Categories (2, object): [' mathematics ', ' Chinese language and literature ']

Create virtual variables

Convert classified data into virtual variables , That is to say one-hot code ( Hot code alone ); Produced DataFrame The different categories in are all part of it , See the following example :

data4 = pd.Series(["col1","col2","col3","col4"] \* 2, dtype="category")
data4
0 col1
1 col2
2 col3
3 col4
4 col1
5 col2
6 col3
7 col4
dtype: category
Categories (4, object): ['col1', 'col2', 'col3', 'col4']
pd.get\_dummies(data4) # get\_dummies: Convert the one-dimensional classification data into a... Containing virtual variables DataFrame

classification method

  • add_categories: Add a new category to the tail
  • as_ordered: Category sorting
  • as_unordered: Disorder categories
  • remove_categories: Remove category , Set the removed value to null
  • remove_unused_categories: Remove all categories that do not appear
  • rename_categories: Replace category name , Do not change the number of categories
  • reorder_categories: Class
  • set_categories: Replace the original class with the specified set of new classes , You can add or delete

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved