Pandas There is a special data type called category. It represents a category , Generally used in statistical classification , Like gender , Blood type , classification , Levels and so on . It's kind of like java Medium enum.
Let's talk about it in detail today category Usage of .
Creating Series Add dtype=”category” You can create category 了 .category In two parts , Part of it is order, Part of it is literal :
In [1]: s = pd.Series(["a", "b", "c", "a"], dtype="category") In [2]: s Out[2]: 0 a 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c']
Can be DF Medium Series Convert to category:
In [3]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
In [4]: df["B"] = df["A"].astype("category")
In [5]: df["B"]
Out[32]:
0 a
1 b
2 c
3 a
Name: B, dtype: category
Categories (3, object): [a, b, c] You can create a pandas.Categorical , Pass it as a parameter to Series:
In [10]: raw_cat = pd.Categorical( ....: ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False ....: ) ....: In [11]: s = pd.Series(raw_cat) In [12]: s Out[12]: 0 NaN 1 b 2 c 3 NaN dtype: category Categories (3, object): ['b', 'c', 'd']
establish DataFrame When , Can also be passed in dtype=”category”:
In [17]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category")
In [18]: df.dtypes
Out[18]:
A category
B category
dtype: objectDF Medium A and B It's all one category:
In [19]: df["A"] Out[19]: 0 a 1 b 2 c 3 a Name: A, dtype: category Categories (3, object): ['a', 'b', 'c'] In [20]: df["B"] Out[20]: 0 b 1 c 2 c 3 d Name: B, dtype: category Categories (3, object): ['b', 'c', 'd']
Or use df.astype(“category”) take DF All of the Series Convert to category:
In [21]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
In [22]: df_cat = df.astype("category")
In [23]: df_cat.dtypes
Out[23]:
A category
B category
dtype: objectBy default dtype=’category’ created category The default value is used :
You can display create CategoricalDtype To change the two default values above :
In [26]: from pandas.api.types import CategoricalDtype In [27]: s = pd.Series(["a", "b", "c", "a"]) In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True) In [29]: s_cat = s.astype(cat_type) In [30]: s_cat Out[30]: 0 NaN 1 b 2 c 3 NaN dtype: category Categories (3, object): ['b' < 'c' < 'd']
alike CategoricalDtype It can also be used in DF in :
In [31]: from pandas.api.types import CategoricalDtype
In [32]: df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
In [33]: cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
In [34]: df_cat = df.astype(cat_type)
In [35]: df_cat["A"]
Out[35]:
0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']
In [36]: df_cat["B"]
Out[36]:
0 b
1 c
2 c
3 d
Name: B, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd'] Use Series.astype(original_dtype) perhaps np.asarray(categorical) Can be Category Convert to original type :
In [39]: s = pd.Series(["a", "b", "c", "a"])
In [40]: s
Out[40]:
0 a
1 b
2 c
3 a
dtype: object
In [41]: s2 = s.astype("category")
In [42]: s2
Out[42]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
In [43]: s2.astype(str)
Out[43]:
0 a
1 b
2 c
3 a
dtype: object
In [44]: np.asarray(s2)
Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)Categorical The data are categories and ordered Two attributes . Can pass s.cat.categories and s.cat.ordered To get :
In [57]: s = pd.Series(["a", "b", "c", "a"], dtype="category") In [58]: s.cat.categories Out[58]: Index(['a', 'b', 'c'], dtype='object') In [59]: s.cat.ordered Out[59]: False
rearrangement category The order of :
In [60]: s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"])) In [61]: s.cat.categories Out[61]: Index(['c', 'b', 'a'], dtype='object') In [62]: s.cat.ordered Out[62]: False
By giving s.cat.categories Assignments can be renamed categories:
In [67]: s = pd.Series(["a", "b", "c", "a"], dtype="category") In [68]: s Out[68]: 0 a 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c'] In [69]: s.cat.categories = ["Group %s" % g for g in s.cat.categories] In [70]: s Out[70]: 0 Group a 1 Group b 2 Group c 3 Group a dtype: category Categories (3, object): ['Group a', 'Group b', 'Group c']
Use rename_categories Can achieve the same effect :
In [71]: s = s.cat.rename_categories([1, 2, 3]) In [72]: s Out[72]: 0 1 1 2 2 3 3 1 dtype: category Categories (3, int64): [1, 2, 3]
Or use dictionary objects :
# You can also pass a dict-like object to map the renaming
In [73]: s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})
In [74]: s
Out[74]:
0 x
1 y
2 z
3 x
dtype: category
Categories (3, object): ['x', 'y', 'z']have access to add_categories To add category:
In [77]: s = s.cat.add_categories([4]) In [78]: s.cat.categories Out[78]: Index(['x', 'y', 'z', 4], dtype='object') In [79]: s Out[79]: 0 x 1 y 2 z 3 x dtype: category Categories (4, object): ['x', 'y', 'z', 4]
In [80]: s = s.cat.remove_categories([4]) In [81]: s Out[81]: 0 x 1 y 2 z 3 x dtype: category Categories (3, object): ['x', 'y', 'z']
In [82]: s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"])) In [83]: s Out[83]: 0 a 1 b 2 a dtype: category Categories (4, object): ['a', 'b', 'c', 'd'] In [84]: s.cat.remove_unused_categories() Out[84]: 0 a 1 b 2 a dtype: category Categories (2, object): ['a', 'b']
Use set_categories() You can add and delete at the same time category operation :
In [85]: s = pd.Series(["one", "two", "four", "-"], dtype="category") In [86]: s Out[86]: 0 one 1 two 2 four 3 - dtype: category Categories (4, object): ['-', 'four', 'one', 'two'] In [87]: s = s.cat.set_categories(["one", "two", "three", "four"]) In [88]: s Out[88]: 0 one 1 two 2 four 3 NaN dtype: category Categories (4, object): ['one', 'two', 'three', 'four']
If category Created with ordered=True , Then you can sort them :
In [91]: s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True))
In [92]: s.sort_values(inplace=True)
In [93]: s
Out[93]:
0 a
3 a
1 b
2 c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']
In [94]: s.min(), s.max()
Out[94]: ('a', 'c')have access to as_ordered() perhaps as_unordered() To force sort or not sort :
In [95]: s.cat.as_ordered() Out[95]: 0 a 3 a 1 b 2 c dtype: category Categories (3, object): ['a' < 'b' < 'c'] In [96]: s.cat.as_unordered() Out[96]: 0 a 3 a 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c']
Use Categorical.reorder_categories() It's possible to do something about the existing category Reorder :
In [103]: s = pd.Series([1, 2, 3, 1], dtype="category") In [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=True) In [105]: s Out[105]: 0 1 1 2 2 3 3 1 dtype: category Categories (3, int64): [2 < 3 < 1]
sort_values Support multi column sorting :
In [109]: dfs = pd.DataFrame(
.....: {
.....: "A": pd.Categorical(
.....: list("bbeebbaa"),
.....: categories=["e", "a", "b"],
.....: ordered=True,
.....: ),
.....: "B": [1, 2, 1, 2, 2, 1, 2, 1],
.....: }
.....: )
.....:
In [110]: dfs.sort_values(by=["A", "B"])
Out[110]:
A B
2 e 1
3 e 2
7 a 1
6 a 2
0 b 1
5 b 1
1 b 2
4 b 2 If it was set when it was created orderedTrue , that category Can be compared between . Support ==, !=, >, >=, <, and <= These operators .
In [113]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True)) In [114]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True)) In [115]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))
In [119]: cat > cat_base Out[119]: 0 True 1 False 2 False dtype: bool In [120]: cat > 2 Out[120]: 0 True 1 False 2 False dtype: bool
Cagetory In essence, it's still a Series, therefore Series The operation of category Basically, it can be used , such as : Series.min(), Series.max() and Series.mode().
value_counts:
In [131]: s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"])) In [132]: s.value_counts() Out[132]: c 2 a 1 b 1 d 0 dtype: int64
DataFrame.sum():
In [133]: columns = pd.Categorical( .....: ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True .....: ) .....: In [134]: df = pd.DataFrame( .....: data=[[1, 2, 3], [4, 5, 6]], .....: columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]), .....: ) .....: In [135]: df.sum(axis=1, level=1) Out[135]: One Two Three 0 3 3 0 1 9 6 0
Groupby:
In [136]: cats = pd.Categorical(
.....: ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
.....: )
.....:
In [137]: df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
In [138]: df.groupby("cats").mean()
Out[138]:
values
cats
a 1.0
b 2.0
c 4.0
d NaN
In [139]: cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
In [140]: df2 = pd.DataFrame(
.....: {
.....: "cats": cats2,
.....: "B": ["c", "d", "c", "d"],
.....: "values": [1, 2, 3, 4],
.....: }
.....: )
.....:
In [141]: df2.groupby(["cats", "B"]).mean()
Out[141]:
values
cats B
a c 1.0
d 2.0
b c 3.0
d 4.0
c c NaN
d NaNPivot tables:
In [142]: raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
In [143]: df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
In [144]: pd.pivot_table(df, values="values", index=["A", "B"])
Out[144]:
values
A B
a c 1
d 2
b c 3
d 4This article has been included in http://www.flydean.com/08-python-pandas-category/ The most popular interpretation , The deepest dry goods , The most concise tutorial , There are so many tricks you don't know about waiting for you to discover ! Welcome to my official account. :「 Program those things 」, Know technology , Know you better !