程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python: DateFrame教程

編輯:Python

目錄

創建、讀寫、顯示

創建DataFrame與Series

讀取、保存數據文件

設置顯示格式

列數據對齊

索引、選擇

取行

取列

同時取行列

將某列設為行索引

條件選取

匯總函數與映射

匯總函數:describe(),unique(),value_counts()

映射:map()與apply()

連接兩列文本

分組與排序

分組分析

多層索引

排序

數據類型與缺失值處理

數據類型及轉換

缺失值處理

重命名、更改列順序、添加新行列

重命名列名與行名

更改列順序

添加新列

數據框合並

相同列名的數據框縱向拼接

橫向拼接或融合數據框


導入庫

import pandas as pd

創建、讀寫、顯示

創建DataFrame與Series

>>> pd.DataFrame({'Yes':[50,21],"No":[131,2]})
Yes No
0 50 131
1 21 2
>>> #數據框的內容也可以為字符串
>>> pd.DataFrame({'Bob':['T like it.','It was awful.'],'Sue':['Pretty good.','Bland.']})
Bob Sue
0 T like it. Pretty good.
1 It was awful. Bland.
>>> #設置行索引
>>> pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.
>>> #創建Series
>>> pd.Series([1,2,3,4,5])
0 1
1 2
2 3
3 4
4 5
dtype: int64
>>> #Series無列名,只有一個總體的名稱
>>> pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64

讀取、保存數據文件

>>> wine_reviews = pd.read_csv('D:\DOCUMENT\PRO\PYTHON\DataFrameTurtorial\winemag-data-130k-v2.csv')
>>> wine_reviews
Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 14 columns]
>>> #顯示行列數
>>> wine_reviews.shape
(129971, 14)
>>> #顯示首尾幾行
>>> wine_reviews.head() #默認顯示前5行,head(3)顯示前3行
Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
2 2 US Tart a... NaN 87 14.0 Oregon Willam... Willam... Paul G... @paulg... Rainst... Pinot ... Rainstorm
3 3 US Pineap... Reserv... 87 13.0 Michigan Lake M... NaN Alexan... NaN St. Ju... Riesling St. Ju...
4 4 US Much l... Vintne... 87 65.0 Oregon Willam... Willam... Paul G... @paulg... Sweet ... Pinot ... Sweet ...
>>> wine_reviews.tail() #默認顯示後5行,tail(3)顯示後3行
Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
129966 129966 Germany Notes ... Braune... 90 28.0 Mosel NaN NaN Anna L... NaN Dr. H.... Riesling Dr. H....
129967 129967 US Citati... NaN 90 75.0 Oregon Oregon Oregon... Paul G... @paulg... Citati... Pinot ... Citation
129968 129968 France Well-d... Kritt 90 30.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
129969 129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
>>> #創建時可指定索引
>>> wine_reviews = pd.read_csv('D:\DOCUMENT\PRO\PYTHON\DataFrameTurtorial\winemag-data-130k-v2.csv', index_col=0)
>>> wine_reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]
>>> #將數據框保存為csv文件
>>> wine_reviews.to_csv(path)

設置顯示格式

#顯示所有列(參數設置為None代表顯示所有列,也可以自行設置數字)
pd.set_option('display.max_columns',None)
#顯示所有行
pd.set_option('display.max_rows',None)
#設置數據的顯示長度,默認為50
pd.set_option('max_colwidth',200)
#禁止自動換行(設置為Flase不自動換行,True反之)
pd.set_option('expand_frame_repr', False)

列數據對齊

#列數據對齊
>>> df #對齊前
f10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 平安銀行 2150 142 -129 30 1442
1 61 000002 萬 科A 3000 155 57 39 840
2 0 000003 PT金田A 0 0 0 0 0
>>> pd.set_option('display.unicode.ambiguous_as_wide', True)
>>> df
f10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 平安銀行 2150 142 -129 30 1442
1 61 000002 萬 科A 3000 155 57 39 840
2 0 000003 PT金田A 0 0 0 0 0
>>> pd.set_option('display.unicode.east_asian_width', True) #只用這一行似乎也可
>>> df #對齊後
f10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 平安銀行 2150 142 -129 30 1442
1 61 000002 萬 科A 3000 155 57 39 840
2 0 000003 PT金田A 0 0 0 0 0
>>> pd.set_option('display.width', 180) # 設置打印寬度(**重要**)
>>> df
f10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 平安銀行 2150 142 -129 30 1442
1 61 000002 萬 科A 3000 155 57 39 840
2 0 000003 PT金田A 0 0 0 0 0

索引、選擇

>>> #導入數據並設置最大顯示行數為5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

取行

>>> reviews.iloc[0] #用行號訪問第一行,返回Series
>>> reviews.iloc[[0]] #用行號訪問第一行,返回DataFrame
>>> reviews.iloc[-5:] #取後5行,返回DataFrame
>>> reviews.iloc[2:4] #第三行與第四行
>>> reviews[2:4] #取第三行與第四行
>>> reviews.loc[[2,4]] #取不連續的多行
country ... winery
2 US ... Rainstorm
4 US ... Sweet Cheeks
[2 rows x 13 columns]

取列

>>> #取一列
>>> reviews['country'] #用列名訪問,返回Series
>>> reviews.country #用.訪問,名稱中有空格時不適用,返回Series
>>> reviews.loc[:,'country'] #同reviews['country'],返回Series
>>> reviews.iloc[:,0] #或用列位置索引訪問,選取第1列,返回Series
>>> reviews[['country']] #取country列,返回DataFrame
>>> reviews.loc[:,['country']] #同reviews[['country']]
>>> reviews.iloc[:,[0]] #取第0列,用位置索引訪問,返回DataFrame
country
0 Italy
1 Portugal
... ...
129969 France
129970 France
[129971 rows x 1 columns]
>>> #取多列,返回DataFrame
>>> reviews.loc[:'country':'points'] #取從country到points(包含points)的所有列,country列需在points列之前
>>> reviews.iloc[:,0:4] #取第0列到第4列(不包含第四列)的所有列
>>> reviews[['country','points']] #取不連續的country與points列、
>>> reviews.loc[:,['country','points']] #同reviews[['country','points']]
>>> reviews.iloc[:,[0,3]] #取不連續的第0列與第2列
country points
0 Italy 87
1 Portugal 87
... ... ...
129969 France 90
129970 France 90
[129971 rows x 2 columns]

同時取行列

>>> #取單個值,返回類型為值的類型
>>> reviews['country'][0] #取country列第一個值
>>> reviews.loc[0,'country'] #取第0行country列的值
>>> reviews.iloc[1,0] #取第一行第0列的值
>>> #取多個值
>>> reviews.iloc[1:4,0] #選取第2行到第4行的第一列,連續索引,返回Series
>>> reviews.iloc[[1,3],0] #選取第2行第4行的第一列,分散索引,返回Series
>>> reviews.iloc[[1,3],[0]] #選取第2行第4行的第一列,分散索引,返回DataFrame
>>> reviews.iloc[[1,3],2:5] #選取第2行第4行的第3列到第5列,連續索引,返回DataFrame
>>> reviews.iloc[[1,3],[2,5]] #選取第2行第4行的第3列和第6列,分散索引,返回DataFrame
>>> reviews.loc[1,['country','points']] #選取第2行的country與points列,分散索引,返回Series
>>> reviews.loc[[1],['country','points']] #選取第2行的country與points列,分散索引,返回DataFrame
>>> reviews.loc[[1,3],['country','points']] #選取第2行與第4行的country與points列,分散索引,返回DataFrame
>>> reviews.loc[[1,3],'country':'points'] #選取第2行與第4行的country到points列,返回DataFrame
country ... points
1 Portugal ... 87
3 US ... 87
[2 rows x 4 columns]

將某列設為行索引

>>> reviews.set_index('title')
country ... winery
title ...
Nicosia 2013 Vulkà Bianco (Etna) Italy ... Nicosia
Quinta dos Avidagos 2011 Avidagos Red (Douro) Portugal ... Quinta dos Avidagos
... ... ... ...
Domaine Marcel Deiss 2012 Pinot Gris (Alsace) France ... Domaine Marcel Deiss
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caro... France ... Domaine Schoffit
[129971 rows x 12 columns]

條件選取

>>> #判斷每行country是否為Italy,返回Series
>>> reviews.country == 'Italy'
0 True
1 False
...
129969 False
129970 False
Name: country, Length: 129971, dtype: bool
>>> reviews[reviews.country == 'Italy'] #選取country為Italy的行
>>> reviews.loc[reviews.country == 'Italy'] #同上,可不用loc
>>> reviews[reviews['points'] == 90] #選取points為90的行
>>> reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)] #選取country為Italy且points>=90的行
>>> reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)] #選取country為Italy或points>=90的行
>>> reviews.loc[reviews.country.isin(['Italy', 'France'])] #選取country為Italy或France的行
>>> reviews.loc[reviews.price.isnull()] #選取price為空的行
>>> reviews.loc[reviews.price.notnull()] #選取price為非空的行
>>> reviews[reviews['description'].str.contains('great')] #選取description列中包含great的行,有空值報錯時添加參數na=False
#,即reviews[reviews['description'].str.contains('great',na=False)]
>>> reviews[~reviews['description'].str.contains('great')] #選取description列中不包含great的行
country ... winery
0 Italy ... Nicosia
1 Portugal ... Quinta dos Avidagos
... ... ... ...
129969 France ... Domaine Marcel Deiss
129970 France ... Domaine Schoffit
[125196 rows x 13 columns]

匯總函數與映射

導入數據

>>> #導入數據並設置最大顯示行數為5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

匯總函數:describe(),unique(),value_counts()

>>> reviews.points.describe() #返回數值列數據的均值、最大最小值等性質
count 129971...
mean 88.447138
...
75% 91.000000
max 100.00...
Name: points, Length: 8, dtype: float64
>>> reviews.taster_name.describe() #返回文本列文本的一些性質
count 103727
unique 19
top Roger ...
freq 25514
Name: taster_name, dtype: object
>>> reviews.points.mean() #取points列的平均值
>>> reviews.taster_name.unique() #查看taster_name列數據有多少種,返回數組
array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
'Christina Pickard'], dtype=object)
>>> reviews.taster_name.value_counts() #查看taster_name列數據每種有多少個
Roger Voss 25514
Michael Schachner 15134
...
Fiona Adams 27
Christina Pickard 6
Name: taster_name, Length: 19, dtype: int64

映射:map()與apply()

>>> #數據列扣除平均值
>>> review_points_mean = reviews.points.mean()
>>> reviews.points-review_points_mean #返回Series,扣除平均值最快的方法,結果同下
>>> reviews.points.map(lambda p: p - review_points_mean) #返回Series,不改變原DataFrame
0 -1.447138
1 -1.447138
...
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
>>> #apply()方法逐行更改數據,較慢
>>> def remean_points(row):
row.points = row.points - review_points_mean
return row
>>> reviews.apply(remean_points, axis='columns') #返回新的DataFrame,不改變原DataFrame
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... -1.447138 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos -1.447138 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 1.552862 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 1.552862 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

連接兩列文本

>>> reviews.country + " - " + reviews.region_1
0 Italy ...
1 NaN
...
129969 France...
129970 France...
Length: 129971, dtype: object

分組與排序

導入數據

>>> #導入數據並設置最大顯示行數為5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

分組分析

>>> #按points不同分類,並統計每種points有多少行,返回Series
>>> reviews.groupby('points').size() #方法1
>>> reviews.groupby('points').points.count() #方法2
>>> reviews.points.value_counts() #方法3,順序可能不同,該方法沒有Series列名
88 17207
87 16933
...
99 33
100 19
Name: points, Length: 21, dtype: int64
>>> #按points不同分類,並統計每種points種的最小price
>>> reviews.groupby('points').price.min()
points
80 5.0
81 5.0
...
99 44.0
100 80.0
Name: price, Length: 21, dtype: float64
>>> #按winery不同分類,查看每類中第一行的title
>>> reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
winery
1+1=3 1+1=3 ...
10 Knots 10 Kno...
...
àMaurice àMauri...
Štoka Štoka ...
Length: 16757, dtype: object
>>> #根據多列(country,province)分類,並找出每類中points最高的行
>>> reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
country province
Argentina Mendoza... Argentina If the... Nicasi... 97 120.0 Mendoz... Mendoza NaN Michae... @wines... Bodega... Malbec Bodega...
Other Argentina Take n... Reserva 95 90.0 Other Salta NaN Michae... @wines... Colomé... Malbec Colomé
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Uruguay San Jose Uruguay Baked,... El Pre... 87 50.0 San Jose NaN NaN Michae... @wines... Castil... Red Blend Castil...
Uruguay Uruguay Cherry... Blend ... 91 22.0 Uruguay NaN NaN Michae... @wines... Narbon... Tannat... Narbona
[425 rows x 13 columns]
>>> #按country分類,並統計每類有多少行及每類的最大值最小值
>>> reviews.groupby(['country']).price.agg([len, min, max])
len min max
country
Argentina 3800.0 4.0 230.0
Armenia 2.0 14.0 15.0
... ... ... ...
Ukraine 14.0 6.0 13.0
Uruguay 109.0 10.0 130.0
[43 rows x 3 columns]

多層索引

>>> #按country和province分類,查看每類有多少個
>>> countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
>>> countries_reviewed
len
country province
Argentina Mendoza... 3264
Other 536
... ...
Uruguay San Jose 3
Uruguay 24
[425 rows x 1 columns]
>>> #查看數據類型
>>> mi = countries_reviewed.index
>>> type(mi)
<class 'pandas.core.indexes.multi.MultiIndex'>
>>> #重置行索引
>>> countries_reviewed.reset_index()
country province len
0 Argentina Mendoz... 3264
1 Argentina Other 536
.. ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24
[425 rows x 3 columns]

排序

>>> #分類結果按種類數目(值)排序
>>> countries_reviewed = countries_reviewed.reset_index()
>>> countries_reviewed.sort_values(by='len') #升序(默認)
>>> countries_reviewed.sort_values(by='len', ascending=False) #降序
country province len
392 US Califo... 36247
415 US Washin... 8639
.. ... ... ...
63 Chile Coelemu 1
149 Greece Beotia 1
[425 rows x 3 columns]
>>> #分類結果按索引升序排序
>>> countries_reviewed.sort_index()
country province len
0 Argentina Mendoz... 3264
1 Argentina Other 536
.. ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24
[425 rows x 3 columns]
>>> #分類結果多條件排序
>>> countries_reviewed.sort_values(by=['country', 'len'])
country province len
1 Argentina Other 536
0 Argentina Mendoz... 3264
.. ... ... ...
424 Uruguay Uruguay 24
419 Uruguay Canelones 43
[425 rows x 3 columns]

數據類型與缺失值處理

導入數據

>>> #導入數據並設置最大顯示行數為5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

數據類型及轉換

>>> #查看列數據類型
>>> reviews.index.dtype #查看索引列數據類型,dtype('int64')
>>> reviews.price.dtype #查看price列數據類型,dtype('float64')
>>> reviews.dtypes
country object
description object
...
variety object
winery object
Length: 13, dtype: object
>>> #轉換數據類型
>>> reviews.points.astype('float64')
0 87.0
1 87.0
...
129969 90.0
129970 90.0
Name: points, Length: 129971, dtype: float64
>>> #將列數據轉換為列表
>>> reviews['country'].tolist()
>>> #或
>>> list(reviews['country'])

缺失值處理

缺失值標記為NaN(Not a Number),總為float64類型

>>> #選擇country值缺失的行
>>> reviews[pd.isnull(reviews.country)] #反過來為notnull
country ... winery
913 NaN ... Gotsa Family Wines
3131 NaN ... Barton & Guestier
... ... ... ...
129590 NaN ... Büyülübağ
129900 NaN ... Psagot
[63 rows x 13 columns]
>>> #用Unknown填充region_2缺失的行
>>> reviews.region_2.fillna("Unknown")
0 Unknown
1 Unknown
...
129969 Unknown
129970 Unknown
Name: region_2, Length: 129971, dtype: object
>>> #值替換
>>> reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino") #將taster_twitter_handle列中的@kerinokeefe替換為@kerino
0 @kerino
1 @vossroger
...
129969 @vossroger
129970 @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

重命名、更改列順序、添加新行列

導入數據

>>> #導入數據並設置最大顯示行數為5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

重命名列名與行名

>>> #重命名列名
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> #方法1:采用字典格式設置列名
>>> df.rename(columns={"A": "a", "B": "c"},inplace=True)
>>>#或
>>>df.rename(columns=dict(A='a',B='b'),inplace=True)
>>> #方法2:設置columns,按原來的順序全部列出
>>> df.columns=['a','c']
>>> df
a c
0 1 4
1 2 5
2 3 6
>>> #重命名行索引
>>> df.rename(index={0: "x", 1: "y", 2: "z"},inplace=True)
>>> df.rename({0: "x", 1: "y", 2: "z"}, axis='index',inplace=True)
>>> df
a c
x 1 4
y 2 5
z 3 6
#命名索引列名稱
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename_axis('number',axis='rows')
A B
number
0 1 4
1 2 5
2 3 6

更改列順序

>>> #方法1:設置列名順序列表並應用
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> newColumnOrderList=['B','A']
>>> df=df[newColumnOrderList]
>>> #方法2:取出列,刪除原來的列,在新的位置插入取出的列
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df_B=df.B
>>> df.drop('B',axis=1,inplace=True)
>>> df.insert(0,'new_B',df_B) #插入新列時還可以對新列重命名
>>> df
new_B A
0 4 1
1 5 2
2 6 3

添加新列

>>> reviews['critic'] = 'everyone' #新建critic列並賦值everyone,操作類似字典
>>> reviews['critic']
0 everyone
1 everyone
...
129969 everyone
129970 everyone
Name: critic, Length: 129971, dtype: object
#迭代賦值
>>> reviews['index_backwards'] = range(len(reviews), 0, -1) #迭代賦值
>>> reviews['index_backwards'] = list(range(len(reviews), 0, -1))
>>> reviews['index_backwards']
0 129971
1 129970
...
129969 2
129970 1
Name: index_backwards, Length: 129971, dtype: int64

數據框合並

相同列名的數據框縱向拼接

>>> #相同列名的數據框縱向拼接
>>> df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('BA'))
>>> #方法1:append()方法
>>> #拼接兩個DataFrame
>>> df1.append(df2,ignore_index=True) #生成新的DataFrame,不改變原有的DataFrame。ignore_index設為True用於重置行索引
A B
0 1 2
1 3 4
2 6 5
3 8 7
>>> #拼接多個Series
>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1,s2],ignore_index=True) #
0 a
1 b
2 c
3 d
dtype: object
>>> #方法2:caoncat()方法,可以拼接多個列名相同的DataFrame
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['letter', 'number'])
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],columns=['letter', 'number'])
>>> pd.concat([df1, df2])
letter number
0 a 1
1 b 2
0 c 3
1 d 4

橫向拼接或融合數據框

>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['letter', 'number'])
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],columns=['animal', 'name'])
>>> #方法1:pd.concat()方法
>>> pd.concat([df1,df4],axis=1)
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
>>> #方法2:join()方法
>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']})
>>> #橫向合並兩個數據框,保留所有列名
>>> df1.join(df2, lsuffix='_df1', rsuffix='_df2')
key_df1 A key_df2 B
0 K0 A0 K0 B0
1 K1 A1 K1 B1
2 K2 A2 K2 B2
3 K3 A3 NaN NaN
4 K4 A4 NaN NaN
5 K5 A5 NaN NaN
>>> #如果要用key列融合兩個數據框,需將key列設為兩個數據框的索引
>>> df1.set_index('key').join(df2.set_index('key'))
A B
key
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 NaN
K4 A4 NaN
K5 A5 NaN
>>> #還可采用設置on參數的方法融合兩個數據框
>>> df1.join(df2.set_index('key'), on='key')
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 NaN
4 K4 A4 NaN
5 K5 A5 NaN

此外,合並方法還有df.merge()等。


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved