程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python: dateframe tutorial

編輯:Python

Catalog

establish 、 Reading and writing 、 Show

establish DataFrame And Series

Read 、 Save data file

Set the display format

Column data alignment

Indexes 、 choice

Take row

Fetch

Simultaneous row and column retrieval

Set a column as a row index

Conditional selection

Summarize functions and mappings

Aggregate function :describe(),unique(),value_counts()

mapping :map() And apply()

Connect two columns of text

Grouping and sorting

Group analysis

Multi level index

Sort

Data type and missing value processing

And data types

Missing value processing

rename 、 Change column order 、 Add a new row

Rename column and row names

Change column order

Add new column

Data frame merge

Data frames with the same column names are spliced vertically

Horizontal splicing or fusion of data frames


Import library

import pandas as pd

establish 、 Reading and writing 、 Show

establish DataFrame And Series

>>> pd.DataFrame({'Yes':[50,21],"No":[131,2]})
Yes No
0 50 131
1 21 2
>>> # The contents of the data frame can also be strings
>>> pd.DataFrame({'Bob':['T like it.','It was awful.'],'Sue':['Pretty good.','Bland.']})
Bob Sue
0 T like it. Pretty good.
1 It was awful. Bland.
>>> # Set row index
>>> pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.
>>> # establish Series
>>> pd.Series([1,2,3,4,5])
0 1
1 2
2 3
3 4
4 5
dtype: int64
>>> #Series No listing , There is only one overall name
>>> pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64

Read 、 Save data file

>>> wine_reviews = pd.read_csv('D:\DOCUMENT\PRO\PYTHON\DataFrameTurtorial\winemag-data-130k-v2.csv')
>>> wine_reviews
Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 14 columns]
>>> # Display the number of rows and columns
>>> wine_reviews.shape
(129971, 14)
>>> # Show the first and last lines
>>> wine_reviews.head() # Before default display 5 That's ok ,head(3) Before display 3 That's ok
Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
2 2 US Tart a... NaN 87 14.0 Oregon Willam... Willam... Paul G... @paulg... Rainst... Pinot ... Rainstorm
3 3 US Pineap... Reserv... 87 13.0 Michigan Lake M... NaN Alexan... NaN St. Ju... Riesling St. Ju...
4 4 US Much l... Vintne... 87 65.0 Oregon Willam... Willam... Paul G... @paulg... Sweet ... Pinot ... Sweet ...
>>> wine_reviews.tail() # After default display 5 That's ok ,tail(3) After the show 3 That's ok
Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
129966 129966 Germany Notes ... Braune... 90 28.0 Mosel NaN NaN Anna L... NaN Dr. H.... Riesling Dr. H....
129967 129967 US Citati... NaN 90 75.0 Oregon Oregon Oregon... Paul G... @paulg... Citati... Pinot ... Citation
129968 129968 France Well-d... Kritt 90 30.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
129969 129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
>>> # The index can be specified during creation
>>> wine_reviews = pd.read_csv('D:\DOCUMENT\PRO\PYTHON\DataFrameTurtorial\winemag-data-130k-v2.csv', index_col=0)
>>> wine_reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]
>>> # Save the data frame as csv file
>>> wine_reviews.to_csv(path)

Set the display format

# Show all columns ( Parameter set to None Represents the display of all columns , You can also set your own number )
pd.set_option('display.max_columns',None)
# Show all lines
pd.set_option('display.max_rows',None)
# Set the display length of the data , The default is 50
pd.set_option('max_colwidth',200)
# No word wrap ( Set to Flase Don't wrap ,True conversely )
pd.set_option('expand_frame_repr', False)

Column data alignment

# Column data alignment
>>> df # Before alignment
f10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 Ping An Bank 2150 142 -129 30 1442
1 61 000002 ten thousand Families, A 3000 155 57 39 840
2 0 000003 PT Gold fields A 0 0 0 0 0
>>> pd.set_option('display.unicode.ambiguous_as_wide', True)
>>> df
f10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 Ping An Bank 2150 142 -129 30 1442
1 61 000002 ten thousand Families, A 3000 155 57 39 840
2 0 000003 PT Gold fields A 0 0 0 0 0
>>> pd.set_option('display.unicode.east_asian_width', True) # It seems that this line alone can
>>> df # After alignment
f10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 Ping An Bank 2150 142 -129 30 1442
1 61 000002 ten thousand Families, A 3000 155 57 39 840
2 0 000003 PT Gold fields A 0 0 0 0 0
>>> pd.set_option('display.width', 180) # Set print width (** important **)
>>> df
f10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 Ping An Bank 2150 142 -129 30 1442
1 61 000002 ten thousand Families, A 3000 155 57 39 840
2 0 000003 PT Gold fields A 0 0 0 0 0

Indexes 、 choice

>>> # Import the data and set the maximum number of rows displayed to 5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

Take row

>>> reviews.iloc[0] # Access the first line with the line number , return Series
>>> reviews.iloc[[0]] # Access the first line with the line number , return DataFrame
>>> reviews.iloc[-5:] # After taking 5 That's ok , return DataFrame
>>> reviews.iloc[2:4] # The third and fourth lines
>>> reviews[2:4] # Take the third and fourth lines
>>> reviews.loc[[2,4]] # Take discontinuous multiple lines
country ... winery
2 US ... Rainstorm
4 US ... Sweet Cheeks
[2 rows x 13 columns]

Fetch

>>> # Take a column
>>> reviews['country'] # Access with column names , return Series
>>> reviews.country # use . visit , Not applicable if there are spaces in the name , return Series
>>> reviews.loc[:,'country'] # Same as reviews['country'], return Series
>>> reviews.iloc[:,0] # Or use the column position index to access , Select the first 1 Column , return Series
>>> reviews[['country']] # take country Column , return DataFrame
>>> reviews.loc[:,['country']] # Same as reviews[['country']]
>>> reviews.iloc[:,[0]] # Take the first place 0 Column , Access by location index , return DataFrame
country
0 Italy
1 Portugal
... ...
129969 France
129970 France
[129971 rows x 1 columns]
>>> # Take multiple columns , return DataFrame
>>> reviews.loc[:'country':'points'] # From country To points( contain points) All columns of ,country Column needs to be in points Before the column
>>> reviews.iloc[:,0:4] # Take the first place 0 Column to the first 4 Column ( The fourth column... Is not included ) All columns of
>>> reviews[['country','points']] # Take discontinuous country And points Column 、
>>> reviews.loc[:,['country','points']] # Same as reviews[['country','points']]
>>> reviews.iloc[:,[0,3]] # Take the discontinuous second 0 Column and column 2 Column
country points
0 Italy 87
1 Portugal 87
... ... ...
129969 France 90
129970 France 90
[129971 rows x 2 columns]

Simultaneous row and column retrieval

>>> # Take a single value , The return type is the type of value
>>> reviews['country'][0] # take country Column first value
>>> reviews.loc[0,'country'] # Take the first place 0 That's ok country The value of the column
>>> reviews.iloc[1,0] # Take the first line 0 The value of the column
>>> # Take multiple values
>>> reviews.iloc[1:4,0] # Select the first 2 Go to the first place 4 First column of row , Continuous index , return Series
>>> reviews.iloc[[1,3],0] # Select the first 2 Xing di 4 First column of row , Decentralized index , return Series
>>> reviews.iloc[[1,3],[0]] # Select the first 2 Xing di 4 First column of row , Decentralized index , return DataFrame
>>> reviews.iloc[[1,3],2:5] # Select the first 2 Xing di 4 OK, No 3 Column to the first 5 Column , Continuous index , return DataFrame
>>> reviews.iloc[[1,3],[2,5]] # Select the first 2 Xing di 4 OK, No 3 Column and the first 6 Column , Decentralized index , return DataFrame
>>> reviews.loc[1,['country','points']] # Select the first 2 Yes country And points Column , Decentralized index , return Series
>>> reviews.loc[[1],['country','points']] # Select the first 2 Yes country And points Column , Decentralized index , return DataFrame
>>> reviews.loc[[1,3],['country','points']] # Select the first 2 Lines and 4 Yes country And points Column , Decentralized index , return DataFrame
>>> reviews.loc[[1,3],'country':'points'] # Select the first 2 Lines and 4 Yes country To points Column , return DataFrame
country ... points
1 Portugal ... 87
3 US ... 87
[2 rows x 4 columns]

Set a column as a row index

>>> reviews.set_index('title')
country ... winery
title ...
Nicosia 2013 Vulkà Bianco (Etna) Italy ... Nicosia
Quinta dos Avidagos 2011 Avidagos Red (Douro) Portugal ... Quinta dos Avidagos
... ... ... ...
Domaine Marcel Deiss 2012 Pinot Gris (Alsace) France ... Domaine Marcel Deiss
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caro... France ... Domaine Schoffit
[129971 rows x 12 columns]

Conditional selection

>>> # Judge each line country Is it Italy, return Series
>>> reviews.country == 'Italy'
0 True
1 False
...
129969 False
129970 False
Name: country, Length: 129971, dtype: bool
>>> reviews[reviews.country == 'Italy'] # selection country by Italy The line of
>>> reviews.loc[reviews.country == 'Italy'] # ditto , It's not necessary loc
>>> reviews[reviews['points'] == 90] # selection points by 90 The line of
>>> reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)] # selection country by Italy And points>=90 The line of
>>> reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)] # selection country by Italy or points>=90 The line of
>>> reviews.loc[reviews.country.isin(['Italy', 'France'])] # selection country by Italy or France The line of
>>> reviews.loc[reviews.price.isnull()] # selection price Empty lines
>>> reviews.loc[reviews.price.notnull()] # selection price Is a non empty line
>>> reviews[reviews['description'].str.contains('great')] # selection description Column contains great The line of , Add a parameter when there is a null value and an error is reported na=False
#, namely reviews[reviews['description'].str.contains('great',na=False)]
>>> reviews[~reviews['description'].str.contains('great')] # selection description Column does not contain great The line of
country ... winery
0 Italy ... Nicosia
1 Portugal ... Quinta dos Avidagos
... ... ... ...
129969 France ... Domaine Marcel Deiss
129970 France ... Domaine Schoffit
[125196 rows x 13 columns]

Summarize functions and mappings

Import data

>>> # Import the data and set the maximum number of rows displayed to 5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

Aggregate function :describe(),unique(),value_counts()

>>> reviews.points.describe() # Returns the mean value of numeric column data 、 Properties such as maximum and minimum
count 129971...
mean 88.447138
...
75% 91.000000
max 100.00...
Name: points, Length: 8, dtype: float64
>>> reviews.taster_name.describe() # Returns some properties of the text column text
count 103727
unique 19
top Roger ...
freq 25514
Name: taster_name, dtype: object
>>> reviews.points.mean() # take points The average of the columns
>>> reviews.taster_name.unique() # see taster_name How many kinds of column data , Returns an array of
array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
'Christina Pickard'], dtype=object)
>>> reviews.taster_name.value_counts() # see taster_name How many columns of data each
Roger Voss 25514
Michael Schachner 15134
...
Fiona Adams 27
Christina Pickard 6
Name: taster_name, Length: 19, dtype: int64

mapping :map() And apply()

>>> # Data column deduction average
>>> review_points_mean = reviews.points.mean()
>>> reviews.points-review_points_mean # return Series, The quickest way to deduct the average , The results are the same as below
>>> reviews.points.map(lambda p: p - review_points_mean) # return Series, Do not change the original DataFrame
0 -1.447138
1 -1.447138
...
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
>>> #apply() Method to change the data row by row , slower
>>> def remean_points(row):
row.points = row.points - review_points_mean
return row
>>> reviews.apply(remean_points, axis='columns') # Return to new DataFrame, Do not change the original DataFrame
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... -1.447138 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos -1.447138 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 1.552862 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 1.552862 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

Connect two columns of text

>>> reviews.country + " - " + reviews.region_1
0 Italy ...
1 NaN
...
129969 France...
129970 France...
Length: 129971, dtype: object

Grouping and sorting

Import data

>>> # Import the data and set the maximum number of rows displayed to 5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

Group analysis

>>> # Press points Different categories , And count each points How many lines , return Series
>>> reviews.groupby('points').size() # Method 1
>>> reviews.groupby('points').points.count() # Method 2
>>> reviews.points.value_counts() # Method 3, The order may be different , This method does not Series Name
88 17207
87 16933
...
99 33
100 19
Name: points, Length: 21, dtype: int64
>>> # Press points Different categories , And count each points The smallest species price
>>> reviews.groupby('points').price.min()
points
80 5.0
81 5.0
...
99 44.0
100 80.0
Name: price, Length: 21, dtype: float64
>>> # Press winery Different categories , View the first line of each class title
>>> reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
winery
1+1=3 1+1=3 ...
10 Knots 10 Kno...
...
àMaurice àMauri...
Štoka Štoka ...
Length: 16757, dtype: object
>>> # According to multiple columns (country,province) classification , And find out in each category points The highest line
>>> reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
country province
Argentina Mendoza... Argentina If the... Nicasi... 97 120.0 Mendoz... Mendoza NaN Michae... @wines... Bodega... Malbec Bodega...
Other Argentina Take n... Reserva 95 90.0 Other Salta NaN Michae... @wines... Colomé... Malbec Colomé
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Uruguay San Jose Uruguay Baked,... El Pre... 87 50.0 San Jose NaN NaN Michae... @wines... Castil... Red Blend Castil...
Uruguay Uruguay Cherry... Blend ... 91 22.0 Uruguay NaN NaN Michae... @wines... Narbon... Tannat... Narbona
[425 rows x 13 columns]
>>> # Press country classification , And count the number of rows in each category and the maximum and minimum values of each category
>>> reviews.groupby(['country']).price.agg([len, min, max])
len min max
country
Argentina 3800.0 4.0 230.0
Armenia 2.0 14.0 15.0
... ... ... ...
Ukraine 14.0 6.0 13.0
Uruguay 109.0 10.0 130.0
[43 rows x 3 columns]

Multi level index

>>> # Press country and province classification , See how many each category has
>>> countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
>>> countries_reviewed
len
country province
Argentina Mendoza... 3264
Other 536
... ...
Uruguay San Jose 3
Uruguay 24
[425 rows x 1 columns]
>>> # View data type
>>> mi = countries_reviewed.index
>>> type(mi)
<class 'pandas.core.indexes.multi.MultiIndex'>
>>> # Reset row index
>>> countries_reviewed.reset_index()
country province len
0 Argentina Mendoz... 3264
1 Argentina Other 536
.. ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24
[425 rows x 3 columns]

Sort

>>> # The classification results are classified according to the number of categories ( value ) Sort
>>> countries_reviewed = countries_reviewed.reset_index()
>>> countries_reviewed.sort_values(by='len') # Ascending ( Default )
>>> countries_reviewed.sort_values(by='len', ascending=False) # Descending
country province len
392 US Califo... 36247
415 US Washin... 8639
.. ... ... ...
63 Chile Coelemu 1
149 Greece Beotia 1
[425 rows x 3 columns]
>>> # The classification results are sorted by index in ascending order
>>> countries_reviewed.sort_index()
country province len
0 Argentina Mendoz... 3264
1 Argentina Other 536
.. ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24
[425 rows x 3 columns]
>>> # Multi condition sorting of classification results
>>> countries_reviewed.sort_values(by=['country', 'len'])
country province len
1 Argentina Other 536
0 Argentina Mendoz... 3264
.. ... ... ...
424 Uruguay Uruguay 24
419 Uruguay Canelones 43
[425 rows x 3 columns]

Data type and missing value processing

Import data

>>> # Import the data and set the maximum number of rows displayed to 5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

And data types

>>> # View column data types
>>> reviews.index.dtype # View index column data types ,dtype('int64')
>>> reviews.price.dtype # see price Column data type ,dtype('float64')
>>> reviews.dtypes
country object
description object
...
variety object
winery object
Length: 13, dtype: object
>>> # Convert data type
>>> reviews.points.astype('float64')
0 87.0
1 87.0
...
129969 90.0
129970 90.0
Name: points, Length: 129971, dtype: float64
>>> # Convert column data to a list
>>> reviews['country'].tolist()
>>> # or
>>> list(reviews['country'])

Missing value processing

Missing values are marked as NaN(Not a Number), Always for float64 type

>>> # choice country The missing line
>>> reviews[pd.isnull(reviews.country)] # The reverse is notnull
country ... winery
913 NaN ... Gotsa Family Wines
3131 NaN ... Barton & Guestier
... ... ... ...
129590 NaN ... Büyülübağ
129900 NaN ... Psagot
[63 rows x 13 columns]
>>> # use Unknown fill region_2 Missing line
>>> reviews.region_2.fillna("Unknown")
0 Unknown
1 Unknown
...
129969 Unknown
129970 Unknown
Name: region_2, Length: 129971, dtype: object
>>> # Value substitution
>>> reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino") # take taster_twitter_handle In column @kerinokeefe Replace with @kerino
0 @kerino
1 @vossroger
...
129969 @vossroger
129970 @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

rename 、 Change column order 、 Add a new row

Import data

>>> # Import the data and set the maximum number of rows displayed to 5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviews
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 13 columns]

Rename column and row names

>>> # Rename column name
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> # Method 1: Set column names in dictionary format
>>> df.rename(columns={"A": "a", "B": "c"},inplace=True)
>>># or
>>>df.rename(columns=dict(A='a',B='b'),inplace=True)
>>> # Method 2: Set up columns, List all in the original order
>>> df.columns=['a','c']
>>> df
a c
0 1 4
1 2 5
2 3 6
>>> # Rename row index
>>> df.rename(index={0: "x", 1: "y", 2: "z"},inplace=True)
>>> df.rename({0: "x", 1: "y", 2: "z"}, axis='index',inplace=True)
>>> df
a c
x 1 4
y 2 5
z 3 6
# Named index column name
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename_axis('number',axis='rows')
A B
number
0 1 4
1 2 5
2 3 6

Change column order

>>> # Method 1: Set the column name order list and apply
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> newColumnOrderList=['B','A']
>>> df=df[newColumnOrderList]
>>> # Method 2: Take out a column , Delete the original column , Insert the extracted column in the new location
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df_B=df.B
>>> df.drop('B',axis=1,inplace=True)
>>> df.insert(0,'new_B',df_B) # When you insert a new column, you can also rename the new column
>>> df
new_B A
0 4 1
1 5 2
2 6 3

Add new column

>>> reviews['critic'] = 'everyone' # newly build critic Column and assign values everyone, The operation is similar to the dictionary
>>> reviews['critic']
0 everyone
1 everyone
...
129969 everyone
129970 everyone
Name: critic, Length: 129971, dtype: object
# Iterative assignment
>>> reviews['index_backwards'] = range(len(reviews), 0, -1) # Iterative assignment
>>> reviews['index_backwards'] = list(range(len(reviews), 0, -1))
>>> reviews['index_backwards']
0 129971
1 129970
...
129969 2
129970 1
Name: index_backwards, Length: 129971, dtype: int64

Data frame merge

Data frames with the same column names are spliced vertically

>>> # Data frames with the same column names are spliced vertically
>>> df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('BA'))
>>> # Method 1:append() Method
>>> # Splice two DataFrame
>>> df1.append(df2,ignore_index=True) # Generate a new DataFrame, Don't change the original DataFrame.ignore_index Set to True Used to reset row index
A B
0 1 2
1 3 4
2 6 5
3 8 7
>>> # Splicing multiple Series
>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1,s2],ignore_index=True) #
0 a
1 b
2 c
3 d
dtype: object
>>> # Method 2:caoncat() Method , You can splice multiple columns with the same name DataFrame
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['letter', 'number'])
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],columns=['letter', 'number'])
>>> pd.concat([df1, df2])
letter number
0 a 1
1 b 2
0 c 3
1 d 4

Horizontal splicing or fusion of data frames

>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['letter', 'number'])
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],columns=['animal', 'name'])
>>> # Method 1:pd.concat() Method
>>> pd.concat([df1,df4],axis=1)
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
>>> # Method 2:join() Method
>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']})
>>> # Merge two data frames horizontally , Keep all column names
>>> df1.join(df2, lsuffix='_df1', rsuffix='_df2')
key_df1 A key_df2 B
0 K0 A0 K0 B0
1 K1 A1 K1 B1
2 K2 A2 K2 B2
3 K3 A3 NaN NaN
4 K4 A4 NaN NaN
5 K5 A5 NaN NaN
>>> # If you want to use key Columns fuse two data frames , Need to key Set the column as the index of two data frames
>>> df1.set_index('key').join(df2.set_index('key'))
A B
key
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 NaN
K4 A4 NaN
K5 A5 NaN
>>> # Settings can also be used on Parameter method fuses two data frames
>>> df1.join(df2.set_index('key'), on='key')
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 NaN
4 K4 A4 NaN
5 K5 A5 NaN

Besides , There are other ways to merge df.merge() etc. .


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved