您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

pandas read file

編輯：Python

pandas讀取文件

讀取文件
- 讀取csv,txt文件read_csv()
- 讀取excel文件read_excel ()
- - Using a library to introduce
  - 讀取方法
  - 參數介紹
  - 應用舉例

讀取文件

讀取csv,txt文件read_csv()

csv文件介紹：

CSV 又稱逗號分隔值文件,是一種簡單的文件格式,以特定的結構來排列表格數據.
CSV 文件能夠以純文本形式存儲表格數據,比如電子表格、數據庫文件,並具有數據交換的通用格式.
CSV 文件會在 Excel 文件中被打開,其行和列都定義了標准的數據格式.

讀取方法：

pandas.read_csv(filepath_or_buffer, sep =‘,’, header=0, names=[“第一列”,“第二列”,“第三列”],encoding=‘utf-8’,usecols=[1,2,3])
filepath_or_buffer: 文件路徑
sep：The original file separator
header: 用作列名的行號,默認為header=0,Use first line as the column name;若header=None,Indicates no column in the data line
names:The column named or renamed,When specifying the column names later,還是可以用0,1,2等indexFor the column visit,注意是從0開始的,0Said the first column of the original data.
usecols：To read the column number or name,Can't use slice the way,也是從0開始,0代表第一列.
Other parameters with reference toread_excel

import pandas as pd
df = pd.read_csv(".\study\weather.txt", sep=",")
print("df------\n", df.head())
df1 = pd.read_csv(".\study\weather.txt", sep=",", names=["a", "b", "c", "d", "e","f", "g", "h", "i"], usecols=[1, 2, 3, 4, 5]) #指定列號
print("df1------\n",df1.head())
df2 = pd.read_csv(".\study\weather.txt", sep=",", names=["a", "b", "c", "d", "e","f", "g", "h", "i"], usecols=["b", "d", "e"]) #指定列名
print("df2------\n",df2.head())

df------
ymd bWendu yWendu tianqi fengxiang fengli aqi aqiInfo aqiLevel
0 2018-01-01 3℃ -6℃ 晴~多雲 東北風 1-2級 59 良 2
1 2018-01-02 2℃ -5℃ 陰~多雲 東北風 1-2級 49 優 1
2 2018-01-03 2℃ -5℃ 多雲 北風 1-2級 28 優 1
3 2018-01-04 0℃ -8℃ 陰 東北風 1-2級 28 優 1
4 2018-01-05 3℃ -6℃ 多雲~晴 西北風 1-2級 50 優 1
df1------
b c d e f
0 bWendu yWendu tianqi fengxiang fengli
1 3℃ -6℃ 晴~多雲 東北風 1-2級
2 2℃ -5℃ 陰~多雲 東北風 1-2級
3 2℃ -5℃ 多雲 北風 1-2級
4 0℃ -8℃ 陰 東北風 1-2級
df2------
b d e
0 bWendu tianqi fengxiang
1 3℃ 晴~多雲 東北風
2 2℃ 陰~多雲 東北風
3 2℃ 多雲 北風
4 0℃ 陰 東北風

讀取excel文件read_excel ()

Using a library to introduce

pandas.read_excel() Used internally calledopenpyxl 和 xlrd 的庫.
So if you just useexcel相關處理,也可以用openpyxl庫.

讀取方法

pandas.read_excel(
io,
sheet_name=0,
header=0,
names=None,
index_col=None,
usecols=None,
skiprows=None,
nrows=None
)

參數介紹

io: 文件路徑
sheet_name: 文件路徑
- 默認是0,索引號從0開始,表示第一個sheet.返回DataFrame.
- sheet_name=1, 2nd sheet as a DataFrame.
- sheet_name=“sheet1”,Load sheet with name “Sheet1”, 返回DataFrame.
- sheet_name=[1,2,“sheet3”],Load first, second and sheet named “Sheet5”,返回dict類型,key名為1,2,“sheet3”.
- None 表示引用所有sheet,返回dict類型,key為sheet名.

header：表示用第幾行作為表頭,支持 int, list of int;
默認是0,第一行的數據當做表頭.header=None表示不使用數據源中的表頭,Pandas自動使用0,1,2,3…的自然數作為索引.
names：表示自定義表頭的名稱,Need an array parameter at this time.
index_col：指定列屬性為行索引列,支持 int, list of int, 默認是None,也就是索引為0,1,2,3等自然數的列用作DataFrame的行標簽.
如果傳入的是列表形式,則行索引會是多層索引.

usecols：待解析的列,支持 int, str, list-like, or callable ,默認是 None,Head to tail,.
- If None, 表示解析全部的列.
- If str, then indicates comma separated list of Excel column letters
  and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of
  both sides.
- If list of int, then indicates list of column numbers to be parsed.
- If list of string, then indicates list of column names to be parsed.

dtype：指定列屬性的字段類型.案例：{“a”: “float64”};默認為None,也就是不改變數據類型.
skiprows：跳過指定的行（可選參數),類型為：list-like, int, or callable
- Rows to skip at the beginning, 1Said to jump off the first line.
nrows：指定讀取的行數,通常用於較大的數據文件中.類型int, 默認是None,讀取全部數據

converters：對指定列進行指定函數的處理,傳入參數為列名與函數組成的字典,和usecols參數連用.
- key 可以是列名或者列的序號,values是函數,可以自定義的函數或者Python的匿名lambda函數

應用舉例

原excel文件中有兩個sheet,第一個是student sheet,第二個是vegetables sheet.

# sheet_name演示
import pandas as pd
# 當sheet_name為None時,返回dict類型,key為sheet名,value為DataFrame類型
df1 = pd.read_excel(r".\study\test_excel.xlsx", sheet_name=None)
print("df1---------------\n", df1, type(df1["student"]))
# 當sheet_name為default值時,即為第一個sheet,返回DataFrame類型
df2 = pd.read_excel(r".\study\test_excel.xlsx")
print("df2---------------\n", df2)
# Read the worksheet key is designated by the digital Numbers,By the table name specified work table is key work table name.0表示第一個sheet
df3 = pd.read_excel(r".\study\test_excel.xlsx", sheet_name=[0,"vegetables"])
print("df3---------------\n", df3)

df1---------------
{'student': name age sex address score
0 劉一 18 女 上海 100
1 花二 40 男 上海 99
2 張三 25 男 北京 80
3 李四 30 男 西安 40
4 王五 70 男 青島 70
5 孫六 65 女 泰州 90, 'vegetables': 序號 菜名 單價 產地
0 1 spinach 4.5 崇明
1 2 cucumber 5.0 奉賢
2 3 tomato 6.0 惠南
3 4 green bean 8.0 金山} <class 'pandas.core.frame.DataFrame'>
df2---------------
name age sex address score
0 劉一 18 女 上海 100
1 花二 40 男 上海 99
2 張三 25 男 北京 80
3 李四 30 男 西安 40
4 王五 70 男 青島 70
5 孫六 65 女 泰州 90
df3---------------
{0: name age sex address score
0 劉一 18 女 上海 100
1 花二 40 男 上海 99
2 張三 25 男 北京 80
3 李四 30 男 西安 40
4 王五 70 男 青島 70
5 孫六 65 女 泰州 90, 'vegetables': 序號 菜名 單價 產地
0 1 spinach 4.5 崇明
1 2 cucumber 5.0 奉賢
2 3 tomato 6.0 惠南
3 4 green bean 8.0 金山}

# header and names演示
# 表示用第幾行作為表頭,支持 int, list of int; 默認是0,第一行的數據當做表頭.header=None表示不使用數據源中的表頭,Pandas自動使用0,1,2,3…的自然數作為索引.
# names：表示自定義表頭的名稱,此時需要傳遞數組參數.
import pandas as pd
# header=[0,1],表示第1,2Row is a header
df1 = pd.read_excel(r".\study\test_excel.xlsx", header=[0,1])
print("df1---------------\n", df1)
# header=None表示不使用數據源中的表頭,Pandas自動使用0,1,2,3…的自然數作為索引
df2 = pd.read_excel(r".\study\test_excel.xlsx", header=None)
print("df2---------------\n", df2)
# names：表示自定義表頭的名稱,注意namesThe number of elements in to the number of columns and tables corresponding
df3 = pd.read_excel(r".\study\test_excel.xlsx", names=["a", "b", "c", "d", "e"])
print("df3---------------\n", df3)

df1---------------
name age sex address score
劉一 18 女 上海 100
0 花二 40 男 上海 99
1 張三 25 男 北京 80
2 李四 30 男 西安 40
3 王五 70 男 青島 70
4 孫六 65 女 泰州 90
df2---------------
0 1 2 3 4
0 name age sex address score
1 劉一 18 女 上海 100
2 花二 40 男 上海 99
3 張三 25 男 北京 80
4 李四 30 男 西安 40
5 王五 70 男 青島 70
6 孫六 65 女 泰州 90
df3---------------
a b c d e
0 劉一 18 女 上海 100
1 花二 40 男 上海 99
2 張三 25 男 北京 80
3 李四 30 男 西安 40
4 王五 70 男 青島 70
5 孫六 65 女 泰州 90

# index_col,skiprows, nrows and usecols演示
import pandas as pd
# 指定列屬性為行索引列,index_col=[0]In the first column as row index
df1 = pd.read_excel(r".\study\test_excel.xlsx", index_col=[0]) #In the first column as row index
print("df1____1---------\n", df1)
# The first column as row index
print("df1____2---------\n",df1.loc[["劉一", "李四"], "address"])
# index_col默認是None,也就是索引為0,1,2,3等自然數的列用作DataFrame的行索引
df2 = pd.read_excel(r".\study\test_excel.xlsx")
print("df2---------\n", df2)
# index_col=None為默認值,索引為0,1,2,3等自然數的列用作DataFrame的行索引,usecols=[0,2]In the first column and the first3列
df4 = pd.read_excel(r".\study\test_excel.xlsx", usecols=[0,2])
print("df4---------\n",df4)
# usecols以類似於excel中range的訪問方式,多列
df5 = pd.read_excel(r".\study\test_excel.xlsx", usecols="A:C,E")
print("df5---------\n",df5)
# 取出前4列,跳過第1行和第2行,注意skiprows是從1開始的
df6 = pd.read_excel(r".\study\test_excel.xlsx", usecols=[0,1,2,3], skiprows=[1,2])
print("df6---------\n",df6)
# 取出列,跳過第1列和第4行,取出前2行
df7 = pd.read_excel(r".\study\test_excel.xlsx", usecols=[0,3], nrows=2)
print("df7---------\n",df7)

df1____1---------
age sex address score
name
劉一 18 女 上海 100
花二 40 男 上海 99
張三 25 男 北京 80
李四 30 男 西安 40
王五 70 男 青島 70
孫六 65 女 泰州 90
df1____2---------
name
劉一 上海
李四 西安
Name: address, dtype: object
df2---------
name age sex address score
0 劉一 18 女 上海 100
1 花二 40 男 上海 99
2 張三 25 男 北京 80
3 李四 30 男 西安 40
4 王五 70 男 青島 70
5 孫六 65 女 泰州 90
df4---------
name sex
0 劉一 女
1 花二 男
2 張三 男
3 李四 男
4 王五 男
5 孫六 女
df5---------
name age sex score
0 劉一 18 女 100
1 花二 40 男 99
2 張三 25 男 80
3 李四 30 男 40
4 王五 70 男 70
5 孫六 65 女 90
df6---------
name age sex address
0 張三 25 男 北京
1 李四 30 男 西安
2 王五 70 男 青島
3 孫六 65 女 泰州
df7---------
name address
0 劉一 上海
1 花二 上海

# dtype演示
df1 = pd.read_excel(r".\study\test_excel.xlsx", names=["a", "b", "c", "d", "e"],
dtype={
"b": "float64", "e":"float64"}) #Pay attention to the original dataa,eChange the column type
print(df1)

 a b c d e
0 劉一 18.0 女 上海 100.0
1 花二 40.0 男 上海 99.0
2 張三 25.0 男 北京 80.0
3 李四 30.0 男 西安 40.0
4 王五 70.0 男 青島 70.0
5 孫六 65.0 女 泰州 90.0

# converters演示
# Meet in front of a column data contain0（如010101）的時候,pd.read_excel()方法返回的DataFrameThis column will be regarded as aint類型,即010101變成10101,
# 這種情況下,If you want to maintain data integrity,可以以strType to read this column
# converters中調用函數時,Is the column of each element in turn as a function of parameters of.
df1 = pd.read_excel(r".\study\test_excel.xlsx", usecols=[0,2,4], #原1,3,列
converters={
0: lambda x: x+"同學", # 0對應上面[0,2,4]中的0, sex對應原2,2對應原4
"sex": lambda x: x + "孩子",
"score": lambda x: x + 1000
})
print("df1---------\n",df1)
print(type(df1.loc[1, "score"]))
def join_str(value):
if isinstance(value, int):
result = value + 100
else:
result = value + "乖寶"
return result
df2 = df1 = pd.read_excel(r".\study\test_excel.xlsx", usecols=[0,2,4], #原1,3,列
converters={
0: join_str,
1: join_str,
2: str #使用str讀取
})
print("df2---------\n",df2)
print(type(df2.loc[1, "score"]))

df1---------
name sex score
0 That classmate 女孩子 1100
1 Take two classmates 男孩子 1099
2 張三同學 男孩子 1080
3 李四同學 男孩子 1040
4 王五同學 男孩子 1070
5 Sun six students 女孩子 1090
<class 'numpy.int64'>
df2---------
name sex score
0 That good treasure Woman good treasure 100
1 Take two lovely treasure Male good treasure 99
2 Zhang SAN good treasure Male good treasure 80
3 Li si good treasure Male good treasure 40
4 Fifty lovely treasure Male good treasure 70
5 Sun six lovely treasure Woman good treasure 90
<class 'str'>