程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Pandas statistical analysis (date data processing, time series, downsampling, upsampling, excel multi table consolidation, stock market data analysis, solving Chinese garbled code)

編輯:Python
  • This blog post is from 《Python Data analysis goes from beginner to proficient 》_ Edited by tomorrow Technology

4.8 Date data processing

4.8.1 DataFrame Date data conversion

  • On a daily basis , A very troublesome thing is that the format of dates can be expressed in many ways , We see the same 2020 year 2 month 14 Japan , There can be many formats , Pictured 4.57 Shown . that , We need to unify these formats before we can carry out the follow-up work .Pandas Provides to_datetime() Methods can help us solve this problem .
  • to_datetime() Method can be used to batch process date data conversion , It is very practical and convenient for processing big data , It can convert date data into various formats you need . for example , take 2/14/20 and 14-2-2020 Convert to date format 2020-02-14.to_datetime() The syntax of the method is as follows :
pandas.to_datetime(arg, errors =’ignore’, dayfirst = False, yearfirst = False, utc = None, box = True, format = None,exact= True, unit = None, infer_datetime_format = False, origin =’unix ‘, cache = False)
  • arg: character string 、 Date time 、 Array of strings
  • errors: The value is ignore、raise or coerce, The details are as follows , The default value is ignore, That is, ignore the error .
    – ignore: Invalid parsing will return the original value
    – raise: Invalid parsing will throw an exception
    – coerce: Invalid resolution will be set to NaT, That is, data that cannot be converted to date will be converted to NaT
  • dayfirst: The first is the day , Boolean type , The default value is False. for example 02/09/2020, If the value is True, Then the first of the parsing date is day , namely 2020-09-02; If the value is False, The resolution date is the same as the original date , namely 2020-02-09
  • yearfirst: The first is year , Boolean type , The default value is False. for example 14-Feb-20, If the value is True, Then the first of the parsing date is year , namely 2014-02-20; If the value is False, The resolution date is the same as the original date , namely 2020-02-14
  • utc: Boolean type , The default value is None. return utc Coordinate world time
  • box: Boolean value , The default value is True, If the value is True, Then return to DatatimeIndex; If the value is False, Then return to ndarray.
  • format: Format the display time . character string , The default value is None
  • exact: Boolean value , The default value is True, If True, The format must match exactly ; If False, Allows the format to match anywhere in the target string
  • unit: The default value is None, Unit of parameter (D、s、ms、us、ns) A unit of time
  • infer_datetime_format: The default value is False. If there is no format , Then try to infer the format from the first date time string .
  • origin: The default value is unix. Define reference date . The value will be resolved to a unit number
  • cache: The default value is False. If the value is True, Then unique 、 The cache of the conversion date applies the time conversion of the date . Parsing duplicate date string , Especially for strings with time zone offset , There may be significant acceleration . Only if at least 50 Cache is used only when there are values . The existence of an out of range value will make the cache unusable , And may slow down parsing .
  • Return value : Date time

Convert various date strings to the specified date format

  • take 2020 year 2 month 14 Various formats of the day are converted to date format , The program code is as follows :
import pandas as pd
df=pd.DataFrame({
' Original date ':['14-Feb-20', '02/14/2020', '2020.02.14', '2020/02/14','20200214']})
df[' Date after conversion ']=pd.to_datetime(df[' Original date '])
print(df)

  • It can also be realized from DataFrame Multiple columns in the object , Like a year 、 month 、 The columns of the day are combined into one column of the day . Date abbreviations commonly used in key value . Combination requirements :
  • Mandatory :year、month、day
  • Optional :hour、minute、second、milisecond( millisecond )、microsecond( Microsecond )、nanosecond( nanosecond ).

Convert a set of data into date data

import pandas as pd
df = pd.DataFrame({
'year': [2018, 2019,2020],
'month': [1, 3,2],
'day': [4, 5,14],
'hour':[13,8,2],
'minute':[23,12,14],
'second':[2,4,0]})
df[' Date after combination ']=pd.to_datetime(df)
print(df)

4.8.2 dt Use of objects

  • dt The object is Series Object to get the date attribute , Through it, you can get the year in the date 、 month 、 Japan 、 Weeks 、 Seasons, etc , You can also judge whether the date is at the end of the year . The grammar is as follows :
Series.dt()
  • Return value : Return the same index series as the original series . If Series Date value is not included , It causes a mistake .
  • dt The object provides year、month、day、dayofweek、dayofyear、is_leap_year、quarter、weekday_name Other properties and methods . for example ,year Can get " year "、quarter You can directly get the quarter of each date ,weekday_name You can directly get the day of the week corresponding to each date .

Gets the year in the date 、 month 、 Japan 、 Number of weeks, etc

import pandas as pd
df=pd.DataFrame({
' Original date ':['2019.1.05', '2019.2.15', '2019.3.25','2019.6.25','2019.9.15','2019.12.31']})
df[' date ']=pd.to_datetime(df[' Original date '])
df

df[' year '],df[' month '],df[' Japan ']=df[' date '].dt.year,df[' date '].dt.month,df[' date '].dt.day
df[' What day ']=df[' date '].dt.day_name()
df[' quarter ']=df[' date '].dt.quarter
df[' Whether the end of the year ']=df[' date '].dt.is_year_end
df

4.8.3 Get the data of the date range

  • The method to obtain the data of the date range is directly in DataFrame Object to enter a date or date range , but The date must be set as the index , Examples are as follows :

  • obtain 2018 Years of data

df1['2018']
  • obtain 2017-2018 Years of data
df1['2017':'2018']
  • Get a month (2018 year 7 month ) The data of
df1['2018-07']
  • Get the specific day (2018 year 5 month 6 Japan ) The data of
df1['2018-05-06':'2018-05-06']

Get the order data of the specified date range

  • obtain 2018 year 5 month 11 solstice 6 month 10 Order of day , The results are shown in the following figure
import pandas as pd
df = pd.read_excel('mingribooks.xls')
df1=df[[' Order payment time ',' Buyer member name ',' Contact your cell phone ',' The actual amount paid by the buyer ']]
df1=df1.sort_values(by=[' Order payment time '])
df1

df1 = df1.set_index(' Order payment time ') # Set the date to index
df1

# Get the data of a certain interval
df2=df1['2018-05-11':'2018-06-10']
df2

4.8.4 Statistics and display data by different periods

1. Statistics by period

  • Statistics by period are mainly through DataFrame Object's resample() Method combined with data calculation function .resample() The method is mainly applied to time series frequency conversion and resampling , It can get the year from the date 、 month 、 Japan 、 week 、 Seasons, etc , Combined with the data calculation function, we can achieve year 、 month 、 Japan 、 Statistics of different periods such as week or quarter . An example is shown below . Index must be of date type .
  • (1) Annual statistics , The code is as follows :
df1=df1.resample('AS').sum()
  • (2) Quarterly statistics , The code is as follows :
df2.resample('Q').sum()
  • (3) According to monthly statistics , The code is as follows :
df1.resample('M').sum()
  • (4) Statistics by week , The code is as follows :
df1.resample('W').sum()
  • (5) Statistics by day , The code is as follows :
df1.resample('D').sum()

2. Display data by period

  • DataFrame Object's to_period() Method can convert a timestamp to a period , So as to display data by date , The premise is that the date must be set to index . The grammar is as follows :
DataFrame.to_period(freq=None,axis=0,copy=True)
  • freq: character string , Frequency of periodic index , The default value is None
  • axis: Row column index ,axis=0 Indicates the row index ,axis=1 Indicates column index . The default value is 0, That is, the row index .
  • copy: Whether to copy data , The default value is True, If the value is False, Data is not copied .
  • Return value : Time series with periodic index

Get different periods from dates

  • Get different periods from dates , The main codes are as follows :
import pandas as pd
aa =r'TB2018.xls'
df = pd.DataFrame(pd.read_excel(aa))
df1=df[[' Order payment time ',' Buyer member name ',' Contact your cell phone ',' The actual amount paid by the buyer ']]
df1 = df1.set_index(' Order payment time ') # take date Set to index
df1

Monthly statistics
# Monthly statistics
#“MS” The first day of each month is the start date ,“M” It's the last day of every month
df1.resample('M').sum()

Quarterly statistics
# Quarterly statistics
#“QS” The first day of each quarter is the start date ,“Q” It's the last day of each quarter
df1.resample('QS').sum()

Annual statistics
# Annual statistics
#“AS” Is the first day of each year as the start date ,“A” It's the last day of every year
df1.resample('AS').sum()

Statistics by year and display data
# Statistics by year and display data
#“AS” Is the first day of each year as the start date ,“A” It's the last day of every year
df1.resample('AS').sum().to_period('A')

Statistics and display data by quarter
# Statistics and display data by quarter
df1.resample('Q').sum().to_period('Q')

Statistics by month and display data
# Monthly statistics
#“MS” The first day of each month is the start date ,“M” It's the last day of every month
df1.resample('M').sum().to_period('M')

Count and display data by week
df1.resample('w').sum().to_period('W').head()

4.9 The time series

4.9.1 Resampling (Resample() Method )

  • Through the previous study , We learned how to generate time indexes of different frequencies , By hour 、 By day 、 By week 、 Monthly, etc , If you want to convert data with different frequencies , What should I do ? stay Pandas The adjustment of the frequency of time series in is called resampling , That is, the process of converting time series from one frequency to another . for example , One frequency per day is converted to every 5 One frequency per day , Pictured 4.67 Shown .
  • Resampling mainly uses resample() Method , This method is used for resampling and frequency conversion of conventional time series , Including down sampling and up sampling . First of all, understand resample() Method , The grammar is as follows :
DataFrame.resample(rule,how=None,axis=0,fill_method=None,closed=None,label=None,convention='start',kind=None,loffset=None,limit=None,base=0,level=None)
  • rule: character string , The offset represents the target string or object transformation
  • how: Function name or array function used to generate aggregate values . for example mean、ohlc and np.max etc. , The default value is mean, Other commonly used values are first、last、median、max and min.
  • axis: integer , Represents a row or column ,axis=0 The column ,axis=1 Said line . The default value is 0, That is, the column
  • fill_method: Fill method used for L sampling ,fill() Method ( Fill with the previous value ) or bfill() Method ( Fill in with the post value ), The default value is None
  • closed: When downsampling , Opening and closing of time interval , Same as the concept of interval in mathematics , Its value is right or left,right Indicates left opening and right closing ,left Indicates left closed right open , The default value is right Left open right closed
  • label: When downsampling , How to label aggregate values . for example ,10:30-10:35 Will be marked as 10:30 still 10:35, The default value is None
  • convention: When resampling , The convention used to convert low frequencies to high frequencies , Its value is start or end, The default value is start
  • kind: It's time to get together (period) Or timestamp (timestamp), Default index type aggregated to time series , The default value is None.
  • loffset: The time correction value of the aggregate tag , The default value is None. for example ,-1s or second(-1) Used to set the aggregation label early 1 second
  • limit: Fill forward or backward , The maximum number of periods allowed to be filled , The default value is None
  • base: integer , The default value is 0. For uniform subdivision 1 The frequency of the day , Aggregate interval " origin ". for example , about 5min frequency ,base The scope can be 0~4
  • on: character string , Optional parameters , The default value is None. Yes DataFrame Object uses columns instead of indexes for resampling . The column must be similar to the date time
  • level: character string , Optional parameters , The default value is None. For multiple indexes , Resampled level or level number , The level must be similar to the date time
  • Return value : Resample the object

A one minute time series translates into 3 Minute time series

  • First create an include 9 A one minute time series , And then use resample() Method to 3 Minute time series , And calculate the sum of the index , Pictured 4.68 Shown .
  • The program code is as follows :
import pandas as pd
index = pd.date_range('02/02/2020', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series

series.resample('3T').sum()

4.9.2 Downsampling processing

  • Downsampling is the transition of the period from high frequency to low frequency . for example , take 5min Stock trading data is converted into daily trading , The sales data counted by day is converted into weekly statistics .
  • Data downsampling involves data aggregation . for example , Day data becomes week data , Then you have to be right 1 Zhou 7 Data of days are aggregated , The way of aggregation mainly includes summation 、 Find the mean, etc .

Statistics of sales data by week

import pandas as pd
df=pd.read_excel('time.xls')
df1 = df.set_index(' Order payment time ') # Set up “ Order payment time ” Index
df1

df1.resample('W').sum().head()

#%%
df1.resample('W',closed='left').sum()

4.9.3 L sampling processing

  • The upsampling is the transition from low frequency to high frequency . When converting data from low frequency to high frequency , There is no need for aggregation , Resample it to daily frequency , Missing values will be introduced by default .
  • for example , It turned out to be weekly statistics , Now it turns to statistics by day . Liter sampling involves data filling , According to the filling method , The filled data is also different . Here are three filling methods .
    – No fill . Null value uses NaN Instead of , Use asfreq() Method .
    – Fill with the previous value . Fill in the empty values with the preceding values , Use ffill() Methods or pad() Method . For the convenience of memory ,ffill() Method can use its first letter "f" Instead of , representative forward, Forward means .
    – Fill in with the post value , Use bfill() Method , You can use letters "b" Instead of , representative back, Backward means .

Every time 6 Make statistics every hour

  • Now create a time series , The start date is 2020-02-02, A total of two days , The corresponding values of each day are 1 and 2, Each... Is processed by liter sampling 6 Make statistics every hour , Null values are filled in different ways , The program code is as follows :
import pandas as pd
import numpy as np
rng = pd.date_range('20200202', periods=2)
s1 = pd.Series(np.arange(1,3), index=rng)
s1

s1_6h_asfreq = s1.resample('6H').asfreq()
print(s1_6h_asfreq)
s1_6h_pad = s1.resample('6H').pad()
print(s1_6h_pad)
s1_6h_ffill = s1.resample('6H').ffill()
print(s1_6h_ffill)
s1_6h_bfill = s1.resample('6H').bfill()
print(s1_6h_bfill)

4.9.4 Time series data summary (ohlc() function )

  • In Finance , I often see the opening (open)、 The close (close)、 Highest price (high) And the lowest price (low) data , And in the Pandas The resampled data in can also achieve such results , By calling ohlc() Function to get the result of data summary , That is, the starting value (open)、 End value (close)、 Maximum value (high) And the lowest value (low).ohlc() The syntax of the function is as follows :
resample.ohlc()
  • ohlc() The function returns DataFrame object , Of each group of data open( open )、high( high )、low( low ) and close( Turn off ) value .

Statistical data open、high、low and close value .

  • Each group below 5 Minute time series , adopt ohlc() Function to get the start value of each group of time in the time series 、 Maximum value 、 Minimum and end values , The program code is as follows :
import pandas as pd
import numpy as np
rng = pd.date_range('2/2/2020',periods=12,freq='T')
s1 = pd.Series(np.arange(12),index=rng)
s1

s1.resample('5min').ohlc()

4.9.5 Move window data calculation (rolling() function )

  • Through resampling, you can get any low-frequency data you want , But these data are also the data of common time points , Then there is such a problem : The data at the time point fluctuates greatly , The data at a certain point cannot well express its own characteristics , So there was " Move the window " The concept of , In short , In order to improve the reliability of data , Expand the value of a point to a range of this point , Use intervals to judge , This section is the window .
  • Here's an example , chart 4.74 The schematic diagram of moving window data is displayed , The time series represents 1 Number to 15 Daily sales data on the th , Next, let's say 3 A window for , Move the window from left to right , According to the statistics 3 The average value of days is taken as the value of this point , Such as 3 The sales volume of No. is 1 Number 、2 Number and 3 The average value of number .
  • I believe you have understood the mobile window through the above schematic diagram , stay Pandas Through rolling() Function to realize the calculation of moving window data , The grammar is as follows :
DataFrame.rolling(window,min_periods=None,center=False,win_type=None,on=None,axis=0,closed=None)
  • window: The size of the time window , There are two forms , namely int or offset. If you use int, Then the numerical value represents the number of observations for calculating the statistics , That is, the forward data ; If you use offset, Represents the size of the time window .
  • min_periods: The minimum number of observations per window , Windows less than this result in NA. Values can be int, The default value is None.offset Under the circumstances , The default value is 1
  • center: Set the label of the window to play . Boolean type , The default value is False, be at the right
  • win_type: The type of window . All kinds of functions that intercept windows . String type , The default value is None
  • on: Optional parameters . about DataFrame object , Is to specify the column to calculate the moving window , The value is the column name
  • axis: integer ,axis=0 The column ,axis=1 Said line . The default value is 0, That is to calculate the column
  • closed: Define the opening and closing of an interval , Support int Type of window . about offset The default type is left open right closed . It can be specified according to the situation left.
  • Return value : A subclass of a window or moving window generated for a specific operation

Create Taobao daily sales data

  • First, create a set of Taobao daily sales data , The program code is as follows :
import pandas as pd
index=pd.date_range('20200201','20200215')
data=[3,6,7,4,2,1,3,8,9,10,12,15,13,22,14]
s1_data=pd.Series(data,index=index)
s1_data

Use rolling() Function calculation 3 The average of days

  • Use rolling() Function calculation 2020-02-01 to 2020-02-15 in 3 The average of days , The number of windows is 3, The code is as follows :
s1_data.rolling(3).mean()

  • Run the program , look down rolling() How is the function calculated ? When the window starts moving , First time 2020-02-01 And the second time point 2020-02-02 The value of is empty , This is because the number of windows is 3, There is empty data in front of them , So the mean is empty ; And to the third time point 2020-02-03 when , The data in front of it is 2020-02-01 to 2020-02-03, therefore 3 The average of days is 5033333; And so on .

Use the data of the day to represent the window data

  • In calculating the first time point 2020-02-01 Window data , Although the data is not enough window length 3, But at least the data of the day , Can we use the data of the day to represent the window data ? The answer is yes , By setting min_periods Parameters can be , It represents the minimum number of observations in the window , The window length less than this value is displayed as empty , There is a value when it is equal to or greater than , The main codes are as follows :
s1_data.rolling(3,min_periods=1).mean()

import numpy as np
import pandas as pd
index=pd.date_range('20200201','20200215')
data=[3,6,7,4,2,1,3,8,9,10,12,15,13,22,14]
np.random.seed(2)
data=np.random.randint(20,size=len(index))
ser_data=pd.Series(data,index=index)
plt.figure(figsize=(15, 5))
ser_data.plot(style='r--')
ser_data.rolling(3).mean().plot(style='b')

4.10 Comprehensive application

Case study 1:Excel Merging multiple tables

  • In daily work , Almost every day, we have a lot of data to process , The desktop is always covered with Excel surface , It looks very messy , In fact, we can have the same category Excel Merge tables together , In this way, data will not be lost , And it can also analyze data very effectively . Use concat() Method to specify all Excel Table merge , The program code is as follows :
import pandas as pd
import glob
filearray=[]
filelocation=glob.glob(r'./aa/*.xlsx') # Specify all... In the directory Excel file # Traverse the specified directory for filename in filelocation: filearray.append(filename) print(filename) res=pd.read_excel(filearray[0]) # Read the first Excel file # Sequential read Excel File and merge for i in range(1,len(filearray)): A=pd.read_excel(filearray[i]) res=pd.concat([res,A],ignore_index=True,sort=False) # Ignore the index , See pandas In statistical analysis print(res.index) # write in Excel file , And save writer = pd.ExcelWriter('all.xlsx') res.to_excel(writer,'sheet1') writer.save() 

4.10.2 Case list 2: Stock market data analysis

  • Stock data includes opening price 、 Closing price 、 The lowest price 、 Trading volume and other indicators . among , The closing price is the standard of the day's market , It is also the basis for the opening price of the next trading day , It can predict the future stock market , So when investors analyze the market , Generally, the closing price is used as the valuation basis .
  • Use rolling() Function to calculate a stock 20 God 、50 Days and 200 The average closing price of the day and generate a trend chart ( Also known as K Line graph ), The program code is as follows :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
aa =r'000001.xlsx'
df = pd.DataFrame(pd.read_excel(aa))
df['date'] = pd.to_datetime(df['date']) # Convert data type to date type
df

df = df.set_index('date') # take date Set to index
df=df[['close']] # Extract closing price data
df

df['20 God '] = np.round(df['close'].rolling(window = 20, center = False).mean(), 2)
df['50 God '] = np.round(df['close'].rolling(window = 50, center = False).mean(), 2)
df['200 God '] = np.round(df['close'].rolling(window = 200, center = False).mean(), 2)
plt.rcParams['font.sans-serif']=['SimHei'] # Solve the Chinese garbled code
df.plot(secondary_y = [" Closing price ", "20","50","200"], grid = True)
plt.legend((' Closing price ','20 God ', '50 God ', '200 God '), loc='upper right')


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved