程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python Excel data analysis based on pandas (I)

編輯:Python

The purpose of writing this article is to record what we have seen and practiced : Today's share is based on SODA In the open data set of the competition , Simple cleaning of all-in-one card data set , Data can be obtained directly SODA Take a look around the official website , The data content is as follows :

According to the data description :0 Represents the all-in-one card number ,1 Representative date ,2 Represents time ,3 Represents where the card is used ,4 Means of transportation ,5 It's for cost ,6 Represents the type of card . So what are we going to do with this powerful information , First of all, clear the operation goal, and then formulate the cleaning strategy around the goal , by the way , The data is csv Format , I changed it artificially , Replace the number title with ,cardId,date,time,station,way,cost,discount When the text title coexists xls Format , Actually , Not doing so will not affect the next analysis , Because this article will use pandas In fact, reading has been provided excel and csv Methods ;

Well, back to the point , Our homework content , In fact, we know that the normal bus operation process should be : Arrival -> Departure , In this process, there are cardId A set of behaviors sent by the same card , A group of normal consumption behaviors should be : Arrival (cost = 0)-> Departure (cost > 0), Through the interpretation of these behaviors, we can extract a passenger from the data ( One cartoon ), From which stop to which stop , Midway expenses , Travel time and other information . Okay , With the homework goal, we should analyze the data well , The important point of cleaning is to eliminate dirty data , In fact, in many data , There will be the following disturbing information :(1) Only inbound or outbound information , Missing inbound or outbound information ;(2) There is inbound and outbound information, but both consumption results are 0, Or the two consumption results are not 0;

For the above situation, we need to propose dirty data , Because the existence of dirty data will affect our normal data analysis behavior , Of course, dirty data also has its value , For example, we can analyze dirty data , The occurrence factors of dirty data , For the situation (1) The possible situation of dirty data may be because a passenger evades tickets , Or caused by missing brushing , If many passengers are missing information at a certain station , Does that mean that there is a problem with the machine on the site or that there is a special organization near the site .....

All right, man , In the wrong , It's the code above , This paper mainly uses pandas,xlrd,xlwt If a module is missing during operation , Please fill it up by yourself , Ah, no , It is installed by itself :

import pandas as pd
import xlrd
import xlwt

# First we need to read the data :
df=pd.read_excel(excel_Path)# If it is csv Just use read_csv

# Read the data that the travel mode is subway
df = df[df['way'] == ' metro ']
# Connect the time and date fields , Integrated as datetime Format
df["date"] = df["date"].apply(lambda x:x.strftime("%Y-%m-%d"))
df["time"] = df["time"].apply(lambda x:x.strftime("%H:%M:%S"))
df["time"] = df['date']+' '+df['time']
df["time"] = pd.to_datetime(df["time"])
# Sort according to card number and date
df = df.sort_values(['cardId','time'])

Through the above code, you can basically have an effect :

              cardId        date                         time               station      way   cost   discount
290     2856179  2016-03-01 2016-03-01 14:59:03   9 Dapuqiao on line   metro     0       Non preferential
291     2856179  2016-03-01 2016-03-01 15:13:46   9 Line No. Guilin road   metro     3       Discount
746     2903472  2016-03-01 2016-03-01 17:03:36   7 Line Changshou Road   metro     0       Non preferential
745     2903472  2016-03-01 2016-03-01 17:20:03  10 Line No. Xintiandi   metro     3       Non preferential
298   100661308  2016-03-01 2016-03-01 08:01:28    1 Xinzhuang of line   metro     0       Non preferential

go on:

# Add new column
#shift Function is used to move the whole row , A positive number means the whole moves down , Squeeze out the last ; A negative number means the whole moves up , Squeeze out the top
df['id_latter'] = df['cardId'].shift(-1)
df['id_former'] = df['cardId'].shift(1)
df['total_cost'] = df['cost'].shift(-1)
df['des_station'] = df['station'].shift(-1)
df['des_time'] = df['time'].shift(-1)

At this point, some of the preparations we need are basically completed , Now, clean up the dirty data : According to what we said before, we can make a logical judgment :

def judge_vaild(df):
    if (df['cost'] == 0) & (df['id_latter'] > 0) & (df['cardId'] == df['id_latter']):
        vaild = 1
    elif (df['cost'] > 0) & (df['total_cost'] == 0) & (df['cardId'] == df['id_former']):
        vaild = 1
    else:
        vaild = -1
    return vaild

Data is logically filtered one by one , And it has been sorted by the door . Logically, it is to judge if the current cost be equal to 0 So in theory, you should enter the site , Corresponding next station cost Should be greater than 0, Then this record should be valid data , If at present cost Greater than 0, In theory, it should be out of the site , Then the last record of this record is cost It should be 0.

Then delete the dirty data :

df.drop(index=(df.loc[(df['vaild'] ==-1)].index),inplace=True)

In fact, so far , The data cleaning is basically completed , I can simply print and output excel:

df['cost_time'] = df.apply(lambda x: x['des_time']-x['time'],axis=1)
df.drop(index=(df.loc[(df['cost']!=0)].index),inplace=True)
df.to_excel(out_path)
df = df[['cardId','station','des_station','total_cost','cost_time','way','discount']]

give the result as follows :

         cardId         station          des_station  total_cost cost_time way discount
290     2856179   9 Dapuqiao on line      9 Line No. Guilin road         3.0  00:14:43   metro       Non preferential
746     2903472   7 Line Changshou Road     10 Line No. Xintiandi         3.0  00:16:27   metro       Non preferential
298   100661308    1 Xinzhuang of line     1 Huangpi South Road         4.0  00:51:36   metro       Non preferential
296   100661308  1 Huangpi South Road       1 Xinzhuang of line         4.0  00:33:13   metro       Non preferential


In fact, the data analysis results should be displayed with visual charts and the like , How , Not yet get To this skill , Today's sharing is just to see other people's objects and practice it with your own hands . For the specific content and visualization results of this article, you can go to WeChat official account search :Yuan Data analysis of , Of course, the official account has nothing to do with bloggers , But it is the reference object of this article , The practice content and reference code of this article are from an article shared by him .
 


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved