程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Save pandas plan (20) -- count the monthly order volume of retail stores

編輯:Python

save pandas plan (20)—— Count the monthly order volume of retail stores

    • / Data requirement
    • / Demand processing
    • / summary

Recently, I found that many friends around me are not happy to use pandas, Switch to other data operation Libraries , As a data worker , Basically open your mouth pandas, Closed mouth pandas 了 , So I wrote this series to make more friends fall in love with pandas.

Series article description :

Series name ( Serial number of series articles )—— This series of articles specifically address the needs

platform :

  • windows 10
  • python 3.8
  • pandas >=1.2.4

/ Data requirement

Recently I was reading a book about using pandas A book for data processing , On 2020 Published in , There is a section on the statistical data processing of online retail goods , Each order and each item is recorded separately , Therefore, when you only care about orders, you will find that there are multiple same order numbers , This article discusses how to count the monthly order quantity . The data is read as follows :

import pandas as pd
df = pd.read_csv('Online_Retail.csv.zip', parse_dates=['InvoiceDate'])
df_new = df.dropna().copy()
# Lending month 
df_new['YearMonth'] = df_new['InvoiceDate'].map(lambda x: 100 * x.year + x.month)

ps: Data acquisition method :

github:
https://github.com/lk-itween/FunnyCodeRepository/raw/main/PandasSaved/data/Online_Retail.csv.zip

(406829, 9)

/ Demand processing

Because only the order number is concerned , Repeated order numbers will make the data statistics inaccurate , The order number needs to be de duplicated before statistics .

  • Mode one : Used in books unique Statistics will be made later .
df_new.groupby('InvoiceNo')['YearMonth'].unique().value_counts().sort_index()

pandas from 2020 It has been updated many times since its development in , The methods in the previous book may not be implemented , In this case, the following error reports will be generated , The reason is unique() After execution, each row of data is of list type ,value_counts Can't handle .

Change the code as follows to complete the requirements .

df_new.groupby('InvoiceNo')['YearMonth'].unique().explode().value_counts().sort_index()

( Manual watermark : original CSDN The fate of the sleepers ,https://blog.csdn.net/weixin_46281427?spm=1011.2124.3001.5343 , official account A11Dot send )

  • Mode two : Yes groupby Use of results value_counts De duplication and re statistics .
df_new.groupby('InvoiceNo')['YearMonth'].value_counts().reset_index(name='count')['YearMonth'].value_counts().sort_index()

first value_counts The function of is to YearMonth duplicate removal , The required column name is already used as an index , adopt reset_index Reset index to column data , Right again YearMonth Conduct value_counts Count the monthly order volume .

On the same computer , This method is faster than the method mentioned in the book , Probably unique It takes some time to process , However, this treatment complicates the thinking ,pandas De reprocessing can be used directly drop_duplicates.

  • Mode three :drop_duplicates Statistics after weight removal .
df_new[['InvoiceNo', 'YearMonth']].drop_duplicates()['YearMonth'].value_counts().sort_index()

Compare the first two methods , The code is a lot shorter , Processing time is also reduced .

/ summary

This article introduces the examples in the book , Reproduce the code in the book , Combined with existing data processing methods , Step by step optimize the way your code is handled , Explain the similarities and differences of each method , Complete data requirements . The source data can be obtained at the beginning of the article .

Watch the sky , Listen to the wind and rain at dawn .


Made on June 22, 2002


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved