您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Classic case of pandas data analysis

編輯：Python

author ：Peter edit ：Peter

Hello everyone , I am a Peter~

I've written a lot about Pandas The article , This paper develops a simple comprehensive use , It is mainly divided into ：

How to simulate data by yourself
Multiple data processing methods
Data statistics and visualization
user RFM Model
User repurchase cycle

Build a data

The data used in this case is simulated by Xiaobian , It mainly contains two data ： Fruit data and order information , And will merge the two data

import pandas as pd
import numpy as np
import random
from datetime import *
import time
import plotly.express as px
import plotly.graph_objects as go
import plotly as py
# Draw a subgraph
from plotly.subplots import make_subplots

1、 The time field

2、 Fruit and users

3、 Generate order data

order = pd.DataFrame({
"time":time_range, # Order time
"fruit":fruit_list, # Fruit name
"name":name_list, # Customer name
# Purchase volume
"kilogram":np.random.choice(list(range(50,100)), size=len(time_range),replace=True)
})
order

4、 Generate fruit information data

infortmation = pd.DataFrame({
"fruit":fruits,
"price":[3.8, 8.9, 12.8, 6.8, 15.8, 4.9, 5.8, 7],
"region":[" south China "," The north China "," The northwest "," Central China "," The northwest "," south China "," The north China "," Central China "]
})
infortmation

5、 Data merging

Directly combine the order information and fruit information into a complete DataFrame, This df This is the data to be processed next

6、 Generate new fields ： Order amount

Here you can learn ：

How to generate time related data
How to from the list （ Iteratable object ） Generate random data in
Pandas Of DataFrame Create your own , Include generate new fields
Pandas Data merging

Analysis dimension 1： Time

2019-2021 Annual monthly sales trend

1、 First extract the year and month ：

df["year"] = df["time"].dt.year
df["month"] = df["time"].dt.month
# Extract the year and month at the same time
df["year_month"] = df["time"].dt.strftime('%Y%m')
df

2、 View the field type ：

3、 Count and display by month and year ：

# Count the sales volume by month
df1 = df.groupby(["year_month"])["kilogram"].sum().reset_index()
fig = px.bar(df1,x="year_month",y="kilogram",color="kilogram")
fig.update_layout(xaxis_tickangle=45) # Tilt angle
fig.show()

2019-2021 Sales trend

df2 = df.groupby(["year_month"])["amount"].sum().reset_index()
df2["amount"] = df2["amount"].apply(lambda x:round(x,2))
fig = go.Figure()
fig.add_trace(go.Scatter( #
x=df2["year_month"],
y=df2["amount"],
mode='lines+markers', # mode Mode selection
name='lines')) # name
fig.update_layout(xaxis_tickangle=45) # Tilt angle
fig.show()

The annual sales 、 Sales and average sales

Analysis dimension 2： goods

Proportion of annual fruit sales

df4 = df.groupby(["year","fruit"]).agg({"kilogram":"sum","amount":"sum"}).reset_index()
df4["year"] = df4["year"].astype(str)
df4["amount"] = df4["amount"].apply(lambda x: round(x,2))
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(
rows=1,
cols=3,
subplot_titles=["2019 year ","2020 year ","2021 year "],
specs=[[{"type": "domain"}, # adopt type To specify the type
{"type": "domain"},
{"type": "domain"}]]
)
years = df4["year"].unique().tolist()
for i, year in enumerate(years):
name = df4[df4["year"] == year].fruit
value = df4[df4["year"] == year].kilogram
fig.add_traces(go.Pie(labels=name,
values=value
),
rows=1,cols=i+1
)
fig.update_traces(
textposition='inside', # 'inside','outside','auto','none'
textinfo='percent+label',
insidetextorientation='radial', # horizontal、radial、tangential
hole=.3,
hoverinfo="label+percent+name"
)
# fig.update_layout(title_text=" Making multi row and multi column subgraphs ")
fig.show()

Comparison of annual sales amount of each fruit

years = df4["year"].unique().tolist()
for _, year in enumerate(years):
df5 = df4[df4["year"]==year]
fig = go.Figure(go.Treemap(
labels = df5["fruit"].tolist(),
parents = df5["year"].tolist(),
values = df5["amount"].tolist(),
textinfo = "label+value+percent root"
))
fig.show()

Change in monthly sales of goods

fig = px.bar(df5,x="year_month",y="amount",color="fruit")
fig.update_layout(xaxis_tickangle=45) # Tilt angle
fig.show()

The line chart shows the changes ：

Analysis dimension 3： region

Sales by Region

Average annual sales in different regions

df7 = df.groupby(["year","region"])["amount"].mean().reset_index()

Analysis dimension 4： user

User order quantity 、 Amount comparison

df8 = df.groupby(["name"]).agg({"time":"count","amount":"sum"}).reset_index().rename(columns={"time":"order_number"})
df8.style.background_gradient(cmap="Spectral_r")

User's preference for fruit

Analyze according to the order quantity and order amount of each fruit by each user ：

df9 = df.groupby(["name","fruit"]).agg({"time":"count","amount":"sum"}).reset_index().rename(columns={"time":"number"})
df10 = df9.sort_values(["name","number","amount"],ascending=[True,False,False])
df10.style.bar(subset=["number","amount"],color="#a97fcf")

px.bar(df10,
x="fruit",
y="amount",
# color="number",
facet_col="name"
)

User hierarchy —RFM Model

RFM Model is an important tool and means to measure customer value and profitability .

This model can reflect the delivery transaction behavior of a user 、 The overall frequency and total amount of transactions 3 Indicators , adopt 3 An indicator to describe the value of the customer ; At the same time, according to these three indicators, customers are divided into 8 Class customer value ：

Recency（R） Is the number of days from the customer's last purchase date to the present , This indicator is related to the time point of analysis , So it's changeable . In theory, the more recent the customer's purchase behavior , The more likely it is to repurchase
Frequency（F） It refers to the number of times the customer has made a purchase -- Most frequent consumers , Loyalty is higher . Increasing the number of customers' purchases means taking more time share .
Monetary value（M） Is the total amount of money the customer spent on the purchase .
Net diagram

Pass below Pandas To solve this problem separately 3 Indicators , First of all F and M： Number of orders per customer and total amount

How to solve R Index ？

1、 First solve the difference between each order and the current time

2、 According to the difference of each user R In ascending order , The number one data is his recent purchase record ： With xiaoming The user, for example , Last time 12 month 15 Number , The difference from the current time is 25 God

3、 According to the user's weight , Keep the first piece of data , In this way, each user's R indicators ：

4、 Data consolidation results in 3 Indicators ：

When the amount of data is large enough , When there are enough users , You can just RFM Model to divide users into 8 A type of

User repurchase analysis

The re purchase cycle is the time interval between every two purchases ： With xiaoming The user, for example , front 2 The re purchase cycles are 4 Days and 22 God

The following is the process of solving the repurchase cycle of each user ：

1、 The purchase time of each user is in ascending order

2、 Move time one unit ：

3、 The combined difference ：

The occurrence of null value is the first record of each user, and there is no data before , After that, the null value part is deleted directly

Directly take out the numerical part of the number of days ：

5、 Re purchase cycle comparison

px.bar(df16,
x="day",
y="name",
orientation="h",
color="day",
color_continuous_scale="spectral" # purples
)

In the figure above, the narrower the rectangle, the smaller the interval ; The whole re purchase cycle of each user is determined by the length of the whole rectangle . Check the sum of the overall re purchase cycle and the average re purchase cycle of each user ：

Come to a conclusion ：Michk and Mike The overall re purchase cycle of the two users is relatively long , Loyal users in the long run ; And from the average repurchase cycle , Relatively low , It indicates that re purchase is active in a short time .

It can also be observed from the violin below ,Michk and Mike The re purchase cycle distribution is the most concentrated .