用python对某网购平台的销售数据进行分析

数据来源于kaggle，该数据覆盖了某网购平台4年（2015-1-2至2018-12-30）的部分零售数据，对该数据集建立RFM模型，对客户进行细化分类，因为时间比较久远，设置了2019年1月1日作为日期对照值。R(Rencency):最近一次消费，F（Frequency）：消费频率，M（Monetary）：消费金额。

导入相关库，设置绘图的中文字体

import datetime
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from pyecharts.charts import Map
from matplotlib import pyplot as plt
from pyecharts import options as opts
from pyecharts.globals import ThemeType
plt.rcParams['font.sans-serif']=['SimHei']

加载数据集

首先看到数据集的大小一共有9800行，18列，大致包括：订单号，订单日期，客户编号，客户名称，国家，城市，产品类别，产品名称，销售金额等。数据集的邮编列有11个空值，本文不会用到邮编列，暂不做处理。单个订单销售额的平均值是230.77，中位数是54.49，最小金额是0.44，最大金额是22638.48，考虑是销售的产品类别跨度较大。

df=pd.read_csv(r'E:\kaggle data\train.csv')
df=df.copy()
print(df.isnull().sum())
print(df.shape)
如图所示：

1 2	print(df.describe()) 如图所示：

数据类型处理

为了便于分析最近一次消费距离2019-01-01的天数，新增了days列

df['Order Date']=pd.to_datetime(df['Order Date'])
str='2019-1-1'
ntime=datetime.datetime.strptime(str,'%Y-%m-%d')
df['days']=(ntime-df['Order Date']).dt.days
如图所示：

消费顾客的地区分布（2015-2019）

df1=df.groupby('State').agg({'State':'count'})
datas=[(i,int(j))for i,j in zip(df1.index,df1.State)]
map_=Map(init_opts=opts.InitOpts(theme=ThemeType.PURPLE_PASSION))
map_.add("", datas, "美国")
map_.set_global_opts(title_opts=opts.TitleOpts('顾客所在地区分布图'),
                     visualmap_opts=opts.VisualMapOpts(is_piecewise=True,
                         pieces=[{'max':2000,'min':1500,'label':'1500-2000','color':'#ff9999'},
                                 {'max':1500,'min':1000,'label':'1000-1500','color':'#66b3ff'},
                                 {'max':1000,'min':500,'label':'500-1000','color':'#99ff99'},
                                 {'max':500,'min':1,'label':'1-500','color':'#ffcc99'}]
                     ))
map_.render('销售地区分布.html')

顾客复购周期

每位顾客复购距离上次购买时间之差。

dcc=df[['Customer Name','Order Date']]
dcc1=dcc.groupby('Customer Name').count().rename(columns={'Order Date':'购买次数'})
dcc1.reset_index(inplace=True)
dcc1=dcc1[dcc1['购买次数']>1]
print(dcc1)
表格如下：

dcc=dcc[dcc['Customer Name'].isin(dcc1['Customer Name'].to_list())]
dcc['Order Date']=pd.to_datetime(dcc['Order Date'])
dcc=dcc.sort_values(['Customer Name','Order Date'],ascending=True).reset_index(drop=True)
dcc2=dcc.groupby('Customer Name').shift(1).rename(columns={'Order Date':'Order_Date'})
dcc2=pd.concat([dcc,dcc2],axis=1)
dcc2=dcc2.dropna().reset_index(drop=True)
dcc2['复购天数']=(dcc2['Order Date']-dcc2['Order_Date']).dt.days
print(dcc2)
表格如下：

dcc3=dcc2.groupby('Customer Name').\
    agg({'Customer Name':'count','复购天数':'sum'}).\
    rename(columns={'Customer Name':'复购次数'})
dcc3.reset_index(inplace=True)
print(dcc3.head(20))
dcc3['平均复购时间']=dcc3['复购天数']/dcc3['复购次数']
print(dcc3)
计算出了每位用户的平均复购周期，表格如下：

1 2	print(dcc3['平均复购时间'].sum()/len(dcc3)) 全部用户的平均复购周期：114.31627962122226天

构造RFM透视表

计算出RFM指标，提取对应的列，生成透视表，查看子数据集的数据分布情况

dc=pd.pivot_table(df,index='Customer Name',values=['Sales','days','Customer ID'],
                  aggfunc={'Sales':'sum','days':'min','Customer ID':'count'})
dc=dc[['days','Customer ID','Sales']]
dc.columns=['R','F','M']
print(dc)
dc.hist()
plt.show()
如图所示：

查看客户消费频次的分布情况，这里RFM建模的指标构造按照四分位来分组；例如消费频次前20%的客户群体计5分，后20%的客户群体计1分，本文中，消费次数16次以上计5分，13-16次计4分，10-13次计3分，7-10次计2分，小于7次计1分。

dc1=dc.groupby(['F']).count().sort_values('R',ascending=False)
dc1.reset_index(inplace=True)
print(dc1)
plt.xlabel('消费次数')
plt.bar(dc1['F'],dc1['R'],label='客户数')
plt.legend()
plt.show()

构造RFM梯度策略

*根据子数据集的数据分布情况，把RFM分成5个级别，再将打分的级别求平均值，对应级别大于平均值的，编码为1，小于平均值的，编码为0。

def f1(x):
    if x>=250:
        return 1
    elif 150<=x<250:
        return 2
    elif 80<=x<150:
        return 3
    elif 40<=x<80:
        return 4
    elif x<40:
        return 5
def f2(x):
    if x<7:
        return 1
    elif 7<=x<10:
        return 2
    elif 10<=x<13:
        return 3
    elif 13<=x<16:
        return 4
    elif x>16:
        return 5
def f3(x):
    if x>4500:
        return 5
    if 2500<=x<4500:
        return 4
    if 1800<=x<2500:
        return 3
    if 900<=x<1800:
        return 2
    if x<900:
        return 1
dc['R_score']=dc['R'].apply(f1)
dc['F_score']=dc['F'].apply(f2)
dc['M_score']=dc['M'].apply(f3)
print(dc)

a=dc['R_score'].mean()
b=dc['F_score'].mean()
c=dc['M_score'].mean()
print(a,b,c)
def func1(x):
    if x>a:
        return 1
    else:
        return 0
def func2(x):
    if x>b:
        return 1
    else:
        return 0
def func3(x):
    if x>c:
        return 1
    else:
        return 0
dc['R大于均值']=dc['R_score'].apply(func1)
dc['F大于均值']=dc['F_score'].apply(func2)
dc['M大于均值']=dc['M_score'].apply(func3)
print(dc)

RFM均值计算出来分别是：3.1059268600252206 3.116339869281046 3.010088272383354。接下来给客户贴标签，按照下表中的RFM值设定标签。

客户分类	用户行为	建议	RFM
重要价值客户	最近消费了，消费频次高，累积消费金额高	VIP个性化服务	111
潜力客户	最近消费了，消费频次高，累积消费金额低	多宣传促销商品，附加些价值更高的产品	110
重要深耕客户	最近消费了，消费频次低，累积消费金额高	积分制，传递会员活动和权益信息，增值客户黏性	101
新客户	最近消费了，消费频次低，累积消费金额低	邮件或短信推送最近优惠活动，提供免费试用，提升品牌知名度	100
重要唤回客户	超出80天无消费，消费频次高，累积消费金额高	超大型活动时，电话回访告知客户近期优惠力度	011
一般客户	超出80天无消费，消费频次高，累积消费金额低	宣传促销活动	010
重要挽回客户	超出80天无消费，消费频次低，累积消费金额高	宣传形象商品，重大节日活动时，主动联系告知	001
流失客户	超出80天无消费，消费频次低，累积消费金额低	传递促销信息，恢复客户兴趣，否则暂时放弃	000

def fun(x):
    if x.iloc[0] == 1 and x.iloc[1] == 1 and x.iloc[2] == 1:
        return "重要价值客户"
    elif x.iloc[0] == 1 and x.iloc[1] == 1 and x.iloc[2] == 0:
        return "潜力客户"
    elif x.iloc[0] == 1 and x.iloc[1] == 0 and x.iloc[2] == 1:
        return "重要深耕客户"
    elif x.iloc[0] == 1 and x.iloc[1] == 0 and x.iloc[2] == 0:
        return "新客户"
    elif x.iloc[0] == 0 and x.iloc[1] == 1 and x.iloc[2] == 1:
        return "重要唤回客户"
    elif x.iloc[0] == 0 and x.iloc[1] == 1 and x.iloc[2] == 0:
        return "一般客户"
    elif x.iloc[0] == 0 and x.iloc[1] == 0 and x.iloc[2] == 1:
        return "重要挽回客户"
    elif x.iloc[0] == 0 and x.iloc[1] == 0 and x.iloc[2] == 0:
        return "流失客户"
dc['客户分类']=dc[['R大于均值','F大于均值','M大于均值']].apply(fun,axis=1)
print(dc)

客户分类可视化

客户分类的人数对比及销售额占比

dc1=dc.groupby('客户分类').count()
dc1.reset_index(inplace=True)
dc1=dc1.sort_values('R')
plt.title('不同类型客户的人数对比')
plt.barh(dc1['客户分类'],dc1['R'])
plt.show()

dc2=dc.groupby('客户分类').agg({'M':'sum'})
dc2.reset_index(inplace=True)
plt.title('不同类型客户的销售额百分比')
color=['#ff9999','#66b3ff','#99ff99','#ffcc99','#55B4B0','orchid','navy','yellow']
plt.pie(dc2['M'],labels=dc2['客户分类'],colors=color,autopct='%0.2f%%')
plt.show()

销售金额分段，客户数及销售额占比

为了验证著名的‘’二八定律“，这里用单个客户的累积购买金额来体现客户消费的差异，总消费金额平均数是2851元，将平均数的1/2作为累积消费金额的分段，则按照1425进行消费金额分段，作出十个分段，可见大多数客户的累积消费金额在4276元（1.5*平均金额）以下，占比80.33%（超4/5），贡献的店铺收入比例占比52.09，剩下的19.67%的客户贡献了将近一半的销售收入。

print(dc['M'].mean())
print(dc.sort_values('M'))
dh=dc[dc['M']<1425]
dh1=dc.query('1425<=M<2851')
dh2=dc.query('2851<=M<4276')
dh3=dc.query('4276<=M<5701')
dh4=dc.query('5701<=M<7126')
dh5=dc.query('7126<=M<8551')
dh6=dc.query('8551<=M<9976')
dh7=dc.query('9976<=M<11401')
dh8=dc.query('11401<=M<12826')
dh9=dc.query('M>=12826')
y1=[dh.F.count(),dh1.F.count(),dh2.F.count(),dh3.F.count(),dh4.F.count(),dh5.F.count(),dh6.F.count(),dh7.F.count(),
    dh8.F.count(),dh9.F.count()]
y2=[dh.M.sum()/1000,dh1.M.sum()/1000,dh2.M.sum()/1000,dh3.M.sum()/1000,dh4.M.sum()/1000,dh5.M.sum()/1000,
    dh6.M.sum()/1000,dh7.M.sum()/1000,dh8.M.sum()/1000,dh9.M.sum()/1000]
x=np.arange(10)+0.3
bar_width=0.3
x_labels=['1425元以下','1425-2851元','2851-4276元','4276-5701元','5701-7126元','7126-8551元','8551-9976元','9976-11401元',
          '11401-12826元','12826元以上']
plt.bar(x_labels,y1,width=bar_width,label='客户数占比'，color='orchid')
plt.bar(x,y2,width=bar_width,label='销售额占比'，color='#66b3ff')
plt.legend()
plt.xticks(size='small',rotation=30)
# print(sum(y1))
for a,b in zip(x_labels,y1):
    plt.text(a,b,"%.2f%%" %(b/793*100),ha='center')
for a,b in zip(x,y2):
    plt.text(a,b,"%.2f%%"%(b/2261.53*100),ha='center')
plt.show()

重要深耕客户分析

将RFM透视表与主表联结，选取需要的列，重要深耕用户购买前三的产品类别是粘合剂，椅子，标签；销售金额前五的产品类别是复印机，粘合剂，椅子，桌子，器械；复印机，椅子，桌子等单价高，又不属于易耗品，所以该客户群体消费频次低，但是消费金额却相对较高。

dt=dc[dc['客户分类']=='重要深耕客户']
dt.reset_index(inplace=True)
df=df.drop_duplicates('Customer Name')
dt=pd.merge(dt.loc[:,['Customer Name']],df.loc[:,['Customer Name','City','State','Product Name','Sales']],
            how='left',on='Customer Name')
print(dt)
表格如下

dt1=dt.groupby(['Sub-Category']).agg({'Sales':'sum','City':'count'})
dt1=dt1.sort_values('Sales',ascending=False)
dt1.reset_index(inplace=True)
plt.style.use('ggplot')
plt.title('重要深耕用户不同产品类别的购买金额')
plt.bar(dt1['Sub-Category'],dt1['Sales'])
plt.xticks(rotation=60)
plt.show()
plt.title('重要深耕用户不同产品类别的购买次数')
plt.bar(dt1['Sub-Category'],dt1['City'])
plt.xticks(rotation=60)
plt.show()