从0开始学大数据分析与机器学习,简简单单写下竞赛心得。得分是0.623537,排名629/5602
一、赛题背景
商家有时会在特定的日期(如节礼日甩卖、"黑色星期五 "或 "双十一(11月11日)")开展大型促销活动(如折扣或现金券),以吸引大量新买家。但是,很多被吸引来的买家都是一次性的,他们在这次消费之后就再也没有购买,针对这些用户的促销活动并没有给店铺带来未来销售的增加。为了缓解这个问题,商家必须确定哪些人可以转化为重复购买者。通过对这些潜在的忠诚客户进行精细化营销,商家可以大大降低促销成本,提高投资回报率(ROI)。众所周知,在网络广告的领域,用户精准定位具有极大的挑战性,尤其是对于新买家。不过,借助天猫长期积累的用户行为日志,我们或许可以解决这个问题。在本次挑战中,我们提供了一组商家以及他们在 "双11 "促销活动中获得的新买家。你的任务是在给定商家中预测其中哪些新买家会在未来成为忠实客户。换句话说,你需要预测这些新买家在未来六个月内再次在同一个商家购买商品的概率。我们给出一个包含约20万用户的数据集进行训练,另一个规模相近的数据集进行测试。与其他比赛类似,你可以提取任何特征,然后用其他工具进行训练。你只需要提交预测结果进行评估。
链接:天猫复购预测之挑战Baseline-天池大赛-阿里云天池
二、数据探索
先导入相关包:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
读取数据:
train_data = pd.read_csv("data_format1/train_format1.csv")
test_data = pd.read_csv("data_format1/test_format1.csv")
user_info = pd.read_csv("data_format1/user_info_format1.csv")
user_log = pd.read_csv("data_format1/user_log_format1.csv")
查看用户信息数据的缺失——年龄值
查看用户信息数据的缺失——性别值
查看用户信息数据的缺失——年龄或性别
查看用户信息数据的缺失——用户行为日志数据缺失
对店铺进行分析
二、特征工程
导入相关包:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"] = "SimHei" #解决中文乱码问题
import seaborn as sns
import random
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn import model_selection
from sklearn.neighbors import KNeighborsRegressor
读取数据:
df_train = pd.read_csv(r'data_format1\train_format1.csv')
df_test = pd.read_csv(r'data_format1\test_format1.csv')
user_info = pd.read_csv(r'data_format1\user_info_format1.csv')
user_log = pd.read_csv(r'data_format1\user_log_format1.csv')
print(df_test.shape,df_train.shape)
print(user_info.shape,user_log.shape)
年龄分布可视化:
fig = plt.figure(figsize = (10, 6))
x = np.array(["NULL","<18","18-24","25-29","30-34","35-39","40-49",">=50"])
#<18岁为1;[18,24]为2; [25,29]为3; [30,34]为4;[35,39]为5;[40,49]为6; > = 50时为7和8
y = np.array([user_info[user_info['age_range'] == -1]['age_range'].count(),
user_info[user_info['age_range'] == 1]['age_range'].count(),
user_info[user_info['age_range'] == 2]['age_range'].count(),
user_info[user_info['age_range'] == 3]['age_range'].count(),
user_info[user_info['age_range'] == 4]['age_range'].count(),
user_info[user_info['age_range'] == 5]['age_range'].count(),
user_info[user_info['age_range'] == 6]['age_range'].count(),
user_info[user_info['age_range'] == 7]['age_range'].count() +
user_info[user_info['age_range'] == 8]['age_range'].count()])
plt.bar(x,y,label='人数')
plt.legend()
plt.title('用户年龄分布')
效果如图:
开始进行特征值合并:
df_train = pd.merge(df_train,user_info,on="user_id",how="left")
total_logs_temp = user_log.groupby([user_log["user_id"],user_log["seller_id"]])["item_id"].count().reset_index()
total_logs_temp.rename(columns={"seller_id":"merchant_id","item_id":"total_item_id"},inplace=True)
df_train = pd.merge(df_train,total_logs_temp,on=["user_id","merchant_id"],how="left")
unique_item_id = user_log.groupby(["user_id","seller_id","item_id"]).count().reset_index()[["user_id","seller_id","item_id"]]
unique_item_id_cnt = unique_item_id.groupby(["user_id","seller_id"]).count().reset_index()
unique_item_id_cnt.rename(columns={"seller_id":"merchant_id","item_id":"unique_item_id"},inplace=True)
df_train = pd.merge(df_train, unique_item_id_cnt, on=["user_id", "merchant_id"], how="left")
cat_id_temp = user_log.groupby(["user_id", "seller_id", "cat_id"]).count().reset_index()[["user_id", "seller_id", "cat_id"]]
cat_id_temp_cnt = cat_id_temp.groupby(["user_id", "seller_id"]).count().reset_index()
cat_id_temp_cnt.rename(columns={"seller_id":"merchant_id","cat_id":"total_cat_id"},inplace=True)
df_train = pd.merge(df_train, cat_id_temp_cnt, on=["user_id", "merchant_id"], how="left")
time_temp = user_log.groupby(["user_id", "seller_id", "time_stamp"]).count().reset_index()[["user_id", "seller_id", "time_stamp"]]
time_temp_cnt = time_temp.groupby(["user_id", "seller_id"]).count().reset_index()
time_temp_cnt.rename(columns={"seller_id":"merchant_id","time_stamp":"total_time_temp"},inplace=True)
df_train = pd.merge(df_train, time_temp_cnt, on=["user_id", "merchant_id"], how="left")
click_temp = user_log.groupby(["user_id", "seller_id", "action_type"])["item_id"].count().reset_index()
click_temp.rename(columns={"seller_id":"merchant_id","item_id":"times"},inplace=True)
click_temp["clicks"] = click_temp["action_type"] == 0
click_temp["clicks"] = click_temp["clicks"] * click_temp["times"]
click_temp["shopping_cart"] = click_temp["action_type"] == 1
click_temp["shopping_cart"] = click_temp["shopping_cart"] * click_temp["times"]
click_temp["purchases"] = click_temp["action_type"] == 2
click_temp["purchases"] = click_temp["purchases"] * click_temp["times"]
click_temp["favourites"] = click_temp["action_type"] == 3
click_temp["favourites"] = click_temp["favourites"] * click_temp["times"]
four_features = click_temp.groupby(["user_id", "merchant_id"]).sum().reset_index()
#删除相关列
four_features = four_features.drop(["action_type", "times"], axis=1)
#合并
df_train = pd.merge(df_train, four_features, on=["user_id", "merchant_id"], how="left")
#缺失值向前填充
df_train = df_train.fillna(method="ffill")
将建立好的特征工程保存为单独的文件:
#将构建好的特征保存
df_train.to_csv("df_train.csv",index=None)
三、模型构建
y = df_train["label"]
X = df_train.drop(["user_id", "merchant_id", "label"], axis=1)
X.head(10)
分割数据:
#分割数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
logistic回归:
#logistic回归
Logit = LogisticRegression(solver='liblinear')
Logit.fit(X_train, y_train)
Predict = Logit.predict(X_test)
Predict_proba = Logit.predict_proba(X_test)
print(Predict[0:20])
print(Predict_proba[:])
Score = accuracy_score(y_test, Predict)
Score
决策树:
#决策树
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=4,random_state=0)
tree.fit(X_train, y_train)
Predict_proba = tree.predict_proba(X_test)
print(Predict_proba[:])
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))