我有一个数据框,如下所示:
user item affinity
0 1 13 0.1
1 2 11 0.4
2 3 14 0.9
3 4 12 1.0
由此,我要创建一个编码数据集(用于fastFM),如下所示:
user1 user2 user4 user4 item11 item12 item13 item14 affinity
1 0 0 0 0 0 1 0 0.1
0 1 0 0 1 0 0 0 0.4
0 0 1 0 0 0 0 1 0.9
0 0 0 1 0 1 0 0 1.0
我需要sklearn的dictvectorizer吗?如果是,那么是否有一种方法可以将原始数据帧转换为字典,然后再将其提供给dictvectorizer,这将依次为我提供如图所示的编码数据集?
解决方法:
您可以将get_dummies
与concat
一起使用如果用户或项目列中的值是数字,请通过astype
强制转换为字符串:
df = pd.DataFrame({'item': {0: 13, 1: 11, 2: 14, 3: 12},
'affinity': {0: 0.1, 1: 0.4, 2: 0.9, 3: 1.0},
'user': {0: 1, 1: 2, 2: 3, 3: 4}},
columns=['user','item','affinity'])
print df
user item affinity
0 1 13 0.1
1 2 11 0.4
2 3 14 0.9
3 4 12 1.0
df1 = df.user.astype(str).str.get_dummies()
df1.columns = ['user' + str(x) for x in df1.columns]
print df1
user1 user2 user3 user4
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
df2 = df.item.astype(str).str.get_dummies()
df2.columns = ['item' + str(x) for x in df2.columns]
print df2
item11 item12 item13 item14
0 0 0 1 0
1 1 0 0 0
2 0 0 0 1
3 0 1 0 0
print pd.concat([df1,df2, df.affinity], axis=1)
user1 user2 user3 user4 item11 item12 item13 item14 affinity
0 1 0 0 0 0 0 1 0 0.1
1 0 1 0 0 1 0 0 0 0.4
2 0 0 1 0 0 0 0 1 0.9
3 0 0 0 1 0 1 0 0 1.0
时间:
len(df)= 4:
In [49]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.91 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 690 µs per loop
len(df)= 40:
df = pd.concat([df]*10).reset_index(drop=True)
In [51]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 5.56 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 719 µs per loop
len(df)= 400:
df = pd.concat([df]*100).reset_index(drop=True)
In [43]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.55 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 748 µs per loop
len(df)= 4k:
df = pd.concat([df]*1000).reset_index(drop=True)
In [41]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 761 µs per loop
len(df)= 40k:
df = pd.concat([df]*10000).reset_index(drop=True)
%timeit pd.concat([df1,df2, df.affinity], axis=1)
1000 loops, best of 3: 1.83 ms per loop
len(df)= 400k:
df = pd.concat([df]*100000).reset_index(drop=True)
%timeit pd.concat([df1,df2, df.affinity], axis=1)
100 loops, best of 3: 15.6 ms per loop