如何使用Scikit Learn dictvectorizer从Python中的密集数据帧获取编码数据帧?

我有一个数据框,如下所示:

   user  item  affinity
0     1    13       0.1
1     2    11       0.4
2     3    14       0.9
3     4    12       1.0

由此,我要创建一个编码数据集(用于fastFM),如下所示:

  user1 user2 user4 user4 item11 item12 item13 item14 affinity
    1     0     0     0     0      0      1      0       0.1
    0     1     0     0     1      0      0      0       0.4
    0     0     1     0     0      0      0      1       0.9
    0     0     0     1     0      1      0      0       1.0

我需要sklearn的dictvectorizer吗?如果是,那么是否有一种方法可以将原始数据帧转换为字典,然后再将其提供给dictvectorizer,这将依次为我提供如图所示的编码数据集?

解决方法:

您可以将get_dummiesconcat一起使用如果用户或项目列中的值是数字,请通过astype强制转换为字符串:

df = pd.DataFrame({'item': {0: 13, 1: 11, 2: 14, 3: 12}, 
                   'affinity': {0: 0.1, 1: 0.4, 2: 0.9, 3: 1.0},
                   'user': {0: 1, 1: 2, 2: 3, 3: 4}},
                    columns=['user','item','affinity'])
print df
   user  item  affinity
0     1    13       0.1
1     2    11       0.4
2     3    14       0.9
3     4    12       1.0

df1 = df.user.astype(str).str.get_dummies()
df1.columns = ['user' + str(x) for x in df1.columns]
print df1
   user1  user2  user3  user4
0      1      0      0      0
1      0      1      0      0
2      0      0      1      0
3      0      0      0      1

df2 = df.item.astype(str).str.get_dummies()
df2.columns = ['item' + str(x) for x in df2.columns]
print df2
   item11  item12  item13  item14
0       0       0       1       0
1       1       0       0       0
2       0       0       0       1
3       0       1       0       0

print pd.concat([df1,df2, df.affinity], axis=1)
   user1  user2  user3  user4  item11  item12  item13  item14  affinity
0      1      0      0      0       0       0       1       0       0.1
1      0      1      0      0       1       0       0       0       0.4
2      0      0      1      0       0       0       0       1       0.9
3      0      0      0      1       0       1       0       0       1.0

时间:

len(df)= 4:

In [49]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.91 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 690 µs per loop

len(df)= 40:

df = pd.concat([df]*10).reset_index(drop=True)

In [51]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 5.56 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 719 µs per loop

len(df)= 400:

df = pd.concat([df]*100).reset_index(drop=True)

In [43]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.55 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 748 µs per loop

len(df)= 4k:

df = pd.concat([df]*1000).reset_index(drop=True)

In [41]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 761 µs per loop

len(df)= 40k:

df = pd.concat([df]*10000).reset_index(drop=True)

%timeit pd.concat([df1,df2, df.affinity], axis=1)
1000 loops, best of 3: 1.83 ms per loop

len(df)= 400k:

df = pd.concat([df]*100000).reset_index(drop=True)

%timeit pd.concat([df1,df2, df.affinity], axis=1)
100 loops, best of 3: 15.6 ms per loop
上一篇:什么是用于Base64编码图像的最快的Java库?


下一篇:在PHP中未设置编码数组后没有索引的json