- 训练集和测试集的数据分在两个不同的表里。通过统计发现只有少部分train_transaction中的TransactionID可以在train_identity中找到对应
# Here we confirm that all of the transactions in `train_identity`
print(np.sum(train_transaction['TransactionID'].isin(train_identity['TransactionID'].unique())))
print(np.sum(test_transaction['TransactionID'].isin(test_identity['TransactionID'].unique())))
输出:
24.4% of TransactionIDs in train (144233 / 590540) have an associated train_identity.
28.0% of TransactionIDs in test (144233 / 590540) have an associated train_identity.
- TransactionDT 列是时间相关的特征,train_transaction和test_transaction之间没有重复的部分。
train_transaction['TransactionDT'].plot(kind='hist',
figsize=(15, 5),
label='train',
bins=50,
title='Train vs Test TransactionDT distribution')
test_transaction['TransactionDT'].plot(kind='hist',
label='test',
bins=50)
plt.legend()
plt.show()
- Categorical Features - Transaction
ProductCD
emaildomain
card1 - card6
addr1, addr2
P_emaildomain
R_emaildomain
M1 - M9
- Categorical Features - Identity
DeviceType
DeviceInfo
id_12 - id_38