ML之FE:基于FE特征工程对RentListingInquries数据集进行预处理并导出为三种格式文件(csv格式/txt格式/libsvm稀疏txt格式)
目录
输出结果
1.1、RentListingInquries_FE_train.csv
1.2、RentListingInquries_FE_test.csv
2.1、RentListingInquries_FE_train.txt
2.2、RentListingInquries_FE_test.txt
代码输出
y_train初步处理: 10 1 10000 2 100004 0 100007 2 100013 2 100014 1 100016 2 100020 2 100026 1 100027 2 100030 2 10004 2 100044 0 100048 2 10005 2 100051 1 100052 2 100053 2 100055 2 100058 2 100062 2 100063 1 100065 2 100066 2 10007 1 100071 2 100075 1 100076 2 100079 0 100081 2 .. 99956 2 99960 1 99961 2 99964 1 99965 2 99966 2 99979 2 99980 2 99982 0 99984 2 99986 2 99987 2 99988 1 9999 1 99991 2 99992 2 99993 2 99994 2 Name: interest_level, Length: 49352, dtype: int64 |
train_test_sparse为最终处理: (0, 0) 1.5 (0, 1) 3.0 (0, 2) 40.7145 (0, 3) -73.9425 (0, 4) 3000.0 (0, 5) 1200.0 (0, 6) 750.0 (0, 7) -1.5 (0, 8) 4.5 (0, 9) 2016.0 (0, 10) 6.0 (0, 11) 24.0 (0, 12) 4.0 (0, 13) 176.0 (0, 14) 7.0 (0, 15) 95.0 (0, 17) 1.0 (0, 18) 1.0 (0, 19) 1.0 (0, 20) 1.0 (0, 21) 1.0 (0, 22) 1.0 (0, 23) 1.0 (0, 24) 1.0 (0, 32) 1.0 : : (124010, 29) 1.0 (124010, 30) 1.0 (124010, 31) 1.0 (124010, 32) 1.0 (124010, 33) 1.0 (124010, 34) 2.0 (124010, 35) 12.0 (124010, 36) 0.04446034405901145 (124010, 37) 0.010558720013101165 (124010, 38) 0.030099750139483926 (124010, 39) 0.9593415298474148 (124010, 40) 0.21662478672029833 (124010, 41) 0.0020547768050611895 (124010, 42) 0.7813204364746404 (124010, 43) 0.12333335451008201 (124010, 44) 1.2281905750572492e-07 (124010, 45) 0.8766665226708605 (124010, 46) 0.0004487893042226658 (124010, 47) 0.001303620464837077 (124010, 48) 0.9982475902309401 (124010, 49) 3.0 (124010, 58) 2.0 (124010, 83) 1.0 (124010, 107) 1.0 (124010, 114) 1.0 |
设计思路
正在更新……
核心代码
正在更新……
train_test['features_count'] = train_test['features'].apply(lambda x: len(x))
train_test['features2'] = train_test['features']
train_test['features2'] = train_test['features2'].apply(lambda x: ' '.join(x))
c_vect = CountVectorizer(stop_words='english', max_features=200, ngram_range=(1, 1))
c_vect_sparse = c_vect.fit_transform(train_test['features2'])
c_vect_sparse_cols = c_vect.get_feature_names()
train_test.drop(['features', 'features2'], axis=1, inplace=True)
from sklearn.datasets import dump_svmlight_file
dump_svmlight_file(y_train, dpath + 'RentListingInquries_FE_train_libsvm.txt',X_train_sparse)
# dump_svmlight_file(X_train_sparse, dpath + 'RentListingInquries_FE_train_libsvm.txt')
dump_svmlight_file(X_test_sparse, dpath + 'RentListingInquries_FE_test_libsvm.txt')