Categorical variable(类别变量)学习笔记

写在前面
一、为什么要引入类别变量:对现有的数据进行预处理,再训练模型,使得训练出来的模型效果更好

二、什么是类别变量:类别变量又名分类变量,顾名思义,类别变量就是能表示类别的名词,比如说,男,女;汽车制造业,手工业,农业等。

本文基于kaggle教程介绍三个方法

1、丢弃分类变量

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OrdinalEncoder,OneHotEncoder
pd.set_option('display.max_columns',None)
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)


# Read the data
data = pd.read_csv('../machineLearning/temp/archive/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)
# 删除存在空值的列
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()]
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
                        X_train_full[cname].dtype == "object"]

# 选择num列
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

#找到object列
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)


print("Categorical variables:")
print(object_cols)

#直接将类别变量去掉
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

2、

上一篇:如何解决ubuntu vi编辑器上下箭头变成ABCD的问题


下一篇:小说下载脚本