第二章:第一节数据清洗及特征处理-课程

【回顾&引言】前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察。那么在这里,我们主要是做数据分析的流程性学习,主要是包括了数据清洗以及数据的特征处理,数据重构以及数据可视化。这些内容是为数据分析最后的建模和模型评价做一个铺垫。

开始之前,导入numpy、pandas包和数据

#加载所需的库
import numpy as np
import pandas as pd
#加载数据train.csv
pd_train_csv=pd.read_csv('./train.csv')
pd_train_csv
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

2 第二章:数据清洗及特征处理

我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。

2.1 缺失值观察与处理

我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢

2.1.1 任务一:缺失值观察

(1) 请查看每个特征缺失值个数
(2) 请查看Age, Cabin, Embarked列的数据
以上方式都有多种方式,所以大家多多益善

#写入代码
pd_train_csv['Age']


0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64
#写入代码
pd_train_csv.Cabin


0       NaN
1       C85
2       NaN
3      C123
4       NaN
       ... 
886     NaN
887     B42
888     NaN
889    C148
890     NaN
Name: Cabin, Length: 891, dtype: object
#写入代码  loc 索引器的一般形式是 loc[*, *] ,其中第一个 * 代表行的选择,第二个 * 代表列的选择
pd_train_csv.loc[:,'Embarked']


0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object
# 是基于元素的 loc 索引器,基于位置的 iloc 索引器
pd_train_csv.iloc[:,5]
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

2.1.2 任务二:对缺失值进行处理

(1)处理缺失值一般有几种思路

(2) 请尝试对Age列的数据的缺失值进行处理

(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

#处理缺失值的一般思路:
#提醒:可使用的函数有--->dropna函数与fillna函数



#写入代码
pd_train_csv[pd_train_csv['Age'].isnull()]



PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
... ... ... ... ... ... ... ... ... ... ... ... ...
859 860 0 3 Razi, Mr. Raihed male NaN 0 0 2629 7.2292 NaN C
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S

177 rows × 12 columns

#写入代码    .dropna将空值所在的行/列删除
df=pd_train_csv.dropna(axis=0,how='any',subset=['Age'])
df=pd_train_csv.dropna(axis=0,how='any',subset=['Sex'])
df=pd_train_csv.dropna(axis=0,how='any',subset=['Embarked'])

#写入代码 .fillna补充缺失值
pd_train_csv=df.fillna(axis=0,method='ffill')

pd_train_csv=pd_train_csv.fillna(axis=0,method='bfill')
pd_train_csv


PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 C85 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 C85 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 C123 S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 C50 S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female 19.0 1 2 W./C. 6607 23.4500 B42 S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 C148 Q

889 rows × 12 columns

【思考1】dropna和fillna有哪些参数,分别如何使用呢?

【思考】检索空缺值用np.nan,None以及.isnull()哪个更好,这是为什么?如果其中某个方式无法找到缺失值,原因又是为什么?

#思考回答



【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

2.2 重复值观察与处理

由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢

2.2.1 任务一:请查看数据中的重复值

#写入代码
pd_train_csv[pd_train_csv.duplicated()==True]


PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

2.2.2 任务二:对重复值进行处理

(1)重复值有哪些处理方式呢?

(2)处理我们数据的重复值

方法多多益善

#重复值有哪些处理方式:

pd_train_csv.duplicated().sum()

0
#写入代码 .drop_duplicates()去重复
pd_train_csv = pd_train_csv.drop_duplicates(keep=False)
pd_train_csv.head(200)

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 C85 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 C85 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 C123 S
... ... ... ... ... ... ... ... ... ... ... ... ...
196 197 0 3 Mernagh, Mr. Robert male 58.0 0 0 368703 7.7500 B80 Q
197 198 0 3 Olsen, Mr. Karl Siegwart Andreas male 42.0 0 1 4579 8.4042 B80 S
198 199 1 3 Madigan, Miss. Margaret "Maggie" female 42.0 0 0 370370 7.7500 B80 Q
199 200 0 2 Yrois, Miss. Henriette ("Mrs Harbeck") female 24.0 0 0 248747 13.0000 B80 S
200 201 0 3 Vande Walle, Mr. Nestor Cyriel male 28.0 0 0 345770 9.5000 B80 S

200 rows × 12 columns

2.2.3 任务三:将前面清洗的数据保存为csv格式

#写入代码

pd_train_csv.to_csv('./pd_train_clear.csv')

2.3 特征观察与处理

我们对特征进行一下观察,可以把特征大概分为两大类:
数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征
文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。

2.3.1 任务一:对年龄进行分箱(离散化)处理

(1) 分箱操作是什么?

(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示

(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示

(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段,并用分类变量12345表示

(5) 将上面的获得的数据分别进行保存,保存为csv格式

#分箱操作是什么:
#将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
pd_train_clear_csv=pd.read_csv('pd_train_clear.csv')
pd_train_clear_csv['cut']=pd.cut(pd_train_clear_csv['Age'],bins=5,labels=[1,2,3,4,5])
pd_train_clear_csv.head(20)
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked cut
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 C85 S 2
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 3
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 C85 S 2
3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 3
4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 C123 S 3
5 5 6 0 3 Moran, Mr. James male 35.0 0 0 330877 8.4583 C123 Q 3
6 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 4
7 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 E46 S 1
8 8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 E46 S 2
9 9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 E46 C 1
10 10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S 1
11 11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S 4
12 12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 C103 S 2
13 13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 C103 S 3
14 14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 C103 S 1
15 15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 C103 S 4
16 16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 C103 Q 1
17 17 18 1 2 Williams, Mr. Charles Eugene male 2.0 0 0 244373 13.0000 C103 S 1
18 18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 C103 S 2
19 19 20 1 3 Masselmani, Mrs. Fatima female 31.0 0 0 2649 7.2250 C103 C 2
#写入代码  将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
pd_train_clear_csv['cut1']=pd.cut(pd_train_clear_csv['Age'],bins=[0,5,15,30,50,80],labels=[1,2,3,4,5])
pd_train_clear_csv,head(20)

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-305-35c221b93aec> in <module>
      1 #写入代码  将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
      2 pd_train_clear_csv['cut1']=pd.cut(pd_train_clear_csv['Age'],bins=[0,5,15,30,50,80],labels=[1,2,3,4,5])
----> 3 pd_train_clear_csv,head(20)


NameError: name 'head' is not defined
#写入代码  将连续变量Age按10% 30% 50% 70% 90%五个年龄段,并用分类变量12345表示  # q为
pd_train_clear_csv['cut2']=pd.qcut(pd_train_clear_csv['Age'],q=[0,0.1,0.3,0.5,0.7,0.9],labels=[1,2,3,4,5])
pd_train_clear_csv.head(20)


Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked cut cut1 cut2
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 C85 S 2 3 2
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 3 4 5
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 C85 S 2 3 3
3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 3 4 4
4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 C123 S 3 4 4
5 5 6 0 3 Moran, Mr. James male 35.0 0 0 330877 8.4583 C123 Q 3 4 4
6 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 4 5 NaN
7 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 E46 S 1 1 1
8 8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 E46 S 2 3 3
9 9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 E46 C 1 2 2
10 10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S 1 1 1
11 11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S 4 5 NaN
12 12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 C103 S 2 3 2
13 13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 C103 S 3 4 5
14 14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 C103 S 1 2 2
15 15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 C103 S 4 5 NaN
16 16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 C103 Q 1 1 1
17 17 18 1 2 Williams, Mr. Charles Eugene male 2.0 0 0 244373 13.0000 C103 S 1 1 1
18 18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 C103 S 2 4 4
19 19 20 1 3 Masselmani, Mrs. Fatima female 31.0 0 0 2649 7.2250 C103 C 2 4 4
#写入代码
pd_train_clear_csv.to_csv('pd_train_clear_cut.csv')


【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

2.3.2 任务二:对文本变量进行转换

(1) 查看文本变量名及种类
(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示
(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示

#写入代码
pd_train_clear_csv=pd.read_csv('pd_train_clear_cut.csv')
pd_train_clear_csv['Sex'].value_counts()


male      577
female    312
Name: Sex, dtype: int64
  
        

#写入代码
# pd_train_clear_csv['Cabin'].unique()
pd_train_csv['Cabin'].value_counts()


G6             24
B78            21
C78            20
C83            19
C23 C25 C27    19
               ..
E36             1
A5              1
F G63           1
D46             1
T               1
Name: Cabin, Length: 146, dtype: int64
#写入代码 #  .nunique()查询一共多少种类
# pd_train_csv['Embarked'].nunique()
pd_train_clear_csv['Cabin'].nunique()
# 查询行
# pd_train_csv.iloc[1,:].nunique()
146
# 2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示
# 方法一 replace
pd_train_clear_csv['Sex_num']=pd_train_clear_csv['Sex'].replace(['male','female'],[1,2])
pd_train_clear_csv
Unnamed: 0 Unnamed: 0.1 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked cut cut1 Sex_num
0 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 C85 S 2 3 1
1 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 3 4 2
2 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 C85 S 2 3 2
3 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 3 4 2
4 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 C123 S 3 4 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 884 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 C50 S 2 3 1
885 885 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 2 3 2
886 886 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female 19.0 1 2 W./C. 6607 23.4500 B42 S 2 3 2
887 887 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 2 3 1
888 888 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 C148 Q 2 4 1

889 rows × 17 columns

# map  字典映射
pd_train_clear_csv['Embarked'].value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64
pd_train_clear_csv['Embarked_num']=pd_train_clear_csv['Embarked'].map({'S':1,'C':2,'Q':3})
pd_train_clear_csv
Unnamed: 0 Unnamed: 0.1 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked cut cut1 Sex_num Embarked_num
0 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 C85 S 2 3 1 1
1 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 3 4 2 2
2 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 C85 S 2 3 2 1
3 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 3 4 2 1
4 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 C123 S 3 4 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 884 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 C50 S 2 3 1 1
885 885 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 2 3 2 1
886 886 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female 19.0 1 2 W./C. 6607 23.4500 B42 S 2 3 2 1
887 887 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 2 3 1 2
888 888 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 C148 Q 2 4 1 3

889 rows × 18 columns

# sklearn.preprocessing的LabelEncoder
pd_train_clear_csv['Cabin'].value_counts()
G6             24
B78            21
C78            20
C83            19
C23 C25 C27    19
               ..
E36             1
A5              1
F G63           1
D46             1
T               1
Name: Cabin, Length: 146, dtype: int64
from sklearn.preprocessing import LabelEncoder
for feet in ['Cabin','Ticket']:
    lbl=LabelEncoder()
    label_dict=dict(zip(pd_train_clear_csv[feet].unique(),range(pd_train_clear_csv[feet].nunique())))
    pd_train_clear_csv[feet+"_LabelEncoder_map"]=pd_train_clear_csv[feet].map(label_dict)
    pd_train_clear_csv[feet+"_LabelEncoder"]=lbl.fit_transform(pd_train_clear_csv[feet].astype(str))
    
pd_train_clear_csv


Unnamed: 0 Unnamed: 0.1 PassengerId Survived Pclass Name Sex Age SibSp Parch ... Cabin Embarked cut cut1 Sex_num Embarked_num Cabin_LabelEncoder_map Cabin_LabelEncoder Ticket_LabelEncoder_map Ticket_LabelEncoder
0 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 ... C85 S 2 3 1 1 0 80 0 522
1 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 ... C85 C 3 4 2 2 0 80 1 595
2 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 ... C85 S 2 3 2 1 0 80 2 668
3 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 ... C123 S 3 4 2 1 1 54 3 48
4 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 ... C123 S 3 4 1 1 1 54 4 471
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 884 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 ... C50 S 2 3 1 1 143 69 676 100
885 885 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 ... B42 S 2 3 2 1 144 29 677 14
886 886 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female 19.0 1 2 ... B42 S 2 3 2 1 144 29 613 674
887 887 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 ... C148 C 2 3 1 2 145 59 678 8
888 888 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 ... C148 Q 2 4 1 3 145 59 679 465

889 rows × 22 columns

# 笔记: 
# zip函数 用于将可迭代的对象作为参数,将对象中对应的元素打包成一个个元组,然后返回由这些元组组成的列表。
# zip 格式不能len 可以将对象进行for 循环遍历
# .map 映射 后面用字典形式
# range函数可创建一个整数列表

2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)

#写入代码
pd_train_clear_csv['Title']=pd_train_clear_csv.Name.str.extract('([A-Za-z]+)\.',expand=False)
pd_train_clear_csv


Unnamed: 0 Unnamed: 0.1 PassengerId Survived Pclass Name Sex Age SibSp Parch ... Embarked cut cut1 Sex_num Embarked_num Cabin_LabelEncoder_map Cabin_LabelEncoder Ticket_LabelEncoder_map Ticket_LabelEncoder Title
0 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 ... S 2 3 1 1 0 80 0 522 Mr
1 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 ... C 3 4 2 2 0 80 1 595 Mrs
2 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 ... S 2 3 2 1 0 80 2 668 Miss
3 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 ... S 3 4 2 1 1 54 3 48 Mrs
4 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 ... S 3 4 1 1 1 54 4 471 Mr
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
884 884 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 ... S 2 3 1 1 143 69 676 100 Rev
885 885 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 ... S 2 3 2 1 144 29 677 14 Miss
886 886 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female 19.0 1 2 ... S 2 3 2 1 144 29 613 674 Miss
887 887 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 ... C 2 3 1 2 145 59 678 8 Mr
888 888 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 ... Q 2 4 1 3 145 59 679 465 Mr

889 rows × 23 columns

#保存最终你完成的已经清理好的数据
pd_train_clear_csv.to_csv('train_fin.csv')
# 笔记 举例LabelEncoder()方法

from sklearn.preprocessing import LabelEncoder
lbl=LabelEncoder()
data=['小猫','小猫','小狗','小狗','兔子','兔子','wu']
encode=lbl.fit_transform(data)
print(encode)
[3 3 2 2 1 1 0]

上一篇:SpringBoot如何集成MongoDB? | 带你读《SpringBoot实战教程》之三十四


下一篇:PostgreSQL 10.1 手册_部分 II. SQL 语言_第 5 章 数据定义_5.1. 表基础