【回顾&引言】前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察。那么在这里,我们主要是做数据分析的流程性学习,主要是包括了数据清洗以及数据的特征处理,数据重构以及数据可视化。这些内容是为数据分析最后的建模和模型评价做一个铺垫。
开始之前,导入numpy、pandas包和数据
#加载所需的库
import numpy as np
import pandas as pd
#加载数据train.csv
pd_train_csv=pd.read_csv('./train.csv')
pd_train_csv
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
2 第二章:数据清洗及特征处理
我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。
2.1 缺失值观察与处理
我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢
2.1.1 任务一:缺失值观察
(1) 请查看每个特征缺失值个数
(2) 请查看Age, Cabin, Embarked列的数据
以上方式都有多种方式,所以大家多多益善
#写入代码
pd_train_csv['Age']
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 NaN
889 26.0
890 32.0
Name: Age, Length: 891, dtype: float64
#写入代码
pd_train_csv.Cabin
0 NaN
1 C85
2 NaN
3 C123
4 NaN
...
886 NaN
887 B42
888 NaN
889 C148
890 NaN
Name: Cabin, Length: 891, dtype: object
#写入代码 loc 索引器的一般形式是 loc[*, *] ,其中第一个 * 代表行的选择,第二个 * 代表列的选择
pd_train_csv.loc[:,'Embarked']
0 S
1 C
2 S
3 S
4 S
..
886 S
887 S
888 S
889 C
890 Q
Name: Embarked, Length: 891, dtype: object
# 是基于元素的 loc 索引器,基于位置的 iloc 索引器
pd_train_csv.iloc[:,5]
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 NaN
889 26.0
890 32.0
Name: Age, Length: 891, dtype: float64
2.1.2 任务二:对缺失值进行处理
(1)处理缺失值一般有几种思路
(2) 请尝试对Age列的数据的缺失值进行处理
(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理
#处理缺失值的一般思路:
#提醒:可使用的函数有--->dropna函数与fillna函数
#写入代码
pd_train_csv[pd_train_csv['Age'].isnull()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C |
28 | 29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NaN | 0 | 0 | 330959 | 7.8792 | NaN | Q |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
859 | 860 | 0 | 3 | Razi, Mr. Raihed | male | NaN | 0 | 0 | 2629 | 7.2292 | NaN | C |
863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
868 | 869 | 0 | 3 | van Melkebeke, Mr. Philemon | male | NaN | 0 | 0 | 345777 | 9.5000 | NaN | S |
878 | 879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
177 rows × 12 columns
#写入代码 .dropna将空值所在的行/列删除
df=pd_train_csv.dropna(axis=0,how='any',subset=['Age'])
df=pd_train_csv.dropna(axis=0,how='any',subset=['Sex'])
df=pd_train_csv.dropna(axis=0,how='any',subset=['Embarked'])
#写入代码 .fillna补充缺失值
pd_train_csv=df.fillna(axis=0,method='ffill')
pd_train_csv=pd_train_csv.fillna(axis=0,method='bfill')
pd_train_csv
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | C85 | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | C85 | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | C123 | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | C50 | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | 19.0 | 1 | 2 | W./C. 6607 | 23.4500 | B42 | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | C148 | Q |
889 rows × 12 columns
【思考1】dropna和fillna有哪些参数,分别如何使用呢?
【思考】检索空缺值用np.nan
,None
以及.isnull()
哪个更好,这是为什么?如果其中某个方式无法找到缺失值,原因又是为什么?
#思考回答
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
2.2 重复值观察与处理
由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢
2.2.1 任务一:请查看数据中的重复值
#写入代码
pd_train_csv[pd_train_csv.duplicated()==True]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
2.2.2 任务二:对重复值进行处理
(1)重复值有哪些处理方式呢?
(2)处理我们数据的重复值
方法多多益善
#重复值有哪些处理方式:
pd_train_csv.duplicated().sum()
0
#写入代码 .drop_duplicates()去重复
pd_train_csv = pd_train_csv.drop_duplicates(keep=False)
pd_train_csv.head(200)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | C85 | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | C85 | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | C123 | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
196 | 197 | 0 | 3 | Mernagh, Mr. Robert | male | 58.0 | 0 | 0 | 368703 | 7.7500 | B80 | Q |
197 | 198 | 0 | 3 | Olsen, Mr. Karl Siegwart Andreas | male | 42.0 | 0 | 1 | 4579 | 8.4042 | B80 | S |
198 | 199 | 1 | 3 | Madigan, Miss. Margaret "Maggie" | female | 42.0 | 0 | 0 | 370370 | 7.7500 | B80 | Q |
199 | 200 | 0 | 2 | Yrois, Miss. Henriette ("Mrs Harbeck") | female | 24.0 | 0 | 0 | 248747 | 13.0000 | B80 | S |
200 | 201 | 0 | 3 | Vande Walle, Mr. Nestor Cyriel | male | 28.0 | 0 | 0 | 345770 | 9.5000 | B80 | S |
200 rows × 12 columns
2.2.3 任务三:将前面清洗的数据保存为csv格式
#写入代码
pd_train_csv.to_csv('./pd_train_clear.csv')
2.3 特征观察与处理
我们对特征进行一下观察,可以把特征大概分为两大类:
数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征
文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。
2.3.1 任务一:对年龄进行分箱(离散化)处理
(1) 分箱操作是什么?
(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段,并用分类变量12345表示
(5) 将上面的获得的数据分别进行保存,保存为csv格式
#分箱操作是什么:
#将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
pd_train_clear_csv=pd.read_csv('pd_train_clear.csv')
pd_train_clear_csv['cut']=pd.cut(pd_train_clear_csv['Age'],bins=5,labels=[1,2,3,4,5])
pd_train_clear_csv.head(20)
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | cut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | C85 | S | 2 |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 3 |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | C85 | S | 2 |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3 |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | C123 | S | 3 |
5 | 5 | 6 | 0 | 3 | Moran, Mr. James | male | 35.0 | 0 | 0 | 330877 | 8.4583 | C123 | Q | 3 |
6 | 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | 4 |
7 | 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | E46 | S | 1 |
8 | 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | E46 | S | 2 |
9 | 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | E46 | C | 1 |
10 | 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S | 1 |
11 | 11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S | 4 |
12 | 12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | C103 | S | 2 |
13 | 13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | C103 | S | 3 |
14 | 14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | C103 | S | 1 |
15 | 15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | C103 | S | 4 |
16 | 16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | C103 | Q | 1 |
17 | 17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | 2.0 | 0 | 0 | 244373 | 13.0000 | C103 | S | 1 |
18 | 18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.0 | 1 | 0 | 345763 | 18.0000 | C103 | S | 2 |
19 | 19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | 31.0 | 0 | 0 | 2649 | 7.2250 | C103 | C | 2 |
#写入代码 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
pd_train_clear_csv['cut1']=pd.cut(pd_train_clear_csv['Age'],bins=[0,5,15,30,50,80],labels=[1,2,3,4,5])
pd_train_clear_csv,head(20)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-305-35c221b93aec> in <module>
1 #写入代码 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
2 pd_train_clear_csv['cut1']=pd.cut(pd_train_clear_csv['Age'],bins=[0,5,15,30,50,80],labels=[1,2,3,4,5])
----> 3 pd_train_clear_csv,head(20)
NameError: name 'head' is not defined
#写入代码 将连续变量Age按10% 30% 50% 70% 90%五个年龄段,并用分类变量12345表示 # q为
pd_train_clear_csv['cut2']=pd.qcut(pd_train_clear_csv['Age'],q=[0,0.1,0.3,0.5,0.7,0.9],labels=[1,2,3,4,5])
pd_train_clear_csv.head(20)
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | cut | cut1 | cut2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | C85 | S | 2 | 3 | 2 |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 3 | 4 | 5 |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | C85 | S | 2 | 3 | 3 |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3 | 4 | 4 |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | C123 | S | 3 | 4 | 4 |
5 | 5 | 6 | 0 | 3 | Moran, Mr. James | male | 35.0 | 0 | 0 | 330877 | 8.4583 | C123 | Q | 3 | 4 | 4 |
6 | 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | 4 | 5 | NaN |
7 | 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | E46 | S | 1 | 1 | 1 |
8 | 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | E46 | S | 2 | 3 | 3 |
9 | 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | E46 | C | 1 | 2 | 2 |
10 | 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S | 1 | 1 | 1 |
11 | 11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S | 4 | 5 | NaN |
12 | 12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | C103 | S | 2 | 3 | 2 |
13 | 13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | C103 | S | 3 | 4 | 5 |
14 | 14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | C103 | S | 1 | 2 | 2 |
15 | 15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | C103 | S | 4 | 5 | NaN |
16 | 16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | C103 | Q | 1 | 1 | 1 |
17 | 17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | 2.0 | 0 | 0 | 244373 | 13.0000 | C103 | S | 1 | 1 | 1 |
18 | 18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.0 | 1 | 0 | 345763 | 18.0000 | C103 | S | 2 | 4 | 4 |
19 | 19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | 31.0 | 0 | 0 | 2649 | 7.2250 | C103 | C | 2 | 4 | 4 |
#写入代码
pd_train_clear_csv.to_csv('pd_train_clear_cut.csv')
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
2.3.2 任务二:对文本变量进行转换
(1) 查看文本变量名及种类
(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示
(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示
#写入代码
pd_train_clear_csv=pd.read_csv('pd_train_clear_cut.csv')
pd_train_clear_csv['Sex'].value_counts()
male 577
female 312
Name: Sex, dtype: int64
#写入代码
# pd_train_clear_csv['Cabin'].unique()
pd_train_csv['Cabin'].value_counts()
G6 24
B78 21
C78 20
C83 19
C23 C25 C27 19
..
E36 1
A5 1
F G63 1
D46 1
T 1
Name: Cabin, Length: 146, dtype: int64
#写入代码 # .nunique()查询一共多少种类
# pd_train_csv['Embarked'].nunique()
pd_train_clear_csv['Cabin'].nunique()
# 查询行
# pd_train_csv.iloc[1,:].nunique()
146
# 2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示
# 方法一 replace
pd_train_clear_csv['Sex_num']=pd_train_clear_csv['Sex'].replace(['male','female'],[1,2])
pd_train_clear_csv
Unnamed: 0 | Unnamed: 0.1 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | cut | cut1 | Sex_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | C85 | S | 2 | 3 | 1 |
1 | 1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 3 | 4 | 2 |
2 | 2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | C85 | S | 2 | 3 | 2 |
3 | 3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3 | 4 | 2 |
4 | 4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | C123 | S | 3 | 4 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
884 | 884 | 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | C50 | S | 2 | 3 | 1 |
885 | 885 | 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S | 2 | 3 | 2 |
886 | 886 | 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | 19.0 | 1 | 2 | W./C. 6607 | 23.4500 | B42 | S | 2 | 3 | 2 |
887 | 887 | 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C | 2 | 3 | 1 |
888 | 888 | 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | C148 | Q | 2 | 4 | 1 |
889 rows × 17 columns
# map 字典映射
pd_train_clear_csv['Embarked'].value_counts()
S 644
C 168
Q 77
Name: Embarked, dtype: int64
pd_train_clear_csv['Embarked_num']=pd_train_clear_csv['Embarked'].map({'S':1,'C':2,'Q':3})
pd_train_clear_csv
Unnamed: 0 | Unnamed: 0.1 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | cut | cut1 | Sex_num | Embarked_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | C85 | S | 2 | 3 | 1 | 1 |
1 | 1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 3 | 4 | 2 | 2 |
2 | 2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | C85 | S | 2 | 3 | 2 | 1 |
3 | 3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3 | 4 | 2 | 1 |
4 | 4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | C123 | S | 3 | 4 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
884 | 884 | 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | C50 | S | 2 | 3 | 1 | 1 |
885 | 885 | 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S | 2 | 3 | 2 | 1 |
886 | 886 | 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | 19.0 | 1 | 2 | W./C. 6607 | 23.4500 | B42 | S | 2 | 3 | 2 | 1 |
887 | 887 | 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C | 2 | 3 | 1 | 2 |
888 | 888 | 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | C148 | Q | 2 | 4 | 1 | 3 |
889 rows × 18 columns
# sklearn.preprocessing的LabelEncoder
pd_train_clear_csv['Cabin'].value_counts()
G6 24
B78 21
C78 20
C83 19
C23 C25 C27 19
..
E36 1
A5 1
F G63 1
D46 1
T 1
Name: Cabin, Length: 146, dtype: int64
from sklearn.preprocessing import LabelEncoder
for feet in ['Cabin','Ticket']:
lbl=LabelEncoder()
label_dict=dict(zip(pd_train_clear_csv[feet].unique(),range(pd_train_clear_csv[feet].nunique())))
pd_train_clear_csv[feet+"_LabelEncoder_map"]=pd_train_clear_csv[feet].map(label_dict)
pd_train_clear_csv[feet+"_LabelEncoder"]=lbl.fit_transform(pd_train_clear_csv[feet].astype(str))
pd_train_clear_csv
Unnamed: 0 | Unnamed: 0.1 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | ... | Cabin | Embarked | cut | cut1 | Sex_num | Embarked_num | Cabin_LabelEncoder_map | Cabin_LabelEncoder | Ticket_LabelEncoder_map | Ticket_LabelEncoder | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | ... | C85 | S | 2 | 3 | 1 | 1 | 0 | 80 | 0 | 522 |
1 | 1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | ... | C85 | C | 3 | 4 | 2 | 2 | 0 | 80 | 1 | 595 |
2 | 2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | ... | C85 | S | 2 | 3 | 2 | 1 | 0 | 80 | 2 | 668 |
3 | 3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | ... | C123 | S | 3 | 4 | 2 | 1 | 1 | 54 | 3 | 48 |
4 | 4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | ... | C123 | S | 3 | 4 | 1 | 1 | 1 | 54 | 4 | 471 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
884 | 884 | 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | ... | C50 | S | 2 | 3 | 1 | 1 | 143 | 69 | 676 | 100 |
885 | 885 | 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | ... | B42 | S | 2 | 3 | 2 | 1 | 144 | 29 | 677 | 14 |
886 | 886 | 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | 19.0 | 1 | 2 | ... | B42 | S | 2 | 3 | 2 | 1 | 144 | 29 | 613 | 674 |
887 | 887 | 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | ... | C148 | C | 2 | 3 | 1 | 2 | 145 | 59 | 678 | 8 |
888 | 888 | 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | ... | C148 | Q | 2 | 4 | 1 | 3 | 145 | 59 | 679 | 465 |
889 rows × 22 columns
# 笔记:
# zip函数 用于将可迭代的对象作为参数,将对象中对应的元素打包成一个个元组,然后返回由这些元组组成的列表。
# zip 格式不能len 可以将对象进行for 循环遍历
# .map 映射 后面用字典形式
# range函数可创建一个整数列表
2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)
#写入代码
pd_train_clear_csv['Title']=pd_train_clear_csv.Name.str.extract('([A-Za-z]+)\.',expand=False)
pd_train_clear_csv
Unnamed: 0 | Unnamed: 0.1 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | ... | Embarked | cut | cut1 | Sex_num | Embarked_num | Cabin_LabelEncoder_map | Cabin_LabelEncoder | Ticket_LabelEncoder_map | Ticket_LabelEncoder | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | ... | S | 2 | 3 | 1 | 1 | 0 | 80 | 0 | 522 | Mr |
1 | 1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | ... | C | 3 | 4 | 2 | 2 | 0 | 80 | 1 | 595 | Mrs |
2 | 2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | ... | S | 2 | 3 | 2 | 1 | 0 | 80 | 2 | 668 | Miss |
3 | 3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | ... | S | 3 | 4 | 2 | 1 | 1 | 54 | 3 | 48 | Mrs |
4 | 4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | ... | S | 3 | 4 | 1 | 1 | 1 | 54 | 4 | 471 | Mr |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
884 | 884 | 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | ... | S | 2 | 3 | 1 | 1 | 143 | 69 | 676 | 100 | Rev |
885 | 885 | 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | ... | S | 2 | 3 | 2 | 1 | 144 | 29 | 677 | 14 | Miss |
886 | 886 | 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | 19.0 | 1 | 2 | ... | S | 2 | 3 | 2 | 1 | 144 | 29 | 613 | 674 | Miss |
887 | 887 | 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | ... | C | 2 | 3 | 1 | 2 | 145 | 59 | 678 | 8 | Mr |
888 | 888 | 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | ... | Q | 2 | 4 | 1 | 3 | 145 | 59 | 679 | 465 | Mr |
889 rows × 23 columns
#保存最终你完成的已经清理好的数据
pd_train_clear_csv.to_csv('train_fin.csv')
# 笔记 举例LabelEncoder()方法
from sklearn.preprocessing import LabelEncoder
lbl=LabelEncoder()
data=['小猫','小猫','小狗','小狗','兔子','兔子','wu']
encode=lbl.fit_transform(data)
print(encode)
[3 3 2 2 1 1 0]