本文将主要介绍以下内容:
1. 针对内存使用优化数据集
2. 按单一条件筛选
3. 按多个条件筛选
4. 其它筛选方法
5. 处理重复项
在前面的文章中,我们练习了从DataFrame中选择单独的行,列和值。现在让我们探索如何基于一个或多个条件来筛选数据。
1. 针对内存使用优化数据集
和往常一样,让我们从导入pandas开始:
In [1]: import pandas as pd
接下来让我们看看要使用的employee.csv测试数据集,它是公司虚构的员工集合。每条记录都包括员工的名字、性别、在公司的开始日期、薪水、管理状态和所在的团队:
In [2]: pd.read_csv("employees.csv")
Out [2]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 8/6/93 NaN True Marketing
1 Thomas Male 3/31/96 61933.0 True NaN
2 Maria Female NaN 130590.0 False Finance
3 Jerry NaN 3/4/05 138705.0 True Finance
4 Larry Male 1/24/98 101004.0 True IT
… … … … … … …
996 Phillip Male 1/31/84 42392.0 False Finance
997 Russell Male 5/20/13 96914.0 False Product
998 Larry Male 4/20/13 60500.0 False Business Dev
999 Albert Male 5/15/12 129949.0 True Sales
1000 NaN NaN NaN NaN NaN NaN
1001 rows × 6 columns
不难看出数据集在每列中都有缺失的值,实际上最后一行仅包含NaN值。在现实世界中的数据会根据导出系统的不同而变化很大,空行是常见的情况。
第一个优化可以使用parse_dates参数将Start Date列中的文本值转换为datetime对象:
In [3]: pd.read_csv("employees.csv", parse_dates = ["Start Date"]).head()
Out [3]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 NaN True Marketing
1 Thomas Male 1996-03-31 61933.0 True NaN
2 Maria Female NaT 130590.0 False Finance
3 Jerry NaN 2005-03-04 138705.0 True Finance
4 Larry Male 1998-01-24 101004.0 True IT
下面让我们将DataFrame对象分配给变量employees:
In [4]: employees = pd.read_csv("employees.csv",
parse_dates = ["Start Date"])
1.1 使用as_type方法转换数据类型
info方法返回数据集的摘要,包括列名、非空值数量、数据类型和内存消耗:
In [5]: employees.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 First Name 933 non-null object
1 Gender 854 non-null object
2 Start Date 999 non-null datetime64[ns]
3 Salary 999 non-null float64
4 Mgmt 933 non-null object
5 Team 957 non-null object
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 47.0+ KB
RangeIndex显示数据集有1001行,然后我们可以从Non-Null Count列确定每列非空值的数量,所有六列都缺少数据。当前的内存使用量约为47kb,让我们看看是否可以减少它。
astype方法用于把Pandas对象的值转换为其它数据类型。下例是把Mgmt列转换为bool数据类型,返回值是一个新的Series对象。请注意,NaN值会被转换为True值。在本文的后面,我会介绍如何删除和替换缺失值。
In [6]: employees["Mgmt"].astype(bool)
Out [6]: 0 True
1 True
2 False
3 True
4 True
...
996 False
997 False
998 False
999 True
1000 True
Name: Mgmt, Length: 1001, dtype: bool
在DataFrame中覆盖存在列或创建新列的原理和向字典添加键值对是相似的,如果存在具有指定名称的列,Pandas会使用新的Series覆盖它;如果该列不存在,Pandas会把新的Series添加到DataFrame的右侧,并通过两个数据结构中的公共索引标签进行匹配。
In [7]: employees["Mgmt"] = employees["Mgmt"].astype(bool)
让我们用info方法再查看一下内存消耗:
In [8]: employees.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 First Name 933 non-null object
1 Gender 854 non-null object
2 Start Date 999 non-null datetime64[ns]
3 Salary 999 non-null float64
4 Mgmt 1001 non-null bool
5 Team 957 non-null object
dtypes: bool(1), datetime64[ns](1), float64(1), object(3)
memory usage: 40.2+ KB
啊哈!我们把内存使用量减少了近15%,这是一个很好的开始!接下来,让我们看一下Salary列。如果打开原始CSV文件会看到该列的值实际上存储为整数,出于技术原因,Pandas需要将其从整数转换为浮点值,以支持分散在各处的NaN值。如果我们尝试把列的值强制转为整数,则会抛出ValueError异常:
In [9]: employees["Salary"].astype(int)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-99-b148c8b8be90> in <module>
----> 1 employees["Salary"].astype(int)
ValueError: Cannot convert non-finite values (NA or inf) to integer
在本文的后面,我将会介绍如何完全删除最后一行。现在,我们可以使用fillna方法把NaN替换为指定的值,返回一个新的Series。例如替换为0:
In [10]: employees["Salary"].fillna(0).tail()
Out [10]: 99 42392.0
99 96914.0
998 60500.0
999 129949.0
1000 0.0
Name: Salary, dtype: float64
现在Salary列中已经没有缺少的值,可以将其值转换为整数:
In [11]: employees["Salary"].fillna(0).astype(int).head()
Out [11]: 0 0
1 61933
2 130590
3 138705
4 101004
Name: Salary, dtype: int64
转换为整数后覆盖原来的列:
In [12]: employees["Salary"] = employees["Salary"].fillna(0).astype(int)
我们已经转换了Start Date和Mgmt列,以存储比字符串更合适的数据类型。还有什么可以优化的吗?绝对有!
Pandas包含一种称为category的特殊数据类型,当列包含相对于其总数量的少量唯一值时,它是理想的选择。数量有限的常见数据包括性别、工作日、血型等。category的值存储为普通的Python对象而不是NumPy的ndarrays,并针对性能进行了优化。
之前介绍过nunique方法可以返回DataFrame每列中唯一值的数量。请注意,默认情况下将排除缺失值NaN,不过您也可以使用dropna = False参数把其计算在内。
In [13]: employees.nunique()
Out [13]: First Name 200
Gender 2
Start Date 971
Salary 994
Mgmt 2
Team 10
dtype: int64
最适合作为category类型的两列是Gender和Team,在1001行数据中,它们分别仅有2和10个唯一值。让我们再次练习使用astype方法:
In [14]: employees["Gender"].astype("category")
Out [14]: 0 Male
1 Male
2 Female
3 NaN
4 Male
...
996 Male
997 Male
998 Male
999 Male
1000 NaN
Name: Gender, Length: 1001, dtype: category
Categories (2, object): [Female, Male]
让我们覆盖原有的Gender列并检查内存使用情况。由于只有2个唯一值而不是1001个,因此内存使用再次大幅下降:
In [15]: employees["Gender"] = employees["Gender"].astype("category")
employees.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 First Name 933 non-null object
1 Gender 854 non-null category
2 Start Date 999 non-null datetime64[ns]
3 Salary 1001 non-null int64
4 Mgmt 1001 non-null bool
5 Team 957 non-null object
dtypes: bool(1), category(1), datetime64[ns](1), int64(1), object(2)
memory usage: 33.5+ KB
让我们对Team列重复相同的操作,因为在1001行中只有10个唯一值:
In [16]: employees["Team"] = employees["Team"].astype("category")
employees.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 First Name 933 non-null object
1 Gender 854 non-null category
2 Start Date 999 non-null datetime64[ns]
3 Salary 1001 non-null int64
4 Mgmt 1001 non-null bool
5 Team 957 non-null category
dtypes: bool(1), category(2), datetime64[ns](1), int64(1), object(1)
memory usage: 27.0+ KB
通过这些简单的方法,我们将内存使用量减少了40%以上!
2. 按单一条件筛选
在数据分析中最常见的操作也许是读取数据的子集,它是由一个数据集的部分或全部行组成的数据集。当读取子集时,一般是基于一个或多个条件来筛选数据集。
如果我们要返回所有名为Maria的员工名单,需要对First Name列中的值进行筛选。提醒一下,双等号 == 用于比较Python中两个对象的相等性。例如检查两个字符串是否相等:
In [17]: "Maria" == "Maria"
Out [17]: True
有人可能会认为将相等运算符用于Series和字符串比较会导致错误,毕竟它们是根本不同的对象。但是,Pandas可以识别出我们想要将Series中的每个值与指定的字符串进行比较。下面的代码返回一个新的布尔Series,其中True值表示在First Name列中该索引位置的值等于Maria:
In [18]: employees["First Name"] == "Maria"
Out [18]: 0 False
1 False
2 True
3 False
4 False
...
996 False
997 False
998 False
999 False
1000 False
Name: First Name, Length: 1001, dtype: bool
然后我们就可以用上述条件筛选出First Name 列的值为Maria的行:
In [19]: employees[employees["First Name"] == "Maria"]
Out [19]: First Name Gender Start Date Salary Mgmt Team
2 Maria Female NaT 130590 False Finance
198 Maria Female 1990-12-27 36067 True Product
815 Maria NaN 1986-01-18 106562 False HR
844 Maria NaN 1985-06-19 148857 False Legal
936 Maria Female 2003-03-14 96250 False Business Dev
984 Maria Female 2011-10-15 43455 False Engineering
如果我们想筛选出所有不在人力资源部门的员工,可以使用Python的 != 运算符:
In [20]: "Engineering" != "HR"
Out [20]: True
Series对象也可以与 != 运算符一起使用。在下例中,True表示给定行的Team值不等于HR,而False则表示Team的值等于HR:
In [21]: (employees["Team"] != "HR").head()
Out [21]: 0 True
1 False
2 True
3 True
4 True
Name: Team, dtype: bool
同样,我们可以用上述条件筛选出所有不在人力资源部门的员工的行:
In [22]: employees[employees["Team"] != "HR"]
Out [22]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 0 True Marketing
2 Maria Female NaT 130590 False Finance
3 Jerry NaN 2005-03-04 138705 True Finance
4 Larry Male 1998-01-24 101004 True IT
5 Dennis Male 1987-04-18 115163 False Legal
… … … … … … …
995 Henry NaN 2014-11-23 132483 False Distribution
996 Phillip Male 1984-01-31 42392 False Finance
997 Russell Male 2013-05-20 96914 False Product
998 Larry Male 2013-04-20 60500 False Business Dev
999 Albert Male 2012-05-15 129949 True Sales
866 rows x 6 columns
请注意,结果将排除NaN的行,因为NaN于任何字符串比较(无论 == 还是!=)都会返回False。
如果我们想筛选所有经理,可以直接使用employee ["Mgmt"],因为它的值已经是布尔值:
In [23]: employees[employees["Mgmt"]].head()
Out [23]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 0 True Marketing
1 Thomas Male 1996-03-31 61933 True NaN
3 Jerry NaN 2005-03-04 138705 True Finance
4 Larry Male 1998-01-24 101004 True IT
6 Ruby Female 1987-08-17 65476 True Product
我们还可以根据数学条件筛选列,下例是返回工资为六位数的行:
In [24]: high_earners = employees["Salary"] > 100000
employees[high_earners].head()
Out [24]: First Name Gender Start Date Salary Mgmt Team
2 Maria Female NaT 130590 False Finance
3 Jerry NaN 2005-03-04 138705 True Finance
4 Larry Male 1998-01-24 101004 True IT
5 Dennis Male 1987-04-18 115163 False Legal
9 Frances Female 2002-08-08 139852 True Business Dev
3. 按多个条件筛选
DataFrame也可以通过多种条件进行筛选。
3.1 与条件
我们的下一个例子是筛选所有从事业务发展的女性员工。对于给定的行,必须满足两个条件:Gender列中的值是Female和Tean列中的值是Business Dev。首先创建第一个条件:
In [25]: is_female = employees["Gender"] == "Female"
然后创建第二个条件:
In [26]: in_biz_dev = employees["Team"] == "Business Dev"
最后把两个条件用&符号分隔,&表示AND条件,必须要同时为True才满足条件:
In [27]: employees[is_female & in_biz_dev].head()
Out [27]: First Name Gender Start Date Salary Mgmt Team
9 Frances Female 2002-08-08 139852 True Business Dev
33 Jean Female 1993-12-18 119082 False Business Dev
36 Rachel Female 2009-02-16 142032 False Business Dev
38 Stephanie Female 1986-09-13 36844 True Business Dev
61 Denise Female 2001-11-06 106862 False Business Dev
我们可以使用不只两个条件,只要把所有条件用 & 符号分隔。下例是筛选所有从事业务发展的女性经理:
In [28]: is_manager = employees["Mgmt"]
employees[is_female & in_biz_dev & is_manager].head()
Out [28]: First Name Gender Start Date Salary Mgmt Team
9 Frances Female 2002-08-08 139852 True Business Dev
38 Stephanie Female 1986-09-13 36844 True Business Dev
66 Nancy Female 2012-12-15 125250 True Business Dev
92 Linda Female 2000-05-25 119009 True Business Dev
111 Bonnie Female 1999-12-17 42153 True Business Dev
3.2 或条件
我们也可以筛选只要符合以其中一个条件的行。例如,查找所有工资低于40,000或开始日期在2015年1月1日之后的员工:
In [29]: earning_below_40k = employees["Salary"] < 40000
started_after_2015 = employees["Start Date"] > "2015-01-01"
要指定或条件,可以使用管道符号 |,表示只要满足其中一个条件即可:
In [30]: employees[earning_below_40k | started_after_2015].tail()
Out [30]: First Name Gender Start Date Salary Mgmt Team
958 Gloria Female 1987-10-24 39833 False Engineering
964 Bruce Male 1980-05-07 35802 True Sales
967 Thomas Male 2016-03-12 105681 False Engineering
989 Justin NaN 1991-02-10 38344 False Legal
1000 NaN NaN NaT 0 True NaN
索引位置958、964、989和1000的行符合第一个条件,而索引967的行符合第二个条件。请注意,同时满足这两个条件的行也将包含在内。
3.3 取反~
~ 符号用于取反,True值变为False,而False值变为True。下面是一个简单的例子:
In [31]: my_series = pd.Series([True, False, True])
my_series
Out [31]: 0 True
1 False
2 True
dtype: bool
In [32]: ~my_series
Out [32]: 0 False
1 True
2 False
dtype: bool
我们可以使用该符号将收入等于或超过100,000的员工的结果集取反,也就是筛选收入少于100,000的员工:
In [33]: employees[~(employees["Salary"] >= 100000)].head()
Out [33]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 0 True Marketing
1 Thomas Male 1996-03-31 61933 True NaN
6 Ruby Female 1987-08-17 65476 True Product
7 NaN Female 2015-07-20 45906 True Finance
8 Angela Female 2005-11-22 95570 True Engineering
3.4 布尔方法
对于那些偏爱于使用方法而不是数学运算符的人,可以使用下面的方法。请注意,category类型的值除了等于外不支持其它任何数学运算。
employees["Team"].eq("Marketing") // 等于
employees["Team"].ne("Marketing") // 不等于
employees["Salary"].lt(100000) // 小于
employees["Salary"].le(100000) // 小于或等于
employees["Salary"].gt(100000) // 大于
employees["Salary"].ge(100000) // 大于或等于
4. 其它筛选方法
某些筛选操作比简单的等于或不等于更为复杂,幸运的是,Pandas自带了其它筛选的方法。
4.1 isin方法
如果我们想筛选销售,法律或市场营销团队中的所有员工,可以创建三个单独的条件,并使用 | 分隔它们:
In [34]: sales = employees["Team"] == "Sales"
legal = employees["Team"] == "Legal"
mktg = employees["Team"] == "Marketing"
employees[sales | legal | mktg].head()
Out [34]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 0 True Marketing
5 Dennis Male 1987-04-18 115163 False Legal
11 Julie Female 1997-10-26 102508 True Legal
13 Gary Male 2008-01-27 109831 False Sales
20 Lois NaN 1995-04-22 64714 True Legal
这种方法虽然可行,但不灵活。更好的解决方案是使用isin方法,因为该方法支持列表:
In [35]: all_star_teams = ["Sales", "Legal", "Marketing"]
in_team = employees["Team"].isin(all_star_teams)
employees[in_team].head()
Out [35]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 0 True Marketing
5 Dennis Male 1987-04-18 115163 False Legal
11 Julie Female 1997-10-26 102508 True Legal
13 Gary Male 2008-01-27 109831 False Sales
20 Lois NaN 1995-04-22 64714 True Legal
4.2 between方法
另一个常见的操作(尤其是在处理数值数据时)是筛选范围内的值。例如,如果我们想筛选所有年薪在80,000到90,000之间的员工。我们同样可以创建两个单独的条件,并使用 & 分隔它们:
In [36]: higher_than_80 = employees["Salary"] >= 80000
lower_than_90 = employees["Salary"] < 90000
employees[higher_than_80 & lower_than_90].head()
Out [36]: First Name Gender Start Date Salary Mgmt Team
19 Donna Female 2010-07-22 81014 False Product
31 Joyce NaN 2005-02-20 88657 False Product
35 Theresa Female 2006-10-10 85182 False Sales
45 Roger Male 1980-04-17 88010 True Sales
54 Sara Female 2007-08-15 83677 False Engineering
但更好的解决方案是使用between方法。注意,它的第一个参数(下限)是包含的,而第二个参数(上限)是不包含的:
In [37]: between_80k_and_90k = employees["Salary"].between(80000, 90000)
employees[between_80k_and_90k].head()
Out [37]: First Name Gender Start Date Salary Mgmt Team
19 Donna Female 2010-07-22 81014 False Product
31 Joyce NaN 2005-02-20 88657 False Product
35 Theresa Female 2006-10-10 85182 False Sales
45 Roger Male 1980-04-17 88010 True Sales
54 Sara Female 2007-08-15 83677 False Engineering
between方法也适用于datetime类型的列。我们可以指定表示时间范围的开始和结束日期的字符串,参数名分别为left和right。下例是筛选从1980年代开始在公司工作的所有员工:
In [38]: eighties_folk = employees["Start Date"].between(
left = "1980-01-01",
right = "1990-01-01"
)
employees[eighties_folk].head()
Out [38]: First Name Gender Start Date Salary Mgmt Team
5 Dennis Male 1987-04-18 115163 False Legal
6 Ruby Female 1987-08-17 65476 True Product
10 Louise Female 1980-08-12 63241 True NaN
12 Brandon Male 1980-12-01 112807 True HR
17 Shawn Male 1986-12-07 111737 False Product
between方法也可以用于字符串列,例如筛选所有名字以R开头的员工。我们可以使用大写字母R开始作为包含下限,然后S作为非包含上限:
In [39]: name_starts_with_r = employees["First Name"].between("R", "S")
employees[name_starts_with_r].head()
Out [39]: First Name Gender Start Date Salary Mgmt Team
6 Ruby Female 1987-08-17 65476 True Product
36 Rachel Female 2009-02-16 142032 False Business Dev
45 Roger Male 1980-04-17 88010 True Sales
67 Rachel Female 1999-08-16 51178 True Finance
78 Robin Female 1983-06-04 114797 True Sales
4.3 isnull和notnull方法
我们的数据集包含大量的缺失值,我们可以在前五行中看到其中一些:
In [40]: employees.head()
Out [40]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 0 True Marketing
1 Thomas Male 1996-03-31 61933 True NaN
2 Maria Female NaT 130590 False Finance
3 Jerry NaN 2005-03-04 138705 True Finance
4 Larry Male 1998-01-24 101004 True IT
缺失值用NaN(not a number)表示,有一个例外是缺失的datetime值,用NaT(not a time)表示。
isnull方法检查给定列中的值是否为空:
In [41]: employees["Team"].isnull().head(2)
Out [41]: 0 False
1 True
2 False
Name: Team, dtype: bool
NaT值也将被视为空值:
In [42]: employees["Start Date"].isnull().head(3)
Out [42]: 0 False
1 False
2 True
Name: Start Date, dtype: bool
notnull方法与isnull方法相反,它检查给定列中的值是否不为空:
In [43]: employees["Team"].notnull().head(2)
Out [43]: 0 True
1 False
Name: Team, dtype: bool
也可以使用isnull方法和取反~符号返回相同的结果,但notnull方法更具描述性,因此建议使用。
In [44]: (~employees["Team"].isnull()).head(2)
Out [44]: 0 True
1 False
Name: Team, dtype: bool
然后,我们可以使用这些条件从DataFrame中筛选null或非null的行。例如筛选不属于任何团队的员工:
In [45]: no_team = employees["Team"].isnull()
employees[no_team].head()
Out [45]: First Name Gender Start Date Salary Mgmt Team
1 Thomas Male 1996-03-31 61933 True NaN
10 Louise Female 1980-08-12 63241 True NaN
23 NaN Male 2012-06-14 125792 True NaN
32 NaN Male 1998-08-21 122340 True NaN
91 James NaN 2005-01-26 128771 False NaN
筛选First Name不为null的员工:
In [46]: has_name = employees["First Name"].notnull()
employees[has_name].tail()
Out [46]: First Name Gender Start Date Salary Mgmt Team
995 Henry NaN 2014-11-23 132483 False Distribution
996 Phillip Male 1984-01-31 42392 False Finance
997 Russell Male 2013-05-20 96914 False Product
998 Larry Male 2013-04-20 60500 False Business Dev
999 Albert Male 2012-05-15 129949 True Sales
4.4 处理空值
让我们先重新导入CSV将数据集恢复到原始格式:
In [47]: employees = pd.read_csv("employees.csv",
parse_dates = ["Start Date"])
In [48]: employees
Out [48]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 NaN True Marketing
1 Thomas Male 1996-03-31 61933.0 True NaN
2 Maria Female NaT 130590.0 False Finance
3 Jerry NaN 2005-03-04 138705.0 True Finance
4 Larry Male 1998-01-24 101004.0 True IT
… … … … … … …
996 Phillip Male 1984-01-31 42392.0 False Finance
997 Russell Male 2013-05-20 96914.0 False Product
998 Larry Male 2013-04-20 60500.0 False Business Dev
999 Albert Male 2012-05-15 129949.0 True Sales
1000 NaN NaN NaT NaN NaN NaN
1001 rows x 6 columns
在之前我们介绍了如何使用fillna方法用常量替换缺失值,而默认地,dropna方法可以删除包含任何NaN值的行,无论该行有一个还是全部都是NaN值。
In [49]: employees.dropna()
Out [49]: First Name Gender Start Date Salary Mgmt Team
4 Larry Male 1998-01-24 101004.0 True IT
5 Dennis Male 1987-04-18 115163.0 False Legal
6 Ruby Female 1987-08-17 65476.0 True Product
8 Angela Female 2005-11-22 95570.0 True Engineering
9 Frances Female 2002-08-08 139852.0 True Business Dev
… … … … … … …
994 George Male 2013-06-21 98874.0 True Marketing
996 Phillip Male 1984-01-31 42392.0 False Finance
997 Russell Male 2013-05-20 96914.0 False Product
998 Larry Male 2013-04-20 60500.0 False Business Dev
999 Albert Male 2012-05-15 129949.0 True Sales
761 rows x 6 columns
不正确地导出数据集通常包含空白行。我们可以把参数how设为all值,以删除所有值都为NaN或NaT的行,该数据集只有最后一行满足此条件。
In [50]: employees.dropna(how = "all").tail()
Out [50]: First Name Gender Start Date Salary Mgmt Team
995 Henry NaN 2014-11-23 132483.0 False Distribution
996 Phillip Male 1984-01-31 42392.0 False Finance
997 Russell Male 2013-05-20 96914.0 False Product
998 Larry Male 2013-04-20 60500.0 False Business Dev
999 Albert Male 2012-05-15 129949.0 True Sales
subset参数用于删除指定列中存在缺少值的行。下例删除在Gender列中存在缺失值的行:
In [51]: employees.dropna(subset = ["Gender"]).tail()
Out [51]: First Name Gender Start Date Salary Mgmt Team
994 George Male 2013-06-21 98874.0 True Marketing
996 Phillip Male 1984-01-31 42392.0 False Finance
997 Russell Male 2013-05-20 96914.0 False Product
998 Larry Male 2013-04-20 60500.0 False Business Dev
999 Albert Male 2012-05-15 129949.0 True Sales
subset参数还支持列表。如果某行在任何一个指定列中存在缺少值,则该行会被删除。下例删除Start Date列或Salary列中存在缺失值的行:
In [52]: employees.dropna(subset = ["Start Date", "Salary"]).head()
Out [52]: First Name Gender Start Date Salary Mgmt Team
1 Thomas Male 1996-03-31 61933.0 True NaN
3 Jerry NaN 2005-03-04 138705.0 True Finance
4 Larry Male 1998-01-24 101004.0 True IT
5 Dennis Male 1987-04-18 115163.0 False Legal
6 Ruby Female 1987-08-17 65476.0 True Product
thresh参数指定存在非空值的列的最小列数,如果存在非空值的列数小于该值,那么该列会被删除。下例删除存在非空值列数小于4的行:
In [53]: employees.dropna(how = "any", thresh = 4).head()
Out [53]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 NaN True Marketing
1 Thomas Male 1996-03-31 61933.0 True NaN
2 Maria Female NaT 130590.0 False Finance
3 Jerry NaN 2005-03-04 138705.0 True Finance
4 Larry Male 1998-01-24 101004.0 True IT
5. 处理重复项
Pandas提供了几种方法来识别数据集中的重复值和唯一值。
5.1 duplicated方法
duplicated方法用于识别之前是否有和当前值重复的值。在下例中,Team列在索引位置2和3的值为Finance。duplicated方法会把索引2第一次出现的Finance值标记为非重复,用(False)表示,并把所有后续出现Finance值标记为重复(True)。
In [54]: employees["Team"].head()
Out [54]: 0 Marketing
1 NaN
2 Finance
3 Finance
4 IT
Name: Team, dtype: object
In [55]: employees["Team"].duplicated().head() // 和下行代码一样
employees["Team"].duplicated(keep = "first").head()
Out [55]: 0 False
1 False
2 False
3 True
4 False
Name: Team, dtype: bool
我们也可以使用参数keep = "last" 设置把列中最后一次出现的值标记为非重复,其默认值为"first":
In [56]: employees["Team"].duplicated(keep = "last")
Out [56]: 0 True
1 True
2 True
3 True
4 True
...
996 False
997 False
998 False
999 False
1000 False
Name: Team, Length: 1001, dtype: bool
如果我们想从每个团队中只选出一名员工,那么可以使用duplicated方法,它会标记第一次出现的值为False,我们只要把结果去反就可以标记第一次出现的值为True:
In [57]: (~employees["Team"].duplicated()).head()
Out [57]: 0 True
1 True
2 True
3 False
4 True
Name: Team, dtype: bool
现在,我们可以实现从每个团队中只选出一名员工。注意,NaN值的行也会被返回:
In [58]: first_one_in_team = ~employees["Team"].duplicated()
employees[first_one_in_team]
Out [58]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 NaN True Marketing
1 Thomas Male 1996-03-31 61933.0 True NaN
2 Maria Female NaT 130590.0 False Finance
4 Larry Male 1998-01-24 101004.0 True IT
5 Dennis Male 1987-04-18 115163.0 False Legal
6 Ruby Female 1987-08-17 65476.0 True Product
8 Angela Female 2005-11-22 95570.0 True Engineering
9 Frances Female 2002-08-08 139852.0 True Business Dev
12 Brandon Male 1980-12-01 112807.0 True HR
13 Gary Male 2008-01-27 109831.0 False Sales
40 Michael Male 2008-10-10 99283.0 True Distribution
5.2 drop_duplicates方法
drop_duplicates方法提供了方便的快捷方式来实现上述操作。默认地,它将删除与先前行所有值都相等的行。由于每行中6个值的组合在我们的数据集中是唯一的,因此在这里使用默认参数没有任何作用。
subset参数可以指定列的列表,其值将用于确定行的唯一性。下例再次实现从每个团队中只选出一名员工,也就是删除Team列中与之前的值相等的行:
In [59]: employees.drop_duplicates(subset = ["Team"])
Out [59]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 NaN True Marketing
1 Thomas Male 1996-03-31 61933.0 True NaN
2 Maria Female NaT 130590.0 False Finance
4 Larry Male 1998-01-24 101004.0 True IT
5 Dennis Male 1987-04-18 115163.0 False Legal
6 Ruby Female 1987-08-17 65476.0 True Product
8 Angela Female 2005-11-22 95570.0 True Engineering
9 Frances Female 2002-08-08 139852.0 True Business Dev
12 Brandon Male 1980-12-01 112807.0 True HR
13 Gary Male 2008-01-27 109831.0 False Sales
40 Michael Male 2008-10-10 99283.0 True Distribution
drop_duplicates方法也有一个keep参数,默认值为"first",我们可以把它设为"last"以保留列中最后一次出现的值的行:
In [60]: employees.drop_duplicates(subset = ["Team"], keep = "last")
Out [60]: First Name Gender Start Date Salary Mgmt Team
988 Alice Female 2004-10-05 47638.0 False HR
989 Justin NaN 1991-02-10 38344.0 False Legal
990 Robin Female 1987-07-24 100765.0 True IT
993 Tina Female 1997-05-15 56450.0 True Engineering
994 George Male 2013-06-21 98874.0 True Marketing
995 Henry NaN 2014-11-23 132483.0 False Distribution
996 Phillip Male 1984-01-31 42392.0 False Finance
997 Russell Male 2013-05-20 96914.0 False Product
998 Larry Male 2013-04-20 60500.0 False Business Dev
999 Albert Male 2012-05-15 129949.0 True Sales
1000 NaN NaN NaT NaN NaN NaN
keep参数还有另外一个值False,它用于删除所有具有重复值的行。下例删除First Name列中的值不是唯一的行:
In [61]: employees.drop_duplicates(subset = ["First Name"], keep = False)
Out [61]: First Name Gender Start Date Salary Mgmt Team
5 Dennis Male 1987-04-18 115163.0 False Legal
8 Angela Female 2005-11-22 95570.0 True Engineering
33 Jean Female 1993-12-18 119082.0 False Business Dev
190 Carol Female 1996-03-19 57783.0 False Finance
291 Tammy Female 1984-11-11 132839.0 True IT
495 Eugene Male 1984-05-24 81077.0 False Sales
688 Brian Male 2007-04-07 93901.0 True Legal
832 Keith Male 2003-02-12 120672.0 False Legal
887 David Male 2009-12-05 92242.0 False Legal
如果subset参数指定多个列,则会使用这两列中的值组合来确定重复值。例如筛选First Name和Gender列的组合值是唯一的行:
In [62]: employees.drop_duplicates(subset = ["First Name", "Gender"]).head()
Out [62]: First Name Gender Start Date Salary Mgmt Team
0 Douglas Male 1993-08-06 NaN True Marketing
1 Thomas Male 1996-03-31 61933.0 True NaN
2 Maria Female NaT 130590.0 False Finance
3 Jerry NaN 2005-03-04 138705.0 True Finance
4 Larry Male 1998-01-24 101004.0 True IT
END O(∩_∩)O