Datawhale-Pandas-Task02-索引【更新中】

索引在pandas中是非常重要的概念,但在R中这个概念就很弱,甚至在R的DataFrame中尽量避免使用行名。

索引器

Series的索引

R没有类似的功能

# Python
s = pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'a', 'a', 'a', 'c'])
s['a']
s[['a', 'b']]
# Python
s = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'], index=[1, 3, 1, 2, 5, 4])
s[1]
s[[2, 3]]

列索引

对DataFrame数据的操作,pandas的[][[]]返回的类型与R正好相反。

df = pd.read_csv('data/learn_pandas.csv', usecols = ['School', 'Grade', 'Name', 'Gender', 'Weight', 'Transfer'])
df['Name'].head()
df <- read.csv('data/learn_pandas.csv', stringsAsFactors = FALSE)
head(df[['Name']])

pandas可以通过.列名获取单列数据,跟['列名']结果一致,但列名不能有空格。

R语言也有类似的功能,不过使用的是$列名,跟[['列名']]结果一致,同样不能有空格。特别注意,R的$列名是模糊匹配,[['列名', exact = FALSE]]也可以实现模糊匹配。

# Python
df.Name.head()
# R
head(df$Name)
head(df$Na)

loc索引器

pandas的loc在行的索引功能,R基本是不具备的。

但其他方面根据字段信息选择行,跟基础的R是一致的。

R的行名相当于pandas的索引,但R的行名功能很局限,只有索引的部分功能,主要是因为R的行名不能重复。

# Python
df_demo = df.set_index('Name')
# R
df_demo <- df
rownames(df_demo) <- df_demo$Name # 报错,因为行名不能重复
# Python
df_demo.loc['Qiang Sun']
df_demo.loc['Qiang Sun', 'School'] # 返回series
df_demo.loc['Quan Zhao', 'School'] # 返回单个元素

df_demo.loc[['Qiang Sun','Quan Zhao'], ['School','Gender']]
df_demo.loc['Gaojuan You':'Gaoqiang Qian', 'School':'Gender']
# R
# Python
df_demo.loc[df_demo.Weight > 70].head()

df_demo.loc[df_demo.Grade.isin(['Freshman', 'Senior'])].head()

# R
head(df_demo[df_demo$Weight > 70, ])

iloc索引器

iloc是根据位置筛选,这部分跟R完全一致,而且R更加简洁。

# python
df_demo.iloc[1, 1] # 第二行第二列

df_demo.iloc[[0, 1], [0, 1]] # 前两行前两列

df_demo.iloc[1:4, 2:4] # 切片不包含结束端点

df_demo.iloc[lambda x: slice(1, 4)] # 传入切片为返回值的函数

df_demo.iloc[(df_demo.Weight > 80).values]

df_demo.School.iloc[1]

df_demo.School.iloc[1:5:2]
# R
df_demo[2, 2]

df_demo[c(1, 2), c(1, 2)]

df_demo[1:4, 2:4]

df_demo[2:4, ]

df_demo[df_demo$Weight > 80, ]

df_demo$School[1]

query方法

query操作起来像写SQL的where条件一样,这个方法在保证可读性的基础上,降低了代码量。

R的dplyr包更加简洁、可读。

# Python
df.query('((School == "Fudan University")&'
         ' (Grade == "Senior")&'
         ' (Weight > 70))|'
         '((School == "Peking University")&'
         ' (Grade != "Senior")&'
         ' (Weight > 80))')
df %>% 
  filter(
    (School == "Fudan University"
     & Grade == "Senior"
     & Weight > 70)
    | (School == "Peking University"
       & Grade != "Senior"
       & Weight > 80)
  )
# Python
df.query('Weight > Weight.mean()').head()
# R
df %>% 
  filter(
    Weight > mean(Weight, na.rm = TRUE)
  ) %>% 
  head()

R没有直接not in的函数,需要!加一层判断。

pandas中==!=分别表示元素出现在列表和没有出现在列表,等价于innot in。R又有完全一样的功能。

# Python
df.query('(Grade not in ["Freshman", "Sophomore"]) and (Gender == "Male")').head()

df.query('Grade == ["Junior", "Senior"]').head()
# R
df %>% 
  filter(
    !(Grade %in% c("Freshman", "Sophomore")) & Gender == "Male"
  ) %>% 
  head()

df %>% 
  filter(Grade == c("Junior", "Senior")) %>% 
  head()
# Python
low, high = 70, 80
df.query('Weight.between(@low, @high)').head()
# R
df %>% 
  filter(
    between(Weight, 70, 80)
  ) %>% 
  head()

随机抽样

pandas和R的sample的功能基本一致。

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)

# Python
df_sample = pd.DataFrame({'id': list('abcde'), 'value': [1, 2, 3, 4, 90]})

df_sample.sample(3, replace = True, weights = df_sample.value)

sample(x, size, replace = FALSE, prob = NULL)

sample_n(tbl, size, replace = FALSE, weight = NULL, .env = NULL, ...)

sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = NULL, ...)

# R
df_sample <- data.frame(id = c('a', 'b', 'c', 'd', 'e'), value = c(1, 2, 3, 4, 90), stringsAsFactors = FALSE)

sample_n(df_sample, 3)

多级索引

索引的常用方法

索引运算

练习

练习

Ex1:公司员工数据集

df = pd.read_csv('data/company.csv')
df.head(3)
df <- read.csv('data/company.csv', stringsAsFactors = FALSE)
head(df)
  1. 分别只使用queryloc选出年龄不超过四十岁且工作部门为DairyBakery的男性。
df.query("age <= 40 & department in ['Dairy', 'Bakery'] & gender == 'M'")

df.loc[(df.age <= 40) & (df.department.isin(['Dairy', 'Bakery'])) & (df.gender == 'M')]
df %>% 
  filter(age <= 40 & department %in% c('Dairy', 'Bakery') & gender == 'M')

df[df$age <= 40 & df$department %in% c('Dairy', 'Bakery') & df$gender == 'M', ]
  1. 选出员工ID号 为奇数所在行的第1、第3和倒数第2列。
df.query("EmployeeID % 2 == 1").iloc[:, [0, 2, -2]]

df.iloc[(df.EmployeeID % 2 == 1).values, [0, 2, -2]]
df %>% 
  filter(EmployeeID %% 2 == 1) %>% 
  select(1, 3, last_col(1))

df[df$EmployeeID %% 2 == 1, c(1, 3, ncol(df) - 1)]
  1. 按照以下步骤进行索引操作:
  • 把后三列设为索引后交换内外两层
  • 恢复中间一层
  • 修改外层索引名为Gender
  • 用下划线合并两层行索引
  • 把行索引拆分为原状态
  • 修改索引名为原表名称
  • 恢复默认索引并将列保持为原表的相对位置
df2 = df.copy()
df2 = df2.set_index(df2.columns[-3: ].tolist()).swaplevel(0, 2, axis=0)
df2 = df2.reset_index(level = 1)
df2 = df2.rename_axis(index = {'gender': 'Gender'})
df2.index = df2.index.map(lambda x: '_'.join(x))
df2.index = df2.index.map(lambda x: tuple(x.split('_')))
df2 = df2.rename_axis(index = ['gender', 'department'])
df2 = df2.reset_index().reindex(df.columns, axis = 1)
df2.equals(df)

Ex2:巧克力数据集

# Python
df = pd.read_csv('data/chocolate.csv')

R的read.csv会讲\n自动换为.

# R
df <- read.csv('data/chocolate.csv', stringsAsFactors = FALSE, check.names = FALSE)
  1. 把列索引名中的\n替换为空格。
# Python
df.columns = [i.replace("\n", " ") for i in df.columns]
# R
names(df) <- str_replace(names(df), '\n', ' ')
  1. 巧克力Rating评分为1至5,每0.25分一档,请选出2.75分及以下且可可含量Cocoa Percent高于中位数的样本。
# Python
# R
  1. Review DateCompany Location设为索引后,选出Review Date在2012年之后且Company Location不属于France, Canada, Amsterdam, Belgium的样本。

参考资料

总结

上一篇:AD设置过孔盖油


下一篇:盘中打开涨停板