pandas索引(2)多层索引

多层索引

多级索引及其表的结构

In [67]: np.random.seed(0)

In [68]: multi_index = pd.MultiIndex.from_product([list('ABCD'),
   ....:               df.Gender.unique()], names=('School', 'Gender'))
   ....: 

In [69]: multi_column = pd.MultiIndex.from_product([['Height', 'Weight'],
   ....:                df.Grade.unique()], names=('Indicator', 'Grade'))
   ....: 

In [70]: df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(),
   ....:                               (np.random.randn(8,4)*5 + 65).tolist()],
   ....:                         index = multi_index,
   ....:                         columns = multi_column).round(1)
   ....: 

In [71]: df_multi
Out[71]: 
Indicator       Height                           Weight                        
Grade         Freshman Senior Sophomore Junior Freshman Senior Sophomore Junior
School Gender                                                                  
A      Female    171.8  165.0     167.9  174.2     60.6   55.1      63.3   65.8
       Male      172.3  158.1     167.8  162.2     71.2   71.0      63.1   63.5
B      Female    162.5  165.1     163.7  170.3     59.8   57.9      56.5   74.8
       Male      166.8  163.6     165.2  164.7     62.5   62.8      58.7   68.9
C      Female    170.5  162.0     164.6  158.7     56.9   63.9      60.5   66.9
       Male      150.2  166.3     167.3  159.3     62.4   59.1      64.9   67.1
D      Female    174.3  155.7     163.2  162.1     65.3   66.5      61.8   63.2
       Male      170.7  170.3     163.8  164.9     61.6   63.2      60.9   56.4

pandas索引(2)多层索引
与单层索引类似, MultiIndex 也具有名字属性,图中的 SchoolGender 分别对应了表的第一层和第二层行索引的名字, IndicatorGrade 分别对应了第一层和第二层列索引的名字。

索引的名字和值属性分别可以通过 namesvalues 获得:

df_multi.index.names
df_multi.columns.names
df_multi.index.values
df_multi.columns.values

get_level_values获取一层的索引

df_multi.index.get_level_values(0)

多级索引中的loc索引器

#将school和grade设为索引
df_multi = df.set_index(['School', 'Grade'])

避免警告先排序

df_multi = df_multi.sort_index()
df_multi.loc[('Fudan University', 'Junior')].head()
df_multi.loc[[('Fudan University', 'Senior'),
				('Shanghai Jiao Tong University', 'Freshman')]].head()

#布尔列表
df_multi.loc[df_multi.Weight > 70].head()

#lambda
df_multi.loc[lambda x:('Fudan University','Junior')].head()

#多层交叉索引格式:[(level_0_list, level_1_list), cols]
#所有北大和复旦的大二大三学生
df_multi.loc[(['Peking University', 'Fudan University'],
		        ['Sophomore', 'Junior']), :]

#北大的大三学生和复旦的大二学生
df_multi.loc[[('Peking University', 'Junior'),
           ('Fudan University', 'Sophomore')]]

IndexSlice对象

In [90]: np.random.seed(0)

In [91]: L1,L2 = ['A','B','C'],['a','b','c']

In [92]: mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))

In [93]: L3,L4 = ['D','E','F'],['d','e','f']

In [94]: mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))

In [95]: df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)),
   ....:                     index=mul_index1,
   ....:                     columns=mul_index2)
   ....: 

In [96]: df_ex
Out[96]: 
Big          D        E        F      
Small        d  e  f  d  e  f  d  e  f
Upper Lower                           
A     a      3  6 -9 -6 -6 -2  0  9 -5
      b     -3  3 -8 -3 -2  5  8 -4  4
      c     -1  0  7 -4  6  6 -9  9 -6
B     a      8  5 -2 -9 -8  0 -9  1 -6
      b      2  9 -7 -9 -9 -5 -4 -3 -1
      c      8  6 -5  0  1 -8 -8 -2  0
C     a     -6 -3  2  5  9 -9  5 -6  3
      b      1  2 -5 -3 -5  6 -6  3 -5
      c     -1  5  6 -6  6  4  7  8 -4
#定义
idx = pd.IndexSlice

loc[idx[*,*]]

不能进行多层分别切片,前一个 * 表示行的选择,后一个 * 表示列的选择

In [98]: df_ex.loc[idx['C':, ('D', 'f'):]]
Out[98]: 
Big          D  E        F      
Small        f  d  e  f  d  e  f
Upper Lower                     
C     a      2  5  9 -9  5 -6  3
      b     -5 -3 -5  6 -6  3 -5
      c      6 -6  6  4  7  8 -4

布尔序列的索引:

df_ex.loc[idx[:'A', lambda x:x.sum()>0]] # 列和大于0

loc[idx[*,*],idx[*,*]]

前一个 idx 指代的是行索引,后一个是列索引。

df_ex.loc[idx[:'A', 'b':], idx['E':, 'e':]]

多级索引的构造

常用的有 from_tuples, from_arrays, from_product 三种方法,它们都是 pd.MultiIndex 对象下的函数

In [101]: my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]

In [102]: pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])
Out[102]: 
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

from_arrays 指根据传入列表中,对应层的列表进行构造:

In [103]: my_array = [list('aabb'), ['cat', 'dog']*2]

In [104]: pd.MultiIndex.from_arrays(my_array, names=['First','Second'])
Out[104]: 
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

from_product 指根据给定多个列表的笛卡尔积进行构造:

In [105]: my_list1 = ['a','b']

In [106]: my_list2 = ['cat','dog']

In [107]: pd.MultiIndex.from_product([my_list1,
   .....:                             my_list2],
   .....:                            names=['First','Second'])
   .....: 
Out[107]: 
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

参考

datawhale第十二期pandas

上一篇:Pandas打卡第三次任务


下一篇:UI自动化框架搭建(五): selenium封装类解析