多层索引
多级索引及其表的结构
In [67]: np.random.seed(0)
In [68]: multi_index = pd.MultiIndex.from_product([list('ABCD'),
....: df.Gender.unique()], names=('School', 'Gender'))
....:
In [69]: multi_column = pd.MultiIndex.from_product([['Height', 'Weight'],
....: df.Grade.unique()], names=('Indicator', 'Grade'))
....:
In [70]: df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(),
....: (np.random.randn(8,4)*5 + 65).tolist()],
....: index = multi_index,
....: columns = multi_column).round(1)
....:
In [71]: df_multi
Out[71]:
Indicator Height Weight
Grade Freshman Senior Sophomore Junior Freshman Senior Sophomore Junior
School Gender
A Female 171.8 165.0 167.9 174.2 60.6 55.1 63.3 65.8
Male 172.3 158.1 167.8 162.2 71.2 71.0 63.1 63.5
B Female 162.5 165.1 163.7 170.3 59.8 57.9 56.5 74.8
Male 166.8 163.6 165.2 164.7 62.5 62.8 58.7 68.9
C Female 170.5 162.0 164.6 158.7 56.9 63.9 60.5 66.9
Male 150.2 166.3 167.3 159.3 62.4 59.1 64.9 67.1
D Female 174.3 155.7 163.2 162.1 65.3 66.5 61.8 63.2
Male 170.7 170.3 163.8 164.9 61.6 63.2 60.9 56.4
与单层索引类似, MultiIndex
也具有名字属性,图中的 School
和 Gender
分别对应了表的第一层和第二层行索引的名字, Indicator
和 Grade
分别对应了第一层和第二层列索引的名字。
索引的名字和值属性分别可以通过 names
和 values
获得:
df_multi.index.names
df_multi.columns.names
df_multi.index.values
df_multi.columns.values
get_level_values获取一层的索引
df_multi.index.get_level_values(0)
多级索引中的loc索引器
#将school和grade设为索引
df_multi = df.set_index(['School', 'Grade'])
避免警告先排序
df_multi = df_multi.sort_index()
df_multi.loc[('Fudan University', 'Junior')].head()
df_multi.loc[[('Fudan University', 'Senior'),
('Shanghai Jiao Tong University', 'Freshman')]].head()
#布尔列表
df_multi.loc[df_multi.Weight > 70].head()
#lambda
df_multi.loc[lambda x:('Fudan University','Junior')].head()
#多层交叉索引格式:[(level_0_list, level_1_list), cols]
#所有北大和复旦的大二大三学生
df_multi.loc[(['Peking University', 'Fudan University'],
['Sophomore', 'Junior']), :]
#北大的大三学生和复旦的大二学生
df_multi.loc[[('Peking University', 'Junior'),
('Fudan University', 'Sophomore')]]
IndexSlice对象
In [90]: np.random.seed(0)
In [91]: L1,L2 = ['A','B','C'],['a','b','c']
In [92]: mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
In [93]: L3,L4 = ['D','E','F'],['d','e','f']
In [94]: mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
In [95]: df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)),
....: index=mul_index1,
....: columns=mul_index2)
....:
In [96]: df_ex
Out[96]:
Big D E F
Small d e f d e f d e f
Upper Lower
A a 3 6 -9 -6 -6 -2 0 9 -5
b -3 3 -8 -3 -2 5 8 -4 4
c -1 0 7 -4 6 6 -9 9 -6
B a 8 5 -2 -9 -8 0 -9 1 -6
b 2 9 -7 -9 -9 -5 -4 -3 -1
c 8 6 -5 0 1 -8 -8 -2 0
C a -6 -3 2 5 9 -9 5 -6 3
b 1 2 -5 -3 -5 6 -6 3 -5
c -1 5 6 -6 6 4 7 8 -4
#定义
idx = pd.IndexSlice
loc[idx[*,*]]
型
不能进行多层分别切片,前一个 *
表示行的选择,后一个 *
表示列的选择
In [98]: df_ex.loc[idx['C':, ('D', 'f'):]]
Out[98]:
Big D E F
Small f d e f d e f
Upper Lower
C a 2 5 9 -9 5 -6 3
b -5 -3 -5 6 -6 3 -5
c 6 -6 6 4 7 8 -4
布尔序列的索引:
df_ex.loc[idx[:'A', lambda x:x.sum()>0]] # 列和大于0
loc[idx[*,*],idx[*,*]]
型
前一个 idx
指代的是行索引,后一个是列索引。
df_ex.loc[idx[:'A', 'b':], idx['E':, 'e':]]
多级索引的构造
常用的有 from_tuples, from_arrays, from_product
三种方法,它们都是 pd.MultiIndex
对象下的函数
In [101]: my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
In [102]: pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])
Out[102]:
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
from_arrays
指根据传入列表中,对应层的列表进行构造:
In [103]: my_array = [list('aabb'), ['cat', 'dog']*2]
In [104]: pd.MultiIndex.from_arrays(my_array, names=['First','Second'])
Out[104]:
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
from_product
指根据给定多个列表的笛卡尔积进行构造:
In [105]: my_list1 = ['a','b']
In [106]: my_list2 = ['cat','dog']
In [107]: pd.MultiIndex.from_product([my_list1,
.....: my_list2],
.....: names=['First','Second'])
.....:
Out[107]:
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
参考
datawhale第十二期pandas