DataFrame

DataFrame

DataFrame 是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)
DataFrame 即有行索引也有列索引,它可以被看作由Series组成的字典(共用一个索引)

创建 DataFrame

from pandas import DataFrame
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year':[2000,2001,2002,2001,2002],
        'pop':[1.5,1.7,3.6,2.4,2.9]}
frame = DataFrame(data)
frame
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

DataFrame的列按照指定顺序进行排序

DataFrame(data,columns=['year','state','pop'])
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9

索引重命名

DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four','five'])
year state pop
one 2000 Ohio 1.5
two 2001 Ohio 1.7
three 2002 Ohio 3.6
four 2001 Nevada 2.4
five 2002 Nevada 2.9

创建空列

传入的列在数据中找不到,会产生NaN值

DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five'])
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN

列之间进行对比,创建 布尔值列

frame = DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five'])
frame['eastern'] = frame.state == 'Ohio'
frame
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 NaN True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 NaN False
five 2002 Nevada 2.9 NaN False

通过字典嵌套(字典的字典) 进行创建

外层字典的键作为列,内层键作为行

pop = {'Nevada':{2001:2.4,2002:2.9},
      'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame = DataFrame(pop)
frame
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
pop = {'Nevada':{2001:2.4,2002:2.9},
      'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame = DataFrame(pop,index=[2001,2002,2003])
frame
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
pop = {'Nevada':{2001:2.4,2002:2.9},
      'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame = DataFrame(pop)
pdata = {'Ohio':frame['Ohio'][:-1],
         'Nevada':frame['Nevada'][:2]}
DataFrame(pdata)
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7

给索引 赋名

pop = {'Nevada':{2001:2.4,2002:2.9},
      'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame = DataFrame(pop)
frame.index.name = 'year'
frame
Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

给列 赋名

frame.columns.name = 'state'
frame
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

转置

pop = {'Nevada':{2001:2.4,2002:2.9},
      'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame = DataFrame(pop)
frame.T
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6

.values 属性以二维ndarray形式返回DataFrame中数据

pop = {'Nevada':{2001:2.4,2002:2.9},
      'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame = DataFrame(pop)
frame.values
array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

删除列值 del

pop = {'Nevada':{2001:2.4,2002:2.9},
      'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame = DataFrame(pop)
del frame['Ohio']
frame
Nevada
2000 NaN
2001 2.4
2002 2.9

索取

获取列值

frame = DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five'])
frame['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
frame.year
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

获取所有列名 .columns

frame.columns 
Index(['year', 'state', 'pop', 'debt'], dtype='object')

获取所有索引名 .index

frame.index
Index(['one', 'two', 'three', 'four', 'five'], dtype='object')

获取行值

frame = DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five'])
frame.ix['three']
/Users/*hong/anaconda2/envs/python35/lib/python3.5/site-packages/ipykernel/__main__.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  from ipykernel import kernelapp as app





year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

赋值

frame = DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five'])
frame['debt'] = 16.5
frame
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
import numpy as np
frame['debt']=np.arange(5)
frame
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4

将 Series 赋值给 DataFrame

赋值的是一个Series,会精确匹配DataFrame的索引,所有的空位都将被填上缺失值

from pandas import Series 
val = Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame['debt'] = val
frame
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

索引对象

Index 对象是不可修改的,这样才能使Index对象在多个数据结构之间安全共享

from pandas import Seriesies
obj = Series(range(3),index=['a','b','c'])
obj
a    0
b    1
c    2
dtype: int64
index = obj.index 
index
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Index(['b', 'c'], dtype='object')

pd.Index()

最泛化的Index对象,将轴标签表示为一个由python对象组成的NumPy数组

import numpy as np
import pandas as pd
pd.Index(np.arange(3))
Int64Index([0, 1, 2], dtype='int64')
index = pd.Index(np.arange(3))
obj = Series([1.5,-2.5,0],index=index)
obj
0    1.5
1   -2.5
2    0.0
dtype: float64
obj.index is index
True

用逻辑变量 返回索引所包含的数据

from pandas import DataFrame,Series
pop = {'Nevada':{2001:2.4,2002:2.9},
      'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame = DataFrame(pop)
frame.index.name = 'year'
frame.columns.name = 'state'
frame
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
'Ohio' in frame.columns
True
2003 in frame.index
False

基本功能

.reindex

其作用是创建一个适应新索引的新对象
参数

index 用作索引的新序列。即可以是index实例,也可以是其他序列型的python数据结构。
index会被完全使用,就像没有任何复制一样
method 插值(填充)方法
fill_value 在重索引的过程中,需要引入缺失值时使用的代替值
limit 前向或后向填充时的最大填充量
level 在MultiIndex的指定级别上匹配简答索引,否则选取其子集
copy 默认为True,无论如何都复制;如果为False,则新旧相等就不复制
obj = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
obj = obj.reindex(['a','b','c','d','e'])
obj
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

fill_value 参数

obj = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
obj.reindex(['a','b','c','d','e'],fill_value=0)
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

method方法

ffill或pad 前向填充(或搬运)值

bfill或backfill 后向填充(或搬运)值

obj = Series(['blue','purple','yellow'],index=[0,2,4])
obj
0      blue
2    purple
4    yellow
dtype: object
obj.reindex(range(6),method='ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
frame = DataFrame(np.arange(9).reshape((3,3)),
                  columns=['Ohio','Texas','California'],index=['a','c','d'])
frame
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
frame.reindex(['a','b','c','d'])
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
states = ['Texas','Utah','California']
frame.reindex(columns=states)
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
frame = DataFrame(np.arange(9).reshape((3,3)),
                  columns=['Ohio','Texas','California'],index=['a','c','d'])
states = ['Texas','Utah','California']
frame = frame.reindex(columns=states)
frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
Texas Utah California
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8

.ix

frame = DataFrame(np.arange(9).reshape((3,3)),
                  columns=['Ohio','Texas','California'],index=['a','c','d'])
states = ['Texas','Utah','California']
frame = frame.reindex(columns=states)
frame.reindex(index=['a','b','c','d'],columns=states)
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0

丢弃指定轴上的项

drop 方法返回的是一个在指定轴上删除了指定值的新对象

obj = Series(np.arange(5),index=['a','b','c','d','e'])
obj
a    0
b    1
c    2
d    3
e    4
dtype: int64
obj.drop('c')
a    0
b    1
d    3
e    4
dtype: int64
obj.drop(['d','c'])
a    0
b    1
e    4
dtype: int64
data = DataFrame(np.arange(16).reshape((4,4)),
                index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

删除行

data.drop(['Colorado','Ohio'])
one two three four
Utah 8 9 10 11
New York 12 13 14 15

删除列

data.drop('two',axis=1)
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
data.drop(['two','four'],axis=1)
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
上一篇:pandas中的applymap和apply


下一篇:lesson3 运算符与表达式