Pandas学习

2022-03-26 23:59:52

点击以下链接阅读原文

Pandas, Intro to Data Structures
http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro
Pandas中文速查手册, 知乎
https://zhuanlan.zhihu.com/p/25630700

首先模块导入别忘了

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

DataStructure

assign函数

In [69]: iris = pd.read_csv('data/iris.data')

In [70]: iris.head()
Out[70]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

In [71]: (iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
   ....:      .head())
   ....: 
Out[71]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa       0.6863
1          4.9         3.0          1.4         0.2  Iris-setosa       0.6122
2          4.7         3.2          1.3         0.2  Iris-setosa       0.6809
3          4.6         3.1          1.5         0.2  Iris-setosa       0.6739
4          5.0         3.6          1.4         0.2  Iris-setosa       0.7200

Assign函数使用时优先进行计算

Warning Since the function signature of assign is **kwargs, a dictionary, the order of the new columns in the resulting DataFrame cannot be guaranteed to match the order you pass in.
 To make things predictable, items are inserted alphabetically (by key) at the end of the DataFrame.
All expressions are computed first, and then assigned. So you can’t refer to another column being assigned in the same call to assign. For example:

In [74]: # Don't do this, bad reference to `C`
        df.assign(C = lambda x: x['A'] + x['B'],
                  D = lambda x: x['A'] + x['C'])

In [2]: # Instead, break it into two assigns
        (df.assign(C = lambda x: x['A'] + x['B'])
           .assign(D = lambda x: x['A'] + x['C']))

切片

Operation	Syntax	Result
Select column	df[col]	Series
Select row by label	df.loc[label]	Series
Select row by integer location	df.iloc[loc]	Series
Slice rows	df[5:10]	DataFrame
Select rows by boolean vector	df[bool_vec]	DataFrame

颠倒(矩阵中行列颠倒)

To transpose, access the T attribute (also the transpose function), similar to an ndarray:

# only show the first 5 rows
In [95]: df[:5].T
Out[95]: 
   2000-01-01  2000-01-02  2000-01-03  2000-01-04  2000-01-05
A     -0.0817     -0.5056     -0.0259      0.0492      1.2432
B      1.3905      0.0213      0.8407      0.4879     -0.6222
C     -1.9620     -0.3171      1.4135      0.4263     -0.5386

Pandas支持Numpy中的转换函数,支持矩阵乘法
Pandas支持数据在Console的显示格式设置

三维或更多维数据目前应该用不到了

重点在这里~

等我用到再写
Visualization, 数据可视化

原文请戳
http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization

基本可视化函数
plot()
If the index consists of dates, it calls gcf().autofmt_xdate()
to try to format the x-axis nicely as per above.

文中的实例函数是逐个增加, 对应的Plot图示是
Numpy的CumSum()函数

>>> a = np.array([[1,2,3], [4,5,6]])
>>> a
array([[1, 2, 3],
       [4, 5, 6]])
>>> np.cumsum(a)
array([ 1,  3,  6, 10, 15, 21])
>>> np.cumsum(a, dtype=float)     # specifies type of output value(s)
array([  1.,   3.,   6.,  10.,  15.,  21.])
>>> np.cumsum(a,axis=0)      # sum over rows for each of the 3 columns
array([[1, 2, 3],
       [5, 7, 9]])
>>> np.cumsum(a,axis=1)      # sum over columns for each of the 2 rows
array([[ 1,  3,  6],
       [ 4,  9, 15]])

# bar柱状图, barh横向柱状图, kde概率分布图, 此外还有散点图等等
# x, y代表x和y轴的标签名, color是图表颜色, r为red, g为green等
# stack可以将图表累加在一起
# cumulative绘制累积图
plot(kind = bar / barh / kde, x = , y= , color = , stack = True / False, cumulative = True/False)

数据存取

CSV, 即逗号分隔值（Comma-Separated Values，CSV，有时也称为字符分隔值，因为分隔字符也可以不是逗号）
CSV文件用记事本打开可以看出其存储格式, CSV同时又能被Excel这样的软件打开, 即使他是纯文本文件
*注意文件名后缀*

df.to_csv('[路径及]文件名')

将DataFrame转化为csv文件

df.to_excel('[路径及]文件名' [, sheetname = ''])

此函数依赖于xlwt模块, 需要另外安装, 且仅支持excel2007及之前版本文件, 要使用之后版本的文件请安装openpyxl模块. 还是尽量选择csv格式吧(逃
将DataFrame转化为Excel文件, 如xls格式

pd.read_csv('[路径及]文件名')

从csv文件读取数据

pd.read_excel('[路径及]文件名')

从excel文件读取数据

NLTK自然语言

码农公寓