Chapter 6 Data Loading, Storage, and File Formats
Reading and Writing Data in Text Format
最常用的是 read_csv
和 read_table
,不过数模竞赛里很多都是用 excel 给数据,不知道今年是个啥情况。
下表是一些常用的数据读取方法:
其中根据试验数据的不同形式,可以选择不同的read_csv
参数进行调整。这部分我觉得根据具体数据具体处理即可,用到再去查文档,现在不必要把所有的参数都记住。
需要注意的是,read_csv
会把数据中的 NA 和 NULL 标记为空值;也可以通过 na_values
参数设置空值:
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
# 可以用下面这种形式分别规定每一列中的空值
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
Reading Text Files in Pieces
这部分主要是用 chunksize
这个参数。
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
tot = pd.Series([])
chunker_list = []
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value=0)
chunker_list.append(piece)
print(chunker_list)
观察输出,可以发现,chunker 相当于是一个 iterator,每个元素都是 chunksize 的大小。需要注意的是,chunker 迭代之后就不可以再用了,这种特性有点类似于之前提到过的 generator。
Writing Data to Text Format
主要是使用to_csv
这个方法。
Working with Delimited Formats
In some cases, however, some manual processing may be necessary. It’s not uncommon to receive a file with one or more malformed lines that trip up read_table.
read_csv
和read_table
功能已经足够强大,但是为了防止数据中存在格式不正确的数据使得这两个方法失效,所以考虑 python 内置的 csv 读取方法:
import csv
f = open('examples/ex7.csv')
reader = csv.reader(f)
一个简单的读取数据方法:
with import csv
with open('examples/ex7.csv') as f:
lines = list(csv.reader(f))
headers, values = lines[0], lines[1:]
data_dict = {h: v for h, v in zip(headers, zip(*values))}
print(data_dict)
为了个性化的定义数据的分隔符等,可以用csv.Dialect
子类,通过一些属性来定义数据的组织形式:
class my_dialect(csv.Dialect):
lineterminator = '\n'
delimiter = ';'
quotechar = '"'
quoting = csv.QUOTE_MINIMAL
with open('examples/ex7.csv') as f:
reader = csv.reader(f, dialect=my_dialect)
这里是一些 dialect 的属性。
JSON Data
可以使用 python 内置的 json 进行读取,也可以用pd.read_json
,但是read_json
默认情况下认为 json 文件中的每一行就是 DataFrame 的每一行。
XML and HTML: Web Scraping
主要使用read_html
方法,顺便可以进行一些简单的数据分析。
Parsing XML with lxml.objectify
使用 lxml.objectify.parse
来对 XML 数据文件进行处理,调用 getroot()
方法得到处理后数据的 root,调用 XML 最外层的标签,会返回一个可迭代的对象。
具体遇到直接查文档吧。。。
Binary Data Formats
One of the easiest ways to store data (also known as serialization) efficiently in binary format is using Python’s built-in pickle serialization. pandas objects all have a to_pickle method that writes the data to disk in pickle format.
同时,作者指出应该慎用 pickle 这种方式,只推荐用 pickle 保存短期的数据:
pickle is only recommended as a short-term storage format. The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle with a later version of a library.
Using HDF5 Format
HDF5( hierarchical data format )便于存储大型、复杂的数据集结构,pandas 提供了一个高级接口 HDFStore
来存储 DataFrame 和Series。
frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
print(store)
HDFStore 支持两种存储形式,一种是 fixed,一种是 table,详细的文档在这里,可以看到 table 的速度虽然慢,但是支持 searching and selecting subsets 操作。
HDF5 is not a database. It is best suited for write-once, read-many datasets. While data can be added to a file at any time, if multiple writers do so simultaneously, the file can become corrupted.
Read Microsoft Excel Files
pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either the ExcelFile class or pandas.read_excel function.
To write pandas data to Excel format, you must first create an ExcelWriter, then write data to it using pandas objects’ to_excel method:
writer = pd.ExcelWriter('examples/ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
# another method
frame.to_excel('examples/ex2.xlsx')
writer.save()
Interacting with Web APIs
这部分其实与 pandas 关系不大,不细读了。
Interacting with Databases
同上。
Conclusion
Getting access to data is frequently the first step in the data analysis process. We have looked at a number of useful tools in this chapter that should help you get started. In the upcoming chapters we will dig deeper into data wrangling, data visualization, time series analysis, and other topics.