Structured data representation of python

Structured data

https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

结构化数据 -- 在数据上定义了一层模式, 例如关系型数据库

非结构数据 -- *形式数据, 没有任何约束, 例如报纸新闻

半结构化数据 -- 没有全局的数据模式, 但是对于每一条数据都有自身的模式定义, 例如文档数据库。

在python应用中往往需要定义结构化数据,来管理业务数据。本文总结几种结构化数据存储方法。

 

Structured data

Structured data sources define a schema on the data. With this extra bit of information about the underlying data, structured data sources provide efficient storage and performance. For example, columnar formats such as Parquet and ORC make it much easier to extract values from a subset of columns. Reading each record row by row first, then extracting the values from the specific columns of interest can read much more data than what is necessary when a query is only interested in a small fraction of the columns. A row-based storage format such as Avro efficiently serializes and stores data providing storage benefits. However, these advantages often come at the cost of flexibility. For example, because of rigidity in structure, evolving a schema can be challenging.

Unstructured data

By contrast, unstructured data sources are generally free-form text or binary objects that contain no markup, or metadata (e.g., commas in CSV files), to define the organization of data. Newspaper articles, medical records, image blobs, application logs are often treated as unstructured data. These sorts of sources generally require context around the data to be parseable. That is, you need to know that the file is an image or is a newspaper article. Most sources of data are unstructured. The cost of having unstructured formats is that it becomes cumbersome to extract value out of these data sources as many transformations and feature extraction techniques are required to interpret these datasets.

Semi-structured data

Semi-structured data sources are structured per record but don’t necessarily have a well-defined global schema spanning all records. As a result, each data record is augmented with its schema information. JSON and XML are popular examples. The benefits of semi-structured data formats are that they provide the most flexibility in expressing your data as each record is self-describing. These formats are very common across many applications as many lightweight parsers exist for dealing with these records, and they also have the benefit of being human readable. However, the main drawback for these formats is that they incur extra parsing overheads, and are not particularly built for ad-hoc querying.

 

Dict

https://docs.python.org/3/tutorial/datastructures.html#dictionaries

实际上没有模式定义, 需要开发者使用的时候按照需求列举出各个fields。

 

>>> tel = {'jack': 4098, 'sape': 4139}
>>> tel['guido'] = 4127
>>> tel
{'jack': 4098, 'sape': 4139, 'guido': 4127}
>>> tel['jack']
4098

namedtuple

https://medium.com/swlh/structures-in-python-ed199411b3e1

命名元组, 定义的元组各个位置的应用名字, 并可以使用名字来索引元素。

 

from collections import namedtuple 
Point = namedtuple('Point', ['x', 'y'])


Point = namedtuple('Point', ['x', 'y'], defaults=[0, 0])



ntpt = Point(3, y=6)



ntpt.x + ntpt.y



ntpt[0] + ntpt[1]

 

class

https://docs.python.org/3/tutorial/classes.html#class-objects

使用class管理复合数据属性。

>>> class Complex:
...     def __init__(self, realpart, imagpart):
...         self.r = realpart
...         self.i = imagpart
...
>>> x = Complex(3.0, -4.5)
>>> x.r, x.i
(3.0, -4.5)

 

dataclass

https://www.geeksforgeeks.org/understanding-python-dataclasses/

dataclass在class的基础上做了增强,专门面向数据存储, 包括初始化, 打印, 和比较。

 

DataClasses has been added in a recent addition in python 3.7 as a utility tool for storing data. DataClasses provides a decorator and functions for automatically adding generated special methods such as __init__() , __repr__() and __eq__() to user-defined classes.

 

# default field example
from dataclasses import dataclass, field


# A class for holding an employees content
@dataclass
class employee:

    # Attributes Declaration
    # using Type Hints
    name: str
    emp_id: str
    age: int
    
    # default field set
    # city : str = "patna"
    city: str = field(default="patna")


emp = employee("Satyam", "ksatyam858", 21)
print(emp)

 

pydantic

https://pydantic-docs.helpmanual.io/

在定义数据模式基础上, 增强了一些功能:

数据验证

运行时类型错误提示

Data validation and settings management using python type annotations.

pydantic enforces type hints at runtime, and provides user friendly errors when data is invalid.

Define how data should be in pure, canonical python; validate it with pydantic.

from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel


class User(BaseModel):
    id: int
    name = 'John Doe'
    signup_ts: Optional[datetime] = None
    friends: List[int] = []


external_data = {
    'id': '123',
    'signup_ts': '2019-06-01 12:22',
    'friends': [1, 2, '3'],
}
user = User(**external_data)
print(user.id)
#> 123
print(repr(user.signup_ts))
#> datetime.datetime(2019, 6, 1, 12, 22)
print(user.friends)
#> [1, 2, 3]
print(user.dict())
"""
{
    'id': 123,
    'signup_ts': datetime.datetime(2019, 6, 1, 12, 22),
    'friends': [1, 2, 3],
    'name': 'John Doe',
}
"""

 

上一篇:『无为则无心』Python面向对象 — 60、魔法属性


下一篇:实践GoF的23种设计模式:SOLID原则(上)