?
- 有哪些 data model? 它们之间的区别和联系是什么? 各自解决了什么场景下的问题? 如何融合?
- 每个 data model 对应的 query language 是什么? 有什么特点?
- 关系型和文档型之间的差异是什么?
- 什么是 nosql? data model 和 query language 的历史发展?
- it’s just a question of whether the schema is explicit (enforced on write) or implicit (handled on read)?
- 关系型数据库的限制?什么导致了这些限制?如何打破这些限制?
Overview
许多应用都由分层的 data-model 组成,上层屏蔽下层的复杂性,data-model 决定我们解决问题的思维模式.本节介绍 relational model, document model, graph-based model
3 种 data-model 和其对应的 query language。
Relational model 解决数据之间 many-to-many
关系的问题,但是无法适应所有应用场景,于是 NoSQL datastore
应运而生,主要分为 2 大类:
-
Doucument database
: Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare. -
Graph databases
: Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything.
适应于一个 data-model 的数据也可以用其他的 data-model 来组织,比如: graph data can be represented in a relational database。用不合适的 data-model 来组织数据会使得数据变得难以使用, That’s why we have different systems for different purposes, not a single one-size-fits-all solution.
定义数据存储格式 -> 在其上定义查询语言.
Relational Model Versus Document Model
Relational Model
Relational Model 由 row(tuple) 组成 table(relation),主要的 use case 是 transaction processing
和 batch processing
。其屏蔽了底层存储的实现细节,对外提供 sql 这种声明式的 query language.
Object-oriented 和 relational model 之间存在差异,需要 orm(mybatis 等), 但是也无法完全屏蔽这种差异.
A key insight of the relational model was this: you only need to build a query optimizer once, and then all applications that use the database can benefit from it
Document Model
适用于数据之间 one-to-many 的关系,对 join 支持比较弱.
NoSql(Not Only SQL) 产生驱动因素:
- A need for greater scalability than relational databases can easily achieve, includ‐ing very large datasets or very high write throughput
- 开源优势
- Specialized query operations that are not well supported by the relational model
- Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model
Schema flexibility in the document model:
- schema-on-read (the structure of the data is implicit, and only interpreted when the data is read),
- schema-on-write(the traditional approach of relational databases, where the schema is explicit and the database ensures all written data con‐ forms to it)
schema-on-read approach is advantageous if the items in the collection don’t all have the same structure for some reason
Many-to-One and Many-to-Many Relationships
Removing such duplication is the key idea behind normalization in databases.
- Relational model 通过外键来组织 one-to-many 的数据关系,通过 join 来查询数据,这种效率比较低。document model 对 one-to-many 支持良好, In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak.但是如果数据库不支持 join, 就需要自己在代码层面实现数据的 join 逻辑.
The network model: 树形结构,通过 link(类似指针而非外键), 支持多对多和多对一
The only way of accessing a record was to follow a path from a root record along these chains of links. This was called an access path.
[外链图片转存失败(img-pfOg5mzW-1568449649182)(http://www.dlsweb.rmit.edu.au/Toolbox/knowmang/content/models/images/multiple_owner.gif)]
- 数据被组织为表,没有嵌套的结构,可以在表上增删改查
- 查询路径由数据库的查询优化器自己选取,无需手动指定
[外链图片转存失败(img-nowIWSqQ-1568449649183)(https://www.dlsweb.rmit.edu.au/Toolbox/knowmang/content/models/images/relational_schema.gif)]
在 many to one 和 many to many 上 2 者没有本质差异,Relationl 使用外键,Document 使用 document reference
主要优势比较:
-
Document model
: schema flexibility, better performance due to locality, for some applications it is closer to the data structures used by the application -
Relational model
: better support joins, many-to-one and many-to-many
Data locality for queryies
document 被连续存储,查询性能比较好; 如果被 split 为多个 table, 查询的开销更大
数据本地化的策略被大量使用,比如 column-family
Convergence of document and relational databases
A hybrid of the relational and document models is a good route for databases to take in the future. 比如 sql-on-elasticsearch
, Phoenix
Query languages for data
Sql
Sql 基于 relatinoal algebra, 是声明式语言,它:
- sql 将查询逻辑和数据库实现解耦,使得 2 者之间可以独立的变化. it gives the database much more room for automatic opti‐ mizations.
- sql 只是指明了查询模式,对执行顺序没有要求,所以可以方便的并行. because it specifies instructions that must be per‐ formed in a particular order. Declarative languages have a better chance of getting faster in parallel execution because they specify only the pattern of the results, not the algorithm that is used to determine the results
MapReduce Querying
一些 NoSql DB 如 Manogo 使用 MR querying。MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between.
The map and reduce functions are somewhat restricted
in what they are allowed to do. They must be pure functions,
写 mr 程序比写 query 是要难的,A usability problem with MapReduce is that you have to write two carefully coordi‐ nated JavaScript functions, which is often harder than writing a single query
Graph-Like Data Models
适用于 many-to-many 模型,由 vertices 和 edges 组成, vertex 和 edge 不需要固定的 schema. graph 的 vertex 和 edge 不一定需要都是同一类,不同的 vertex 和 edge 之间可能会有各种个样的关系, 而且要在这些 vertex 和 edge 上进行各种关联查询. 如果用 Relational model 来表示,会不直观,比如查询需要写大量的 join,不同的 vertex 需要存储到不同表,且有严格的 schema 限制.
构建灵活,利于演进:
-
Graphs are not limited to such homogeneous data: an equally powerful use of graphs is to provide a consis‐ tent way of storing completely different types of objects in a single datastore
-
Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures.
property graph model
those features vive graphs a great deal of flexibility for data modeling
- 没有 schema 限制类型
- 可以有效的查询一个顶点的出边和入边
- 可以存储额外的信息来描述边
In the property graph model, each vertex consists of:
- A unique identifier
- A set of outgoing edges
- A set of incoming edges
- A collection of properties (key-value pairs)
Each edge consists of:
- A unique identifier
- The vertex at which the edge starts (the tail vertex)
The Cypher Query Language
Cypher is a declarative query language for property graphs, created for the Neo4j graph database
如果用关系型 model 来表示图,用 sql 来进行图上的操作, sql 会异常复杂
Triple-store model
The triple-store model is mostly equivalent to the property graph model, using differ‐ent words to describe the same ideas
In a triple-store, all information is stored in the form of very simple three-part state‐ments: (subject, predicate, object)
.
but fortunately you can use semicolons to say multiple things about the same subject. This makes the Turtle format quite nice and readable:
The semantic web
RDF is designed for internet-wide data exchange,The Resource Description Framework (RDF) was intended as a mechanism for different web‐sites to publish data in a consistent format, allowing data from different websites to be automatically combined into a web of data—a kind of internet-wide “database of everything.”
RDF doesn’t distinguish between properties and edges but just uses predicates for both
Cypher’s pattern matching is borrowed from SPARQL
.
Graph Databases Compared to the Network Model
- Network 规定了 link 的类型,类似
java: Type link
, 而 graph vertex 可以和任意 vertex 建立联系,类似java: T link
,This gives much greater flexibility for applications to adapt to changing requirements. - 在 network 中必须沿着 access path 进行数据查找. graph 中 you can refer directly to any vertex by its unique ID, or you can use an index to find vertices with a particular value
- Network 中 the children of a record were an ordered set, so the database had to maintain that ordering, Graph 中 . In a graph database, vertices and edges are not ordered (you can only sort the results when making a query).
- Network 中 all queries were imperative, Graph 支持 high-level, declarative query languages such as Cypher or SPARQL
The Foundation: Datalog
古老的 language, it provides the foundation that later query languages build upon. Cascalog 是其一种实现,用来查询 hadoop 上的数据集.
Dtalog 定义 data model 为: predicate(subject, object)
Cypher and SPARQL jump in right away with SELECT, but Datalog takes a small step at a time.But it’s a very powerful approach, because rules can be combined and reused in different queries. It’s less convenient for simple one-off queries, but it can cope better if your data is complex.
Summary
每种 model 适应于不同的场景,趋势是不同的 model 之间也在相互融合。
Relation model:
- 适应于 many-to-many 和 many-to-one 的模式
- 良好的 join 支持, schema on write
- Declarative query language: Sql, 解耦查询逻辑和数据库实现,使得查询优化器得以实现
- 对于 one-to-many 关系将数据 split 到多个 table 中,通过 join 查询降低数据的获取效率
- 用 join 来进行 graph 的查询难以理解且费时费力
- 和 OO 的设计之间存在差异,需要通过 ORM 来弥补
Document model
- 适应于 one-to-many, join 支持不够
- 灵活的 schema(schema on read)
- data locality 带来的查询效率
- Declarative query language: MR
Graph model
- 允许任意类型的 vertex 建立任意的 edge,灵活的 schema 非常便于应用扩展演进
- Declarative query language: Cypher, SPARQL
- 支持 graph 上的各种查询