转自: https://blog.open-metadata.org/why-openmetadata-is-the-right-choice-for-you-59e329163cac
We’ve had an overwhelming response for the OpenMetadata project. A frequent question from users, who are deciding what system they should adopt is:
“What generation of metadata systems is OpenMetadata? Are you pull-based, push-based, or hybrid? How is it different from other open-source and commercial solutions?”
These questions have their roots in the excellent blog that captured the evolution of the metadata systems at LinkedIn. But, what if my answer is that these categorizations are not that important. There are other important factors to consider when you start your journey to lay a strong foundation of metadata.
What architecture is right for me?
The current terminology used to distinguish generations of metadata system architectures makes sense in the context of how the system evolved at LinkedIn. This evolution could look entirely different in a different company depending on the experience of the engineering teams building these systems, and the availability of time and resources.
Architectural choices in large tech companies like Uber and LinkedIn reflect Conway’s law. System designers seek to minimize friction across organizational boundaries, leverage existing know-how and technology choices, and reuse building blocks that are already available.
As an example, at Uber, there are teams responsible for scalable compute infrastructure, Kafka as a service, ETL as a service, data warehouse services, database services, data lakes, data ingestion as a service, etc. In such an environment, the complexity of the architecture is secondary to reducing team friction with a clear division of responsibilities. As a result, no one worries about having a dependency (often unnecessary) on Kafka, HDFS, Cassandra, and splitting a service into a large number of microservices. The organization staffs teams with deep technical expertise for each type of service and can, therefore, operationalize complex architectures supported by a plethora of in-house tools and machinery.
This is not possible for every organization. As a consequence, when a new company is formed to productize open-source software developed in larger companies, the team needs to revisit many architectural choices to make the software broadly usable. These teams have a hard time understanding their real users. They continue to build the solutions as if their target customer is still their previous company (as happened in the case of Hortonworks, a company I co-founded). This is the main reason why we decided to build OpenMetadata from the ground up instead of open-sourcing what we had built at Uber.
A second influence on the architecture is, Metadata system as a Product vs as a Service. The second-generation systems mentioned in the LinkedIn blog are providing metadata systems as a product. Instead of optimizing for organizational boundaries and teams, the primary focus is on User Personas and Use Cases. The product team owns the responsibility for all aspects of the product: when pipelines that crawl and publish metadata break, or the Kafka cluster used for ingesting metadata has problems, or the database has issues. There is no luxury of a Kafka team, a pipeline team, a database team, or infra/SRE teams to run to. A product must be delivered as a whole instead of bits and pieces of services to fit into an existing stack. The simplicity of these architectures is by design. The architecture must rely on fewer proven dependencies that customers can understand and operate.
OpenMetadata takes the product approach to architecting the system. The dependencies are kept to an absolute minimum — Jetty, MySQL, and ElasticSearch are proven and well-known technologies that our users can operationalize. All our schemas are already modeled, strongly typed, and available to our users out of the box; so they don’t need to spend time on modeling. Metadata is extensible — you can define your own tag categories and introduce extensions to existing entities. We provide rich APIs to publish/consume metadata and receive metadata events that are designed for easy consumption by developers. As a user, you have the flexibility to keep the architecture simple. But, if you still want to use Kafka for publishing metadata to services that have interfaces to consume from it, there are integration points to do just that.
Push-based vs. Pull-based confusion
There is a lot of confusion around these terms. There are two directions in which metadata flows — metadata ingestion from metadata sources to metadata store, and metadata consumption from metadata store by applications.
Metadata ingestion
In all metadata systems, including OpenMetadata, there are crawlers/ETL ingestion jobs that pull the metadata from the source and are pushed into the metadata store using APIs resulting in pull-push design.
Perhaps due to LinkedIn heritage, Datahub turns this into pull-push-pull-push — pull from metadata sources and push into Kafka, pull from Kafka, and then push into metadata store.
This might be a reasonable choice for some companies that have a central Kafka team and the architectural pattern commonly used is to throw data onto Kafka to get it from one place to the other. However, this extra dependency on Kafka adds no tangible benefit. On the contrary, it increases the system dependency, adds operational complexity, and reduces the overall availability of the system with complex failure modes. The often-cited benefit that ‘streaming metadata keeps metadata fresh and avoids ingestion job failures’ is not valid for most metadata sources (with the exception of Apache Hive) as they don’t produce metadata change events. No metadata system can be purely push-based. Even if such metadata change events become available (don’t hold your breath), other metadata, such as queries, data profiling can’t be done in a streaming fashion. So every system must run some kind of a batch job against metadata sources and can’t avoid building strategies to keep the data as fresh as it can, handle job failures, and not overwhelm the sources with a heavy query load. Btw, Uber’s Databook classified as third-generation architecture ingests metadata with pull-push strategy.
So, as far as ingestion of metadata is concerned, all systems are pull-based.
Metadata consumption
Applications consume metadata using APIs. They can keep in sync with the changing metadata by consuming metadata change events. This can be done in several ways:
- Use APIs to periodically pull the change events. This is the simplest way to keep metadata in sync.
- Use APIs to subscribe for change events and get notifications as webhook callbacks.
- If your deployments need the change events from Kafka (where your backend systems already have Kafka clients), you can set up a change event sink to Kafka.
Here again, the dependency on Kafka is optional to keep the system as simple as possible. Even webhooks, which require a server to receive metadata POST events, are optional and one can just use pull requests to keep the architecture simple.
The architecture needs to support both pull and push-based metadata consumption.
Schemas
One of the biggest issues with data is that poorly designed schemas make data hard to use. Undocumented schemas mean that data consumers don’t understand their data or, worse misunderstand their data. As data people, we should take these lessons to heart. We need to create well-modeled schemas that are strongly typed, and well documented with clear vocabularies. Metadata can’t be stored in property bags that no one other than the producer of metadata can understand. Even approaches where metadata is modeled using an array of facets or aspects is just another way to model data as key-values with no clear indication of what key values can be expected and what are optional. Unlike a metadata-as-a-service solution, which leaves metadata modeling to the users, the metadata-as-a-product approach must understand the user’s needs, model all the entities and types required by most users. For a few sophisticated users, schemas should be extensible with clear extension points.
OpenMetadata considers schemas as the most important aspect of the system. We model schemas using JSON Schema, a powerful way to model, generate language bindings to any language of your choice, and aided by excellent tools. Open metadata standards can help in the ubiquitous use of metadata to go beyond catalogs limited to discovery toward data collaboration and automation. You can look at our schemas here.
Metadata must be well-modeled, strongly-typed, and clearly documented in order to be shareable and ubiquitously used to power innovations.
APIs
APIs are driving digital transformation across many industries. We believe metadata APIs are at the heart of transforming data. Metadata as a product approach must understand developer needs and provide well-designed, easy-to-use APIs to unlock the innovation. APIs are a key differentiating factor between metadata systems. Most systems lack APIs, and when available, they are unusable and the best practices of API design are not adopted.
Following are critical for ubiquitous metadata use:
- CRUD APIs to create/retrieve/update/delete entities and relationships
- Listing APIs with pagination support
- Events API for pulling metadata events or pushing metadata events with webhooks support
- Search APIs for both keyword and advance searches
- Suggest APIs for building user interfaces
Requests and responses should have well-designed, strongly-typed schemas with clear descriptions. A lot of time and effort has gone into building OpenMetadata APIs and all the above APIs are supported. You can see more details about our APIs here.
Well-designed, easy-to-use and comprehensive APIs simplify metadata consumption and enable a new generation of metadata-powered applications and automation.
Putting it all together
Metadata ingestions and consumption interfaces supported by OpenMetadata are shown below. As a user, you have the flexibility to use different ingestion/consumption interfaces depending on the level of complexity you can handle and the support afforded by your technology stack. For most users interested in simplicity, the ingestion path and consumption paths shown in yellow are optional. Sophisticated users can set up the more advanced ingestion and consumption features for better metadata freshness when sources can generate change events.
Conclusion
Let’s go back to the questions we started with. What generation of architecture is OpenMetadata? As you can guess, the terms currently used are not relevant beyond LinkedIn evolution. OpenMetadata is a ‘metadata as a product’ system with the goal to centralize all metadata using Open Standards and make it shareable by all the tools in the data ecosystem using well-modeled schemas and easy-to-use APIs. The focus is on architectural simplicity with minimum dependency to help our users easily operationalize the system.
Is OpenMetadata pull-based, push-based, or hybrid? Again, all systems are mainly pull-based to integrate with metadata sources. We support push-based ingestion when it is possible. As far as metadata consumption is concerned, we support both pull-based and push-based with Events API. Our focus is on serving user needs with flexibility of push/pull with easy-to-use APIs.
OpenMetadata is a fresh start on how to do Metadata right from first principles. We are employing what we learned from attempts to build three different metadata systems over a decade. Metadata should be a solved problem by now. However, we still see many in-house systems and other solutions being built in this space. If you are building an in-house system, come join our fast-growing community so we can build it together. If you are stuck with a legacy system, let’s build a migration path together toward a better solution. With metadata standards and APIs, the data ecosystem can go a long way to eliminate narrow tools, simplify architecture, and unlock innovation. It takes an open-source community to get there.