Data Lineage
Journey of data from origin to endpoint ..
Last updated
Journey of data from origin to endpoint ..
Last updated
Data lineage in a data catalog outlines the lifecycle of data, detailing its journey from origin to endpoint across various processes and transformations. It offers a clear, visual map of data's provenance, its modifications, and its final location. This mapping includes tracing data from source to destination, capturing all steps, transformations, and processes it encounters along the way.
The importance of data lineage lies in its role in ensuring data integrity, supporting compliance with regulations, facilitating error root cause analysis, and enhancing overall data governance and management. By providing a comprehensive view of data’s journey, data lineage is an indispensable tool for maintaining high-quality, reliable data in a data catalog.
Pentaho Data Catalog adheres to the OpenLineage standards.
OpenLineage is an open platform for collection and analysis of data lineage. It tracks metadata about datasets, jobs, and runs, giving users the information required to identify the root cause of complex issues and understand the impact of changes. OpenLineage contains an open standard for lineage data collection, a metadata repository reference implementation (Marquez), libraries for common languages, and integrations with data pipeline tools.
At the core of OpenLineage is a standard API for capturing lineage events. Pipeline components - like schedulers, warehouses, analysis tools, and SQL engines - can use this API to send data about runs, jobs, and datasets to a compatible OpenLineage backend for further study.
OpenLineage supports both simple deployments with single consumers and complex deployments with multiple consumers.
The OpenLineage object model is designed to capture the metadata about data processes and their lineage in a standardized format. The core components of this model include:
Job: Represents a processing activity, such as a SQL query or ETL job. A job can have multiple runs and produce one or more datasets.
Run: An instance of a job execution. Each run can have metadata including start time, end time, and status.
Dataset: Represents a collection of data, such as a database table or a file. It includes metadata about its schema, version, and location.
Facet: Provides additional metadata about jobs, runs, and datasets, like data quality metrics, schema details, or pipeline dependencies.
These components work together to provide a comprehensive view of data lineage, allowing users to track data through various transformations and understand its origins and dependencies.