Worker Server
What is a worker?
Pentaho Data Catalog (PDC) uses worker processes to implement virtually all the data analytics functions. Most worker processes consist of a single primary worker process that Data Catalog launches from a user action or a scheduled action. Some processes might also initiate secondary worker processes.
The following table lists the worker processes:
Test Connection
Returns detailed success or failure information for each step of the test. Data Catalog starts this worker process when you configure or update a data source connection. Data Catalog marks the data source “OFFLINE” until a successful test completes.
• Connect to data source
• Authenticate
• Retrieve list of schemas and store in MongoDB
Metadata Ingest
Ingests the metadata for one or more schemas.
• Read schema from data source and store in MongoDB
Data Profiling
Generates a variety of statistics and intermediate data with a single pass through the source data.
Typically, this is the first process you run on your data.
• Create bitset
• Create HyperLogLogs (HLL) for full data
• Generate statistics (numeric and string related)
• Generate data patterns
• Lucene Indexing (optional)
• Extract samples for viewing (<100)
Data Identification
Identifies and tags columns and tables using ontology information (dictionaries, aliases), along with underlying data and metadata.
• Tag columns based on dictionaries
• Tag columns based on metadata and aliases
Key Discovery
Performs a variety of key discovery actions. Foreign key discovery requires that Data Profiling of the data sources has completed.
• Foreign key discovery
• Superkey identification
• Composite key discovery
• Compound key discovery
• Secondary key discovery
• Natural and Surrogate key identification
Data Quality
Performs a full data quality (DQ) analysis on the underlying data, using regular expressions and other configurable business rules.
• RegEx matching
• Data pattern analysis
• Update column statistics
• Evaluate column DQ rules
• Evaluate row-relative DQ rules
Sensitive Data Discovery (SDD)
Performs the tasks beyond data identification for SDD. This process uses flows, lineage, Foreign Keys, and more to put together the items comprising PI and PII.
• Generate separate SDD Lucene Index which cross- references data
x
Last updated