Worker Server

What is a worker?

Pentaho Data Catalog (PDC) uses worker processes to implement virtually all the data analytics functions. Most worker processes consist of a single primary worker process that Data Catalog launches from a user action or a scheduled action. Some processes might also initiate secondary worker processes.

The following table lists the worker processes:

Process

Description

Actions performed

Test Connection

Returns detailed success or failure information for each step of the test. Data Catalog starts this worker process when you configure or update a data source connection. Data Catalog marks the data source “OFFLINE” until a successful test completes.

• Connect to data source

• Authenticate

• Retrieve list of schemas and store in MongoDB

Metadata Ingest

Ingests the metadata for one or more schemas.

• Read schema from data source and store in MongoDB

Data Profiling

Generates a variety of statistics and intermediate data with a single pass through the source data.

Typically, this is the first process you run on your data.

• Create bitset

• Create HyperLogLogs (HLL) for full data

• Generate statistics (numeric and string related)

• Generate data patterns

• Lucene Indexing (optional)

• Extract samples for viewing (<100)

Data Identification

Identifies and tags columns and tables using ontology information (dictionaries, aliases), along with underlying data and metadata.

• Tag columns based on dictionaries

• Tag columns based on metadata and aliases

Key Discovery

Performs a variety of key discovery actions. Foreign key discovery requires that Data Profiling of the data sources has completed.

• Foreign key discovery

• Superkey identification

• Composite key discovery

• Compound key discovery

• Secondary key discovery

• Natural and Surrogate key identification

Data Quality

Performs a full data quality (DQ) analysis on the underlying data, using regular expressions and other configurable business rules.

• RegEx matching

• Data pattern analysis

• Update column statistics

• Evaluate column DQ rules

• Evaluate row-relative DQ rules

Sensitive Data Discovery (SDD)

Performs the tasks beyond data identification for SDD. This process uses flows, lineage, Foreign Keys, and more to put together the items comprising PI and PII.

• Generate separate SDD Lucene Index which cross- references data

PreviousMetadata Store NextObservability

Last updated 1 year ago