Pentaho Data Catalog
Data QualityData IntegrationBusiness Analytics
  • Overview
    • Pentaho Data Catalog ..
  • Overview
  • Data Catalog
    • Getting Started
      • Data Sources
      • Process the data
      • Identify the data
      • Business Glossaries & Terms
      • Reference Data
      • Data Lineage
    • Management
      • Users, Roles & Community
      • Data Identification Methods
      • Business Rules
      • Metadata Rules
      • Schedules
      • Workers
    • Components
      • Keycloak
      • Reverse Proxy Server
      • App Server
      • Metadata Store
      • Worker Server
      • Observability
Powered by GitBook
On this page
  1. Data Catalog
  2. Management

Data Identification Methods

Data Dictionaries & Patterns ..

PreviousUsers, Roles & CommunityNextBusiness Rules

Last updated 11 months ago

A Pentaho Data Catalog Data Identification policy is a combination of Data Dictionaries + Patterns.

Accessing Your Catalog

To access your catalog, please follow these steps:

  1. Open Google Chrome web browser. and click on the bookmark, or

  2. Enter the following email and password, then click Sign In.

Username

data_steward@hv.com

Password

Welcome123!

Security Advisory: Handling Login Credentials

For enhanced security, it is strongly recommended that users avoid saving their login details directly in web browsers. Browsers may inadvertently autofill these credentials in unrelated fields, posing a security risk.

Best Practice

• Disable Autofill: To mitigate potential risks, users should disable the autofill functionality for login credentials in their browser settings. This preventive measure ensures that sensitive information is not unintentionally exposed or misused.

  1. From the Business Rules card click Add New and select: Add Business Rule.

Data dictionaries contain technical information about data assets, such as data sources, fields and data types. They are typically used by technical audiences such as data engineers and data analysts to understand the data. Data catalogs contain much broader and deeper data intelligence than data dictionaries do.

x

x

Data Patterns in Data Identification

Data patterns play a crucial role in identifying and categorizing data within a data catalog. These patterns are essentially recurring characteristics or behaviors in data sets that can be recognized and used to automate data management.

'Getting Started' -> 'Identify the data' explained how data patterns are used to profile the data

Data Pattern Analysis reduces each data item into a simple pattern essentially using dimensional reduction for each character position in the input text. The result is a string which indicates where alphabetic characters, numeric characters, symbols, and whitespace appear.

KT-1734B generates a data pattern of “AA-nnnnA” to indicate two letters, followed by a dash, followed by 4 digits and another letter.

Case sensitivity could optionally be tracked as well. Also, the set of “significant” symbols might be user-configurable (i.e., “As a data quality engineer, for this column, a dash and an underscore are significant”).

The base process iterates over every character in the data item and performs a simple character-for-character substitution, resulting in a “data pattern” string for the item.

The pattern consists of the following characters:

Character
Description

a

lower case alphabet character

A

upper case alphabetic character

n

digit 0..9

w

whitespace character (space, tab)

s

symbol character (e.g., -/|!£$%^&*()+=[]{}@#~;:,.?¬¥§¢" )

-

Some other character (control, special symbol, etc.)

Others

Any other symbol may be treated as “significant” (such as a dash, underscore, or colon). These are output as-is in the generated data pattern for the entry.

Additional tests could be built into the algorithm to look for certain additional characteristics. For example, date formats can be very tricky. PDC could observe that ‘nn/nn/nnnn’ is a date and could then observe whether it is predominantly ‘mm/dd/yyyy’ or if its ‘dd/mm/yyyy.’

Another enhancement is detecting credit card numbers.

Lets look at an example:

Part numbers often begin with two or three designated letters. This observation helps in defining a more precise RegEx rule based on observed patterns.

Additionally, tracking the "largest" and "smallest" values for each character position in these patterns reveals the degree of variability per position. Each time a pattern recurs, a counter tallies its occurrence; upon identifying a new pattern, the system stores the analyzed data as a distinct “sample” for that pattern.

The first step is to generate a substitution string (for purpose of the example, not all possible characters are shown):

abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789/|"!£$%^&*()+=[]{}@#~;:,.?¬¥§¢

aaaaaaaaaaaaaaaaaaaaaaaaaawAAAAAAAAAAAAAAAAAAAAAAAAAAnnnnnnnnnn/sssssssssssssssssssssssssssss

The top row is the character lookup row; and the bottom row is the substitution to be made for each character position.

For example, “KT127-3” would generate a simple pattern “AAnnn-n”. Additionally, the largest and smallest character seen for each character position is also tracked.

Consider a set of tracking numbers and the associated pattern for each:

Code
Pattern

KT17341

AAnnnnn

KL91632

AAnnnnn

KW81234

AAnnnnn

KW91020

AAnnnnn

KA002021

AAnnnnnn

Additionally, we capture the largest and smallest character seen in each character position. This allows us to potentially determine if there are fixed characters in the pattern, and to generate stricter RegEx recommendations.

  • AAnnnnn – Occurs 4 times

  • KA11220 – Lowest character seen in each position

  • KW97644 – Highest character seen in each position

  • AAnnnnnn – Occurs 1 time

  • KA002021 – Lowest character seen in each position

  • KA002021 – Highest character seen in each position

The top ~20 data patterns will be captured and stored for subsequent consumption by data quality related and other processes as needed.

Navigate to:

https://pdc.pentaho.example/
Management - Data Identification Methods
Data Dictionary v Data Catalog
If our data comes from a certain probability distribution, we can reduce its size by estimating the parameters of this distribution.