Data Identification Methods

Data Dictionaries & Patterns ..

A Pentaho Data Catalog Data Identification policy is a combination of Data Dictionaries + Patterns.

Accessing Your Catalog

To access your catalog, please follow these steps:

Open Google Chrome web browser. and click on the bookmark, or
Navigate to: https://pdc.pentaho.example/
Enter the following email and password, then click Sign In.

Username

[email protected]

Password

Welcome123!

For enhanced security, it is strongly recommended that users avoid saving their login details directly in web browsers. Browsers may inadvertently autofill these credentials in unrelated fields, posing a security risk.

Best Practice

• Disable Autofill: To mitigate potential risks, users should disable the autofill functionality for login credentials in their browser settings. This preventive measure ensures that sensitive information is not unintentionally exposed or misused.

From the Business Rules card click Add New and select: Add Business Rule.

Data dictionaries contain technical information about data assets, such as data sources, fields and data types. They are typically used by technical audiences such as data engineers and data analysts to understand the data. Data catalogs contain much broader and deeper data intelligence than data dictionaries do.

Data Patterns in Data Identification

Data patterns play a crucial role in identifying and categorizing data within a data catalog. These patterns are essentially recurring characteristics or behaviors in data sets that can be recognized and used to automate data management.

'Getting Started' -> 'Identify the data' explained how data patterns are used to profile the data

Data Pattern Analysis reduces each data item into a simple pattern essentially using dimensional reduction for each character position in the input text. The result is a string which indicates where alphabetic characters, numeric characters, symbols, and whitespace appear.

KT-1734B generates a data pattern of “AA-nnnnA” to indicate two letters, followed by a dash, followed by 4 digits and another letter.

Case sensitivity could optionally be tracked as well. Also, the set of “significant” symbols might be user-configurable (i.e., “As a data quality engineer, for this column, a dash and an underscore are significant”).

The base process iterates over every character in the data item and performs a simple character-for-character substitution, resulting in a “data pattern” string for the item.

The pattern consists of the following characters:

Character

Description

lower case alphabet character

upper case alphabetic character

digit 0..9

whitespace character (space, tab)

symbol character (e.g., -/|!£$%^&*()+=[]{}@#~;:,.?¬¥§¢" )

Some other character (control, special symbol, etc.)

Others

Any other symbol may be treated as “significant” (such as a dash, underscore, or colon). These are output as-is in the generated data pattern for the entry.

Additional tests could be built into the algorithm to look for certain additional characteristics. For example, date formats can be very tricky. PDC could observe that ‘nn/nn/nnnn’ is a date and could then observe whether it is predominantly ‘mm/dd/yyyy’ or if its ‘dd/mm/yyyy.’

Another enhancement is detecting credit card numbers.

PreviousUsers, Roles & Community NextBusiness Rules

Last updated 1 year ago

Accessing Your Catalog

Security Advisory: Handling Login Credentials

Data Patterns in Data Identification