Process the data
Collect statistical and informative summaries about the data ..
Processing and Profiling the 'Patient' Table
After successfully ingesting the metadata from our database schemas, our next step focuses on the data - 'Patient' table. This phase is essential for evaluating the data's structure, quality, and integrity to ensure our database's efficiency and effectiveness. Although currently limited to database metadata, Data Profiling will offer deeper insights, which we will explore later in this workshop.

Accessing Your Catalog
To access your catalog, please follow these steps:
Open Google Chrome web browser. and click on the bookmark, or
Navigate to: https://pdc.pentaho.example/
Enter the following email and password, then click Sign In.
Username
Password
Welcome123!
Security Advisory: Handling Login Credentials
For enhanced security, it is strongly recommended that users avoid saving their login details directly in web browsers. Browsers may inadvertently autofill these credentials in unrelated fields, posing a security risk.
Best Practice
• Disable Autofill: To mitigate potential risks, users should disable the autofill functionality for login credentials in their browser settings. This preventive measure ensures that sensitive information is not unintentionally exposed or misused.
Process the Data
Select 'Data Canvas' from the left menu option.
Click the checkbox to select the 'synthea' schema.

For optimal performance, keep your selection within a practical limit. Processing a very large number of tables, like 100,000, can drastically reduce speed. The default settings on the Configure Data Profiling page generally fulfill most needs and are recommended.
Click 'Process'.
In the process of managing both structured and unstructured data, two critical steps stand out: Metadata Ingest and Data Profiling. This distinction is essential for ensuring data quality and accessibility.
Ingest Metadata
Metadata ingest is a foundational process in data management within a Data Catalog. It involves the automatic collection of metadata — the data about data — from a database schema / file / object. This step is crucial for understanding and organizing the data, making it easily accessible for further analysis and data profiling.
Navigate to the metadata ingest section of your Data Catalog tool and initiate the process by clicking the
Startbutton.

Users can select specific tables or datasets for metadata ingestion. For example, if you are interested in patient information, you might expand the 'patients' table and opt for relevant fields such as 'passport'.
After starting the ingest process, monitor its progress on the Manage Workers page. This page provides real-time updates on the ingestion task.

Next -> 2. Data Profiling
Data Profiling: An Essential Step in Data Analytics
Before diving into data analytics within a Data Catalog, it's critical to ensure your data has been properly profiled. Data profiling is not just a step but a prerequisite. Without a valid data profile, proceeding to data identification tasks is premature.
What is Data Profiling?
Data profiling involves a detailed examination of your dataset, focusing on specific data objects. During this process, you collect valuable statistics and summaries that provide insights into the nature of your data. The beauty of data profiling is its immediacy. As soon as data from a column, table, or schema is processed, the results are ready for review.
When a data source schema is initially ingested, you're limited to the database metadata. Data profiling enriches this by uncovering additional layers of information, making it an indispensable tool in the data analyst's toolkit.
Click the Data Profiling tile.

Enable the option: 'Extract samples'

Click: 'Start'.
You can view the status of 'Profiling' on the Manage Workers page.

Explore the Data
Highlight the resource: postgresql:synthea
The Summary tab displays the connection properties

Select Details option.
The Details tab provides a overview of the Schema name, the number of Tables & Columns.

Next -> 2.2 Schema
Expand the 'patients' table and select: 'passport'.
Once the profiling has completed, you can review the collected metadata.
For instance, selecting the 'passport' field will present various statistical insights, such as Null Count, Cardinality, and Min/Max Width, aiding in understanding the data's characteristics.

The following table identifies the key details available in the Statistics pane when you select a column in a table to view:
Null Count
Number of entries that are null.
Cardinality
The number of unique values in a field, where a low cardinality number indicates many repeated values.
HLL
An estimate of cardinality of the data, with a roughly ~2% margin of error.
Blank Count
The number of entries that are blank.
Min Width
The minimum number of character count in a value in the column.
Max Width
The maximum number of character count in a value in the column.
Avg Width
The average number of character count in a value in the column.
Scroll down to Data Patterns:

Next -> 3. Data Identification
Viewing Data Patterns
Data Patterns provide insights into the frequency and distribution of data when examining a specific column, including any random values present. Names and values within this view are limited to 200 characters for readability.
Access to this feature requires the appropriate permissions, enabled through the Sample Data Access setting in your user role. Users with administrative rights have the capability to adjust these permissions as needed.
This functionality is essential for understanding sample-profiled resources and accessing various resource-level details efficiently.
This feature is explored in more detail in: Data Patterns
Data Identification - Data Quality
Pentaho Data Catalog uses data identification methods called dictionaries and data patterns to help you identify data.
Dictionaries
Dictionaries are lists of words that may occur in your data that can be used for column data matching. Typically, dictionary words are those that you could not write a pattern to be able to match, such as country codes.There are two kinds of dictionaries, system defined and user defined. System defined, or built-in dictionaries come with Data Catalog and user-defined dictionaries are created by Data Catalog users.You can add a user defined dictionary in three ways:
by importing a CSV file zipped with a JSON file containing the definition of the directory.
by uploading a CSV file and creating the JSON file to define the dictionary by entering information in the UI.
by selecting a profiled column of data from any JDBC table or file.
Data patterns
Data patterns are also used to identify data, by discovering patterns in the data that match a regular expression.
If data profiling is not done, Data Catalog highlights as 'Required'.
Click the Data Identification tile.

Click 'Select Methods'.

Select the following Data Dictionaries & Data Patterns:
USA_SSN
Social Security Number
Country Codes
Country Names
Country Names
Country Names
DoB
Date of Birth
USA States
States in USA

Click Start.
Track the Job in the Workers.

In Data Canvas, check that the sensitive data in the Synthea -> 'patients' table has now been identified - tagged as PII & Sensitive.

Last updated

