Process the data

Collect statistical and informative summaries about the data ..

Processing and Profiling the 'Patient' Table

After successfully ingesting the metadata from our database schemas, our next step focuses on the data - 'Patient' table. This phase is essential for evaluating the data's structure, quality, and integrity to ensure our database's efficiency and effectiveness. Although currently limited to database metadata, Data Profiling will offer deeper insights, which we will explore later in this workshop.

Accessing Your Catalog

To access your catalog, please follow these steps:

Open Google Chrome web browser. and click on the bookmark, or
Navigate to: https://pdc.pentaho.example/
Enter the following email and password, then click Sign In.

Username

[email protected]

Password

Welcome123!

For enhanced security, it is strongly recommended that users avoid saving their login details directly in web browsers. Browsers may inadvertently autofill these credentials in unrelated fields, posing a security risk.

Best Practice

• Disable Autofill: To mitigate potential risks, users should disable the autofill functionality for login credentials in their browser settings. This preventive measure ensures that sensitive information is not unintentionally exposed or misused.

Process the Data

Select 'Data Canvas' from the left menu option.
Click the checkbox to select the 'synthea' schema.

For optimal performance, keep your selection within a practical limit. Processing a very large number of tables, like 100,000, can drastically reduce speed. The default settings on the Configure Data Profiling page generally fulfill most needs and are recommended.

Click 'Process'.

In the process of managing both structured and unstructured data, two critical steps stand out: Metadata Ingest and Data Profiling. This distinction is essential for ensuring data quality and accessibility.

Ingest Metadata

Metadata ingest is a foundational process in data management within a Data Catalog. It involves the automatic collection of metadata — the data about data — from a database schema / file / object. This step is crucial for understanding and organizing the data, making it easily accessible for further analysis and data profiling.

Navigate to the metadata ingest section of your Data Catalog tool and initiate the process by clicking the Start button.

Users can select specific tables or datasets for metadata ingestion. For example, if you are interested in patient information, you might expand the 'patients' table and opt for relevant fields such as 'passport'.
After starting the ingest process, monitor its progress on the Manage Workers page. This page provides real-time updates on the ingestion task.

Next -> 2. Data Profiling

Data Profiling: An Essential Step in Data Analytics

Before diving into data analytics within a Data Catalog, it's critical to ensure your data has been properly profiled. Data profiling is not just a step but a prerequisite. Without a valid data profile, proceeding to data identification tasks is premature.

What is Data Profiling?

Data profiling involves a detailed examination of your dataset, focusing on specific data objects. During this process, you collect valuable statistics and summaries that provide insights into the nature of your data. The beauty of data profiling is its immediacy. As soon as data from a column, table, or schema is processed, the results are ready for review.

When a data source schema is initially ingested, you're limited to the database metadata. Data profiling enriches this by uncovering additional layers of information, making it an indispensable tool in the data analyst's toolkit.

Click the Data Profiling tile.

Enable the option: 'Extract samples'

• Extract Samples: This feature profiles a subset of data to ascertain its characteristics. It employs predefined Dictionaries & Patterns for a comprehensive analysis, streamlining the profiling process.

• Skip Recent (days): Avoids re-profiling tables that were profiled within a specified period, enhancing efficiency by focusing on older data. Setting this to 7 means tables profiled in the last week are not re-examined.

Click: 'Start'.

You can view the status of 'Profiling' on the Manage Workers page.

The dataset has now been scanned which returns the metadata properties along with some additional details.

The next stage is to explore the data to confirm the details.

Explore the Data

The 'top level' is the connection resource: postgresql:synthea

Highlight the resource: postgresql:synthea

The Summary tab displays the connection properties

Select Details option.

The Details tab provides a overview of the Schema name, the number of Tables & Columns.

Next -> 2.2 Schema

Data Identification - Data Quality

Pentaho Data Catalog uses data identification methods called dictionaries and data patterns to help you identify data.

Dictionaries

Dictionaries are lists of words that may occur in your data that can be used for column data matching. Typically, dictionary words are those that you could not write a pattern to be able to match, such as country codes.There are two kinds of dictionaries, system defined and user defined. System defined, or built-in dictionaries come with Data Catalog and user-defined dictionaries are created by Data Catalog users.You can add a user defined dictionary in three ways:

by importing a CSV file zipped with a JSON file containing the definition of the directory.
by uploading a CSV file and creating the JSON file to define the dictionary by entering information in the UI.
by selecting a profiled column of data from any JDBC table or file.

Data patterns

Data patterns are also used to identify data, by discovering patterns in the data that match a regular expression.

If data profiling is not done, Data Catalog highlights as 'Required'.

To illustrate, let's apply a data identification policy to the patients 'table'. This will apply identify and tag various columns in the table, this helps:

to quickly locate relevant information, saving time and effort when exploring vast data repositories.
link related data across different sections of the taxonomy.
derive maximum value from information assets by understanding their context and purpose.

Click the Data Identification tile.

Click 'Select Methods'.

Applying Data Dictionary + Pattern Analysis = Policy

Select the following Data Dictionaries & Data Patterns:

Method

Data Dictionaries

Pattern Analysis

USA_SSN

Social Security Number

Country Codes

Country Names

DoB

Date of Birth

USA States

States in USA

Click Start.
Track the Job in the Workers.

In Data Canvas, check that the sensitive data in the Synthea -> 'patients' table has now been identified - tagged as PII & Sensitive.

PreviousData Sources NextIdentify the data

Last updated 1 year ago

Processing and Profiling the 'Patient' Table

Accessing Your Catalog

Security Advisory: Handling Login Credentials

Process the Data

Ingest Metadata

Data Profiling: An Essential Step in Data Analytics

Explore the Data

Viewing Data Patterns

Data Identification - Data Quality