Pentaho Data Catalog
Data QualityData IntegrationBusiness Analytics
  • Overview
    • Pentaho Data Catalog ..
  • Overview
  • Data Catalog
    • Getting Started
      • Data Sources
      • Process the data
      • Identify the data
      • Business Glossaries & Terms
      • Reference Data
      • Data Lineage
    • Management
      • Users, Roles & Community
      • Data Identification Methods
      • Business Rules
      • Metadata Rules
      • Schedules
      • Workers
    • Components
      • Keycloak
      • Reverse Proxy Server
      • App Server
      • Metadata Store
      • Worker Server
      • Observability
Powered by GitBook
On this page
  1. Data Catalog
  2. Components

Metadata Store

Exploring the Pentaho Data Catalog Metadata Store ..

PreviousApp ServerNextWorker Server

Last updated 10 months ago

Metadata Store in Data Catalogs: A Foundation for Effective Data Management

A metadata store within a data catalog serves as a critical repository for storing metadata from various sources, ensuring the preservation of their relationships, historical context, and vital details. Acting as the backbone of a data catalog's architecture, this specialized database is instrumental in streamlining the organization and handling of metadata. This, in turn, significantly aids in the efficient discovery, understanding, and governance of data.

The primary role of the metadata store is to facilitate easy access to the correct data assets for users by encompassing technical, process, and business metadata. This includes comprehensive information on the data's structure, its lineage, utilization patterns, and the business context it operates within. By making data assets readily identifiable, assessable, and actionable, the metadata store plays a pivotal role in bolstering data-driven decision-making processes.

The diagram depicts a simplified representation of the key data model of the PDC Metadata Store (MDS). To understand the relationships between the 'Collections' , it's illustrated in a relational database format.

The model illustrates the relationships between data sources and entities where the customer data assets and discovered metadata properties are persisted as rich JSON graphs.

At the core of the model is the mds.entities collection.

Connect to Metadata Store

  1. Run NoSQLBooster.

cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*

So what happens when you connect/ingest/profile a datasource?

When you create a data source connection, for example, to the Synthea dataset on PostgreSQL, the details are persisted in the resourceconnections collection located in the fe database.

If not not familiar with MQL, the easiest way to query the MongoDB database is to use SQL..!

For example, to display the connection name in a dashboard:

resourceName where configMethod = 'credentials'

The following table shows how SQL components map to MongoDB document components.

SQL
MongoDB

database

database

table

collection

column

field

row (record)

document

  1. Run NoSQLBooster.

cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*
  1. In the Query panel, enter the following SQL statement.

//Single quotes are for strings 'string'
mb.runSQLQuery(`
       SELECT resourceName FROM resourceconnections 
       WHERE configMethod = 'credentials'     
`).sort({_id:-1})
  .limit(100)

As expected the result is: postgresql:synthea as there's currently only 1 connection defined.

When you ingest the database schema, the connection details are persisted in the datasources collection located in the mds database.

The database schema metadata details are persisted in the entities collection (synthea - 179 documents).

In the example below, the zip column properties, from the providers 'table'.

Taking a look at the MDS diagram above we can see that the recursive mds.entities.type refers to:

SCHEMA - synthea

TABLE - 'aggregated' documents

RESOURCE

COLUMN -

This is useful for querying the dataset. The mds.entites collection consists of documents (records) that are referenced by knowing the 'type:'.

  1. Edit the query to retrieve the fields on the 'patients' table (you could have ingested more than 1 schema).

db.entities.find({type:'TABLE',_id:'1/business_apps_db/synthea/patients'})
   .projection({})
   .sort({_id:-1})
   .limit(100)

TIP: To find the TABLE _id, under Actions select the option: Copy path.

mb.runSQLQuery(`
       SELECT * FROM entities 
       WHERE type = 'TABLE' AND name = 'patients'       
`).sort({_id:-1})
  .limit(100)

Notice the out-of-the-box predefined collections for:

  • datapatterns

  • dictionaries

  • dictionaryTerms

The data profiling process generates statistical and intermediate data that is required by other data analytic processes. The intermediate data is consumed by downstream processes such as data flow and foreign key detection.

Referencing the MDS model, the resulting mds.data_profile

The intermediate data generated for each column of data includes:

Profile Action
Description

Roaring Bitset

A bitmap of the hash values for all entries in the column.

HyperLogLog (HLL)

Provides an estimate of the cardinality of the data, with a roughly ~2% margin of error.

Data Pattern Analysis

Performs a rudimentary data pattern analysis using dimensional reduction, tracking the most frequently occurring patterns.

Data Quality Pre-Analysis

Using the Data Pattern Analysis results, Data Catalog performs a statistical estimation of the data quality is performed. This is summarized as an overall percentage as well as a heat map for each data pattern. Additionally, Data Catalog makes RegEx recommendations for the most probable matches.

Statistics

Data Catalog gathers the following statistics when examining all the data:

• Minimum and Maximum values (for numeric columns)

• Widest and Narrowest (non-null) string widths

• Null count

• Total row count

Data Sampling

Data Catalog takes a controlled sampling of the data so that the samples are consistently chosen across different columns.

Taking a look at the MDS diagram above we can see that the recursive mds.entities.type refers to:

SCHEMA - synthea

TABLE

RESOURCE

COLUMN

This is useful for querying the dataset. The mds.entites collection consists of documents (records) that are referenced by knowing the 'type:'.

  1. Edit the query to retrieve the fields for the 'patients.passport' column.

db.entities.find({type:'COLUMN',_id:'1/business_apps_db/synthea/patients/passport'})
   .projection({})
   .sort({_id:-1})
   .limit(100)

TIP: To find the COLUMN _id, under Actions select the option: Copy path.

mb.runSQLQuery(`
   SELECT * FROM entities
   WHERE type = 'COLUMN' AND name = 'passport'
`).sort({_id:-1})
.limit(100)

The Statistics, Data Patterns and Properties are displayed in the Data Canvas.

Obviously .. a lot more data is captured, which is stored in mds.data_profile collection.

  1. Edit the query to retrieve the fields for the 'mds.data_profile.patients.passport' column.

db.data_profile.find({_id:'1/business_apps_db/synthea/patients/passport'})
   .projection({})
   .sort({_id:-1})
   .limit(100)
mb.runSQLQuery(`
       SELECT * FROM data_profile 
       WHERE _id = '1/business_apps_db/synthea/patients/passport'     
`).sort({_id:-1})
  .limit(100)

x

x

x

x

x

x

x

NoSQLBooster is a cross-platform GUI tool for MongoDB Server 3.6-7.0, which provides a build-in MongoDB script debugger, comprehensive server monitoring tools, chaining fluent query, SQL query, query code generator, task scheduling, ES2020 support, and advanced IntelliSense experience.

x

To watch the videos please copy and paste the website URL into your host Chrome browser.

x

Be aware this exposes MongoDB port ..!

  1. Navigate to the PDC depolyment folder.

cd
cd /opt/pentaho/pdc-docker-deployment/vendor/
  1. Edit docker-compose.yml file.

nano docker-compose.yml
  1. Uncomment ports under mongodb in docker-compose.yml

  mongodb:
    image: ${PDC_IMAGE_PREFIX}pdc-mongodb-ee:${PDC_MONGODB_TAG:-master}
    ports:
      - "27017:27017"
  1. Save changes.

CTRL + O
Enter
CTRL + X
  1. Restart PDC.

cd
cd /opt/pentaho/pdc-docker-deployment
sudo ./pdc.sh up

Ensure all the containers successfully restart ..

x

  1. Download NoSQLBooster.

cd
cd ~/Downloads
wget https://s3.nosqlbooster.com/download/releasesv8/nosqlbooster4mongo-8.1.7.tar.gz
  1. Untar the package.

cd
cd ~/Downloads
tar -xvf nosqlbooster4mongo-8.1.7.tar.gz
  1. Move the NoSQLBooster folder to home.

cd
cd ~/Downloads
mv nosqlbooster4mongo* /home/pdc
  1. Run NoSQLBooster.

cd
cd ~/nosqlbooster4mongo*
ls
./nosqlbooster4mongo*

  1. Run NoSQLBooster.

cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*
  1. From the Connect option -> Connect From Connection String URI

  1. Enter the following details.

Connection string

mongodb://root:broot@localhost:27017

  1. Finally .. click 'Connect'


Troubleshooting

When a VM is started up, sometimes not all Docker PDC docker containers start up and as a result some functionality may not work such as:

  • Login page not showing login panel

  • Register Page not creating root user

  • Adding a data source and Test Connection button doesn't work

In this case, navigate to the PDC docker deployment location (/opt/pentaho/pdc-docker-deployment) and run ./pdc.sh up to ensure all PDC containers are running.

MongoDB Compass is a powerful GUI tool that simplifies the processes of querying, aggregating, and analyzing MongoDB data through a user-friendly visual interface.

Check for the latest package:

  1. Update the Ubuntu system’s repositories.

sudo apt update
  1. Download MongoDB compass.

You may need to edit the version in the request below.

cd
cd ~/Downloads
wget https://downloads.mongodb.com/compass/mongodb-compass_1.43.4_amd64.deb
  1. Install MongoDB Compass.

cd
cd ~/Downloads
sudo apt install ./mongodb-compass_1.43.4_amd64.deb

  1. Click on New Connection.

Ensure you check the option: Direct Connection

  1. Click on the Authentication option.

  1. Enter the following details:

Username

root

Password

broot

  1. Click Connect.

Through the MongoDB Atlas SQL Interface, Connectors, and Drivers, you can leverage your existing SQL knowledge to query and analyze live application data directly from your preferred SQL-based tool.

x

x

Altair is a feature-rich GraphQL client that enables developers to debug and test their GraphQL queries. It provides an intuitive interface for crafting queries, setting variables, and viewing responses.

Altair supports various features like query autocompletion, response formatting, and even documentation exploration, making it an invaluable tool for working with GraphQL APIs.

  1. Enter your query:

Query
Description

query {userCount(filter:{})}

Number of Users

query {userMany(limit:100){ email

date

firstName

lastName

lastLogin

modified

modifiedBy

status

_id }}

Returns User information - Limit 100

The bidb database is structured to allow a JDBC MySQL connection. Useful for creating audit reports.

DbVisualizer lets you visualize, manage and edit data with the power of SQL and simplicity of a spreadsheet.

Install Java

  1. Ensure the apt libraries are updated.

sudo apt update && sudo apt upgrade -y
  1. Check if Java has been installed.

java -version
  1. To see all available editions of OpenJDK.

apt-cache search openjdk
  1. Install OpenJDK 21 JDK.

sudo apt install openjdk-21-jdk
  1. Install OpenJDK 21 JRE.

sudo apt install openjdk-21-jre-headless
  1. Check installation.

java -version
pdc@pdc:~$ java -version
openjdk version "21.0.3" 2024-04-16
OpenJDK Runtime Environment (build 21.0.3+9-Ubuntu-1ubuntu1)
OpenJDK 64-Bit Server VM (build 21.0.3+9-Ubuntu-1ubuntu1, mixed mode, sharing)

Set the Java environment variables

Although Java is already installed, further configuration is required. Setting a few environment variables makes OpenJDK easier to use and allows other applications to find it.

  1. Edit the .bashrc file in your home directory.

cd
sudo nano ~/.bashrc
  1. Add the following lines to the bottom of the file.

export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
export PATH=$PATH:$JAVA_HOME/bin
  1. Save.

Ctrl + o
Enter
Ctrl + x
  1. Source to apply changes.

source ~/.bashrc
  1. Echo JAVA_HOME and PATH to verify they are set correctly.

echo $JAVA_HOME
echo $PATH
/usr/lib/jvm/java-21-openjdk-amd64
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/usr/lib/jvm/java-21-openjdk-amd64/bin

To set these values for all system users, add the following changes to /etc/environment instead.


Install DbVisualizer

  1. Download the Linux - without Java version.

  1. Execute the script located in the Downloads directory.

Ensure the file has execute permission.

You will need to edit the version before executing the shell script.

cd
cd ~/Downloads
./dbvis_linux_24_1_3.sh
  1. Accept the default path / settings.

  1. Once 'Finished'. Accept Free license forever.

  1. To start DbVisualizer.

cd
cd ~/DbVisualizer
./dbvis

As we know MongoDB stores the data as a 'Collection' of 'Documents' in a flat hierarchical structure.

The BIDB database contains aggregated documents, useful for reporting.

  1. Start DBVisualizer.

cd
cd ~/DbVisualizer
./dbvis
  1. Click on the blue cross to create a new connection.

  1. From the drop-down list, select MySQL 5.

  1. Enter the following details:

Connection Name

BIDB

Database Server

localhost

Database Port

3307

Use SSL

Automatic

Database Userid

root

Database Password

groot

  1. Ping the server & Connect.

  2. using SQL Commander, Query the database.

Ensure you have installed & configured before commencing the following steps.

Ensure you have enabled the MongoDB ports:

To access Altair:

https://www.mongodb.com/try/download/compass
http://localhost/graghql2
DBVisualizer
Enable MongoDB ports
MongoDB CompassMongoDB
Logo
Download Atlas SQL JDBC DriverMongoDB
Link to connectors & drivers for reporting
Logo
The SQL client with the highest user satisfactiondbvisualizer
Download DBVisualizer
Logo
Data Model of Pentaho Data Catalog - Metadata Store (MDS)
resourceconnections
SQL Query
mds.datasources
mds.entities
Copy Path
SQL Query - type: 'TABLE' name:'patients'
Data Profiling fields for patients.passport column.
Statistics - patients.passport
Data Profile - data_profile.passports
NoSQLBooster
NoSQLBooster
Connection URI
Connect to MongoDB
Compass
Compass - URI
Compass - Authentication
Compass - mds.entities
Altair
bidb database
DBVisualizer - download
Keep default path
DBVisualizer UI
Create a new database connection
MySQL 5
BIDB Connection
SQL Query