Metadata Store

Exploring the Pentaho Data Catalog Metadata Store ..

Metadata Store in Data Catalogs: A Foundation for Effective Data Management

A metadata store within a data catalog serves as a critical repository for storing metadata from various sources, ensuring the preservation of their relationships, historical context, and vital details. Acting as the backbone of a data catalog's architecture, this specialized database is instrumental in streamlining the organization and handling of metadata. This, in turn, significantly aids in the efficient discovery, understanding, and governance of data.

The primary role of the metadata store is to facilitate easy access to the correct data assets for users by encompassing technical, process, and business metadata. This includes comprehensive information on the data's structure, its lineage, utilization patterns, and the business context it operates within. By making data assets readily identifiable, assessable, and actionable, the metadata store plays a pivotal role in bolstering data-driven decision-making processes.

The diagram depicts a simplified representation of the key data model of the PDC Metadata Store (MDS). To understand the relationships between the 'Collections' , it's illustrated in a relational database format.

The model illustrates the relationships between data sources and entities where the customer data assets and discovered metadata properties are persisted as rich JSON graphs.

At the core of the model is the mds.entities collection.

Connect to Metadata Store

Run NoSQLBooster.

cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*

So what happens when you connect/ingest/profile a datasource?

When you create a data source connection, for example, to the Synthea dataset on PostgreSQL, the details are persisted in the resourceconnections collection located in the fe database.

If not not familiar with MQL, the easiest way to query the MongoDB database is to use SQL..!

For example, to display the connection name in a dashboard:

resourceName where configMethod = 'credentials'

The following table shows how SQL components map to MongoDB document components.

SQL

MongoDB

database

table

collection

column

field

row (record)

document

Run NoSQLBooster.

cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*

In the Query panel, enter the following SQL statement.

//Single quotes are for strings 'string'
mb.runSQLQuery(`
       SELECT resourceName FROM resourceconnections 
       WHERE configMethod = 'credentials'     
`).sort({_id:-1})
  .limit(100)

As expected the result is: postgresql:synthea as there's currently only 1 connection defined.

When you ingest the database schema, the connection details are persisted in the datasources collection located in the mds database.

The database schema metadata details are persisted in the entities collection (synthea - 179 documents).

In the example below, the zip column properties, from the providers 'table'.

Taking a look at the MDS diagram above we can see that the recursive mds.entities.type refers to:

SCHEMA - synthea

TABLE - 'aggregated' documents

RESOURCE

COLUMN -

This is useful for querying the dataset. The mds.entites collection consists of documents (records) that are referenced by knowing the 'type:'.

Edit the query to retrieve the fields on the 'patients' table (you could have ingested more than 1 schema).

db.entities.find({type:'TABLE',_id:'1/business_apps_db/synthea/patients'})
   .projection({})
   .sort({_id:-1})
   .limit(100)

TIP: To find the TABLE _id, under Actions select the option: Copy path.

mb.runSQLQuery(`
       SELECT * FROM entities 
       WHERE type = 'TABLE' AND name = 'patients'       
`).sort({_id:-1})
  .limit(100)

Notice the out-of-the-box predefined collections for:

datapatterns
dictionaries
dictionaryTerms

The data profiling process generates statistical and intermediate data that is required by other data analytic processes. The intermediate data is consumed by downstream processes such as data flow and foreign key detection.

Referencing the MDS model, the resulting mds.data_profile

The intermediate data generated for each column of data includes:

Profile Action

Description

Roaring Bitset

A bitmap of the hash values for all entries in the column.

HyperLogLog (HLL)

Provides an estimate of the cardinality of the data, with a roughly ~2% margin of error.

Data Pattern Analysis

Performs a rudimentary data pattern analysis using dimensional reduction, tracking the most frequently occurring patterns.

Data Quality Pre-Analysis

Using the Data Pattern Analysis results, Data Catalog performs a statistical estimation of the data quality is performed. This is summarized as an overall percentage as well as a heat map for each data pattern. Additionally, Data Catalog makes RegEx recommendations for the most probable matches.

Statistics

Data Catalog gathers the following statistics when examining all the data:

• Minimum and Maximum values (for numeric columns)

• Widest and Narrowest (non-null) string widths

• Null count

• Total row count

Data Sampling

Data Catalog takes a controlled sampling of the data so that the samples are consistently chosen across different columns.

Taking a look at the MDS diagram above we can see that the recursive mds.entities.type refers to:

SCHEMA - synthea

TABLE

RESOURCE

COLUMN

This is useful for querying the dataset. The mds.entites collection consists of documents (records) that are referenced by knowing the 'type:'.

Edit the query to retrieve the fields for the 'patients.passport' column.

db.entities.find({type:'COLUMN',_id:'1/business_apps_db/synthea/patients/passport'})
   .projection({})
   .sort({_id:-1})
   .limit(100)

TIP: To find the COLUMN _id, under Actions select the option: Copy path.

mb.runSQLQuery(`
   SELECT * FROM entities
   WHERE type = 'COLUMN' AND name = 'passport'
`).sort({_id:-1})
.limit(100)

The Statistics, Data Patterns and Properties are displayed in the Data Canvas.

Obviously .. a lot more data is captured, which is stored in mds.data_profile collection.

Edit the query to retrieve the fields for the 'mds.data_profile.patients.passport' column.

db.data_profile.find({_id:'1/business_apps_db/synthea/patients/passport'})
   .projection({})
   .sort({_id:-1})
   .limit(100)

mb.runSQLQuery(`
       SELECT * FROM data_profile 
       WHERE _id = '1/business_apps_db/synthea/patients/passport'     
`).sort({_id:-1})
  .limit(100)

NoSQLBooster is a cross-platform GUI tool for MongoDB Server 3.6-7.0, which provides a build-in MongoDB script debugger, comprehensive server monitoring tools, chaining fluent query, SQL query, query code generator, task scheduling, ES2020 support, and advanced IntelliSense experience.

To watch the videos please copy and paste the website URL into your host Chrome browser.

Be aware this exposes MongoDB port ..!

Navigate to the PDC depolyment folder.

cd
cd /opt/pentaho/pdc-docker-deployment/vendor/

Edit docker-compose.yml file.

nano docker-compose.yml

Uncomment ports under mongodb in docker-compose.yml

  mongodb:
    image: ${PDC_IMAGE_PREFIX}pdc-mongodb-ee:${PDC_MONGODB_TAG:-master}
    ports:
      - "27017:27017"

Save changes.

CTRL + O
Enter
CTRL + X

Restart PDC.

cd
cd /opt/pentaho/pdc-docker-deployment
sudo ./pdc.sh up

Ensure all the containers successfully restart ..

Download NoSQLBooster.

cd
cd ~/Downloads
wget https://s3.nosqlbooster.com/download/releasesv8/nosqlbooster4mongo-8.1.7.tar.gz

Untar the package.

cd
cd ~/Downloads
tar -xvf nosqlbooster4mongo-8.1.7.tar.gz

Move the NoSQLBooster folder to home.

cd
cd ~/Downloads
mv nosqlbooster4mongo* /home/pdc

Run NoSQLBooster.

cd
cd ~/nosqlbooster4mongo*
ls
./nosqlbooster4mongo*

Run NoSQLBooster.

cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*

From the Connect option -> Connect From Connection String URI

Enter the following details.

Connection string

mongodb://root:broot@localhost:27017

Finally .. click 'Connect'

Troubleshooting

When a VM is started up, sometimes not all Docker PDC docker containers start up and as a result some functionality may not work such as:

Login page not showing login panel
Register Page not creating root user
Adding a data source and Test Connection button doesn't work

In this case, navigate to the PDC docker deployment location (/opt/pentaho/pdc-docker-deployment) and run ./pdc.sh up to ensure all PDC containers are running.

MongoDB Compass is a powerful GUI tool that simplifies the processes of querying, aggregating, and analyzing MongoDB data through a user-friendly visual interface.

MongoDB CompassMongoDB

Ensure you have enabled the MongoDB ports: Enable MongoDB ports

Check for the latest package:

https://www.mongodb.com/try/download/compass

Update the Ubuntu system’s repositories.

sudo apt update

Download MongoDB compass.

You may need to edit the version in the request below.

cd
cd ~/Downloads
wget https://downloads.mongodb.com/compass/mongodb-compass_1.43.4_amd64.deb

Install MongoDB Compass.

cd
cd ~/Downloads
sudo apt install ./mongodb-compass_1.43.4_amd64.deb

The bidb database is structured to allow a JDBC MySQL connection. Useful for creating audit reports.

DbVisualizer lets you visualize, manage and edit data with the power of SQL and simplicity of a spreadsheet.

Install Java

Ensure the apt libraries are updated.

sudo apt update && sudo apt upgrade -y

Check if Java has been installed.

java -version

To see all available editions of OpenJDK.

apt-cache search openjdk

Install OpenJDK 21 JDK.

sudo apt install openjdk-21-jdk

Install OpenJDK 21 JRE.

sudo apt install openjdk-21-jre-headless

Check installation.

java -version

pdc@pdc:~$ java -version
openjdk version "21.0.3" 2024-04-16
OpenJDK Runtime Environment (build 21.0.3+9-Ubuntu-1ubuntu1)
OpenJDK 64-Bit Server VM (build 21.0.3+9-Ubuntu-1ubuntu1, mixed mode, sharing)

Set the Java environment variables

Although Java is already installed, further configuration is required. Setting a few environment variables makes OpenJDK easier to use and allows other applications to find it.

Edit the .bashrc file in your home directory.

cd
sudo nano ~/.bashrc

Add the following lines to the bottom of the file.

export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
export PATH=$PATH:$JAVA_HOME/bin

Save.

Ctrl + o
Enter
Ctrl + x

Source to apply changes.

source ~/.bashrc

Echo JAVA_HOME and PATH to verify they are set correctly.

echo $JAVA_HOME
echo $PATH

/usr/lib/jvm/java-21-openjdk-amd64
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/usr/lib/jvm/java-21-openjdk-amd64/bin

To set these values for all system users, add the following changes to /etc/environment instead.

Install DbVisualizer

Download the Linux - without Java version.

Execute the script located in the Downloads directory.

Ensure the file has execute permission.

You will need to edit the version before executing the shell script.

cd
cd ~/Downloads
./dbvis_linux_24_1_3.sh

Accept the default path / settings.

Once 'Finished'. Accept Free license forever.

To start DbVisualizer.

cd
cd ~/DbVisualizer
./dbvis

As we know MongoDB stores the data as a 'Collection' of 'Documents' in a flat hierarchical structure.

The BIDB database contains aggregated documents, useful for reporting.

Start DBVisualizer.

cd
cd ~/DbVisualizer
./dbvis

Click on the blue cross to create a new connection.

From the drop-down list, select MySQL 5.

Enter the following details:

Connection Name

BIDB

Database Server

localhost

Database Port

3307

Use SSL

Automatic

Database Userid

root

Database Password

groot

Ping the server & Connect.
using SQL Commander, Query the database.

PreviousApp Server NextWorker Server

Last updated 1 year ago