Metadata Store
Exploring the Pentaho Data Catalog Metadata Store ..
Last updated
Exploring the Pentaho Data Catalog Metadata Store ..
Last updated
Metadata Store in Data Catalogs: A Foundation for Effective Data Management
A metadata store within a data catalog serves as a critical repository for storing metadata from various sources, ensuring the preservation of their relationships, historical context, and vital details. Acting as the backbone of a data catalog's architecture, this specialized database is instrumental in streamlining the organization and handling of metadata. This, in turn, significantly aids in the efficient discovery, understanding, and governance of data.
The primary role of the metadata store is to facilitate easy access to the correct data assets for users by encompassing technical, process, and business metadata. This includes comprehensive information on the data's structure, its lineage, utilization patterns, and the business context it operates within. By making data assets readily identifiable, assessable, and actionable, the metadata store plays a pivotal role in bolstering data-driven decision-making processes.
The diagram depicts a simplified representation of the key data model of the PDC Metadata Store (MDS). To understand the relationships between the 'Collections' , it's illustrated in a relational database format.
The model illustrates the relationships between data sources and entities where the customer data assets and discovered metadata properties are persisted as rich JSON graphs.
At the core of the model is the mds.entities
collection.
Run NoSQLBooster.
cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*
So what happens when you connect/ingest/profile a datasource?
When you create a data source connection, for example, to the Synthea dataset on PostgreSQL, the details are persisted in the resourceconnections collection located in the fe database.
If not not familiar with MQL, the easiest way to query the MongoDB database is to use SQL..!
For example, to display the connection name in a dashboard:
resourceName where configMethod = 'credentials'
The following table shows how SQL components map to MongoDB document components.
database
database
table
collection
column
field
row (record)
document
Run NoSQLBooster.
cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*
In the Query panel, enter the following SQL statement.
//Single quotes are for strings 'string'
mb.runSQLQuery(`
SELECT resourceName FROM resourceconnections
WHERE configMethod = 'credentials'
`).sort({_id:-1})
.limit(100)
As expected the result is: postgresql:synthea as there's currently only 1 connection defined.
When you ingest the database schema, the connection details are persisted in the datasources collection located in the mds database.
The database schema metadata details are persisted in the entities collection (synthea - 179 documents).
In the example below, the zip column properties, from the providers 'table'.
Taking a look at the MDS diagram above we can see that the recursive mds.entities.type refers to:
SCHEMA - synthea
TABLE - 'aggregated' documents
RESOURCE
COLUMN -
This is useful for querying the dataset. The mds.entites collection consists of documents (records) that are referenced by knowing the 'type:'.
Edit the query to retrieve the fields on the 'patients' table (you could have ingested more than 1 schema).
db.entities.find({type:'TABLE',_id:'1/business_apps_db/synthea/patients'})
.projection({})
.sort({_id:-1})
.limit(100)
TIP: To find the TABLE _id, under Actions select the option: Copy path.
mb.runSQLQuery(`
SELECT * FROM entities
WHERE type = 'TABLE' AND name = 'patients'
`).sort({_id:-1})
.limit(100)
Notice the out-of-the-box predefined collections for:
datapatterns
dictionaries
dictionaryTerms
The data profiling process generates statistical and intermediate data that is required by other data analytic processes. The intermediate data is consumed by downstream processes such as data flow and foreign key detection.
Referencing the MDS model, the resulting mds.data_profile
The intermediate data generated for each column of data includes:
Roaring Bitset
A bitmap of the hash values for all entries in the column.
HyperLogLog (HLL)
Provides an estimate of the cardinality of the data, with a roughly ~2% margin of error.
Data Pattern Analysis
Performs a rudimentary data pattern analysis using dimensional reduction, tracking the most frequently occurring patterns.
Data Quality Pre-Analysis
Using the Data Pattern Analysis results, Data Catalog performs a statistical estimation of the data quality is performed. This is summarized as an overall percentage as well as a heat map for each data pattern. Additionally, Data Catalog makes RegEx recommendations for the most probable matches.
Statistics
Data Catalog gathers the following statistics when examining all the data:
• Minimum and Maximum values (for numeric columns)
• Widest and Narrowest (non-null) string widths
• Null count
• Total row count
Data Sampling
Data Catalog takes a controlled sampling of the data so that the samples are consistently chosen across different columns.
Taking a look at the MDS diagram above we can see that the recursive mds.entities.type refers to:
SCHEMA - synthea
TABLE
RESOURCE
COLUMN
This is useful for querying the dataset. The mds.entites collection consists of documents (records) that are referenced by knowing the 'type:'.
Edit the query to retrieve the fields for the 'patients.passport' column.
db.entities.find({type:'COLUMN',_id:'1/business_apps_db/synthea/patients/passport'})
.projection({})
.sort({_id:-1})
.limit(100)
TIP: To find the COLUMN _id, under Actions select the option: Copy path.
mb.runSQLQuery(`
SELECT * FROM entities
WHERE type = 'COLUMN' AND name = 'passport'
`).sort({_id:-1})
.limit(100)
The Statistics, Data Patterns and Properties are displayed in the Data Canvas.
Obviously .. a lot more data is captured, which is stored in mds.data_profile collection.
Edit the query to retrieve the fields for the 'mds.data_profile.patients.passport' column.
db.data_profile.find({_id:'1/business_apps_db/synthea/patients/passport'})
.projection({})
.sort({_id:-1})
.limit(100)
mb.runSQLQuery(`
SELECT * FROM data_profile
WHERE _id = '1/business_apps_db/synthea/patients/passport'
`).sort({_id:-1})
.limit(100)
x
x
x
x
x
x
x
NoSQLBooster is a cross-platform GUI tool for MongoDB Server 3.6-7.0, which provides a build-in MongoDB script debugger, comprehensive server monitoring tools, chaining fluent query, SQL query, query code generator, task scheduling, ES2020 support, and advanced IntelliSense experience.
x
To watch the videos please copy and paste the website URL into your host Chrome browser.
x
Be aware this exposes MongoDB port ..!
Navigate to the PDC depolyment folder.
cd
cd /opt/pentaho/pdc-docker-deployment/vendor/
Edit docker-compose.yml file.
nano docker-compose.yml
Uncomment ports under mongodb in docker-compose.yml
mongodb:
image: ${PDC_IMAGE_PREFIX}pdc-mongodb-ee:${PDC_MONGODB_TAG:-master}
ports:
- "27017:27017"
Save changes.
CTRL + O
Enter
CTRL + X
Restart PDC.
cd
cd /opt/pentaho/pdc-docker-deployment
sudo ./pdc.sh up
Ensure all the containers successfully restart ..
x
Download NoSQLBooster.
cd
cd ~/Downloads
wget https://s3.nosqlbooster.com/download/releasesv8/nosqlbooster4mongo-8.1.7.tar.gz
Untar the package.
cd
cd ~/Downloads
tar -xvf nosqlbooster4mongo-8.1.7.tar.gz
Move the NoSQLBooster folder to home.
cd
cd ~/Downloads
mv nosqlbooster4mongo* /home/pdc
Run NoSQLBooster.
cd
cd ~/nosqlbooster4mongo*
ls
./nosqlbooster4mongo*
Run NoSQLBooster.
cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*
From the Connect option -> Connect From Connection String URI
Enter the following details.
Connection string
mongodb://root:broot@localhost:27017
Finally .. click 'Connect'
When a VM is started up, sometimes not all Docker PDC docker containers start up and as a result some functionality may not work such as:
Login page not showing login panel
Register Page not creating root user
Adding a data source and Test Connection button doesn't work
In this case, navigate to the PDC docker deployment location (/opt/pentaho/pdc-docker-deployment
) and run ./pdc.sh up
to ensure all PDC containers are running.
MongoDB Compass is a powerful GUI tool that simplifies the processes of querying, aggregating, and analyzing MongoDB data through a user-friendly visual interface.
Check for the latest package:
Update the Ubuntu system’s repositories.
sudo apt update
Download MongoDB compass.
You may need to edit the version in the request below.
cd
cd ~/Downloads
wget https://downloads.mongodb.com/compass/mongodb-compass_1.43.4_amd64.deb
Install MongoDB Compass.
cd
cd ~/Downloads
sudo apt install ./mongodb-compass_1.43.4_amd64.deb
Click on New Connection.
Ensure you check the option: Direct Connection
Click on the Authentication option.
Enter the following details:
Username
root
Password
broot
Click Connect.
Through the MongoDB Atlas SQL Interface, Connectors, and Drivers, you can leverage your existing SQL knowledge to query and analyze live application data directly from your preferred SQL-based tool.
x
x
Altair is a feature-rich GraphQL client that enables developers to debug and test their GraphQL queries. It provides an intuitive interface for crafting queries, setting variables, and viewing responses.
Altair supports various features like query autocompletion, response formatting, and even documentation exploration, making it an invaluable tool for working with GraphQL APIs.
Enter your query:
query {userCount(filter:{})}
Number of Users
query {userMany(limit:100){ email
date
firstName
lastName
lastLogin
modified
modifiedBy
status
_id }}
Returns User information - Limit 100
The bidb database is structured to allow a JDBC MySQL connection. Useful for creating audit reports.
DbVisualizer lets you visualize, manage and edit data with the power of SQL and simplicity of a spreadsheet.
Ensure the apt
libraries are updated.
sudo apt update && sudo apt upgrade -y
Check if Java has been installed.
java -version
To see all available editions of OpenJDK.
apt-cache search openjdk
Install OpenJDK 21 JDK.
sudo apt install openjdk-21-jdk
Install OpenJDK 21 JRE.
sudo apt install openjdk-21-jre-headless
Check installation.
java -version
pdc@pdc:~$ java -version
openjdk version "21.0.3" 2024-04-16
OpenJDK Runtime Environment (build 21.0.3+9-Ubuntu-1ubuntu1)
OpenJDK 64-Bit Server VM (build 21.0.3+9-Ubuntu-1ubuntu1, mixed mode, sharing)
Although Java is already installed, further configuration is required. Setting a few environment variables makes OpenJDK easier to use and allows other applications to find it.
Edit the .bashrc
file in your home directory.
cd
sudo nano ~/.bashrc
Add the following lines to the bottom of the file.
export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
export PATH=$PATH:$JAVA_HOME/bin
Save.
Ctrl + o
Enter
Ctrl + x
Source to apply changes.
source ~/.bashrc
Echo JAVA_HOME
and PATH
to verify they are set correctly.
echo $JAVA_HOME
echo $PATH
/usr/lib/jvm/java-21-openjdk-amd64
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/usr/lib/jvm/java-21-openjdk-amd64/bin
To set these values for all system users, add the following changes to /etc/environment
instead.
Download the Linux - without Java version.
Execute the script located in the Downloads directory.
Ensure the file has execute permission.
You will need to edit the version before executing the shell script.
cd
cd ~/Downloads
./dbvis_linux_24_1_3.sh
Accept the default path / settings.
Once 'Finished'. Accept Free license forever.
To start DbVisualizer.
cd
cd ~/DbVisualizer
./dbvis
As we know MongoDB stores the data as a 'Collection' of 'Documents' in a flat hierarchical structure.
The BIDB database contains aggregated documents, useful for reporting.
Start DBVisualizer.
cd
cd ~/DbVisualizer
./dbvis
Click on the blue cross to create a new connection.
From the drop-down list, select MySQL 5.
Enter the following details:
Connection Name
BIDB
Database Server
localhost
Database Port
3307
Use SSL
Automatic
Database Userid
root
Database Password
groot
Ping the server & Connect.
using SQL Commander, Query the database.
Ensure you have installed & configured before commencing the following steps.
Ensure you have enabled the MongoDB ports:
To access Altair: