Data Sources
Adding Data Sources ..
Data Source Configuration Guide
Before integrating your data sources, it's essential to collect all the necessary configuration details. This guide outlines the key pieces of information required to establish a connection to your data sources. Your database administrator (DBA) will be a valuable resource in providing this configuration information.
URI (Uniform Resource Identifier): This unique identifier is used to locate your data source. You'll typically need a username and password to authenticate your connection.
Driver: Ensure you have the appropriate driver for your data source. This is crucial for enabling your application to communicate with the database.
For Amazon Web Services (AWS) data source types, a configuration method isn't specified. You must have information such as AWS region, account number, IAM username, access key ID, and secret access key to configure these data source types.
Accessing Your Catalog
To access your catalog, please follow these steps:
Open Google Chrome web browser. and click on the bookmark, or
Navigate to: https://pdc.pdc.lab/
Enter the following email and password, then click Sign In.
Username
Password
Welcome123!
Security Advisory: Handling Login Credentials
For enhanced security, it is strongly recommended that users avoid saving their login details directly in web browsers. Browsers may inadvertently autofill these credentials in unrelated fields, posing a security risk.
Best Practice
• Disable Autofill: To mitigate potential risks, users should disable the autofill functionality for login credentials in their browser settings. This preventive measure ensures that sensitive information is not unintentionally exposed or misused.
Click on: Management -> Resources tile.

Click on: Add Data Source.
Specify the following basic information for the connection to your data source (you'll find the connection details in the table below these descriptions):
Data Source Name
Specify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize.
Names must start with a letter, and must contain only letters, digits, and underscores. White spaces in names are not supported.
Data Source ID (Optional)
Specify a permanent identifier for your data source. If you leave this field blank, Data Catalog generates a permanent identifier for you.
You cannot modify Data Source ID for this data source after you specify or generate it.
Description (Optional)
Specify a description of your data source.
Data Source Type
Select the database type of your source. You are then prompted to specify additional connection information based on the file system or database type you are trying to access.
After you have specified the basic information, specify the following additional connection information based on the file system or database type you are trying to access.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Configuration Method: Select Credentials or URI as a configuration method.
Configuration Method: Credentials
• Username/Password: Credentials that provide access to the specified database.
• Host: The address of the machine where the Microsoft SQL database server is running. It can be an IP address or a domain name.
• Port: The port number on which the Microsoft SQL server is listening for incoming connections. The default port is 5432.
Configuration Method: URI
• Username/Password: Credentials that provide access to the specified database.
• Service URI: For example, URL would look like Server=myServerAddress;Database=myDatabase;User Id=myUsername;Password=myPassword;Port=1433;Integrated Security=False;Connection Timeout=30;.
Driver
Select an existing driver or upload a new driver to ensure that the communication between the application and the database is efficient, secure, and follows the required standards.
Database Name
The name of the database within the Microsoft SQL server that you want to connect with.
Connect to Demo Data Sources
Follow the steps below to connect to one of the demo datasets. In this workshop we're going to connect to the Synthea dataset, stored on a PostgreSQL database:
To install the 'Synthea' demo datasource, click on the PostgreSQL tab below:

The generated data is free from legal and privacy concerns.

To watch the videos please copy and paste the website URL into your host Chrome browser.
Follow the steps below to connect and ingest the schema metadata:
Test Connection and Ingest Metadata Schema ..
After you have specified the detailed information according to your data source type, test the connection to the data source and add the data source.
Enter the following details to connect to: PostgreSQL business_apps_db (Synthea) database.
Data Source Name
postgresql:synthea
Data Source ID
Leave Blank to autogenerate
Description
Demo dataset of patients medical records
Data Source Type
PostgreSQL
Affinity
Default
Configuration Method
Credentials
Username
sqlreader
Password
2Petabytes
*Host
pdc.pdc.lab
Port
5432
**Driver
postgresql-42.7.1.jar
Database Name
business_apps_db
*Enter server IP address or FQDN.
**PDC does not ship with any database drivers.
To upload JDBC drivers follow the instructions in tab: 1.2 Upload JDBC drivers
After you have specified the detailed information according to your data source type, test the connection to the data source and add the data source.
Click Test Connection to test your connection to the specified data source.
Take a look at the 'workers' to check for any issues.

Prior to completing and saving your new data source setup, it's essential to execute the 'Ingest Schemas' process. This step is crucial for importing the database schema and associated metadata into the system.
Click Ingest Schema, select the 'synthea' schema, and then click Ingest Schemas.


(Optional) Enter a Note for any information you need to share with others who might access this data source.
Click: Create Data Source to establish your data source connection.


PDC does not ship with JDBC drivers. You will need to download the required driver from the vendor site.
To upload JDBC drivers
Click on Manage Drivers.

Click on 'Add New'.

Select Database type: POSTGRES

Click Add Driver.
Click Close & return to: Ingest Metadata
Install & Configure pgAdmin4
Ensure all the existing packages are up-to-date.
sudo apt update && sudo apt upgrade -yInstall the public key for the PgAdmin4 repository.
curl -fsS https://www.pgadmin.org/static/packages_pgadmin_org.pub | sudo gpg --dearmor -o /usr/share/keyrings/packages-pgadmin-org.gpgCreate the repository configuration file.
sudo sh -c 'echo "deb [signed-by=/usr/share/keyrings/packages-pgadmin-org.gpg] https://ftp.postgresql.org/pub/pgadmin/pgadmin4/apt/$(lsb_release -cs) pgadmin4 main" > /etc/apt/sources.list.d/pgadmin4.list && apt update'Choose your preferred mode for PgAdmin4 installation.
• For both desktop and web modes:
sudo apt install pgadmin4• For desktop mode only:
sudo apt install pgadmin4-desktop• For web mode only:
sudo apt install pgadmin4-webConnect to Synthea database
Start pgAdmin desktop.

Click on Add New Server button and enter the information of your remote server.

Name
Synthea
Host name
localhost
Port
5432
Username
sqlreader
Password
2Petabytes
View the data in the synthea schema.

Easiest way to install DBeaver-ce is to use Snap.
sudo snap install dbeaver-ceTo create a connection to the Sythea Postgres database
Select PostgreSQL database & click Next.

Enter the following coneection details:
Connect by URL
jdbc:postgresql://pdc.pentaho.example:5432/businessapps_db
Username
sqlreader
Password
2Petabytes
'Test Connection' & download driver version 42.7.2

Click Finish

Click OK.

Expand Databases -> Schemas


To watch the videos please copy and paste the website URL into your host Chrome browser.
Follow the steps below to connect and ingest the schema metadata:
Test Connection and Ingest Metadata Schema ..
After you have specified the detailed information according to your data source type, test the connection to the data source and add the data source.
Enter the following details to connect to: MSSQL AdventureWorks2019 database.
Data Source Name
mssql:adventureworks2019
Data Source ID
Leave Blank to autogenerate
Description
Demo dataset of fictitious bicycle manufacturer
Data Source Type
Microsoft SQL Server
Affinity
Default
Configuration Method
Credentials
Username
sqlreader
Password
2Petabytes
Host
pdc.pdc.lab
Port
1433
Driver
mssql-jdbc-9.2.1.jre15.jar
Database Name
AdventureWorks2019
After you have specified the detailed information according to your data source type, test the connection to the data source and add the data source.
Click Test Connection to test your connection to the specified data source.
Click Ingest Schema, select the following 5 schemas, and then click Ingest Schemas.

(Optional) Enter a Note for any information you need to share with others who might access this data source.
Click: Create Data Source to establish your data source connection.

For Linux folks you can access the MSSQL AdventureWorks2019 database with Azure Data Studio.
Ensure all the existing packages are up-to-date.
sudo apt update && sudo apt upgradeEnsure dependencies are up-to-date.
sudo apt install libunwind8Download Deb binary available on the official website: Azure Data Studio
Extract the .deb file.
cd ~
sudo dpkg -i ./Downloads/azuredatastudio-linux-<version string>.debConnect to AdventureWorks2019 database
Start Azure Data Studio.
azuredatastudioSelect: Connections (first icon in left menu).

Select SQL Login
Enter the following details:
Connection type
Microsoft SQL Server
Input type
Parameters
*Server
localhost,1433
Authentication type
SQL Login
User name
sqlreader
Password
2Petabytes
Database
AdventureWorks2019
Encrypt
Mandatory (True)
Trust server certificate
True
Server group
<Default>
Name (optional)
AdventureWorks2019
*Enter server IP address or FQDN.
**PDC does not ship with any database drivers.
To upload JDBC drivers follow the instructions in tab: 1.2 Upload JDBC drivers
Click: Connect.

x
x
x
Arlojet database is an airline demo dataset. You can query the data based on:
Passengers
Ticketing
Weather
Aircraft
Catering

To watch the videos please copy and paste the website URL into your host Chrome browser.
x
Follow the steps below to connect and ingest the schema metadata:
Test Connection and Ingest Metadata Schema ..
After you have specified the detailed information according to your data source type, test the connection to the data source and add the data source.
Enter the following details to connect to: MySQL arlojet database.
Data Source Name
mysql:arlojet
Data Source ID
Leave Blank to autogenerate
Description
Demo dataset of airline / passenger data
Data Source Type
MySQL
Affinity
Default
Configuration Method
Credentials
Username
sqlreader
Password
2Petabytes
Host
pdc.pdc.lab
Port
3306
Driver
mysql-connector-j-8.2.0.jar
After you have specified the detailed information according to your data source type, test the connection to the data source and add the data source.
Click Test Connection to test your connection to the specified data source.
Click Ingest Schema, select the 'arlojet' schema, and then click Ingest Schemas.

(Optional) Enter a Note for any information you need to share with others who might access this data source.
Click Create Data Source to establish your data source connection.

Install & Configure Schemaworkbench
MySQL Workbench is a is a graphical MySQL database management tool.
Ensure all the existing packages are up-to-date.
sudo apt update && sudo apt upgradeInstall MySQL Workbench.
sudo snap install mysql-workbench-communityConnect to MySQL Workbench
Select “Applications” from the menu.
Search for the MySQL workbench application, and then launch it.
Edit the default connection.

Enter the following connection details:
Connection Name
arlojet
Username
sqlreader
Password
2Petabytes
Default Schema
arlojet
Click 'Test Connection'.

Click Close.
Connect to Arlojet database
Check for arlojet database.

Select the option: Schemas & expand Tables

MinIO is a high-performance, Kubernetes-native object storage service that is designed for cloud-native and containerized applications. It is open-source and allows enterprises to build Amazon S3-compatible data storage solutions on-premises, integrating smoothly with a wide range of cloud-native ecosystems.
Banking - Chat bot data.
Football -
IoT Sensor -

To watch the videos please copy and paste the website URL into your host Chrome browser.
x
Follow the steps below to connect and ingest the schema metadata:
Test Connection and Ingest Metadata Schema ..
After you have specified the detailed information according to your data source type, test the connection to the data source and add the data source.
Affinity
This default setting specifies which agents should be associated with the data source in a multi-agent deployment.
Region
Geographical location where AWS maintains a cluster of data centers.
Endpoint
Location of the bucket. For example, s3.<region containing S3 bucket>.amazonaws.com
Access Key
User credential to access data on the bucket.
Secret Key
Password credential to access data on the bucket.
Bucket Name
The name of the S3 bucket in which the data resides. For S3 access from non-EMR file systems, Data Catalog uses the AWS command line interface to access S3 data.
These commands send requests using access keys, which consist of an access key ID and a secret access key.
You must specify the logical name for the cluster root.
This value is defined by dfs.nameservices in the hdfssite.xml configuration file.
For S3 access from AWS S3 and MapR file systems, you must identify the root of the MapR file system with maprfs:///.
Path
Directory where this data source is included.
Enter the following details to connect to: minIO 'Banking' Object Store.
Data Source Name
minIO:sensor
Data Source ID
Leave Blank to autogenerate
Description
Demo IoT sensor-data
Data Source Type
AWS S3
Affinity
Default
Region
us-east-1
Endpoint
Access Key
minioadmin
Secret Key
minioadmin
Bucket Name
iot-sensors-data-lake
Path
/
After you have specified the detailed information according to your data source type, test the connection to the data source and add the data source.
Click Test Connection to test your connection to the specified data source.
Click: Scan Files.

(Optional) Enter a Note for any information you need to share with others who might access this data source.
Click Create Data Source to establish your data source connection.

MinIO
The MinIO Console displays a login screen for unauthenticated users. The Console defaults to providing a username and password prompt for a minIO-managed user.
Either click on the bookmark or enter the following URL to Log into minIO.

Username
minioadmin
Password
minioadmin
Managing Objects
The Object Browser lists the buckets and objects the authenticated user has access to on the deployment.
After logging in or navigating to the tab, the object browser displays a list of the user’s buckets, which the user can filter.
Select 'Buckets' from the left hand menu.
Browse 'banking-data' bucket to show a list of objects in the bucket.

Highlight: banking77.csv

x
x
x
x
x
Last updated

