Enter the following email and password, then click Sign In.
Username
data_steward@hv.com
Password
Welcome123!
Security Advisory: Handling Login Credentials
For enhanced security, it is strongly recommended that users avoid saving their login details directly in web browsers. Browsers may inadvertently autofill these credentials in unrelated fields, posing a security risk.
Best Practice
• Disable Autofill: To mitigate potential risks, users should disable the autofill functionality for login credentials in their browser settings. This preventive measure ensures that sensitive information is not unintentionally exposed or misused.
From the Business Rules card click Add New and select: Add Business Rule.
Data dictionaries contain technical information about data assets, such as data sources, fields and data types. They are typically used by technical audiences such as data engineers and data analysts to understand the data. Data catalogs contain much broader and deeper data intelligence than data dictionaries do.
x
x
Data Patterns in Data Identification
Data patterns play a crucial role in identifying and categorizing data within a data catalog. These patterns are essentially recurring characteristics or behaviors in data sets that can be recognized and used to automate data management.
'Getting Started' -> 'Identify the data' explained how data patterns are used to profile the data
Data Pattern Analysis reduces each data item into a simple pattern essentially using dimensional reduction for each character position in the input text. The result is a string which indicates where alphabetic characters, numeric characters, symbols, and whitespace appear.
KT-1734B generates a data pattern of “AA-nnnnA” to indicate two letters, followed by a dash, followed by 4 digits and another letter.
Case sensitivity could optionally be tracked as well. Also, the set of “significant” symbols might be user-configurable (i.e., “As a data quality engineer, for this column, a dash and an underscore are significant”).
The base process iterates over every character in the data item and performs a simple character-for-character substitution, resulting in a “data pattern” string for the item.
The pattern consists of the following characters:
Character
Description
a
lower case alphabet character
A
upper case alphabetic character
n
digit 0..9
w
whitespace character (space, tab)
s
symbol character (e.g., -/|!£$%^&*()+=[]{}@#~;:,.?¬¥§¢" )
-
Some other character (control, special symbol, etc.)
Others
Any other symbol may be treated as “significant” (such as a dash, underscore, or colon). These are output as-is in the generated data pattern for the entry.
Additional tests could be built into the algorithm to look for certain additional characteristics. For example, date formats can be very tricky. PDC could observe that ‘nn/nn/nnnn’ is a date and could then observe whether it is predominantly ‘mm/dd/yyyy’ or if its ‘dd/mm/yyyy.’
Another enhancement is detecting credit card numbers.
Lets look at an example:
Part numbers often begin with two or three designated letters. This observation helps in defining a more precise RegEx rule based on observed patterns.
Additionally, tracking the "largest" and "smallest" values for each character position in these patterns reveals the degree of variability per position. Each time a pattern recurs, a counter tallies its occurrence; upon identifying a new pattern, the system stores the analyzed data as a distinct “sample” for that pattern.
The first step is to generate a substitution string (for purpose of the example, not all possible characters are shown):
The top row is the character lookup row; and the bottom row is the substitution to be made for each character position.
For example, “KT127-3” would generate a simple pattern “AAnnn-n”. Additionally, the largest and smallest character seen for each character position is also tracked.
Consider a set of tracking numbers and the associated pattern for each:
Code
Pattern
KT17341
AAnnnnn
KL91632
AAnnnnn
KW81234
AAnnnnn
KW91020
AAnnnnn
KA002021
AAnnnnnn
Additionally, we capture the largest and smallest character seen in each character position. This allows us to potentially determine if there are fixed characters in the pattern, and to generate stricter RegEx recommendations.
AAnnnnn – Occurs 4 times
KA11220 – Lowest character seen in each position
KW97644 – Highest character seen in each position
AAnnnnnn – Occurs 1 time
KA002021 – Lowest character seen in each position
KA002021 – Highest character seen in each position
The top ~20 data patterns will be captured and stored for subsequent consumption by data quality related and other processes as needed.
Management - Data Identification Methods
Data Dictionary v Data Catalog
If our data comes from a certain probability distribution, we can reduce its size by estimating the parameters of this distribution.