Blogs

The first step in blogging is not writing them but reading them. - Jeff Jarvis, journalist

08Jun, 2021
Data Profiling Services

Data Profiling - a “go / no go” decision maker

Data profiling—reviewing source data for content and quality—is required for data processing and analysis. Data profiling is becoming increasingly critical as data grows in size and infrastructure moves to the cloud. With limited time and resources, data profiling must be completed.

In this post, we will cover the following:

  1. What is data profiling?
  2. Why Is Data Profiling Important?
  3. Types of data profiling
  4. Data Profiling: The Different Techniques
  5. Data profiling steps—an efficient process for data profiling
  6. Data profiling and data quality analysis best practices
  7. Data profiling challenges
  8. Our approaches
  9. What is data profiling?

    Reviewing source data, comprehending structure, content, and interrelationships, and finding potential for data projects are all part of data profiling.

    Data profiling is a crucial part of:

    • Data warehouse and business intelligence (DW/BI) projects—Data profiling can reveal data quality concerns in data sources, as well as what has to be fixed in the ETL.
    • Data conversion and migration projects—Data profiling can reveal concerns with data quality, which can be addressed with scripts and data integration tools that replicate data from source to target. It can also help the target system discover new requirements.
    • Source system data quality projects—Data profiling can reveal data with substantial or many quality concerns, as well as the cause of the problems (e.g. user inputs, errors in interfaces, data corruption).

    Data profiling involves:

    • Obtaining descriptive statistics such as the minimum, maximum, count, and sum.
    • Data kinds, lengths, and recurrent patterns are being collected.
    • Keywords, descriptions, and categories are used to categorise and tag data.
    • Executing a data quality evaluation, as well as the danger of performing data joins
    • Identifying metadata and evaluating its accuracy.
    • Identifying distributions, key candidates, foreign-key candidates, functional dependencies, embedding value dependencies, and performing inter-table analysis are all tasks that must be completed.
  10. Why Is Data Profiling Important?

    Using tainted data puts your entire project at jeopardy. The problems and challenges that data integration projects confront are comparable to those that the IT industry faces. They include:

    • Compromising quality to meet deadlines
    • Lack of time
    • Budget overrun
    • Incorrect and insufficient understanding of the data source

    Certain factors could be the source of these difficulties and troubles, including the following

    • Due to the large number of data, unwinding it can be tough.
    • Databases and applications are quite complex.
    • The procedure is difficult and time-consuming.
    • This is also prone to mistakes.

    Before data can be incorporated or used in an application, its quality, structure, and content must be understood.

    The majority of data integration initiatives rely on external sources of information to comprehend the precision and quality of data. This includes relying on staff knowledge and documentation, as well as relying on source programmes and documentation.

    Most of the time, external data may be incorrect, old, or incomplete. This means that fixing these flaws and validating your data will take additional time, effort, and money. If you don't do that, the entire project will be jeopardised.

    Data Profiling is necessary for the following.

    • To comprehend the data
    • to put it together
    • To see if your data matches the data it came from.
    • To guarantee that the data is consistent with the statistical measurement standards
    • To ensure that data is compliant with the company's policies and procedures.

    The following questions can be answered with the help of proper data profiling.

    • Do you have all of the necessary information?
    • Will that information be enough to accomplish your job on time?
    • Is all of your information correct? or are there any values that are blank?
    • What makes your data stand out?
    • Is it compatible with your company's needs?
    • Is it representative of your company's requirements?
    • Is it possible to combine, cross-refer, or consolidate data to make it more usable?
    • What data needs to be cleaned?
    • Is there any data that has been duplicated?
    • Are there any unusual data patterns?
    • What data has to be transformed?
    • Are you confident in its accuracy and consistency?

    Being able to appropriately answer these questions will assure the quality of your data, which is critical for your company's overall growth and success.

  11. Types of data profiling

    There are three main types of data profiling:

    1. Structure discovery

      Validating data for consistency and accurate formatting, as well as performing mathematical checks on the data (e.g. sum, minimum or maximum). Structure discovery is used to determine how well data is structured, such as what proportion of phone numbers are incorrectly formatted.

    2. Content discovery

      Individual data records are examined for errors. Content discovery determines which single rows in a table have difficulties, as well as which systemic issues exist in the data (for example, phone numbers with no area code).

    3. Relationship discovery

      Finding out how different pieces of the data are connected. Key linkages between database tables, for example, or spreadsheet references between cells or tables. Reusing data requires an understanding of relationships; related data sources should be combined into one or imported in a way that preserves significant linkages.

  12. Data Profiling: The Different Techniques

    In general data, profiling is done using 3 different techniques. They are the following.

    1. Column Profiling Technique

      The number of times each value appears within each of the columns in the table is tallied using this profiling technique. This method aids in the discovery of patterns in data as well as the comprehension of the frequency distribution.

    2. Cross-Column Profiling Technique

      Under this data profiling technique, there are two distinct procedures. They are:

      • Key Analysis
      • Dependency Analysis

      A method in which a group of values in a table is searched to find a potential primary key is known as key analysis.

      The structure built/dependent relationships within the data collection are identified via dependency analysis. The process of Dependency Analysis is more complex than Key Analysis.

      Both of these methods are used to find dependencies and relationships between data attributes in a table.

    3. Cross-Table Profiling Technique

      This method of profiling looks through the entire table for possible foreign keys. This method also aids in identifying data and syntactic discrepancies and similarities between tables. This will aid in the removal of data redundancy as well as the identification of data sets that can be charted together.

      Data Rule Validation is an additional step that is often considered the ultimate step in data profiling. Using a set of predetermined rules, this proactive method evaluates the validity and integrity of the data entered.

    The above-mentioned techniques of data profiling may be carried out using automated services or can be done by an analyst manually.

  13. Data profiling steps—an efficient process for data profiling

    Ralph Kimball, a father of data warehouse architecture, suggests a four-step process for data profiling:

    • Use data profiling at project start to discover if data is suitable for analysis—and make a “go / no go” decision on the project.
    • Identify and correct data quality issues in source data, even before starting to move it into target database.
    • Identify data quality issues that can be corrected by Extract-Transform-Load (ETL), while data is moved from source to target. Data profiling can uncover if additional manual processing is needed.
    • Identify unanticipated business rules, hierarchical structures and foreign key / private key relationships, use them to fine-tune the ETL process.
  14. Data profiling and data quality analysis best practices

    Basic data profiling techniques:

    • Distinct count and percent—identifies natural keys, distinct values in each column that can help process inserts and updates. Handy for tables without headers.
    • Percent of zero / blank / null values—identifies missing or unknown data. Helps ETL architects setup appropriate default values.
    • Minimum / maximum / average string length—helps select appropriate data types and sizes in target database. Enables setting column widths just wide enough for the data, to improve performance.

    Advanced data profiling techniques:

    • Key integrity—Using zero/blank/null analysis, assures that keys are always present in the data. It also aids in the detection of orphan keys, which provide a difficulty for ETL and future analysis.
    • Cardinality—One-to-one, one-to-many, and many-to-many linkages between related data sets are checked. This enables BI tools to accurately conduct inner or outer joins.
    • Pattern and frequency distributions—Verifies that data fields are formatted correctly, such as if emails are in a valid format. Data fields utilised in outgoing messages are quite crucial (emails, phone numbers, addresses).
  15. Data profiling challenges
    • The sheer volume of data you'll need to profile can make data profiling tough. This is especially true when dealing with an older system. Years of old data with thousands of inaccuracies could be found in a legacy system. Experts advise segmenting your data as part of your data profiling procedure in order to discern the forest for the trees.
    • If you do your data profiling manually, you'll need an expert to run multiple queries and filter through the results in order to acquire useful insights about your data, which can take up a lot of time and money. Furthermore, you will most likely only be able to check a fraction of your total data because going through the complete data collection is too time-consuming.

    The data profiling procedure aids in determining whether the table's rows are filled with accurate and valid data, as well as determining the data's quality. Once a problem has been identified, you must address it by outlining the measures for data quality in your project. Data profiling aids in the correct management of your data.

  16. Our approaches:

    If you’d like to know more about in2in global automated data profiling solution that can profile your data for a “go / no go” decision and make it usable for its intended purposes, please get in touch with us!

Give us an opportunity to serve you better

Let's talk about your data transformation and insights issues. Our highly skilled data analysts will look at your data challenges and offer cost-effective price plans on both automated platform & consulting approach. We'll show you how our solutions mitigate risk, guarantee compliance and optimize revenues from your transactional data.