What Is Data Profiling?

Data profiling is critical to data migration projects because data quality in legacy systems is usually worse than business users assume. Bad data can interfere with properly functioning businesses.

Imagine invalid contact information preventing customers from receiving invoices or a system showing a quantity on hand of 150 for an item that is completely out of stock and discontinued. By knowing the facts of your data, instead of relying on assumptions and possibly outdated information, steps can be taken to clean up data to more efficiently run your business.

The following are examples of basic types of field level data profiling:

Numeric Field

Amount Paid
  • Invalid: 0
  • Negative: 0
  • Zero: 63,149
  • Positive: 404
  • Highest: 101,085,153.43
  • Lowest: 0
  • Total: 495,831,993.65
  • Average: 7,801.87

Character Field

Unit of Measure
  • Value: Count
  • EA: 63104
  • EACH: 204
  • FT: 19
  • GAL: 3
  • GRAM: 1
  • IN: 1
  • LB: 10

Date Field

Creation Date
  • Invalid: 0
  • Blank: 63,406
  • Non-Blank: 147
  • Highest: 7/12/2019
  • Lowest: 2/5/1991

Pattern

Phone Number
  • Value: Count
  • (999)999-9999 : 145,035
  • 99999 99 99 : 15,013
  • A999-999-9999 : 1,529
  • +99#99#(99)#9999 : 1,062
  • AAAAAAA@AAAAA.AAA : 202
  • AAAA : 29

Why Is Data Profiling Important?

Data profiling is critical to data migration projects because data quality in legacy systems is usually worse than business users assume. Bad data can interfere with properly functioning businesses.

Imagine invalid contact information preventing customers from receiving invoices or a system showing a quantity on hand of 150 for an item that is completely out of stock and discontinued. By knowing the facts of your data, instead of relying on assumptions and possibly outdated information, steps can be taken to clean up data to more efficiently run your business.

When Should Data be Profiled?

Data profiling is used during the data assessment, data mapping, data cleansing, and reconciliation phases.  Performing data profiling during each of these phases will help with mitigating risk at each step of the way related to major data issues that could result in project delays which ultimately impacts the budget.

During the data assessment, data profiling should be performed on all of the legacy data identified to be converted as part of the migration effort early in the project before a functional team even begins documenting the data mapping specifications.  By profiling the data first, the functional and data migration teams can work together to understand the current state of the legacy data and the real data facts can be used to document more accurate and complete data mapping specifications.  Data issues are uncovered early in the project rather than in the middle of a test cycle, or worse, during production cut-over, which can inevitably cause delays and increase costs.  The earlier these issues are identified, more informed decisions can be made regarding data cleansing efforts and how specific scenarios can be handled during the data transformations.

In the character field profiling example above, there appears to be inconsistent use of values in the UOM field.  The data profiling uncovered the values EA vs EACH and IN vs INCH.  By knowing this up front, the mapping specifications can be documented accurately to account for all of the values identified without any being inadvertently missed.

Data profiling results can also be used to assist with data cleansing efforts.  It’s especially helpful in identifying missing data, such as missing pieces of an address that are required for successful delivery or identify strange patterns in specific fields such as a phone number in an email address field.  These types of bad data likely have impacts on the business.  Uncovering these types of issues early on allows time for the data to be cleansed in the legacy system and leaving the data in a better state for the target system.

During the post-conversion reconciliation phase, data profiling can be used to validate record counts, translated values, and more.