Duplicate master data is a common problem in across every organization. Identifying and resolving duplicate information on customers, suppliers, items, employees, etc. is an important activity. Solving the duplicate issue can yield increased buying power, more accurate marketing campaigns, reduced maintenance costs, improved inventory management, better forecasting; improvements across all any aspect of the business.

In terms of a new implementation, if the duplicate data is not resolved before or during the migration, the duplicate data will either silently load into the new ERP application and cause problems when the application is put to use or will error out during the load and cause problems because the downstream transaction conversions won’t be able to load. On an ongoing data management process, it is important to prevent duplicates by defining the criteria and keeping scorecard metrics that ensure that the organization is preventing duplicates adding to overall data degradation.

The basic concept for identifying a duplicate candidate is straight forward.  Standardize fields that identify a unique entity (e.g. a single customer, vendor, item, etc.)  and compare.  In practice, depending on the type\cleanliness of the data, the standardization process can be complex.  

There are several techniques that are commonly used to identify duplicates within master data.

  • Noise Word Removal – The process of removing words that don’t add significance to the data or are often incorrect.  Common examples of noise words are “the”, “of”, and “inc”.
  • Word Substitution - The process of replacing an existing word or phrase with another word or phrase.  It is common to substitute names and abbreviations when identifying duplicates. For example, Tim would be replaced with Timothy, OZ would be replaced with Ounce, a single quote might be replaced with foot, and a double space might be replaced with a single space.    Additionally, context of the words in the current field or other fields could affect the substitution that needs to take place.
  • Case Standardization – The process of making everything the same case.
  • Punctuation Removal – The process of removing all alpha or numeric values that don’t add any significance to the field value.
  • Phonetic Encoding – The process of encoding words based on how they sound.  For example, “donut” and “doughnut” would be phonetically encoded to the same value.  There are several types of phonetic encoding methods that are commonly used.
  • Address Standardization – The process of standardizing all of the components of an address prior to comparing values.  Usually the process that checks for duplicates utilizes address validation techniques\software\services to make sure the address is valid and to make ensure that all of the pieces of the address are formatted uniformly throughout the data.
  • Attribute Standardization - Attribute data can be used to determine a distinct entity, but inconsistencies could make them look different at first blush.  For example Unit of measure could be in pounds and ounces, but those could the same once the conversion factor is complete.  Or there could be 2 Assemblies that have the same Bill of Material parts list attached to them. This attribute data can be used in the identification of duplicates.

Outside of these techniques there are others that might be used like ignoring all one or two character words, match on the first X number of characters, alphabetize the words in a field or across fields, execute rules in a variety of orders, not requiring a field if it is not populated, merging results of different match criteria together, and many more. There are seemingly endless combinations of rules that can be applied for identifying duplicate candidates.   It usually takes several iteration of rules and combinations to determine which rule set(s) are appropriate.  

Additionally, if you have large enough datasets, machine learning algorithms coupled with some of the techniques above can further automate the identification and consolidation of duplicate data.

If you have questions on data management, governance, migration, or just love data, let's connect.  My email is steve_novak@premierintl.com.