|
Understanding and Overcoming the Unavoidable Error and Variation in Identity Data
Published: January 1, 2001
Published in TDAN.com January 2001 Most people take their name for granted. Some of us like our names; others don't - either way we are stuck with it. We respond when our name is spoken (even when pronounced badly), dutifully write it down when requested (often not all that neatly), and recognize it when someone else writes it (even when the spelling is wrong). But how do computer systems cope with such error and variation specifically, how do they competently search and match this type of data? To some extent, back office systems control this problem through the management of the people who enter or use the data. However, these systems are far removed from the data's real owners, such as you and me, and suffer from errors introduced by interpretations and assumptions. Users of front office systems can also be managed with the added benefit that the data owner is at hand for confirmation. However, such systems often require speed - which is often at odds with quality. Web-based systems, on the other hand, have raised the problem to new heights. There is no management of the people who enter or use the data, except for the feeble attempts to structure the way the data is entered. In addition, the users are far removed and want responses in real time. This article exposes the problems that computer systems have with names and other identity data, how traditional systems are ineffective and how intelligent systems may overcome these limitations. The Problem With 'Names' In many systems, whether computerized or manual, it is important to find, match or group information that has been filed away using a person's or customer's name, account name, company or location name, address, file title, author's name, book title, etc. All such "names" are collections of words, numbers and codes used to label the original real world item. Such labels are chosen from a much larger and very different vocabulary than any meaningful language. There are no dictionaries, spell checkers or rules governing the names given to addresses, people, places or things. For the sake of simplicity, when the word "name" is used in this article it should be taken to mean all of the above mentioned "labels". Names, when spoken, written and especially when entered into a computer system are subject to considerable variation and error. Although this variation and error can be reduced, it can't be entirely eliminated. Even if the data on file is absolutely correct, the search criterion comes from the real world and is subject to natural error and variation. In addition to the words and codes in names, addresses, titles and descriptions, other data is frequently used to make decisions about whether we believe two reports or records are about the same identity. Data such as dates of birth, dates of contract, ages, phone numbers and identity numbers are all used and all subject to error and variation. Examples of Variation The variations that occur in names include spelling, typing and phonetic error; synonyms & nicknames; Anglicization and foreign versions of names; initials, truncation and abbreviation; prefix and suffix variations; compound names; account names; missing words, extra words and word sequence variations, as well as format, character and convention variations. Apart from the natural error and variation that unavoidably occurs in all real world identification data, in many systems the objective is also to overcome fraudulent modification of identity data. This class of error, which does not occur naturally, is more aggressive because it is introduced to defeat or control aspects of matching systems while retaining the defense that it was in error rather than fraudulent. The frequency distribution of names is also a concern when searching for a match from within a large population. The vocabulary in use for people's first names includes in excess of 2,500,000 words in the USA alone, yet as much as 80% of the population may have names from as few as 500 words. Family names are just as unevenly distributed, causing searches for common names to take longer and requiring additional supplementary identification data to make the correct choice. Name search and matching systems must work well at both ends of this extreme curve. They must perform efficiently for the uncommon names as well as for very common names. This is a difficult challenge when a database of 100,000,000 people may contain 100,000 John Smiths or Mike Jones in addition to as many addresses (1 Main Streets). When people make choices about whether words match or not, they compensate for the error and variation. To confirm that records match requires that systems use the same data in the same manner as the human users. In fact, the system needs to mimic the very best users doing the same job. Whether the process is an on-line inquiry (such as customer identification), or a batch matching process (as might occur during the merging of marketing lists before a mailing), or a criminal record search, we must mimic the human expert in finding all candidate records. It is important that the system make the same matching choices as the human expert for any specific business purpose. For the system to overcome error and variation increases the work done, and therefore the cost. In addition, the actual process of compensating for the error may introduce errors and false matches. Any solution to this unique data processing problem requires a balance between performance and quality, between under-matching verses over-matching. Popular Yet Often Ineffective Techniques Exact Name Searches The use of exact name keys is very inefficient leading too much of the duplication of records, accounts and customers in today's systems. Finding an exact match does not mean the correct record has been located nor is it necessarily any better than one with some variation. Searching With Wildcards Wildcard searches do overcome some of the error and variation in the name and for that reason are popular with the users. Unfortunately, since wildcard searches often return too many irrelevant candidates, users don't realize the large amount of data they are actually missing. In reality, wildcard searches work if the searcher guesses the correct character sequences to include or exclude - assuming there are no errors in the characters of the database being searched. Not only do these searches miss relevant records, they do not address nicknames and abbreviations, or the fact that different records have different types of errors. Keying Partial Words To Save Time Data entry time can be reduced by only keying partial search criteria - e.g. the first three characters of the first word followed by the first three characters of the next word. Performing searches with this type of criteria, however, makes it impossible to use techniques to overcome word variation and errors in a database. Soundex (an algorithm for encoding the last name) or other techniques to handle nicknames or Anglicization, or translation or formal abbreviation, cannot be applied to partial words. Text Retrieval Software and Name Search The use of text retrieval packages for name search applications also misses data. Even when full text inversion indexes have phonetic algorithms or "expert" rule bases for name searches, the indexing mechanism is an inefficient process. Does it make sense to find all index values for the records containing John and then join them with those that contain Smith to discover the subset John Smith? For text retrieval based systems to be successful they must recognize, find or discover the names from within the text and index them with the specialized techniques that are necessary for quality and performance in a name search system (as opposed to indexing the names in the same way that other words are indexed). Match-Codes A Match-code is a key built from a combination of an identity's attributes. For example, a key built from State-code+Surname Soundex code+ Birth Date is a match-code. Match-codes require that each attribute be first strictly formatted into its pieces (e.g. that the position of the surname is known in the name); that all pieces used are in the "stable" order (e.g. that the birth date is always yyyy/mm/dd, or whatever), and that there are no errors in the pieces used. Of course, attributes such as Sex, Birth Date, State, Postal or Zip code have a stable, known set of valid formats and values and can be accurately edited and validated. However, the fact that they are valid does not mean that they are true, accurate or consistent (e.g. a birth date can be a valid date without being the correct date for a certain identity). Mathematically such data can be precise but not necessarily accurate. Typically, Match-Codes find correct records but they frequently miss the other equally good candidates. Yesterday's Soundex's And Related Tools In the early 1900's the Russel Soundex technique was developed to provide a stable manual filing code for the USA Census documents. The development of this algorithm for encoding a person's last name was based upon phonetics and certain classes of typical spelling and filing errors. This simple set of rules to convert a last name word into a four, five or six digit number had a high probability of being the same for two words that were variations of each other. Since then, many Algorithms with similar objectives have been developed and modified. In the 1960's, the New York State Identification Intelligence System (NYSIIS) project evaluated the popular algorithms. This included evaluation of Algorithms such as: Soundex and many of its variants; LA County Sheriff Consonant coding; Phonic standard and extended; Michigan Lien; and several Extract list based systems. The end result of this project was a popular algorithm known as NYSIIS that proved better optimized for their data at that point in time. While such "stabilization" (similar sounding words are "stabilized" to the same encoding) algorithms can be a critical piece of a name search engine, the algorithms of the past are not sufficient for use on their own as database keys with today's data or volumes. In addition, purely "English" based Algorithms are not suitable for non-English languages. Typically, these stabilization algorithms either cause too many incorrect records to be found, or miss too many relevant records. Mimicking the Expert Users The best solution will overcome the error and variation in the identity data while: a) maintaining acceptable performance and b) not missing candidates or generating too many false matches. Such a solution needs intelligent and scalable algorithms, which, through the use of rich keys and search strategies, return all of the candidates an expert user would consider as being the same as the search data. These algorithms must be able to cope with data from the real world. This includes data from different countries, data which is not formatted or cleaned or not capable of being formatted or cleaned, which contains noise characters, noise words, initials, abbreviations, nicknames and concatenations, and which contains tokens in an unstable order. The algorithms need a customizable rule base to incorporate the knowledge of the expert user, and a default populated rule base in the case where the user is not that experienced. The algorithms require phonetic and orthographic correction functionality, to address spelling and typing errors. Intelligent matching routines must be available and able to be tuned to mimic the expert user making a choice as to which candidates are the correct matches. Such matching routines need to take into account all of the error and variation in the identities' attributes, as well as weighting the attributes as the user would. The Algorithms must work well regardless of the country of origin and language of the data and must insulate the application developer from the differences between country and language. Conclusion Increasing M&A, expanding databases, the need to limit risk and fraud, the current emphasis on "customer" relationship, the proliferation of data available on or from the web, the "thirst" for that data, are all putting pressure on search and matching systems to come up with these capabilities. There are many general search engines around hopefully this article has illustrated that identity search and matching is a non-trivial problem requiring a more intelligent and targeted solution.
Copyright 2000 - Search Software America, A division of SPL WorldGroup. All rights reserved. Go to Current Issue | Go to Issue Archive
Mike Dunkerley - Mike Dunkerley has over 20 years experience in the IT industry serving in many capacities, including operations, programming, design, systems engineering, marketing, sales, and management. After
joining SSA, Mike established and managed SSA's UK office, where he was responsible for sales and support in the UK & European region. Mike returned to Australia in 1995 and until 2000 has had
the role of SSA's International Product & Support Manager. In mid-2000, Mike accepted the position of VP, North America with management responsibilities for sales, support and marketing
operations in the USA, Canada and Brazil.
Geoff Holloway - Geoff Holloway is the Founder, President, and Director of Research and Development for Search Software America (SSA). He began his technology career with IBM in 1960 at their research laboratories in
England. In 1971, Geoff began working for Systems Programming Limited (SPL), where his executive experience originated as a result of his appointment to the Board of Directors in 1974. Three years
later Geoff moved to Australia to establish a new company for SPL, which he quickly expanded into New Zealand, Hong Kong, and Singapore. In 1986 Geoff relocated to the United States to market
research his proposal for a 'Name Search' business. The following year Search Software, Inc. was incorporated as a subsidiary of SPL Australia. Since that time Geoff has been concentrating on the
development of the SSA-NAME3 range of products to solve high volume name search, identification, and matching problems. Geoff holds an Honors Degree in Economics from the London School of Economics.
|