Matching Names for Data Mining

Commercial databases are compiled from multiple sources.  Using me as an example ...

I am listed in the telephone book in Manhattan (in New York City).  So here is what is known about me: my name, address and telephone number; the number of years that I have been listed (since 1991), the number of other listings in the same building (in the order of hundreds), and so on.

I have a driver's license.  So here is what is known about me: my name, address and driver's licence number; date of birth, gender, eye color, height and corrective eyeglass requirements (if any).

How do the two pieces of information get linked together into one database record?  It can only be done by matching the name and address, since there is no universal identification number in the United States.  It turns out that matching by name and address is not an exact science.  It is estimated that the error rate (of false positives and false negatives) may be in the order between 10% to 20%.  The following are hypothetical records from two databases that illustrate some of the problems.  Is it the same person?  It is not an exact match, but if you spend a few seconds thinking about it, you will probably agree with me that it is the same person and the discrepancies reflect different conventions and requirements.

J Smith
800 Fifth Avenue, Apt 66C
New York, NY    10021

Telephone number 212-555-9999
Smith, John
800 5th Avenue, #66C
New York City, NY   10021-4923

Driver license number: xxxxxxxxxxx
Date of birth: February 30, 1976
Gender: Male
Eye color: Brown
Height: 6'0"
Corrective eyeglasses: None

Recently, I had the opportunity of having to examine two files with names and addresses and verify the matches.  I was told that these matches were found by a proprietary name/matching program.  Thus, I was looking at the problem of false positives (while the number of false negatives could not be considered since there are millions of records on the files that were being matched).  I found that 95% of the cases were 'exact' matches (i.e. same last name and zipcode), but 5% showed some discrepancies.

Here is what I found by examining those 5% of the cases in detail:

These examples illustrate the fact that matching is an inexact science.  If you look inside a matching program, you will see thousands and thousands of expert rules.  Some of these rules may make no intuitive sense, but they were empirically derived based upon decades of experience (e.g. data entry workers easily confuse 'n' with 'v', or 'i' with 'l').  But the expert rules are fallible, and there will always be false positives and false negatives.  Unfortunately, there may be bad consequences for the citizens, and the database providers, compilers and resellers don't seem to care much about their mistakes and the subsequent impact.

In the article Bad Data Fouls Background Checks by Kim Zeller in Wired (March 11, 2005), there are a couple of examples of bad data arising from mismatched data:

When Kenneth Schustereit was 18 years old, he tried to swipe a pile of what he thought was scrap metal from a machine shop's parking lot and ended up spending part of his summer vacation in jail for misdemeanor theft.  That was in 1974. Thirty years later, Schustereit is still paying for his crime.  That's because a background check of his criminal record sold to employers by ChoicePoint data brokers erroneously reported that his misdemeanor was a felony. It also stated that he spent seven years in prison when he spent 51 days in county jail.

Schustereit discovered the mistake only after Home Depot turned him down for a job last year and mentioned the report. He thinks the report cost him half a dozen other jobs as well, although he doesn't know for sure, since most employers don't tell job applicants why they've been rejected.  "I have a stellar work record," said Schustereit, who was laid off nine months ago as a quality-assurance inspector at a Texas plant. "But the problem is that I write down a 30-year-old misdemeanor on the application, and when they look it up, it comes up as a felony. It makes me look like a lying convict." 

After being laid off from his job, he applied for work in Home Depot's electrical department. He'd passed a drug test and psychological review and had even discussed salary and working hours with the company. But then Home Depot told him his background didn't check out.

It took several calls to ChoicePoint and Home Depot's headquarters before Schustereit discovered that ChoicePoint had listed him as a felon. The company's report also listed his middle name as Dale instead of Don, which suggested that the company might have confused him with someone else.

Back in 1974, Schustereit was originally charged with third-degree felony theft. But in a deal with authorities, he pleaded guilty to a misdemeanor instead and was sentenced to 60 days in jail. He was released early for good behavior. But the ChoicePoint report failed to note either of these significant details.

ChoicePoint blamed the Texas Department of Public Safety, where it said the incorrect felony information originated. The Texas DPS did admit to misidentifying Schustereit's offense, but not for turning his 60-day sentence into seven years. The department said ChoicePoint was responsible for that error.

Schustereit thinks the mistake is indicative of the sloppy work that data brokers do.  "It was incumbent on both the Texas DPS and ChoicePoint to find out if Kenneth Dale was different from Kenneth Don before ruining someone's life," he said.

Texas DPS spokeswoman Tena Mange said her department has quality-control procedures for information that it creates but has little control over the accuracy of electronic data that comes from courts and arresting authorities. And after information leaves the DPS office, the department has no control over how data brokers manipulate it.  Mange said her department always recommends that people counting on criminal background checks for hiring decisions conduct fingerprint matches instead of name matches, even though they're more expensive and take more time.

ChoicePoint declined to comment for this story and Home Depot did not return calls for comment.  After numerous phone calls and e-mails, ChoicePoint and the Texas DPS did fix Schustereit's record, although the damage was already done. And Schustereit has no idea how many other data brokers still list him as a felon. 


Ron Peterson's problem was even more pronounced than Schustereit's. A report from backgroundchecks.com attributed him with an array of serious criminal offenses he never committed.  "In Florida I'm a female prostitute (named Ronnie); in Texas I'm currently incarcerated for manslaughter," Peterson, a California resident, said. "In New Mexico I'm a dealer of stolen goods. Oregon has me as a witness tamperer. And in Nevada -- this is my favorite -- I'm a registered sex offender.

Peterson had to work hard to get his record cleaned up. He bought reports from ChoicePoint and backgroundchecks.com after State Farm denied him insurance last year. ChoicePoint got his middle name wrong and reported that there was a bench warrant for his arrest in Arizona.

Backgroundchecks.com -- which claims to have 4,000 customers worldwide, including Fortune 500 companies -- included information about all Ronald or Ronnie Petersons in its database, apparently making no attempt to distinguish relevant records from irrelevant ones, even when Peterson inserted different birth dates to see if the information would change. It didn't.

Backgroundchecks.com President Craig Kessler said there was little data brokers could do to distinguish the records of individuals sharing the same name.  "We're not in the business of authenticating the identity of individuals. All we do is report the data that's supplied to us from the courts," said Kessler. He said the problem stems from the fact that courts are doing away with using Social Security numbers that could help distinguish people with similar names.  "Sex-offender registries do not have anything other than a name in many cases," Kessler said. "We encourage companies to ask additional questions to help them confirm that this is the same person."

But Peterson, who believes that background reports contributed to his inability to get a good job offer for the last two years, said it's easier for employers to pass on candidates who have bad information associated with their name than to do the work to determine if the information is correct.  It took Peterson 40 hours and numerous phone calls to clear his identity in Arizona -- the bench warrant was for a different Ron Peterson -- and he was able to do so only after submitting his fingerprints.  "The victim is victimized by the system," Peterson said.