The challenges of, and why you should reconsider using, truth sets for optimizing entity resolution: A case study

Pei Wang, Daniel L. Pullen, Maryam Y. Garza, Meredith N. Zozus

Research output: Contribution to conferencePaperpeer-review

Abstract

The creation of high quality Entity Resolution (ER) processes depends on the ability to quickly and effectively identify erroneous outcomes (false positives and false negatives) in ER results. In past and current research, truth sets have been used to provide this ability. Unfortunately, managing the quantity of data provided to reviewers for manual annotation during the generation process often forces researchers to generate sampled data that is not entirely representative of the total amount of variation contained within the original dataset. This often causes an over-fitting of the match logic to the truth set. This case study shows the challenges and issues that can arise when using truth sets for creating and analyzing ER matching logic.

Original languageEnglish (US)
StatePublished - 2017
Event22nd MIT International Conference on Information Quality, ICIQ 2017 - Little Rock, United States
Duration: Oct 6 2017Oct 7 2017

Conference

Conference22nd MIT International Conference on Information Quality, ICIQ 2017
CountryUnited States
CityLittle Rock
Period10/6/1710/7/17

Keywords

  • Boolean Match Rule
  • EHR Data
  • Entity Resolution
  • Truth Set

ASJC Scopus subject areas

  • Safety, Risk, Reliability and Quality
  • Information Systems

Cite this