Anomaly-Based Duplicate Detection: A Probabilistic Approach

A Obermeier - Extending the Boundaries of Design Science Theory …, 2019 - Springer
Extending the Boundaries of Design Science Theory and Practice: 14th …, 2019Springer
The importance of identifying records in databases that refer to the same real-world entity
(“duplicate detection”) has been recognized in both research and practice. However,
existing supervised approaches for duplicate detection need training data with labeled
instances of duplicates and non-duplicates, which is often costly and time-consuming to
generate. On the contrary, unsupervised approaches can forego such training data but may
suffer from limiting assumptions (eg, monotonicity) and providing less reliable results. To …
Abstract
The importance of identifying records in databases that refer to the same real-world entity (“duplicate detection”) has been recognized in both research and practice. However, existing supervised approaches for duplicate detection need training data with labeled instances of duplicates and non-duplicates, which is often costly and time-consuming to generate. On the contrary, unsupervised approaches can forego such training data but may suffer from limiting assumptions (e.g., monotonicity) and providing less reliable results. To address the issue of generating high-quality results using easy to acquire duplicate-free training data only, we propose a probabilistic approach for anomaly-based duplicate detection. Duplicates exhibit specific characteristics which differ significantly from the characteristics of non-duplicates and therefore represent anomalies. Based on the grade of anomaly compared to duplicate-free training data, our approach assigns the probability of being a duplicate to each analyzed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analyzing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform even fully supervised state-of-the-art approaches for duplicate detection.
Springer
顯示最佳搜尋結果。 查看所有結果