Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification

Christen, Peter

Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification

dc.contributor.author	Christen, Peter
dc.coverage.spatial	Las Vegas USA
dc.date.accessioned	2015-12-10T21:53:19Z
dc.date.created	August 24-27 2008
dc.date.issued	2008
dc.date.updated	2016-02-24T10:17:06Z
dc.description.abstract	The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearestneighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.
dc.identifier.uri	http://hdl.handle.net/1885/38481
dc.publisher	Association for Computing Machinery Inc (ACM)
dc.relation.ispartofseries	ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008)
dc.source	Proceedings of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
dc.source.uri	http://www.sigkdd.org/kdd2008/
dc.subject	Keywords: Data linkage; Data matching; Deduplication; Entity resolution; Nearest neighbour; Classifiers; Education; Image retrieval; Knowledge management; Learning algorithms; Mining; Security of data; Supervised learning; Vectors; Support vector machines Data linkage; Data matching; Deduplication; Entity resolution; Nearest neighbour; Support vector machine
dc.title	Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification
dc.type	Conference paper
local.bibliographicCitation.lastpage	160
local.bibliographicCitation.startpage	151
local.contributor.affiliation	Christen, Peter, College of Engineering and Computer Science, ANU
local.contributor.authoruid	Christen, Peter, u4021539
local.description.embargo	2037-12-31
local.description.notes	Imported from ARIES
local.description.refereed	Yes
local.identifier.absfor	080109 - Pattern Recognition and Data Mining
local.identifier.ariespublication	U3594520xPUB162
local.identifier.doi	10.1145/1401890.1401913
local.identifier.scopusID	2-s2.0-65449139594
local.type.status	Published Version

Downloads

Original bundle

Now showing 1 - 3 of 3

Name:: 01_Christen_Automatic_Record_Linkage_using_2008.pdf
Size:: 53.28 KB
Format:: Adobe Portable Document Format

Download

Name:: 02_Christen_Automatic_Record_Linkage_using_2008.pdf
Size:: 357.23 KB
Format:: Adobe Portable Document Format

Download

Name:: 03_Christen_Automatic_Record_Linkage_using_2008.pdf
Size:: 99.56 KB
Format:: Adobe Portable Document Format

Download

Collections

ANU Research Publications