Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification

dc.contributor.authorChristen, Peter
dc.coverage.spatialLas Vegas USA
dc.date.accessioned2015-12-10T21:53:19Z
dc.date.createdAugust 24-27 2008
dc.date.issued2008
dc.date.updated2016-02-24T10:17:06Z
dc.description.abstractThe task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearestneighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.
dc.identifier.urihttp://hdl.handle.net/1885/38481
dc.publisherAssociation for Computing Machinery Inc (ACM)
dc.relation.ispartofseriesACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008)
dc.sourceProceedings of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
dc.source.urihttp://www.sigkdd.org/kdd2008/
dc.subjectKeywords: Data linkage; Data matching; Deduplication; Entity resolution; Nearest neighbour; Classifiers; Education; Image retrieval; Knowledge management; Learning algorithms; Mining; Security of data; Supervised learning; Vectors; Support vector machines Data linkage; Data matching; Deduplication; Entity resolution; Nearest neighbour; Support vector machine
dc.titleAutomatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification
dc.typeConference paper
local.bibliographicCitation.lastpage160
local.bibliographicCitation.startpage151
local.contributor.affiliationChristen, Peter, College of Engineering and Computer Science, ANU
local.contributor.authoruidChristen, Peter, u4021539
local.description.embargo2037-12-31
local.description.notesImported from ARIES
local.description.refereedYes
local.identifier.absfor080109 - Pattern Recognition and Data Mining
local.identifier.ariespublicationU3594520xPUB162
local.identifier.doi10.1145/1401890.1401913
local.identifier.scopusID2-s2.0-65449139594
local.type.statusPublished Version

Downloads

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
01_Christen_Automatic_Record_Linkage_using_2008.pdf
Size:
53.28 KB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
02_Christen_Automatic_Record_Linkage_using_2008.pdf
Size:
357.23 KB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
03_Christen_Automatic_Record_Linkage_using_2008.pdf
Size:
99.56 KB
Format:
Adobe Portable Document Format