Detecting vandalism on Wikipedia across multiple languages

Tran, Khoi-Nguyen Dao

Detecting vandalism on Wikipedia across multiple languages

dc.contributor.author	Tran, Khoi-Nguyen Dao
dc.date.accessioned	2015-07-27T01:47:32Z
dc.date.available	2015-07-27T01:47:32Z
dc.date.issued	2015
dc.description.abstract	Vandalism, the malicious modification or editing of articles, is a serious problem for free and open access online encyclopedias such as Wikipedia. Over the 13 year lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more than 500 million revisions of over 9 million English articles, but smaller manually inspected sets of revisions for research show vandalism may appear in 7% to 11% of all revisions of English Wikipedia articles. The persistent threat of vandalism has led to the development of automated programs (bots) and editing assistance programs to help editors detect and repair vandalism. Research into improving vandalism detection through application of machine learning techniques have shown significant improvements to detection rates of a wider variety of vandalism. However, the focus of research is often only on the English Wikipedia, which has led us to develop a novel research area of cross-language vandalism detection (CLVD). CLVD provides a solution to detecting vandalism across several languages through the development of language-independent machine learning models. These models can identify undetected vandalism cases across languages that may have insufficient identified cases to build learning models. The two main challenges of CLVD are (1) identifying language-independent features of vandalism that are common to multiple languages, and (2) extensibility of vandalism detection models trained in one language to other languages without significant loss in detection rate. In addition, other important challenges of vandalism detection are (3) high detection rate of a variety of known vandalism types, (4) scalability to the size of Wikipedia in the number of revisions, and (5) ability to incorporate and generate multiple types of data that characterise vandalism. In this thesis, we present our research into CLVD onWikipedia, where we identify gaps and problems in existing vandalism detection techniques. To begin our thesis, we introduce the problem of vandalism onWikipedia with motivating examples, and then present a review of the literature. From this review, we identify and address the following research gaps. First, we propose techniques for summarising the user activity of articles and comparing the knowledge coverage of articles across languages. Second, we investigate CLVD using the metadata of article revisions together with article views to learn vandalism models and classify incoming revisions. Third, we propose new text features that are more suitable for CLVD than text features from the literature. Fourth, we propose a novel context-aware vandalism detection technique for sneaky types of vandalism that may not be detectable through constructing features. Finally, to show that our techniques of detecting malicious activities are not limited to Wikipedia, we apply our feature sets to detecting malicious attachments and URLs in spam emails. Overall, our ultimate aim is to build the next generation of vandalism detection bots that can learn and detect vandalism from multiple languages and extend their usefulness to other language editions of Wikipedia.	en_AU
dc.identifier.other	b37327884
dc.identifier.uri	http://hdl.handle.net/1885/14453
dc.language.iso	en	en_AU
dc.subject	Wikipedia	en_AU
dc.subject	vandalism	en_AU
dc.subject	sneaky vandalism	en_AU
dc.subject	detection	en_AU
dc.subject	cross-language learning	en_AU
dc.subject	machine learning	en_AU
dc.subject	feature engineering	en_AU
dc.subject	metadata	en_AU
dc.subject	text	en_AU
dc.subject	context-aware	en_AU
dc.subject	bots	en_AU
dc.subject	users	en_AU
dc.subject	editors	en_AU
dc.subject	English	en_AU
dc.subject	German	en_AU
dc.subject	Spanish	en_AU
dc.subject	French	en_AU
dc.subject	Russian	en_AU
dc.subject	spam emails	en_AU
dc.subject	malicious	en_AU
dc.subject	attachments	en_AU
dc.subject	URLs	en_AU
dc.title	Detecting vandalism on Wikipedia across multiple languages	en_AU
dc.type	Thesis (PhD)	en_AU
dcterms.valid	2015	en_AU
local.contributor.affiliation	Research School of Computer Science, The Australian National University	en_AU
local.contributor.supervisor	Christen, Peter
local.identifier.doi	10.25911/5d70eeb78a592
local.mintdoi	mint
local.type.degree	Doctor of Philosophy (PhD)	en_AU

Downloads

Original bundle

Now showing 1 - 1 of 1

Name:: Tran Thesis 2015.pdf
Size:: 3.05 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 884 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Open Access Theses