by Srijan Kumar, Robert West and Jure Leskovec
While information on the web has tremendous positive effect on the lives of billions of people worldwide, false information has many dangerous and harmful impact! Hoaxes are delibirately fabricated falsehoods made to masquarade as truth. Therefore, in this work, we conduct a thorough study of all 20,000+ hoaxes created on Wikipedia throughout its history, and understand their impact, characteristics and detection.
Impact of Wikipedia Hoaxes:
We find the impact of hoaxes by quantifying (i) how long they last, (ii) how much traffic they receive (shown on left), and (iii) how heavily they are cited on the Web.
We find that most hoaxes have negligible impact along all of these three dimensions, but that 1% of hoaxes survive for over an year, 1% receive significant attention (more than 100 pageviews a day) even before being uncovered, and are heavily referenced within Wikipedia and across the web.
Characteristics of Wikipedia Hoaxes:
We find typical characteristics of hoaxes by comparing them to non-hoax articles.
We study the characteristics along four dimentions:
Detection of Wikipedia Hoaxes:
We build machine learning classifiers for various tasks, most notably to identify whether a given article is a hoax or not. Our algorithm has very high performance (shown on left). Simply training on appearance features do no better than random, but digging in with editor properties and link features boosts the performance. This means that faking the content of the article is easy, but faking its relation other Wikipedia articles is not!
Download the PDF
Slides for the conference presentation at WWW 2016.
Abstract
Wikipedia is a major source of information for many people. However, false information on Wikipedia raises concerns about its credibility. One way in which false information may be presented on Wikipedia is in the form of hoax articles, i.e., articles containing fabricated facts about nonexistent entities or events. In this paper we study false information on Wikipedia by focusing on the hoax articles that have been created throughout its history. We make several contributions. First, we assess the real-world impact of hoax articles by measuring how long they survive before being debunked, how many pageviews they receive, and how heavily they are referred to by documents on the Web. We find that, while most hoaxes are detected quickly and have little impact on Wikipedia, a small number of hoaxes survive long and are well cited across the Web. Second, we characterize the nature of successful hoaxes by comparing them to legitimate articles and to failed hoaxes that were discovered shortly after being created. We find characteristic differences in terms of article structure and content, embeddedness into the rest of Wikipedia, and features of the editor who created the hoax. Third, we successfully apply our findings to address a series of classification tasks, most notably to determine whether a given article is a hoax. And finally, we describe and evaluate a task involving humans distinguishing hoaxes from non-hoaxes. We find that humans are not particularly good at the task and that our automated classifier outperforms them by a big margin.
Bibtex
@inproceedings{kumar2016disinformation, author = {Kumar, Srijan and West, Robert and Leskovec, Jure}, title = {Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes}, booktitle = {Proceedings of the 25th International World Wide Web Conference}, year = {2016} }
The publicly available hoax and similar non-hoax articles can be downloaded below!
DownloadNewer hoaxes can be found at: Speedy Deletion Wikia and Deletionpedia.