A Machine Learning Analysis of Lexicon in Deceptive and Credible News
Qi Jia Sun
Burnaby Mountain Secondary
Floor Location : S 050 F

Fake news is a pervasive and continual crisis in the media. Wikipedia defines it as "a type of propaganda that spreads disinformation via print, broadcast news media, or online social media". Fake news is published with the intent to "mislead or damage" a person, company or political party, in order to gain financial or political success.

Over the past two years, fake news has acquired renewed attention as a result of its perceived impact in the 2016 U.S. Presidential election. Mass communication via social media has magnified the impact and influence of misinformation. Fake news takes advantage of social media’s extensive reach to manipulate public opinion on various matters.

The existing classifiers from previous research did not yield high accuracy rates. The main gap in the existing research is that the data was not controlled, and the specific linguistic features were not examined closely. This poses a problem because it does not provide a profound understanding of fake news, and the existing classifier does not account for the lurking variables.

The machine learning method for classifying spam emails has proven to be successful. Moreover, the intentions of spam emails and fake news are similar: they both seek to manipulate the reader in some way. There further exists many similar features between spam emails and fake news articles: grammatical errors, little to no factual information, frequent use of adverbs, limited lexicon, and emotional tone. Thus, applying a similar method from spam emails to fake news could yield promising results.

This project seeks to find a machine learning classifier that can determine the validity of news based on word distributions, and specific linguistic and stylistic differences. Furthermore, the classifier also attempts to classify based on a minimum number of words. If such a classifier exists, the linguistic features of the classifier can help readers manually identify the validity of an article by looking for specific features in the first few sentences.

Therefore, this project not only serves to find a classifier for automated detection of fake news, but it also seeks to establish an understanding of the variations between fake and credible news. In this project, a dataset of 2,107 articles from 30 different domains was collected. This dataset was used to examine the differentiation in the features of these articles in order to improve the accuracy of the classifier. Detecting fake news is a difficult task because fake news attempts to resemble credible news. Nevertheless, a deeper understanding of the differences between deceptive and credible media will further the collective progress in the battle against fake news.