6 million spam tweets: a large ground truth for timely Twitter spam detection


Autoria(s): Chen, Chao; Zhang, Jun; Chen, Xiao; Xiang, Yang; Zhou, Wanlei
Data(s)

01/01/2015

Resumo

Twitter has changed the way of communication and getting news for people's daily life in recent years. Meanwhile, due to the popularity of Twitter, it also becomes a main target for spamming activities. In order to stop spammers, Twitter is using Google SafeBrowsing to detect and block spam links. Despite that blacklists can block malicious URLs embedded in tweets, their lagging time hinders the ability to protect users in real-time. Thus, researchers begin to apply different machine learning algorithms to detect Twitter spam. However, there is no comprehensive evaluation on each algorithms' performance for real-time Twitter spam detection due to the lack of large groundtruth. To carry out a thorough evaluation, we collected a large dataset of over 600 million public tweets. We further labelled around 6.5 million spam tweets and extracted 12 light-weight features, which can be used for online detection. In addition, we have conducted a number of experiments on six machine learning algorithms under various conditions to better understand their effectiveness and weakness for timely Twitter spam detection. We will make our labelled dataset for researchers who are interested in validating or extending our work.

Identificador

http://hdl.handle.net/10536/DRO/DU:30081434

Idioma(s)

eng

Publicador

IEEE

Relação

http://dro.deakin.edu.au/eserv/DU:30081434/chen-6millionspam-2015.pdf

http://dro.deakin.edu.au/eserv/DU:30081434/chen-6millionspam-evid-2015.pdf

http://www.dx.doi.org/10.1109/ICC.2015.7249453

Direitos

2015, IEEE

Tipo

Conference Paper