Full Program »
Platform-Oblivious Anti-Spam Gateway
This paper addresses a novel anti-spam gateway targeting multiple linguistic-based social platforms to expose the outlier property of their spam messages uniformly for effective detection. Instead of labeling ground truth datasets and extracting key features, which are labor-intensive and time-consuming, we start with coarsely mining seed corpora of spams and hams from the target data (aiming for spam classification), before reconstructing them as the reference. To catch each word’s rich information in the semantic and syntactic perspectives, we then leverage the natural language processing (NLP) model to embed each word into the high-dimensional vector space and use a neural network to train a spam word model. After that, each message is encoded by using the predicted spam scores from this model for all included stem words. The encoded messages are processed by the prominent outlier techniques to produce their respective scores, allowing us to rank them for making the outlier visible. Our solution is unsupervised, without relying on specifics of any platform or dataset, to be platform-oblivious. Through extensive experiments, our solution is demonstrated to expose spammers’ outlier characteristics effectively, outperform all examined unsupervised methods in almost all metrics, and may even better supervised counterparts.