878 resultados para Spam filtering
Resumo:
Treating e-mail filtering as a binary text classification problem, researchers have applied several statistical learning algorithms to email corpora with promising results. This paper examines the performance of a Naive Bayes classifier using different approaches to feature selection and tokenization on different email corpora
Resumo:
The presence of spam in a document ranking is a major issue for Web search engines. Common approaches that cope with spam remove from the document rankings those pages that are likely to contain spam. These approaches are implemented as post-retrieval processes, that filter out spam pages only after documents have been retrieved with respect to a user’s query. In this paper we suggest to remove spam pages at indexing time, therefore obtaining a pruned index that is virtually “spam-free”. We investigate the benefits of this approach from three points of view: indexing time, index size, and retrieval performances. Not surprisingly, we found that the strategy decreases both the time required by the indexing process and the space required for storing the index. Surprisingly instead, we found that by considering a spam-pruned version of a collection’s index, no difference in retrieval performance is found when compared to that obtained by traditional post-retrieval spam filtering approaches.
Resumo:
Billing Mediation Platform (BMP) in telecommunication industry is used to process real-time streams of Call Detail Records (CDRs) which can be a massive number a day. The generated records by BMP can be deployed for billing purposes, fraud detection, spam filtering, traffic analysis, and churn forecast. Several of these applications are distinguished by real-time processing requiring low-latency analysis of CDRs. Testing of such a platform carries diverse aspects like stress testing of analytics for scalability and what-if scenarios which require generating of CDRs with realistic volumetric and appropriate properties. The approach of this project is to build user friendly and flexible application which assists the development department to test their billing solution occasionally. These generators projects have been around for a while the only difference are the potions they cover and the purpose they will be used for. This paper proposes to use a simulator application to test the BMPs with simulating CDRs. The Simulated CDRs are modifiable based on the user requirements and represent real world data.
Resumo:
Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears in areas like e-commerce, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Copyright 2008 by the author(s)/owner(s).
Resumo:
电子邮件(Electronic Mail,E-Mail)是目前使用最广泛的互联网应用。随着互联网络以惊人的速度增长,电子邮件成为发布恶意信息的一个重要途径,垃圾邮件已经成为危害互联网络的最大毒瘤。针对方式多样的垃圾邮件技术,垃圾邮件过滤系统往往也需要综合多种过滤技术以提高系统的有效性。其中摘要技术已经成为重要的垃圾邮件过滤方法之一:通过摘要技术判断一个邮件和已知垃圾邮件的相似度,从而对邮件进行分类。判断一个垃圾邮件过滤算法是否有效,要综合考虑算法的召回率、准确率以及时间性能。I-Match算法通过摘要值的精确匹配来判断两个邮件文本内容是否相似,算法在效率方面表现突出。但是I-Match算法在实际的应用中还存在很多问题,其中包括字典生成制约算法的性能以及面对攻击时算法表现出的鲁棒性不足。因此,优化算法的字典生成过程以及提高算法的鲁棒性成了算法应用于实际系统的两个重要问题。本文的主要工作包含以下内容: 对垃圾邮件进行相似性分析,包括垃圾邮件相似性的起因、垃圾邮件在时间和内容两方面所表现出的相似性特征。垃圾邮件体现出的相似性特征是使用摘要算法进行垃圾邮件过滤的必要条件之一。 改进I-Match算法的字典生成过程。提出利用特征的互信息作为特征选择依据改进字典生成过程,并对比几种不同的特征选择方式对算法性能的影响。 分析I-Match算法的鲁棒性以及几种I-Match改进算法对算法鲁棒性的提升,在实际的邮件语料上对各种改进算法进行评测,并综合分析各个算法的实用性。 完成了KSpam系统原型,以插件的形式综合多种邮件过滤方法,并给出了I-Match算法在KSpam系统中的实现方案。同时,系统实现了一种新式的邮件自动回收功能,有效减少邮件管理员的邮件语料收集工作。
Resumo:
Pós-graduação em Engenharia Elétrica - FEIS
Resumo:
[ES]El spam, o correo no deseado enviado masivamente, es una amenaza que afecta al correo electrónico y otros medios de comunicación telemática. Su alto volumen de circulación genera pérdidas temporales y económicas considerables. Se presenta una solución a este problema: un sistema inteligente híbrido de filtrado antispam, basado en redes neuronales artificiales (RNA) no supervisadas. Consta de una etapa de preprocesado y de otra de procesado, basadas en distintos modelos de computación: programada (con 2 fases: manual y computacional) y neuronal (mediante mapas autoorganizados de Kohonen, SOM), respectivamente. Este sistema ha sido optimizado usando, como cuerpo de datos, ham de “Enron Email” y spam de dos fuentes diferentes. Se analiza la calidad y el rendimiento del mismo mediante diferentes métricas.
Resumo:
In ubiquitous data stream mining applications, different devices often aim to learn concepts that are similar to some extent. In these applications, such as spam filtering or news recommendation, the data stream underlying concept (e.g., interesting mail/news) is likely to change over time. Therefore, the resultant model must be continuously adapted to such changes. This paper presents a novel Collaborative Data Stream Mining (Coll-Stream) approach that explores the similarities in the knowledge available from other devices to improve local classification accuracy. Coll-Stream integrates the community knowledge using an ensemble method where the classifiers are selected and weighted based on their local accuracy for different partitions of the feature space. We evaluate Coll-Stream classification accuracy in situations with concept drift, noise, partition granularity and concept similarity in relation to the local underlying concept. The experimental results show that Coll-Stream resultant model achieves stability and accuracy in a variety of situations using both synthetic and real world datasets.
Resumo:
E-mail spam has remained a scourge and menacing nuisance for users, internet and network service operators and providers, in spite of the anti-spam techniques available; and spammers are relentlessly circumventing these anti-spam techniques embedded or installed in form of software products on both client and server sides of both fixed and mobile devices to their advantage. This continuous evasion degrades the capabilities of these anti-spam techniques as none of them provides a comprehensive reliable solution to the problem posed by spam and spammers. Major problem for instance arises when these anti-spam techniques misjudge or misclassify legitimate emails as spam (false positive); or fail to deliver or block spam on the SMTP server (false negative); and the spam passes-on to the receiver, and yet this server from where it originates does not notice or even have an auto alert service to indicate that the spam it was designed to prevent has slipped and moved on to the receiver’s SMTP server; and the receiver’s SMTP server still fail to stop the spam from reaching user’s device and with no auto alert mechanism to inform itself of this inability; thus causing a staggering cost in loss of time, effort and finance. This paper takes a comparative literature overview of some of these anti-spam techniques, especially the filtering technological endorsements designed to prevent spam, their merits and demerits to entrench their capability enhancements, as well as evaluative analytical recommendations that will be subject to further research.
Resumo:
We propose an economic mechanism to reduce the incidence of malware that delivers spam. Earlier research proposed attention markets as a solution for unwanted messages, and showed they could provide more net benefit than alternatives such as filtering and taxes. Because it uses a currency system, Attention Bonds faces a challenge. Zombies, botnets, and various forms of malware might steal valuable currency instead of stealing unused CPU cycles. We resolve this problem by taking advantage of the fact that the spam-bot problem has been reduced to financial fraud. As such, the large body of existing work in that realm can be brought to bear. By drawing an analogy between sending and spending, we show how a market mechanism can detect and prevent spam malware. We prove that by using a currency (i) each instance of spam increases the probability of detecting infections, and (ii) the value of eradicating infections can justify insuring users against fraud. This approach attacks spam at the source, a virtue missing from filters that attack spam at the destination. Additionally, the exchange of currency provides signals of interest that can improve the targeting of ads. ISPs benefit from data management services and consumers benefit from the higher average value of messages they receive. We explore these and other secondary effects of attention markets, and find them to offer, on the whole, attractive economic benefits for all – including consumers, advertisers, and the ISPs.