424 resultados para Blog datasets
Resumo:
Generating discriminative input features is a key requirement for achieving highly accurate classifiers. The process of generating features from raw data is known as feature engineering and it can take significant manual effort. In this paper we propose automated feature engineering to derive a suite of additional features from a given set of basic features with the aim of both improving classifier accuracy through discriminative features, and to assist data scientists through automation. Our implementation is specific to HTTP computer network traffic. To measure the effectiveness of our proposal, we compare the performance of a supervised machine learning classifier built with automated feature engineering versus one using human-guided features. The classifier addresses a problem in computer network security, namely the detection of HTTP tunnels. We use Bro to process network traffic into base features and then apply automated feature engineering to calculate a larger set of derived features. The derived features are calculated without favour to any base feature and include entropy, length and N-grams for all string features, and counts and averages over time for all numeric features. Feature selection is then used to find the most relevant subset of these features. Testing showed that both classifiers achieved a detection rate above 99.93% at a false positive rate below 0.01%. For our datasets, we conclude that automated feature engineering can provide the advantages of increasing classifier development speed and reducing development technical difficulties through the removal of manual feature engineering. These are achieved while also maintaining classification accuracy.
Resumo:
Twitter’s hashtag functionality is now used for a very wide variety of purposes, from covering crises and other breaking news events through gathering an instant community around shared media texts (such as sporting events and TV broadcasts) to signalling emotive states from amusement to despair. These divergent uses of the hashtag are increasingly recognised in the literature, with attention paid especially to the ability for hashtags to facilitate the creation of ad hoc or hashtag publics. A more comprehensive understanding of these different uses of hashtags has yet to be developed, however. Previous research has explored the potential for a systematic analysis of the quantitative metrics that could be generated from processing a series of hashtag datasets. Such research found, for example, that crisis-related hashtags exhibited a significantly larger incidence of retweets and tweets containing URLs than hashtags relating to televised events, and on this basis hypothesised that the information-seeking and -sharing behaviours of Twitter users in such different contexts were substantially divergent. This article updates such study and their methodology by examining the communicative metrics of a considerably larger and more diverse number of hashtag datasets, compiled over the past five years. This provides an opportunity both to confirm earlier findings, as well as to explore whether hashtag use practices may have shifted subsequently as Twitter’s userbase has developed further; it also enables the identification of further hashtag types beyond the “crisis” and “mainstream media event” types outlined to date. The article also explores the presence of such patterns beyond recognised hashtags, by incorporating an analysis of a number of keyword-based datasets. This large-scale, comparative approach contributes towards the establishment of a more comprehensive typology of hashtags and their publics, and the metrics it describes will also be able to be used to classify new hashtags emerging in the future. In turn, this may enable researchers to develop systems for automatically distinguishing newly trending topics into a number of event types, which may be useful for example for the automatic detection of acute crises and other breaking news events.
Resumo:
The increased availability of image capturing devices has enabled collections of digital images to rapidly expand in both size and diversity. This has created a constantly growing need for efficient and effective image browsing, searching, and retrieval tools. Pseudo-relevance feedback (PRF) has proven to be an effective mechanism for improving retrieval accuracy. An original, simple yet effective rank-based PRF mechanism (RB-PRF) that takes into account the initial rank order of each image to improve retrieval accuracy is proposed. This RB-PRF mechanism innovates by making use of binary image signatures to improve retrieval precision by promoting images similar to highly ranked images and demoting images similar to lower ranked images. Empirical evaluations based on standard benchmarks, namely Wang, Oliva & Torralba, and Corel datasets demonstrate the effectiveness of the proposed RB-PRF mechanism in image retrieval.
Resumo:
Muscoidea is a significant dipteran clade that includes house flies (Family Muscidae), latrine flies (F. Fannidae), dung flies (F. Scathophagidae) and root maggot flies (F. Anthomyiidae). It is comprised of approximately 7000 described species. The monophyly of the Muscoidea and the precise relationships of muscoids to the closest superfamily the Oestroidea (blow flies, flesh flies etc) are both unresolved. Until now mitochondrial (mt) genomes were available for only two of the four muscoid families precluding a thorough test of phylogenetic relationships using this data source. Here we present the first two mt genomes for the families Fanniidae (Euryomma sp.) (family Fanniidae) and Anthomyiidae (Delia platura (Meigen, 1826)). We also conducted phylogenetic analyses containing of these newly sequenced mt genomes plus 15 other species representative of dipteran diversity to address the internal relationship of Muscoidea and its systematic position. Both maximum-likelihood and Bayesian analyses suggested that Muscoidea was not a monophyletic group with the relationship: (Fanniidae + Muscidae) + ((Anthomyiidae + Scathophagidae) + (Calliphoridae + Sarcophagidae)), supported by the majority of analysed datasets. This also infers that Oestroidea was paraphyletic in the majority of analyses. Divergence time estimation suggested that the earliest split within the Calyptratae, separating (Tachinidae + Oestridae) from the remaining families, occurred in the Early Eocene. The main divergence within the paraphyletic muscoidea grade was between Fanniidae + Muscidae and the lineage ((Anthomyiidae + Scathophagidae) + (Calliphoridae + Sarcophagidae)) which occurred in the Late Eocene