340 resultados para datasets


Relevância:

10.00% 10.00%

Publicador:

Resumo:

An application that translates raw thermal melt curve data into more easily assimilated knowledge is described. This program, called ‘Meltdown’, performs a number of data remediation steps before classifying melt curves and estimating melting temperatures. The final output is a report that summarizes the results of a differential scanning fluorimetry experiment. Meltdown uses a Bayesian classification scheme, enabling reproducible identification of various trends commonly found in DSF datasets. The goal of Meltdown is not to replace human analysis of the raw data, but to provide a sensible interpretation of the data to make this useful experimental technique accessible to naïve users, as well as providing a starting point for detailed analyses by more experienced users.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Recommender systems assist users in finding what they want. The challenging issue is how to efficiently acquire user preferences or user information needs for building personalized recommender systems. This research explores the acquisition of user preferences using data taxonomy information to enhance personalized recommendations for alleviating cold-start problem. A concept hierarchy model is proposed, which provides a two-dimensional hierarchy for acquiring user preferences. The language model is also extended for the proposed hierarchy in order to generate an effective recommender algorithm. Both Amazon.com book and music datasets are used to evaluate the proposed approach, and the experimental results show that the proposed approach is promising.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Quantifying nitrous oxide (N(2)O) fluxes, a potent greenhouse gas, from soils is necessary to improve our knowledge of terrestrial N(2)O losses. Developing universal sampling frequencies for calculating annual N(2)O fluxes is difficult, as fluxes are renowned for their high temporal variability. We demonstrate daily sampling was largely required to achieve annual N(2)O fluxes within 10% of the best estimate for 28 annual datasets collected from three continents, Australia, Europe and Asia. Decreasing the regularity of measurements either under- or overestimated annual N(2)O fluxes, with a maximum overestimation of 935%. Measurement frequency was lowered using a sampling strategy based on environmental factors known to affect temporal variability, but still required sampling more than once a week. Consequently, uncertainty in current global terrestrial N(2)O budgets associated with the upscaling of field-based datasets can be decreased significantly using adequate sampling frequencies.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The problem of unsupervised anomaly detection arises in a wide variety of practical applications. While one-class support vector machines have demonstrated their effectiveness as an anomaly detection technique, their ability to model large datasets is limited due to their memory and time complexity for training. To address this issue for supervised learning of kernel machines, there has been growing interest in random projection methods as an alternative to the computationally expensive problems of kernel matrix construction and sup-port vector optimisation. In this paper we leverage the theory of nonlinear random projections and propose the Randomised One-class SVM (R1SVM), which is an efficient and scalable anomaly detection technique that can be trained on large-scale datasets. Our empirical analysis on several real-life and synthetic datasets shows that our randomised 1SVM algorithm achieves comparable or better accuracy to deep auto encoder and traditional kernelised approaches for anomaly detection, while being approximately 100 times faster in training and testing.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Identifying unusual or anomalous patterns in an underlying dataset is an important but challenging task in many applications. The focus of the unsupervised anomaly detection literature has mostly been on vectorised data. However, many applications are more naturally described using higher-order tensor representations. Approaches that vectorise tensorial data can destroy the structural information encoded in the high-dimensional space, and lead to the problem of the curse of dimensionality. In this paper we present the first unsupervised tensorial anomaly detection method, along with a randomised version of our method. Our anomaly detection method, the One-class Support Tensor Machine (1STM), is a generalisation of conventional one-class Support Vector Machines to higher-order spaces. 1STM preserves the multiway structure of tensor data, while achieving significant improvement in accuracy and efficiency over conventional vectorised methods. We then leverage the theory of nonlinear random projections to propose the Randomised 1STM (R1STM). Our empirical analysis on several real and synthetic datasets shows that our R1STM algorithm delivers comparable or better accuracy to a state-of-the-art deep learning method and traditional kernelised approaches for anomaly detection, while being approximately 100 times faster in training and testing.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Many conventional statistical machine learning al- gorithms generalise poorly if distribution bias ex- ists in the datasets. For example, distribution bias arises in the context of domain generalisation, where knowledge acquired from multiple source domains need to be used in a previously unseen target domains. We propose Elliptical Summary Randomisation (ESRand), an efficient domain generalisation approach that comprises of a randomised kernel and elliptical data summarisation. ESRand learns a domain interdependent projection to a la- tent subspace that minimises the existing biases to the data while maintaining the functional relationship between domains. In the latent subspace, ellipsoidal summaries replace the samples to enhance the generalisation by further removing bias and noise in the data. Moreover, the summarisation enables large-scale data processing by significantly reducing the size of the data. Through comprehensive analysis, we show that our subspace-based approach outperforms state-of-the-art results on several activity recognition benchmark datasets, while keeping the computational complexity significantly low.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Generating discriminative input features is a key requirement for achieving highly accurate classifiers. The process of generating features from raw data is known as feature engineering and it can take significant manual effort. In this paper we propose automated feature engineering to derive a suite of additional features from a given set of basic features with the aim of both improving classifier accuracy through discriminative features, and to assist data scientists through automation. Our implementation is specific to HTTP computer network traffic. To measure the effectiveness of our proposal, we compare the performance of a supervised machine learning classifier built with automated feature engineering versus one using human-guided features. The classifier addresses a problem in computer network security, namely the detection of HTTP tunnels. We use Bro to process network traffic into base features and then apply automated feature engineering to calculate a larger set of derived features. The derived features are calculated without favour to any base feature and include entropy, length and N-grams for all string features, and counts and averages over time for all numeric features. Feature selection is then used to find the most relevant subset of these features. Testing showed that both classifiers achieved a detection rate above 99.93% at a false positive rate below 0.01%. For our datasets, we conclude that automated feature engineering can provide the advantages of increasing classifier development speed and reducing development technical difficulties through the removal of manual feature engineering. These are achieved while also maintaining classification accuracy.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Twitter’s hashtag functionality is now used for a very wide variety of purposes, from covering crises and other breaking news events through gathering an instant community around shared media texts (such as sporting events and TV broadcasts) to signalling emotive states from amusement to despair. These divergent uses of the hashtag are increasingly recognised in the literature, with attention paid especially to the ability for hashtags to facilitate the creation of ad hoc or hashtag publics. A more comprehensive understanding of these different uses of hashtags has yet to be developed, however. Previous research has explored the potential for a systematic analysis of the quantitative metrics that could be generated from processing a series of hashtag datasets. Such research found, for example, that crisis-related hashtags exhibited a significantly larger incidence of retweets and tweets containing URLs than hashtags relating to televised events, and on this basis hypothesised that the information-seeking and -sharing behaviours of Twitter users in such different contexts were substantially divergent. This article updates such study and their methodology by examining the communicative metrics of a considerably larger and more diverse number of hashtag datasets, compiled over the past five years. This provides an opportunity both to confirm earlier findings, as well as to explore whether hashtag use practices may have shifted subsequently as Twitter’s userbase has developed further; it also enables the identification of further hashtag types beyond the “crisis” and “mainstream media event” types outlined to date. The article also explores the presence of such patterns beyond recognised hashtags, by incorporating an analysis of a number of keyword-based datasets. This large-scale, comparative approach contributes towards the establishment of a more comprehensive typology of hashtags and their publics, and the metrics it describes will also be able to be used to classify new hashtags emerging in the future. In turn, this may enable researchers to develop systems for automatically distinguishing newly trending topics into a number of event types, which may be useful for example for the automatic detection of acute crises and other breaking news events.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The increased availability of image capturing devices has enabled collections of digital images to rapidly expand in both size and diversity. This has created a constantly growing need for efficient and effective image browsing, searching, and retrieval tools. Pseudo-relevance feedback (PRF) has proven to be an effective mechanism for improving retrieval accuracy. An original, simple yet effective rank-based PRF mechanism (RB-PRF) that takes into account the initial rank order of each image to improve retrieval accuracy is proposed. This RB-PRF mechanism innovates by making use of binary image signatures to improve retrieval precision by promoting images similar to highly ranked images and demoting images similar to lower ranked images. Empirical evaluations based on standard benchmarks, namely Wang, Oliva & Torralba, and Corel datasets demonstrate the effectiveness of the proposed RB-PRF mechanism in image retrieval.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Muscoidea is a significant dipteran clade that includes house flies (Family Muscidae), latrine flies (F. Fannidae), dung flies (F. Scathophagidae) and root maggot flies (F. Anthomyiidae). It is comprised of approximately 7000 described species. The monophyly of the Muscoidea and the precise relationships of muscoids to the closest superfamily the Oestroidea (blow flies, flesh flies etc) are both unresolved. Until now mitochondrial (mt) genomes were available for only two of the four muscoid families precluding a thorough test of phylogenetic relationships using this data source. Here we present the first two mt genomes for the families Fanniidae (Euryomma sp.) (family Fanniidae) and Anthomyiidae (Delia platura (Meigen, 1826)). We also conducted phylogenetic analyses containing of these newly sequenced mt genomes plus 15 other species representative of dipteran diversity to address the internal relationship of Muscoidea and its systematic position. Both maximum-likelihood and Bayesian analyses suggested that Muscoidea was not a monophyletic group with the relationship: (Fanniidae + Muscidae) + ((Anthomyiidae + Scathophagidae) + (Calliphoridae + Sarcophagidae)), supported by the majority of analysed datasets. This also infers that Oestroidea was paraphyletic in the majority of analyses. Divergence time estimation suggested that the earliest split within the Calyptratae, separating (Tachinidae + Oestridae) from the remaining families, occurred in the Early Eocene. The main divergence within the paraphyletic muscoidea grade was between Fanniidae + Muscidae and the lineage ((Anthomyiidae + Scathophagidae) + (Calliphoridae + Sarcophagidae)) which occurred in the Late Eocene