970 resultados para Datasets


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Local spatio-temporal features with a Bag-of-visual words model is a popular approach used in human action recognition. Bag-of-features methods suffer from several challenges such as extracting appropriate appearance and motion features from videos, converting extracted features appropriate for classification and designing a suitable classification framework. In this paper we address the problem of efficiently representing the extracted features for classification to improve the overall performance. We introduce two generative supervised topic models, maximum entropy discrimination LDA (MedLDA) and class- specific simplex LDA (css-LDA), to encode the raw features suitable for discriminative SVM based classification. Unsupervised LDA models disconnect topic discovery from the classification task, hence yield poor results compared to the baseline Bag-of-words framework. On the other hand supervised LDA techniques learn the topic structure by considering the class labels and improve the recognition accuracy significantly. MedLDA maximizes likelihood and within class margins using max-margin techniques and yields a sparse highly discriminative topic structure; while in css-LDA separate class specific topics are learned instead of common set of topics across the entire dataset. In our representation first topics are learned and then each video is represented as a topic proportion vector, i.e. it can be comparable to a histogram of topics. Finally SVM classification is done on the learned topic proportion vector. We demonstrate the efficiency of the above two representation techniques through the experiments carried out in two popular datasets. Experimental results demonstrate significantly improved performance compared to the baseline Bag-of-features framework which uses kmeans to construct histogram of words from the feature vectors.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The commercialization of aerial image processing is highly dependent on the platforms such as UAVs (Unmanned Aerial Vehicles). However, the lack of an automated UAV forced landing site detection system has been identified as one of the main impediments to allow UAV flight over populated areas in civilian airspace. This article proposes a UAV forced landing site detection system that is based on machine learning approaches including the Gaussian Mixture Model and the Support Vector Machine. A range of learning parameters are analysed including the number of Guassian mixtures, support vector kernels including linear, radial basis function Kernel (RBF) and polynormial kernel (poly), and the order of RBF kernel and polynormial kernel. Moreover, a modified footprint operator is employed during feature extraction to better describe the geometric characteristics of the local area surrounding a pixel. The performance of the presented system is compared to a baseline UAV forced landing site detection system which uses edge features and an Artificial Neural Network (ANN) region type classifier. Experiments conducted on aerial image datasets captured over typical urban environments reveal improved landing site detection can be achieved with an SVM classifier with an RBF kernel using a combination of colour and texture features. Compared to the baseline system, the proposed system provides significant improvement in term of the chance to detect a safe landing area, and the performance is more stable than the baseline in the presence of changes to the UAV altitude.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and robustness of these methods to key evolutionary processes remain to be investigated. Here, using simulated sequence sets of various sizes in both nucleotides and amino acids, we systematically assess the accuracy of phylogenetic inference using an alignment-free approach, based on D2 statistics, under different evolutionary scenarios. We find that compared to a multiple sequence alignment approach, D2 methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, the alignment-free methods perform well for sequences sharing low divergence, at greater computation speed. Our findings provide strong evidence for the scalability and the potential use of alignment-free methods in large-scale phylogenomics.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In recent years, considerable research efforts have been directed to micro-array technologies and their role in providing simultaneous information on expression profiles for thousands of genes. These data, when subjected to clustering and classification procedures, can assist in identifying patterns and providing insight on biological processes. To understand the properties of complex gene expression datasets, graphical representations can be used. Intuitively, the data can be represented in terms of a bipartite graph, with weighted edges corresponding to gene-sample node couples in the dataset. Biologically meaningful subgraphs can be sought, but performance can be influenced both by the search algorithm, and, by the graph-weighting scheme and both merit rigorous investigation. In this paper, we focus on edge-weighting schemes for bipartite graphical representation of gene expression. Two novel methods are presented: the first is based on empirical evidence; the second on a geometric distribution. The schemes are compared for several real datasets, assessing efficiency of performance based on four essential properties: robustness to noise and missing values, discrimination, parameter influence on scheme efficiency and reusability. Recommendations and limitations are briefly discussed. Keywords: Edge-weighting; weighted graphs; gene expression; bi-clustering

Relevância:

10.00% 10.00%

Publicador:

Resumo:

One of the main challenges in data analytics is that discovering structures and patterns in complex datasets is a computer-intensive task. Recent advances in high-performance computing provide part of the solution. Multicore systems are now more affordable and more accessible. In this paper, we investigate how this can be used to develop more advanced methods for data analytics. We focus on two specific areas: model-driven analysis and data mining using optimisation techniques.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Most real-life data analysis problems are difficult to solve using exact methods, due to the size of the datasets and the nature of the underlying mechanisms of the system under investigation. As datasets grow even larger, finding the balance between the quality of the approximation and the computing time of the heuristic becomes non-trivial. One solution is to consider parallel methods, and to use the increased computational power to perform a deeper exploration of the solution space in a similar time. It is, however, difficult to estimate a priori whether parallelisation will provide the expected improvement. In this paper we consider a well-known method, genetic algorithms, and evaluate on two distinct problem types the behaviour of the classic and parallel implementations.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In providing simultaneous information on expression profiles for thousands of genes, microarray technologies have, in recent years, been largely used to investigate mechanisms of gene expression. Clustering and classification of such data can, indeed, highlight patterns and provide insight on biological processes. A common approach is to consider genes and samples of microarray datasets as nodes in a bipartite graphs, where edges are weighted e.g. based on the expression levels. In this paper, using a previously-evaluated weighting scheme, we focus on search algorithms and evaluate, in the context of biclustering, several variations of Genetic Algorithms. We also introduce a new heuristic “Propagate”, which consists in recursively evaluating neighbour solutions with one more or one less active conditions. The results obtained on three well-known datasets show that, for a given weighting scheme,optimal or near-optimal solutions can be identified.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper addresses the development of trust in the use of Open Data through incorporation of appropriate authentication and integrity parameters for use by end user Open Data application developers in an architecture for trustworthy Open Data Services. The advantages of this architecture scheme is that it is far more scalable, not another certificate-based hierarchy that has problems with certificate revocation management. With the use of a Public File, if the key is compromised: it is a simple matter of the single responsible entity replacing the key pair with a new one and re-performing the data file signing process. Under this proposed architecture, the the Open Data environment does not interfere with the internal security schemes that might be employed by the entity. However, this architecture incorporates, when needed, parameters from the entity, e.g. person who authorized publishing as Open Data, at the time that datasets are created/added.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This chapter discusses the methodological aspects and empirical findings of a large-scale, funded project investigating public communication through social media in Australia. The project concentrates on Twitter, but we approach it as representative of broader current trends toward the integration of large datasets and computational methods into media and communication studies in general, and social media scholarship in particular. The research discussed in this chapter aims to empirically describe networks of affiliation and interest in the Australian Twittersphere, while reflecting on the methodological implications and imperatives of ‘big data’ in the humanities. Using custom network crawling technology, we have conducted a snowball crawl of Twitter accounts operated by Australian users to identify more than one million users and their follower/followee relationships, and have mapped their interconnections. In itself, the map provides an overview of the major clusters of densely interlinked users, largely centred on shared topics of interest (from politics through arts to sport) and/or sociodemographic factors (geographic origins, age groups). Our map of the Twittersphere is the first of its kind for the Australian part of the global Twitter network, and also provides a first independent and scholarly estimation of the size of the total Australian Twitter population. In combination with our investigation of participation patterns in specific thematic hashtags, the map also enables us to examine which areas of the underlying follower/followee network are activated in the discussion of specific current topics – allowing new insights into the extent to which particular topics and issues are of interest to specialised niches or to the Australian public more broadly. Specifically, we examine the Twittersphere footprint of dedicated political discussion, under the #auspol hashtag, and compare it with the heightened, broader interest in Australian politics during election campaigns, using #ausvotes; we explore the different patterns of Twitter activity across the map for major television events (the popular competitive cooking show #masterchef, the British #royalwedding, and the annual #stateoforigin Rugby League sporting contest); and we investigate the circulation of links to the articles published by a number of major Australian news organisations across the network. Such analysis, which combines the ‘big data’-informed map and a close reading of individual communicative phenomena, makes it possible to trace the dynamic formation and dissolution of issue publics against the backdrop of longer-term network connections, and the circulation of information across these follower/followee links. Such research sheds light on the communicative dynamics of Twitter as a space for mediated social interaction. Our work demonstrates the possibilities inherent in the current ‘computational turn’ (Berry, 2010) in the digital humanities, as well as adding to the development and critical examination of methodologies for dealing with ‘big data’ (boyd and Crawford, 2011). Out tools and methods for doing Twitter research, released under Creative Commons licences through our project Website, provide the basis for replicable and verifiable digital humanities research on the processes of public communication which take place through this important new social network.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Building on hashtag datasets gathered since January 2011, this paper will compare patterns of Twitter usage during the popular revolution in Egypt and the civil war in Libya. Using custom-made tools for processing ‘big data’ (boyd & Crawford, 2011), we will examine the volume of tweets sent by English-, Arabic-, and mixed-language Twitter users over time, and examine the networks of interaction (variously through @replying, retweeting, or both) between these groups as they developed and shifted over the course of these uprisings. Examining @reply and retweet traffic, we will identify general patterns of information flow between the English- and Arabic-speaking sides of the Twittersphere, and highlight the roles played by key boundary riders connecting both language spheres. Further, we will examine the URLs shared in these hashtags by Twitter participants, to identify the most prominent overall information sources, examine differences in the information diet experienced by English- and Arabic-language users, and investigate whether there are any online sources whose URLs are transcending language boundaries more frequently than others.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Discounted Cumulative Gain (DCG) is a well-known ranking evaluation measure for models built with multiple relevance graded data. By handling tagging data used in recommendation systems as an ordinal relevance set of {negative,null,positive}, we propose to build a DCG based recommendation model. We present an efficient and novel learning-to-rank method by optimizing DCG for a recommendation model using the tagging data interpretation scheme. Evaluating the proposed method on real-world datasets, we demonstrate that the method is scalable and outperforms the benchmarking methods by generating a quality top-N item recommendation list.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Map-matching algorithms that utilise road segment connectivity along with other data (i.e.position, speed and heading) in the process of map-matching are normally suitable for high frequency (1 Hz or higher) positioning data from GPS. While applying such map-matching algorithms to low frequency data (such as data from a fleet of private cars, buses or light duty vehicles or smartphones), the performance of these algorithms reduces to in the region of 70% in terms of correct link identification, especially in urban and sub-urban road networks. This level of performance may be insufficient for some real-time Intelligent Transport System (ITS) applications and services such as estimating link travel time and speed from low frequency GPS data. Therefore, this paper develops a new weight-based shortest path and vehicle trajectory aided map-matching (stMM) algorithm that enhances the map-matching of low frequency positioning data on a road map. The well-known A* search algorithm is employed to derive the shortest path between two points while taking into account both link connectivity and turn restrictions at junctions. In the developed stMM algorithm, two additional weights related to the shortest path and vehicle trajectory are considered: one shortest path-based weight is related to the distance along the shortest path and the distance along the vehicle trajectory, while the other is associated with the heading difference of the vehicle trajectory. The developed stMM algorithm is tested using a series of real-world datasets of varying frequencies (i.e. 1 s, 5 s, 30 s, 60 s sampling intervals). A high-accuracy integrated navigation system (a high-grade inertial navigation system and a carrier-phase GPS receiver) is used to measure the accuracy of the developed algorithm. The results suggest that the algorithm identifies 98.9% of the links correctly for every 30 s GPS data. Omitting the information from the shortest path and vehicle trajectory, the accuracy of the algorithm reduces to about 73% in terms of correct link identification. The algorithm can process on average 50 positioning fixes per second making it suitable for real-time ITS applications and services.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Advances in neural network language models have demonstrated that these models can effectively learn representations of words meaning. In this paper, we explore a variation of neural language models that can learn on concepts taken from structured ontologies and extracted from free-text, rather than directly from terms in free-text. This model is employed for the task of measuring semantic similarity between medical concepts, a task that is central to a number of techniques in medical informatics and information retrieval. The model is built with two medical corpora (journal abstracts and patient records) and empirically validated on two ground-truth datasets of human-judged concept pairs assessed by medical professionals. Empirically, our approach correlates closely with expert human assessors ($\approx$ 0.9) and outperforms a number of state-of-the-art benchmarks for medical semantic similarity. The demonstrated superiority of this model for providing an effective semantic similarity measure is promising in that this may translate into effectiveness gains for techniques in medical information retrieval and medical informatics (e.g., query expansion and literature-based discovery).

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper we present a new method for performing Bayesian parameter inference and model choice for low count time series models with intractable likelihoods. The method involves incorporating an alive particle filter within a sequential Monte Carlo (SMC) algorithm to create a novel pseudo-marginal algorithm, which we refer to as alive SMC^2. The advantages of this approach over competing approaches is that it is naturally adaptive, it does not involve between-model proposals required in reversible jump Markov chain Monte Carlo and does not rely on potentially rough approximations. The algorithm is demonstrated on Markov process and integer autoregressive moving average models applied to real biological datasets of hospital-acquired pathogen incidence, animal health time series and the cumulative number of poison disease cases in mule deer.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Pilot and industrial scale dilute acid pretreatment data can be difficult to obtain due to the significant infrastructure investment required. Consequently, models of dilute acid pretreatment by necessity use laboratory scale data to determine kinetic parameters and make predictions about optimal pretreatment conditions at larger scales. In order for these recommendations to be meaningful, the ability of laboratory scale models to predict pilot and industrial scale yields must be investigated. A mathematical model of the dilute acid pretreatment of sugarcane bagasse has previously been developed by the authors. This model was able to successfully reproduce the experimental yields of xylose and short chain xylooligomers obtained at the laboratory scale. In this paper, the ability of the model to reproduce pilot scale yield and composition data is examined. It was found that in general the model over predicted the pilot scale reactor yields by a significant margin. Models that appear very promising at the laboratory scale may have limitations when predicting yields on a pilot or industrial scale. It is difficult to comment whether there are any consistent trends in optimal operating conditions between reactor scale and laboratory scale hydrolysis due to the limited reactor datasets available. Further investigation is needed to determine whether the model has some efficacy when the kinetic parameters are re-evaluated by parameter fitting to reactor scale data, however, this requires the compilation of larger datasets. Alternatively, laboratory scale mathematical models may have enhanced utility for predicting larger scale reactor performance if bulk mass transport and fluid flow considerations are incorporated into the fibre scale equations. This work reinforces the need for appropriate attention to be paid to pilot scale experimental development when moving from laboratory to pilot and industrial scales for new technologies.