203 resultados para Large Data
Resumo:
This article presents new theoretical and empirical evidence on the forecasting ability of prediction markets. We develop a model that predicts that the time until expiration of a prediction market should negatively affect the accuracy of prices as a forecasting tool in the direction of a ‘favourite/longshot bias’. That is, high-likelihood events are underpriced, and low-likelihood events are over-priced. We confirm this result using a large data set of prediction market transaction prices. Prediction markets are reasonably well calibrated when time to expiration is relatively short, but prices are significantly biased for events farther in the future. When time value of money is considered, the miscalibration can be exploited to earn excess returns only when the trader has a relatively low discount rate.
Resumo:
MapReduce is a computation model for processing large data sets in parallel on large clusters of machines, in a reliable, fault-tolerant manner. A MapReduce computation is broken down into a number of map tasks and reduce tasks, which are performed by so called mappers and reducers, respectively. The placement of the mappers and reducers on the machines directly affects the performance and cost of the MapReduce computation. From the computational point of view, the mappers/reducers placement problem is a generation of the classical bin packing problem, which is NPcomplete. Thus, in this paper we propose a new grouping genetic algorithm for the mappers/reducers placement problem in cloud computing. Compared with the original one, our grouping genetic algorithm uses an innovative coding scheme and also eliminates the inversion operator which is an essential operator in the original grouping genetic algorithm. The new grouping genetic algorithm is evaluated by experiments and the experimental results show that it is much more efficient than four popular algorithms for the problem, including the original grouping genetic algorithm.
Resumo:
In this paper, a novel data-driven approach to monitoring of systems operating under variable operating conditions is described. The method is based on characterizing the degradation process via a set of operation-specific hidden Markov models (HMMs), whose hidden states represent the unobservable degradation states of the monitored system while its observable symbols represent the sensor readings. Using the HMM framework, modeling, identification and monitoring methods are detailed that allow one to identify a HMM of degradation for each operation from mixed-operation data and perform operation-specific monitoring of the system. Using a large data set provided by a major manufacturer, the new methods are applied to a semiconductor manufacturing process running multiple operations in a production environment.
Resumo:
Species identification based on short sequences of DNA markers, that is, DNA barcoding, has emerged as an integral part of modern taxonomy. However, software for the analysis of large and multilocus barcoding data sets is scarce. The Basic Local Alignment Search Tool (BLAST) is currently the fastest tool capable of handling large databases (e.g. >5000 sequences), but its accuracy is a concern and has been criticized for its local optimization. However, current more accurate software requires sequence alignment or complex calculations, which are time-consuming when dealing with large data sets during data preprocessing or during the search stage. Therefore, it is imperative to develop a practical program for both accurate and scalable species identification for DNA barcoding. In this context, we present VIP Barcoding: a user-friendly software in graphical user interface for rapid DNA barcoding. It adopts a hybrid, two-stage algorithm. First, an alignment-free composition vector (CV) method is utilized to reduce searching space by screening a reference database. The alignment-based K2P distance nearest-neighbour method is then employed to analyse the smaller data set generated in the first stage. In comparison with other software, we demonstrate that VIP Barcoding has (i) higher accuracy than Blastn and several alignment-free methods and (ii) higher scalability than alignment-based distance methods and character-based methods. These results suggest that this platform is able to deal with both large-scale and multilocus barcoding data with accuracy and can contribute to DNA barcoding for modern taxonomy. VIP Barcoding is free and available at http://msl.sls.cuhk.edu.hk/vipbarcoding/.
Resumo:
Bioacoustic data can be used for monitoring animal species diversity. The deployment of acoustic sensors enables acoustic monitoring at large temporal and spatial scales. We describe a content-based birdcall retrieval algorithm for the exploration of large data bases of acoustic recordings. In the algorithm, an event-based searching scheme and compact features are developed. In detail, ridge events are detected from audio files using event detection on spectral ridges. Then event alignment is used to search through audio files to locate candidate instances. A similarity measure is then applied to dimension-reduced spectral ridge feature vectors. The event-based searching method processes a smaller list of instances for faster retrieval. The experimental results demonstrate that our features achieve better success rate than existing methods and the feature dimension is greatly reduced.
Resumo:
This chapter draws on a large data set of children's work samples collected as part of a five-year school reform project in a community of high poverty. One component of the data set from this project is a corpus of more than 2000 writing samples collected from students across eight grade levels (Prep to year 7) annually, across four years of the project (2009-2013). This paper utilises a selection of these texts to consider insights available to teachers and schools through a simple process of collecting and assessing writing samples produced by children over time. The focus is on what samples of writing might enable us to know and understand about learning and teaching this important dimension of literacy in current classrooms.
Resumo:
A few studies examined interactive effects between air pollution and temperature on health outcomes. This study is to examine if temperature modified effects of ozone and cardiovascular mortality in 95 large US cities. A nonparametric and a parametric regression models were separately used to explore interactive effects of temperature and ozone on cardiovascular mortality during May and October, 1987-2000. A Bayesian meta-analysis was used to pool estimates. Both models illustrate that temperature enhanced the ozone effects on mortality in the northern region, but obviously in the southern region. A 10-ppb increment in ozone was associated with 0.41 % (95% posterior interval (PI): -0.19 %, 0.93 %), 0.27 % (95% PI: -0.44 %, 0.87 %) and 1.68 % (95% PI: 0.07 %, 3.26 %) increases in daily cardiovascular mortality corresponding to low, moderate and high levels of temperature, respectively. We concluded that temperature modified effects of ozone, particularly in the northern region.
Resumo:
Acoustic sensors play an important role in augmenting the traditional biodiversity monitoring activities carried out by ecologists and conservation biologists. With this ability however comes the burden of analysing large volumes of complex acoustic data. Given the complexity of acoustic sensor data, fully automated analysis for a wide range of species is still a significant challenge. This research investigates the use of citizen scientists to analyse large volumes of environmental acoustic data in order to identify bird species. Specifically, it investigates ways in which the efficiency of a user can be improved through the use of species identification tools and the use of reputation models to predict the accuracy of users with unidentified skill levels. Initial experimental results are reported.
Resumo:
Although the collection of player and ball tracking data is fast becoming the norm in professional sports, large-scale mining of such spatiotemporal data has yet to surface. In this paper, given an entire season's worth of player and ball tracking data from a professional soccer league (approx 400,000,000 data points), we present a method which can conduct both individual player and team analysis. Due to the dynamic, continuous and multi-player nature of team sports like soccer, a major issue is aligning player positions over time. We present a "role-based" representation that dynamically updates each player's relative role at each frame and demonstrate how this captures the short-term context to enable both individual player and team analysis. We discover role directly from data by utilizing a minimum entropy data partitioning method and show how this can be used to accurately detect and visualize formations, as well as analyze individual player behavior.
Resumo:
The Enhancing NeuroImaging Genetics through Meta-Analysis (ENIGMA) Consortium is a collaborative network of researchers working together on a range of large-scale studies that integrate data from 70 institutions worldwide. Organized into Working Groups that tackle questions in neuroscience, genetics, and medicine, ENIGMA studies have analyzed neuroimaging data from over 12,826 subjects. In addition, data from 12,171 individuals were provided by the CHARGE consortium for replication of findings, in a total of 24,997 subjects. By meta-analyzing results from many sites, ENIGMA has detected factors that affect the brain that no individual site could detect on its own, and that require larger numbers of subjects than any individual neuroimaging study has currently collected. ENIGMA's first project was a genome-wide association study identifying common variants in the genome associated with hippocampal volume or intracranial volume. Continuing work is exploring genetic associations with subcortical volumes (ENIGMA2) and white matter microstructure (ENIGMA-DTI). Working groups also focus on understanding how schizophrenia, bipolar illness, major depression and attention deficit/hyperactivity disorder (ADHD) affect the brain. We review the current progress of the ENIGMA Consortium, along with challenges and unexpected discoveries made on the way.
Resumo:
Rapid advances in sequencing technologies (Next Generation Sequencing or NGS) have led to a vast increase in the quantity of bioinformatics data available, with this increasing scale presenting enormous challenges to researchers seeking to identify complex interactions. This paper is concerned with the domain of transcriptional regulation, and the use of visualisation to identify relationships between specific regulatory proteins (the transcription factors or TFs) and their associated target genes (TGs). We present preliminary work from an ongoing study which aims to determine the effectiveness of different visual representations and large scale displays in supporting discovery. Following an iterative process of implementation and evaluation, representations were tested by potential users in the bioinformatics domain to determine their efficacy, and to understand better the range of ad hoc practices among bioinformatics literate users. Results from two rounds of small scale user studies are considered with initial findings suggesting that bioinformaticians require richly detailed views of TF data, features to compare TF layouts between organisms quickly, and ways to keep track of interesting data points.