100 resultados para Large Data Sets
em Queensland University of Technology - ePrints Archive
Resumo:
This paper evaluates the efficiency of a number of popular corpus-based distributional models in performing discovery on very large document sets, including online collections. Literature-based discovery is the process of identifying previously unknown connections from text, often published literature, that could lead to the development of new techniques or technologies. Literature-based discovery has attracted growing research interest ever since Swanson's serendipitous discovery of the therapeutic effects of fish oil on Raynaud's disease in 1986. The successful application of distributional models in automating the identification of indirect associations underpinning literature-based discovery has been heavily demonstrated in the medical domain. However, we wish to investigate the computational complexity of distributional models for literature-based discovery on much larger document collections, as they may provide computationally tractable solutions to tasks including, predicting future disruptive innovations. In this paper we perform a computational complexity analysis on four successful corpus-based distributional models to evaluate their fit for such tasks. Our results indicate that corpus-based distributional models that store their representations in fixed dimensions provide superior efficiency on literature-based discovery tasks.
Resumo:
Rapid advances in sequencing technologies (Next Generation Sequencing or NGS) have led to a vast increase in the quantity of bioinformatics data available, with this increasing scale presenting enormous challenges to researchers seeking to identify complex interactions. This paper is concerned with the domain of transcriptional regulation, and the use of visualisation to identify relationships between specific regulatory proteins (the transcription factors or TFs) and their associated target genes (TGs). We present preliminary work from an ongoing study which aims to determine the effectiveness of different visual representations and large scale displays in supporting discovery. Following an iterative process of implementation and evaluation, representations were tested by potential users in the bioinformatics domain to determine their efficacy, and to understand better the range of ad hoc practices among bioinformatics literate users. Results from two rounds of small scale user studies are considered with initial findings suggesting that bioinformaticians require richly detailed views of TF data, features to compare TF layouts between organisms quickly, and ways to keep track of interesting data points.
Resumo:
Big data is big news in almost every sector including crisis communication. However, not everyone has access to big data and even if we have access to big data, we often do not have necessary tools to analyze and cross reference such a large data set. Therefore this paper looks at patterns in small data sets that we have ability to collect with our current tools to understand if we can find actionable information from what we already have. We have analyzed 164390 tweets collected during 2011 earthquake to find out what type of location specific information people mention in their tweet and when do they talk about that. Based on our analysis we find that even a small data set that has far less data than a big data set can be useful to find priority disaster specific areas quickly.
Resumo:
In this paper we present large, accurately calibrated and time-synchronized data sets, gathered outdoors in controlled and variable environmental conditions, using an unmanned ground vehicle (UGV), equipped with a wide variety of sensors. These include four 2D laser scanners, a radar scanner, a color camera and an infrared camera. It provides a full description of the system used for data collection and the types of environments and conditions in which these data sets have been gathered, which include the presence of airborne dust, smoke and rain.
Resumo:
Data associated with germplasm collections are typically large and multivariate with a considerable number of descriptors measured on each of many accessions. Pattern analysis methods of clustering and ordination have been identified as techniques for statistically evaluating the available diversity in germplasm data. While used in many studies, the approaches have not dealt explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions). To consider the application of these techniques to germplasm evaluation data, 11328 accessions of groundnut (Arachis hypogaea L) from the International Research Institute for the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the rainy and post-rainy growing seasons were used. The ordination technique of principal component analysis was used to reduce the dimensionality of the germplasm data. The identification of phenotypically similar groups of accessions within large scale data via the computationally intensive hierarchical clustering techniques was not feasible and non-hierarchical techniques had to be used. Finite mixture models that maximise the likelihood of an accession belonging to a cluster were used to cluster the accessions in this collection. The patterns of response for the different growing seasons were found to be highly correlated. However, in relating the results to passport and other characterisation and evaluation descriptors, the observed patterns did not appear to be related to taxonomy or any other well known characteristics of groundnut.
Resumo:
As a sequel to a paper that dealt with the analysis of two-way quantitative data in large germplasm collections, this paper presents analytical methods appropriate for two-way data matrices consisting of mixed data types, namely, ordered multicategory and quantitative data types. While various pattern analysis techniques have been identified as suitable for analysis of the mixed data types which occur in germplasm collections, the clustering and ordination methods used often can not deal explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions) with incomplete information. However, it is shown that the ordination technique of principal component analysis and the mixture maximum likelihood method of clustering can be employed to achieve such analyses. Germplasm evaluation data for 11436 accessions of groundnut (Arachis hypogaea L.) from the International Research Institute of the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the post-rainy season and five ordered multicategory descriptors were used. Pattern analysis results generally indicated that the accessions could be distinguished into four regions along the continuum of growth habit (or plant erectness). Interpretation of accession membership in these regions was found to be consistent with taxonomic information, such as subspecies. Each growth habit region contained accessions from three of the most common groundnut botanical varieties. This implies that within each of the habit types there is the full range of expression for the other descriptors used in the analysis. Using these types of insights, the patterns of variability in germplasm collections can provide scientists with valuable information for their plant improvement programs.
Resumo:
Buildings are key mediators between human activity and the environment around them, but details of energy usage and activity in buildings is often poorly communicated and understood. ECOS is an Eco-Visualization project that aims to contextualize the energy generation and consumption of a green building in a variety of different climates. The ECOS project is being developed for a large public interactive space installed in the new Science and Engineering Centre of the Queensland University of Technology that is dedicated to delivering interactive science education content to the public. This paper focuses on how design can develop ICT solutions from large data sets to create meaningful engagement with environmental data.
Resumo:
Buildings are key mediators between human activity and the environment around them, but details of energy usage and activity in buildings is often poorly communicated and understood. ECOS is an Eco-Visualization project that aims to contextualize the energy generation and consumption of a green building in a variety of different climates. The ECOS project is being developed for a large public interactive space installed in the new Science and Engineering Centre of the Queensland University of Technology that is dedicated to delivering interactive science education content to the public. This paper focuses on how design can develop ICT solutions from large data sets to create meaningful engagement with environmental data.
Resumo:
Traffic congestion has a significant impact on the economy and environment. Encouraging the use of multimodal transport (public transport, bicycle, park’n’ride, etc.) has been identified by traffic operators as a good strategy to tackle congestion issues and its detrimental environmental impacts. A multi-modal and multi-objective trip planner provides users with various multi-modal options optimised on objectives that they prefer (cheapest, fastest, safest, etc) and has a potential to reduce congestion on both a temporal and spatial scale. The computation of multi-modal and multi-objective trips is a complicated mathematical problem, as it must integrate and utilize a diverse range of large data sets, including both road network information and public transport schedules, as well as optimising for a number of competing objectives, where fully optimising for one objective, such as travel time, can adversely affect other objectives, such as cost. The relationship between these objectives can also be quite subjective, as their priorities will vary from user to user. This paper will first outline the various data requirements and formats that are needed for the multi-modal multi-objective trip planner to operate, including static information about the physical infrastructure within Brisbane as well as real-time and historical data to predict traffic flow on the road network and the status of public transport. It will then present information on the graph data structures representing the road and public transport networks within Brisbane that are used in the trip planner to calculate optimal routes. This will allow for an investigation into the various shortest path algorithms that have been researched over the last few decades, and provide a foundation for the construction of the Multi-modal Multi-objective Trip Planner by the development of innovative new algorithms that can operate the large diverse data sets and competing objectives.
Resumo:
Trees are capable of portraying the semi-structured data which is common in web domain. Finding similarities between trees is mandatory for several applications that deal with semi-structured data. Existing similarity methods examine a pair of trees by comparing through nodes and paths of two trees, and find the similarity between them. However, these methods provide unfavorable results for unordered tree data and result in yielding NP-hard or MAX-SNP hard complexity. In this paper, we present a novel method that encodes a tree with an optimal traversing approach first, and then, utilizes it to model the tree with its equivalent matrix representation for finding similarity between unordered trees efficiently. Empirical analysis shows that the proposed method is able to achieve high accuracy even on the large data sets.
Resumo:
Analytically or computationally intractable likelihood functions can arise in complex statistical inferential problems making them inaccessible to standard Bayesian inferential methods. Approximate Bayesian computation (ABC) methods address such inferential problems by replacing direct likelihood evaluations with repeated sampling from the model. ABC methods have been predominantly applied to parameter estimation problems and less to model choice problems due to the added difficulty of handling multiple model spaces. The ABC algorithm proposed here addresses model choice problems by extending Fearnhead and Prangle (2012, Journal of the Royal Statistical Society, Series B 74, 1–28) where the posterior mean of the model parameters estimated through regression formed the summary statistics used in the discrepancy measure. An additional stepwise multinomial logistic regression is performed on the model indicator variable in the regression step and the estimated model probabilities are incorporated into the set of summary statistics for model choice purposes. A reversible jump Markov chain Monte Carlo step is also included in the algorithm to increase model diversity for thorough exploration of the model space. This algorithm was applied to a validating example to demonstrate the robustness of the algorithm across a wide range of true model probabilities. Its subsequent use in three pathogen transmission examples of varying complexity illustrates the utility of the algorithm in inferring preference of particular transmission models for the pathogens.
Resumo:
Motor vehicles are a major source of gaseous and particulate matter pollution in urban areas, particularly of ultrafine sized particles (diameters < 0.1 µm). Exposure to particulate matter has been found to be associated with serious health effects, including respiratory and cardiovascular disease, and mortality. Particle emissions generated by motor vehicles span a very broad size range (from around 0.003-10 µm) and are measured as different subsets of particle mass concentrations or particle number count. However, there exist scientific challenges in analysing and interpreting the large data sets on motor vehicle emission factors, and no understanding is available of the application of different particle metrics as a basis for air quality regulation. To date a comprehensive inventory covering the broad size range of particles emitted by motor vehicles, and which includes particle number, does not exist anywhere in the world. This thesis covers research related to four important and interrelated aspects pertaining to particulate matter generated by motor vehicle fleets. These include the derivation of suitable particle emission factors for use in transport modelling and health impact assessments; quantification of motor vehicle particle emission inventories; investigation of the particle characteristic modality within particle size distributions as a potential for developing air quality regulation; and review and synthesis of current knowledge on ultrafine particles as it relates to motor vehicles; and the application of these aspects to the quantification, control and management of motor vehicle particle emissions. In order to quantify emissions in terms of a comprehensive inventory, which covers the full size range of particles emitted by motor vehicle fleets, it was necessary to derive a suitable set of particle emission factors for different vehicle and road type combinations for particle number, particle volume, PM1, PM2.5 and PM1 (mass concentration of particles with aerodynamic diameters < 1 µm, < 2.5 µm and < 10 µm respectively). The very large data set of emission factors analysed in this study were sourced from measurement studies conducted in developed countries, and hence the derived set of emission factors are suitable for preparing inventories in other urban regions of the developed world. These emission factors are particularly useful for regions with a lack of measurement data to derive emission factors, or where experimental data are available but are of insufficient scope. The comprehensive particle emissions inventory presented in this thesis is the first published inventory of tailpipe particle emissions prepared for a motor vehicle fleet, and included the quantification of particle emissions covering the full size range of particles emitted by vehicles, based on measurement data. The inventory quantified particle emissions measured in terms of particle number and different particle mass size fractions. It was developed for the urban South-East Queensland fleet in Australia, and included testing the particle emission implications of future scenarios for different passenger and freight travel demand. The thesis also presents evidence of the usefulness of examining modality within particle size distributions as a basis for developing air quality regulations; and finds evidence to support the relevance of introducing a new PM1 mass ambient air quality standard for the majority of environments worldwide. The study found that a combination of PM1 and PM10 standards are likely to be a more discerning and suitable set of ambient air quality standards for controlling particles emitted from combustion and mechanically-generated sources, such as motor vehicles, than the current mass standards of PM2.5 and PM10. The study also reviewed and synthesized existing knowledge on ultrafine particles, with a specific focus on those originating from motor vehicles. It found that motor vehicles are significant contributors to both air pollution and ultrafine particles in urban areas, and that a standardized measurement procedure is not currently available for ultrafine particles. The review found discrepancies exist between outcomes of instrumentation used to measure ultrafine particles; that few data is available on ultrafine particle chemistry and composition, long term monitoring; characterization of their spatial and temporal distribution in urban areas; and that no inventories for particle number are available for motor vehicle fleets. This knowledge is critical for epidemiological studies and exposure-response assessment. Conclusions from this review included the recommendation that ultrafine particles in populated urban areas be considered a likely target for future air quality regulation based on particle number, due to their potential impacts on the environment. The research in this PhD thesis successfully integrated the elements needed to quantify and manage motor vehicle fleet emissions, and its novelty relates to the combining of expertise from two distinctly separate disciplines - from aerosol science and transport modelling. The new knowledge and concepts developed in this PhD research provide never before available data and methods which can be used to develop comprehensive, size-resolved inventories of motor vehicle particle emissions, and air quality regulations to control particle emissions to protect the health and well-being of current and future generations.
Resumo:
In this paper we present pyktree, an implementation of the K-tree algorithm in the Python programming language. The K-tree algorithm provides highly balanced search trees for vector quantization that scales up to very large data sets. Pyktree is highly modular and well suited for rapid-prototyping of novel distance measures and centroid representations. It is easy to install and provides a python package for library use as well as command line tools.
Resumo:
Background Phylogeographic reconstruction of some bacterial populations is hindered by low diversity coupled with high levels of lateral gene transfer. A comparison of recombination levels and diversity at seven housekeeping genes for eleven bacterial species, most of which are commonly cited as having high levels of lateral gene transfer shows that the relative contributions of homologous recombination versus mutation for Burkholderia pseudomallei is over two times higher than for Streptococcus pneumoniae and is thus the highest value yet reported in bacteria. Despite the potential for homologous recombination to increase diversity, B. pseudomallei exhibits a relative lack of diversity at these loci. In these situations, whole genome genotyping of orthologous shared single nucleotide polymorphism loci, discovered using next generation sequencing technologies, can provide very large data sets capable of estimating core phylogenetic relationships. We compared and searched 43 whole genome sequences of B. pseudomallei and its closest relatives for single nucleotide polymorphisms in orthologous shared regions to use in phylogenetic reconstruction. Results Bayesian phylogenetic analyses of >14,000 single nucleotide polymorphisms yielded completely resolved trees for these 43 strains with high levels of statistical support. These results enable a better understanding of a separate analysis of population differentiation among >1,700 B. pseudomallei isolates as defined by sequence data from seven housekeeping genes. We analyzed this larger data set for population structure and allele sharing that can be attributed to lateral gene transfer. Our results suggest that despite an almost panmictic population, we can detect two distinct populations of B. pseudomallei that conform to biogeographic patterns found in many plant and animal species. That is, separation along Wallace's Line, a biogeographic boundary between Southeast Asia and Australia. Conclusion We describe an Australian origin for B. pseudomallei, characterized by a single introduction event into Southeast Asia during a recent glacial period, and variable levels of lateral gene transfer within populations. These patterns provide insights into mechanisms of genetic diversification in B. pseudomallei and its closest relatives, and provide a framework for integrating the traditionally separate fields of population genetics and phylogenetics for other bacterial species with high levels of lateral gene transfer.