203 resultados para Large Data


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper evaluates the efficiency of a number of popular corpus-based distributional models in performing discovery on very large document sets, including online collections. Literature-based discovery is the process of identifying previously unknown connections from text, often published literature, that could lead to the development of new techniques or technologies. Literature-based discovery has attracted growing research interest ever since Swanson's serendipitous discovery of the therapeutic effects of fish oil on Raynaud's disease in 1986. The successful application of distributional models in automating the identification of indirect associations underpinning literature-based discovery has been heavily demonstrated in the medical domain. However, we wish to investigate the computational complexity of distributional models for literature-based discovery on much larger document collections, as they may provide computationally tractable solutions to tasks including, predicting future disruptive innovations. In this paper we perform a computational complexity analysis on four successful corpus-based distributional models to evaluate their fit for such tasks. Our results indicate that corpus-based distributional models that store their representations in fixed dimensions provide superior efficiency on literature-based discovery tasks.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Data associated with germplasm collections are typically large and multivariate with a considerable number of descriptors measured on each of many accessions. Pattern analysis methods of clustering and ordination have been identified as techniques for statistically evaluating the available diversity in germplasm data. While used in many studies, the approaches have not dealt explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions). To consider the application of these techniques to germplasm evaluation data, 11328 accessions of groundnut (Arachis hypogaea L) from the International Research Institute for the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the rainy and post-rainy growing seasons were used. The ordination technique of principal component analysis was used to reduce the dimensionality of the germplasm data. The identification of phenotypically similar groups of accessions within large scale data via the computationally intensive hierarchical clustering techniques was not feasible and non-hierarchical techniques had to be used. Finite mixture models that maximise the likelihood of an accession belonging to a cluster were used to cluster the accessions in this collection. The patterns of response for the different growing seasons were found to be highly correlated. However, in relating the results to passport and other characterisation and evaluation descriptors, the observed patterns did not appear to be related to taxonomy or any other well known characteristics of groundnut.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

As a sequel to a paper that dealt with the analysis of two-way quantitative data in large germplasm collections, this paper presents analytical methods appropriate for two-way data matrices consisting of mixed data types, namely, ordered multicategory and quantitative data types. While various pattern analysis techniques have been identified as suitable for analysis of the mixed data types which occur in germplasm collections, the clustering and ordination methods used often can not deal explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions) with incomplete information. However, it is shown that the ordination technique of principal component analysis and the mixture maximum likelihood method of clustering can be employed to achieve such analyses. Germplasm evaluation data for 11436 accessions of groundnut (Arachis hypogaea L.) from the International Research Institute of the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the post-rainy season and five ordered multicategory descriptors were used. Pattern analysis results generally indicated that the accessions could be distinguished into four regions along the continuum of growth habit (or plant erectness). Interpretation of accession membership in these regions was found to be consistent with taxonomic information, such as subspecies. Each growth habit region contained accessions from three of the most common groundnut botanical varieties. This implies that within each of the habit types there is the full range of expression for the other descriptors used in the analysis. Using these types of insights, the patterns of variability in germplasm collections can provide scientists with valuable information for their plant improvement programs.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Real-Time Kinematic (RTK) positioning is a technique used to provide precise positioning services at centimetre accuracy level in the context of Global Navigation Satellite Systems (GNSS). While a Network-based RTK (N-RTK) system involves multiple continuously operating reference stations (CORS), the simplest form of a NRTK system is a single-base RTK. In Australia there are several NRTK services operating in different states and over 1000 single-base RTK systems to support precise positioning applications for surveying, mining, agriculture, and civil construction in regional areas. Additionally, future generation GNSS constellations, including modernised GPS, Galileo, GLONASS, and Compass, with multiple frequencies have been either developed or will become fully operational in the next decade. A trend of future development of RTK systems is to make use of various isolated operating network and single-base RTK systems and multiple GNSS constellations for extended service coverage and improved performance. Several computational challenges have been identified for future NRTK services including: • Multiple GNSS constellations and multiple frequencies • Large scale, wide area NRTK services with a network of networks • Complex computation algorithms and processes • Greater part of positioning processes shifting from user end to network centre with the ability to cope with hundreds of simultaneous users’ requests (reverse RTK) There are two major requirements for NRTK data processing based on the four challenges faced by future NRTK systems, expandable computing power and scalable data sharing/transferring capability. This research explores new approaches to address these future NRTK challenges and requirements using the Grid Computing facility, in particular for large data processing burdens and complex computation algorithms. A Grid Computing based NRTK framework is proposed in this research, which is a layered framework consisting of: 1) Client layer with the form of Grid portal; 2) Service layer; 3) Execution layer. The user’s request is passed through these layers, and scheduled to different Grid nodes in the network infrastructure. A proof-of-concept demonstration for the proposed framework is performed in a five-node Grid environment at QUT and also Grid Australia. The Networked Transport of RTCM via Internet Protocol (Ntrip) open source software is adopted to download real-time RTCM data from multiple reference stations through the Internet, followed by job scheduling and simplified RTK computing. The system performance has been analysed and the results have preliminarily demonstrated the concepts and functionality of the new NRTK framework based on Grid Computing, whilst some aspects of the performance of the system are yet to be improved in future work.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Data mining techniques extract repeated and useful patterns from a large data set that in turn are utilized to predict the outcome of future events. The main purpose of the research presented in this paper is to investigate data mining strategies and develop an efficient framework for multi-attribute project information analysis to predict the performance of construction projects. The research team first reviewed existing data mining algorithms, applied them to systematically analyze a large project data set collected by the survey, and finally proposed a data-mining-based decision support framework for project performance prediction. To evaluate the potential of the framework, a case study was conducted using data collected from 139 capital projects and analyzed the relationship between use of information technology and project cost performance. The study results showed that the proposed framework has potential to promote fast, easy to use, interpretable, and accurate project data analysis.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Buildings are key mediators between human activity and the environment around them, but details of energy usage and activity in buildings is often poorly communicated and understood. ECOS is an Eco-Visualization project that aims to contextualize the energy generation and consumption of a green building in a variety of different climates. The ECOS project is being developed for a large public interactive space installed in the new Science and Engineering Centre of the Queensland University of Technology that is dedicated to delivering interactive science education content to the public. This paper focuses on how design can develop ICT solutions from large data sets to create meaningful engagement with environmental data.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Buildings are key mediators between human activity and the environment around them, but details of energy usage and activity in buildings is often poorly communicated and understood. ECOS is an Eco-Visualization project that aims to contextualize the energy generation and consumption of a green building in a variety of different climates. The ECOS project is being developed for a large public interactive space installed in the new Science and Engineering Centre of the Queensland University of Technology that is dedicated to delivering interactive science education content to the public. This paper focuses on how design can develop ICT solutions from large data sets to create meaningful engagement with environmental data.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Traffic congestion has a significant impact on the economy and environment. Encouraging the use of multimodal transport (public transport, bicycle, park’n’ride, etc.) has been identified by traffic operators as a good strategy to tackle congestion issues and its detrimental environmental impacts. A multi-modal and multi-objective trip planner provides users with various multi-modal options optimised on objectives that they prefer (cheapest, fastest, safest, etc) and has a potential to reduce congestion on both a temporal and spatial scale. The computation of multi-modal and multi-objective trips is a complicated mathematical problem, as it must integrate and utilize a diverse range of large data sets, including both road network information and public transport schedules, as well as optimising for a number of competing objectives, where fully optimising for one objective, such as travel time, can adversely affect other objectives, such as cost. The relationship between these objectives can also be quite subjective, as their priorities will vary from user to user. This paper will first outline the various data requirements and formats that are needed for the multi-modal multi-objective trip planner to operate, including static information about the physical infrastructure within Brisbane as well as real-time and historical data to predict traffic flow on the road network and the status of public transport. It will then present information on the graph data structures representing the road and public transport networks within Brisbane that are used in the trip planner to calculate optimal routes. This will allow for an investigation into the various shortest path algorithms that have been researched over the last few decades, and provide a foundation for the construction of the Multi-modal Multi-objective Trip Planner by the development of innovative new algorithms that can operate the large diverse data sets and competing objectives.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Big data is big news in almost every sector including crisis communication. However, not everyone has access to big data and even if we have access to big data, we often do not have necessary tools to analyze and cross reference such a large data set. Therefore this paper looks at patterns in small data sets that we have ability to collect with our current tools to understand if we can find actionable information from what we already have. We have analyzed 164390 tweets collected during 2011 earthquake to find out what type of location specific information people mention in their tweet and when do they talk about that. Based on our analysis we find that even a small data set that has far less data than a big data set can be useful to find priority disaster specific areas quickly.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

As of June 2009, 361 genome-wide association studies (GWAS) had been referenced by the HuGE database. GWAS require DNA from many thousands of individuals, relying on suitable DNA collections. We recently performed a multiple sclerosis (MS) GWAS where a substantial component of the cases (24%) had DNA derived from saliva. Genotyping was done on the Illumina genotyping platform using the Infinium Hap370CNV DUO microarray. Additionally, we genotyped 10 individuals in duplicate using both saliva- and blood-derived DNA. The performance of blood- versus saliva-derived DNA was compared using genotyping call rate, which reflects both the quantity and quality of genotyping per sample and the “GCScore,” an Illumina genotyping quality score, which is a measure of DNA quality. We also compared genotype calls and GCScores for the 10 sample pairs. Call rates were assessed for each sample individually. For the GWAS samples, we compared data according to source of DNA and center of origin. We observed high concordance in genotyping quality and quantity between the paired samples and minimal loss of quality and quantity of DNA in the saliva samples in the large GWAS sample, with the blood samples showing greater variation between centers of origin. This large data set highlights the usefulness of saliva DNA for genotyping, especially in high-density single-nucleotide polymorphism microarray studies such as GWAS.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Trees are capable of portraying the semi-structured data which is common in web domain. Finding similarities between trees is mandatory for several applications that deal with semi-structured data. Existing similarity methods examine a pair of trees by comparing through nodes and paths of two trees, and find the similarity between them. However, these methods provide unfavorable results for unordered tree data and result in yielding NP-hard or MAX-SNP hard complexity. In this paper, we present a novel method that encodes a tree with an optimal traversing approach first, and then, utilizes it to model the tree with its equivalent matrix representation for finding similarity between unordered trees efficiently. Empirical analysis shows that the proposed method is able to achieve high accuracy even on the large data sets.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The concept of big data has already outperformed traditional data management efforts in almost all industries. Other instances it has succeeded in obtaining promising results that provide value from large-scale integration and analysis of heterogeneous data sources for example Genomic and proteomic information. Big data analytics have become increasingly important in describing the data sets and analytical techniques in software applications that are so large and complex due to its significant advantages including better business decisions, cost reduction and delivery of new product and services [1]. In a similar context, the health community has experienced not only more complex and large data content, but also information systems that contain a large number of data sources with interrelated and interconnected data attributes. That have resulted in challenging, and highly dynamic environments leading to creation of big data with its enumerate complexities, for instant sharing of information with the expected security requirements of stakeholders. When comparing big data analysis with other sectors, the health sector is still in its early stages. Key challenges include accommodating the volume, velocity and variety of healthcare data with the current deluge of exponential growth. Given the complexity of big data, it is understood that while data storage and accessibility are technically manageable, the implementation of Information Accountability measures to healthcare big data might be a practical solution in support of information security, privacy and traceability measures. Transparency is one important measure that can demonstrate integrity which is a vital factor in the healthcare service. Clarity about performance expectations is considered to be another Information Accountability measure which is necessary to avoid data ambiguity and controversy about interpretation and finally, liability [2]. According to current studies [3] Electronic Health Records (EHR) are key information resources for big data analysis and is also composed of varied co-created values [3]. Common healthcare information originates from and is used by different actors and groups that facilitate understanding of the relationship for other data sources. Consequently, healthcare services often serve as an integrated service bundle. Although a critical requirement in healthcare services and analytics, it is difficult to find a comprehensive set of guidelines to adopt EHR to fulfil the big data analysis requirements. Therefore as a remedy, this research work focus on a systematic approach containing comprehensive guidelines with the accurate data that must be provided to apply and evaluate big data analysis until the necessary decision making requirements are fulfilled to improve quality of healthcare services. Hence, we believe that this approach would subsequently improve quality of life.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Queensland University of Technology (QUT) is faced with a rapidly growing research agenda built upon a strategic research capacity-building program. This presentation will outline the results of a project that has recently investigated QUT’s research support requirements and which has developed a model for the support of eResearch across the university. QUT’s research building strategy has produced growth at the faculty level and within its research institutes. This increased research activity is pushing the need for university-wide eResearch platforms capable of providing infrastructure and support in areas such as collaboration, data, networking, authentication and authorisation, workflows and the grid. One of the driving forces behind the investigation is data-centric nature of modern research. It is now critical that researchers have access to supported infrastructure that allows the collection, analysis, aggregation and sharing of large data volumes for exploration and mining in order to gain new insights and to generate new knowledge. However, recent surveys into current research data management practices by the Australian Partnership for Sustainable Repositories (APSR) and by QUT itself, has revealed serious shortcomings in areas such as research data management, especially its long term maintenance for reuse and authoritative evidence of research findings. While these internal university pressures are building, at the same time there are external pressures that are magnifying them. For example, recent compliance guidelines from bodies such as the ARC, and NHMRC and Universities Australia indicate that institutions need to provide facilities for the safe and secure storage of research data along with a surrounding set of policies, on its retention, ownership and accessibility. The newly formed Australian National Data Service (ANDS) is developing strategies and guidelines for research data management and research institutions are a central focus, responsible for managing and storing institutional data on platforms that can be federated nationally and internationally for wider use. For some time QUT has recognised the importance of eResearch and has been active in a number of related areas: ePrints to digitally publish research papers, grid computing portals and workflows, institutional-wide provisioning and authentication systems, and legal protocols for copyright management. QUT also has two widely recognised centres focused on fundamental research into eResearch itself: The OAK LAW project (Open Access to Knowledge) which focuses upon legal issues relating eResearch and the Microsoft QUT eResearch Centre whose goal is to accelerate scientific research discovery, through new smart software. In order to better harness all of these resources and improve research outcomes, the university recently established a project to investigate how it might better organise the support of eResearch. This presentation will outline the project outcomes, which include a flexible and sustainable eResearch support service model addressing short and longer term research needs, identification of resource requirements required to establish and sustain the service, and the development of research data management policies and implementation plans.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

An information filtering (IF) system monitors an incoming document stream to find the documents that match the information needs specified by the user profiles. To learn to use the user profiles effectively is one of the most challenging tasks when developing an IF system. With the document selection criteria better defined based on the users’ needs, filtering large streams of information can be more efficient and effective. To learn the user profiles, term-based approaches have been widely used in the IF community because of their simplicity and directness. Term-based approaches are relatively well established. However, these approaches have problems when dealing with polysemy and synonymy, which often lead to an information overload problem. Recently, pattern-based approaches (or Pattern Taxonomy Models (PTM) [160]) have been proposed for IF by the data mining community. These approaches are better at capturing sematic information and have shown encouraging results for improving the effectiveness of the IF system. On the other hand, pattern discovery from large data streams is not computationally efficient. Also, these approaches had to deal with low frequency pattern issues. The measures used by the data mining technique (for example, “support” and “confidences”) to learn the profile have turned out to be not suitable for filtering. They can lead to a mismatch problem. This thesis uses the rough set-based reasoning (term-based) and pattern mining approach as a unified framework for information filtering to overcome the aforementioned problems. This system consists of two stages - topic filtering and pattern mining stages. The topic filtering stage is intended to minimize information overloading by filtering out the most likely irrelevant information based on the user profiles. A novel user-profiles learning method and a theoretical model of the threshold setting have been developed by using rough set decision theory. The second stage (pattern mining) aims at solving the problem of the information mismatch. This stage is precision-oriented. A new document-ranking function has been derived by exploiting the patterns in the pattern taxonomy. The most likely relevant documents were assigned higher scores by the ranking function. Because there is a relatively small amount of documents left after the first stage, the computational cost is markedly reduced; at the same time, pattern discoveries yield more accurate results. The overall performance of the system was improved significantly. The new two-stage information filtering model has been evaluated by extensive experiments. Tests were based on the well-known IR bench-marking processes, using the latest version of the Reuters dataset, namely, the Reuters Corpus Volume 1 (RCV1). The performance of the new two-stage model was compared with both the term-based and data mining-based IF models. The results demonstrate that the proposed information filtering system outperforms significantly the other IF systems, such as the traditional Rocchio IF model, the state-of-the-art term-based models, including the BM25, Support Vector Machines (SVM), and Pattern Taxonomy Model (PTM).

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Motor vehicles are a major source of gaseous and particulate matter pollution in urban areas, particularly of ultrafine sized particles (diameters < 0.1 µm). Exposure to particulate matter has been found to be associated with serious health effects, including respiratory and cardiovascular disease, and mortality. Particle emissions generated by motor vehicles span a very broad size range (from around 0.003-10 µm) and are measured as different subsets of particle mass concentrations or particle number count. However, there exist scientific challenges in analysing and interpreting the large data sets on motor vehicle emission factors, and no understanding is available of the application of different particle metrics as a basis for air quality regulation. To date a comprehensive inventory covering the broad size range of particles emitted by motor vehicles, and which includes particle number, does not exist anywhere in the world. This thesis covers research related to four important and interrelated aspects pertaining to particulate matter generated by motor vehicle fleets. These include the derivation of suitable particle emission factors for use in transport modelling and health impact assessments; quantification of motor vehicle particle emission inventories; investigation of the particle characteristic modality within particle size distributions as a potential for developing air quality regulation; and review and synthesis of current knowledge on ultrafine particles as it relates to motor vehicles; and the application of these aspects to the quantification, control and management of motor vehicle particle emissions. In order to quantify emissions in terms of a comprehensive inventory, which covers the full size range of particles emitted by motor vehicle fleets, it was necessary to derive a suitable set of particle emission factors for different vehicle and road type combinations for particle number, particle volume, PM1, PM2.5 and PM1 (mass concentration of particles with aerodynamic diameters < 1 µm, < 2.5 µm and < 10 µm respectively). The very large data set of emission factors analysed in this study were sourced from measurement studies conducted in developed countries, and hence the derived set of emission factors are suitable for preparing inventories in other urban regions of the developed world. These emission factors are particularly useful for regions with a lack of measurement data to derive emission factors, or where experimental data are available but are of insufficient scope. The comprehensive particle emissions inventory presented in this thesis is the first published inventory of tailpipe particle emissions prepared for a motor vehicle fleet, and included the quantification of particle emissions covering the full size range of particles emitted by vehicles, based on measurement data. The inventory quantified particle emissions measured in terms of particle number and different particle mass size fractions. It was developed for the urban South-East Queensland fleet in Australia, and included testing the particle emission implications of future scenarios for different passenger and freight travel demand. The thesis also presents evidence of the usefulness of examining modality within particle size distributions as a basis for developing air quality regulations; and finds evidence to support the relevance of introducing a new PM1 mass ambient air quality standard for the majority of environments worldwide. The study found that a combination of PM1 and PM10 standards are likely to be a more discerning and suitable set of ambient air quality standards for controlling particles emitted from combustion and mechanically-generated sources, such as motor vehicles, than the current mass standards of PM2.5 and PM10. The study also reviewed and synthesized existing knowledge on ultrafine particles, with a specific focus on those originating from motor vehicles. It found that motor vehicles are significant contributors to both air pollution and ultrafine particles in urban areas, and that a standardized measurement procedure is not currently available for ultrafine particles. The review found discrepancies exist between outcomes of instrumentation used to measure ultrafine particles; that few data is available on ultrafine particle chemistry and composition, long term monitoring; characterization of their spatial and temporal distribution in urban areas; and that no inventories for particle number are available for motor vehicle fleets. This knowledge is critical for epidemiological studies and exposure-response assessment. Conclusions from this review included the recommendation that ultrafine particles in populated urban areas be considered a likely target for future air quality regulation based on particle number, due to their potential impacts on the environment. The research in this PhD thesis successfully integrated the elements needed to quantify and manage motor vehicle fleet emissions, and its novelty relates to the combining of expertise from two distinctly separate disciplines - from aerosol science and transport modelling. The new knowledge and concepts developed in this PhD research provide never before available data and methods which can be used to develop comprehensive, size-resolved inventories of motor vehicle particle emissions, and air quality regulations to control particle emissions to protect the health and well-being of current and future generations.