980 resultados para Automatic Analysis of Multivariate Categorical Data Sets


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Our approach for knowledge presentation is based on the idea of expert system shell. At first we will build a graph shell of both possible dependencies and possible actions. Then, reasoning by means of Loglinear models, we will activate some nodes and some directed links. In this way a Bayesian network and networks presenting loglinear models are generated.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Energy consumption data are required to perform analysis, modelling, evaluation, and optimisation of energy usage in buildings. While a variety of energy consumption data sets have been examined and reported in the literature, there is a lack of a comprehensive categorisation and analysis of the available data sets. In this study, an overview of energy consumption data of buildings is provided. Three common strategies for generating energy consumption data, i.e., measurement, survey, and simulation, are described. A number of important characteristics pertaining to each strategy and the resulting data sets are discussed. In addition, a directory of energy consumption data sets of buildings is developed. The data sets are collected from either published papers or energy related organisations. The main contributions of this study include establishing a resource pertaining to energy consumption data sets and providing information related to the characteristics and availability of the respective data sets; therefore facilitating and promoting research activities in energy consumption data analysis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Concerns regarding groundwater contamination with nitrate and the long-term sustainability of groundwater resources have prompted the development of a multi-layered three dimensional (3D) geological model to characterise the aquifer geometry of the Wairau Plain, Marlborough District, New Zealand. The 3D geological model which consists of eight litho-stratigraphic units has been subsequently used to synthesise hydrogeological and hydrogeochemical data for different aquifers in an approach that aims to demonstrate how integration of water chemistry data within the physical framework of a 3D geological model can help to better understand and conceptualise groundwater systems in complex geological settings. Multivariate statistical techniques(e.g. Principal Component Analysis and Hierarchical Cluster Analysis) were applied to groundwater chemistry data to identify hydrochemical facies which are characteristic of distinct evolutionary pathways and a common hydrologic history of groundwaters. Principal Component Analysis on hydrochemical data demonstrated that natural water-rock interactions, redox potential and human agricultural impact are the key controls of groundwater quality in the Wairau Plain. Hierarchical Cluster Analysis revealed distinct hydrochemical water quality groups in the Wairau Plain groundwater system. Visualisation of the results of the multivariate statistical analyses and distribution of groundwater nitrate concentrations in the context of aquifer lithology highlighted the link between groundwater chemistry and the lithology of host aquifers. The methodology followed in this study can be applied in a variety of hydrogeological settings to synthesise geological, hydrogeological and hydrochemical data and present them in a format readily understood by a wide range of stakeholders. This enables a more efficient communication of the results of scientific studies to the wider community.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data associated with germplasm collections are typically large and multivariate with a considerable number of descriptors measured on each of many accessions. Pattern analysis methods of clustering and ordination have been identified as techniques for statistically evaluating the available diversity in germplasm data. While used in many studies, the approaches have not dealt explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions). To consider the application of these techniques to germplasm evaluation data, 11328 accessions of groundnut (Arachis hypogaea L) from the International Research Institute for the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the rainy and post-rainy growing seasons were used. The ordination technique of principal component analysis was used to reduce the dimensionality of the germplasm data. The identification of phenotypically similar groups of accessions within large scale data via the computationally intensive hierarchical clustering techniques was not feasible and non-hierarchical techniques had to be used. Finite mixture models that maximise the likelihood of an accession belonging to a cluster were used to cluster the accessions in this collection. The patterns of response for the different growing seasons were found to be highly correlated. However, in relating the results to passport and other characterisation and evaluation descriptors, the observed patterns did not appear to be related to taxonomy or any other well known characteristics of groundnut.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We examined variability in hierarchical beta diversity across ecosystems, geographical gradients, and organism groups using multivariate spatial mixed modeling analysis of two independent data sets. The larger data set comprised reported ratios of regional species richness (RSR) to local species richness (LSR) and the second data set consisted of RSR: LSR ratios derived from nested species-area relationships. There was a negative, albeit relatively weak, relationship between beta diversity and latitude. We found only relatively subtle differences in beta diversity among the realms, yet beta diversity was lower in marine systems than in terrestrial or freshwater realms. Beta diversity varied significantly among organisms' major characteristics such as body mass, trophic position, and dispersal type in the larger data set. Organisms that disperse via seeds had highest beta diversity, and passively dispersed organisms showed the lowest beta diversity. Furthermore, autotrophs had lower beta diversity than organisms higher up the food web; omnivores and carnivores had consistently higher beta diversity. This is evidence that beta diversity is simultaneously controlled by extrinsic factors related to geography and environment, and by intrinsic factors related to organism characteristics.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A análise isotópica tem se mostrado uma ferramenta de suma importância ao processo de rastreabilidade, no entanto, existem divergências nas análises estatísticas dos resultados, uma vez que os dados são dependentes e advindos de vários elementos químicos tais como Carbono, Hidrogênio, Oxigênio, Nitrogênio e Enxofre (CHON'S). Com o intuito de estabelecer a análise propícia para os dados de rastreabilidade em aves pela técnica de isótopos estáveis e avaliar a necessidade da análise conjunta das variáveis, foram usados dados de carbono-13 e de nitrogênio-15 de ovos (albúmen + gema) de poedeiras e músculo peitoral de frangos de corte, os quais foram submetidos à análise estatística univariada (Anova e complementada pelo teste de Tukey) e multivariada (Manova e Discriminante). Os dados foram analisados no software Minitab 16, e os resultados, consolidados na teoria, confirmam a necessidade de análise multivariada, mostrando também que a análise discriminante esclarece as dúvidas apresentadas nos resultados de outros métodos de análise comparados nesta pesquisa.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Big data is big news in almost every sector including crisis communication. However, not everyone has access to big data and even if we have access to big data, we often do not have necessary tools to analyze and cross reference such a large data set. Therefore this paper looks at patterns in small data sets that we have ability to collect with our current tools to understand if we can find actionable information from what we already have. We have analyzed 164390 tweets collected during 2011 earthquake to find out what type of location specific information people mention in their tweet and when do they talk about that. Based on our analysis we find that even a small data set that has far less data than a big data set can be useful to find priority disaster specific areas quickly.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As a sequel to a paper that dealt with the analysis of two-way quantitative data in large germplasm collections, this paper presents analytical methods appropriate for two-way data matrices consisting of mixed data types, namely, ordered multicategory and quantitative data types. While various pattern analysis techniques have been identified as suitable for analysis of the mixed data types which occur in germplasm collections, the clustering and ordination methods used often can not deal explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions) with incomplete information. However, it is shown that the ordination technique of principal component analysis and the mixture maximum likelihood method of clustering can be employed to achieve such analyses. Germplasm evaluation data for 11436 accessions of groundnut (Arachis hypogaea L.) from the International Research Institute of the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the post-rainy season and five ordered multicategory descriptors were used. Pattern analysis results generally indicated that the accessions could be distinguished into four regions along the continuum of growth habit (or plant erectness). Interpretation of accession membership in these regions was found to be consistent with taxonomic information, such as subspecies. Each growth habit region contained accessions from three of the most common groundnut botanical varieties. This implies that within each of the habit types there is the full range of expression for the other descriptors used in the analysis. Using these types of insights, the patterns of variability in germplasm collections can provide scientists with valuable information for their plant improvement programs.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This report presents a method for viewing complex programs as built up out of simpler ones. The central idea is that typical programs are built up in a small number of stereotyped ways. The method is designed to make it easier for an automatic system to work with programs. It focuses on how the primitive operations performed by a program are combined together in order to produce the actions of the program as a whole. It does not address the issue of how complex data structures are built up from simpler ones, nor the relationships between data structures and the operations performed on them.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The analysis of seabed structure is important in a wide variety of scientific and industrial applications. In this paper, underwater acoustic data produced by bottom-penetrating sonar (Topas) are analyzed using unsupervised volumetric segmentation, based on a three dimensional Gibbs-Markov model. The result is a concise and accurate description of the seabed, in which key structures are emphasized. This description is also very well suited to further operations, such as the enhancement and automatic recognition of important structures. Experimental results demonstrating the effectiveness of this approach are shown, using Topas data gathered in the North Sea off Horten, Norway.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Computational Biology is the research are that contributes to the analysis of biological data through the development of algorithms which will address significant research problems.The data from molecular biology includes DNA,RNA ,Protein and Gene expression data.Gene Expression Data provides the expression level of genes under different conditions.Gene expression is the process of transcribing the DNA sequence of a gene into mRNA sequences which in turn are later translated into proteins.The number of copies of mRNA produced is called the expression level of a gene.Gene expression data is organized in the form of a matrix. Rows in the matrix represent genes and columns in the matrix represent experimental conditions.Experimental conditions can be different tissue types or time points.Entries in the gene expression matrix are real values.Through the analysis of gene expression data it is possible to determine the behavioral patterns of genes such as similarity of their behavior,nature of their interaction,their respective contribution to the same pathways and so on. Similar expression patterns are exhibited by the genes participating in the same biological process.These patterns have immense relevance and application in bioinformatics and clinical research.Theses patterns are used in the medical domain for aid in more accurate diagnosis,prognosis,treatment planning.drug discovery and protein network analysis.To identify various patterns from gene expression data,data mining techniques are essential.Clustering is an important data mining technique for the analysis of gene expression data.To overcome the problems associated with clustering,biclustering is introduced.Biclustering refers to simultaneous clustering of both rows and columns of a data matrix. Clustering is a global whereas biclustering is a local model.Discovering local expression patterns is essential for identfying many genetic pathways that are not apparent otherwise.It is therefore necessary to move beyond the clustering paradigm towards developing approaches which are capable of discovering local patterns in gene expression data.A biclusters is a submatrix of the gene expression data matrix.The rows and columns in the submatrix need not be contiguous as in the gene expression data matrix.Biclusters are not disjoint.Computation of biclusters is costly because one will have to consider all the combinations of columans and rows in order to find out all the biclusters.The search space for the biclustering problem is 2 m+n where m and n are the number of genes and conditions respectively.Usually m+n is more than 3000.The biclustering problem is NP-hard.Biclustering is a powerful analytical tool for the biologist.The research reported in this thesis addresses the problem of biclustering.Ten algorithms are developed for the identification of coherent biclusters from gene expression data.All these algorithms are making use of a measure called mean squared residue to search for biclusters.The objective here is to identify the biclusters of maximum size with the mean squared residue lower than a given threshold. All these algorithms begin the search from tightly coregulated submatrices called the seeds.These seeds are generated by K-Means clustering algorithm.The algorithms developed can be classified as constraint based,greedy and metaheuristic.Constarint based algorithms uses one or more of the various constaints namely the MSR threshold and the MSR difference threshold.The greedy approach makes a locally optimal choice at each stage with the objective of finding the global optimum.In metaheuristic approaches particle Swarm Optimization(PSO) and variants of Greedy Randomized Adaptive Search Procedure(GRASP) are used for the identification of biclusters.These algorithms are implemented on the Yeast and Lymphoma datasets.Biologically relevant and statistically significant biclusters are identified by all these algorithms which are validated by Gene Ontology database.All these algorithms are compared with some other biclustering algorithms.Algorithms developed in this work overcome some of the problems associated with the already existing algorithms.With the help of some of the algorithms which are developed in this work biclusters with very high row variance,which is higher than the row variance of any other algorithm using mean squared residue, are identified from both Yeast and Lymphoma data sets.Such biclusters which make significant change in the expression level are highly relevant biologically.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A role for sequential test procedures is emerging in genetic and epidemiological studies using banked biological resources. This stems from the methodology's potential for improved use of information relative to comparable fixed sample designs. Studies in which cost, time and ethics feature prominently are particularly suited to a sequential approach. In this paper sequential procedures for matched case–control studies with binary data will be investigated and assessed. Design issues such as sample size evaluation and error rates are identified and addressed. The methodology is illustrated and evaluated using both real and simulated data sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A quality analysis trial was undertaken at Ford Geelong Stamping Plant on a press line that was fitted with standard press sensors to measure press and binder force over the stamping cycle for each panel. The quality of randomly sampled panels was measured by obtaining the panel thicknesses at five points, for 135 panels in total. These points were chosen such that they exhibited different forming modes. This paper analyses the input force data and the output quality data from the trial to determine any potential relationships. The analysis of the production data was performed using statistical correlation techniques to determine initial potential relationships between input and output variables. An Active Shape Model was used to extract features when identifying the major sources of variation within the input data. However, the initial analysis of the data elicited no direct relationship between the input variables measured and the panel thicknesses. This result is significant as the data collected is from a standard sensor configuration found in many press lines through-out the world. The reason for the lack of a direct relationship is believed to come from the lack of sensitivity in the force measurements which are not able to identify small changes in the process, whereas gross geometric variations have in previous studies shown an obvious relationship with changes in the force press profile. This means that existing force sensors require augmentation by additional sensors if a detailed automatic quality control system for the press lines based on input sensors alone.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Rain acidity may be ascribed to emissions from power station stacks, as well as emissions from other industry, biomass burning, maritime influences, agricultural influences, etc. Rain quality data are available for 30 sites in the South African interior, some from as early as 1985 for up to 14 rainfall seasons, while others only have relatively short records. The article examines trends over time in the raw and volume weighted concentrations of the parameters measured, separately for each of the sites for which sufficient data are available. The main thrust, however, is to examine the inter-relationship structure between the concentrations within each rain event (unweighted data), separately for each site, and to examine whether these inter-relationships have changed over time. The rain events at individual sites can be characterized by approximately eight combinations of rainfall parameters (or rain composition signatures), and these are common to all sites. Some sites will have more events from one signature than another, but there appear to be no signatures unique to a single site. Analysis via factor and cluster analysis, with a correspondence analysis of the results, also aid interpretation of the patterns. This spatio-temporal analysis, performed by pooling all rain event data, irrespective of site or time period, results in nine combinations of rainfall parameters being sufficient to characterize the rain events. The sites and rainfall seasons show patterns in these combinations of parameters, with some combinations appearing more frequently during certain rainfall seasons. In particular, the presence of the combination of low acetate and formate with high magnesium appears to be increasing in the later rainfall seasons, as does this combination together with calcium, sodium, chloride, potassium and fluoride. As expected, sites close together exhibit similar signatures. Copyright © 2002 John Wiley & Sons, Ltd.