924 resultados para Automatic Analysis of Multivariate Categorical Data Sets
Resumo:
Our approach for knowledge presentation is based on the idea of expert system shell. At first we will build a graph shell of both possible dependencies and possible actions. Then, reasoning by means of Loglinear models, we will activate some nodes and some directed links. In this way a Bayesian network and networks presenting loglinear models are generated.
Resumo:
In the analysis of multivariate categorical data, typically the analysis of questionnaire data, it is often advantageous, for substantive and technical reasons, to analyse a subset of response categories. In multiple correspondence analysis, where each category is coded as a column of an indicator matrix or row and column of Burt matrix, it is not correct to simply analyse the corresponding submatrix of data, since the whole geometric structure is different for the submatrix . A simple modification of the correspondence analysis algorithm allows the overall geometric structure of the complete data set to be retained while calculating the solution for the selected subset of points. This strategy is useful for analysing patterns of response amongst any subset of categories and relating these patterns to demographic factors, especially for studying patterns of particular responses such as missing and neutral responses. The methodology is illustrated using data from the International Social Survey Program on Family and Changing Gender Roles in 1994.
Resumo:
A análise isotópica tem se mostrado uma ferramenta de suma importância ao processo de rastreabilidade, no entanto, existem divergências nas análises estatísticas dos resultados, uma vez que os dados são dependentes e advindos de vários elementos químicos tais como Carbono, Hidrogênio, Oxigênio, Nitrogênio e Enxofre (CHON'S). Com o intuito de estabelecer a análise propícia para os dados de rastreabilidade em aves pela técnica de isótopos estáveis e avaliar a necessidade da análise conjunta das variáveis, foram usados dados de carbono-13 e de nitrogênio-15 de ovos (albúmen + gema) de poedeiras e músculo peitoral de frangos de corte, os quais foram submetidos à análise estatística univariada (Anova e complementada pelo teste de Tukey) e multivariada (Manova e Discriminante). Os dados foram analisados no software Minitab 16, e os resultados, consolidados na teoria, confirmam a necessidade de análise multivariada, mostrando também que a análise discriminante esclarece as dúvidas apresentadas nos resultados de outros métodos de análise comparados nesta pesquisa.
Resumo:
The MAP-i Doctoral Program of the Universities of Minho, Aveiro and Porto
Resumo:
SUMMARY: Large sets of data, such as expression profiles from many samples, require analytic tools to reduce their complexity. The Iterative Signature Algorithm (ISA) is a biclustering algorithm. It was designed to decompose a large set of data into so-called 'modules'. In the context of gene expression data, these modules consist of subsets of genes that exhibit a coherent expression profile only over a subset of microarray experiments. Genes and arrays may be attributed to multiple modules and the level of required coherence can be varied resulting in different 'resolutions' of the modular mapping. In this short note, we introduce two BioConductor software packages written in GNU R: The isa2 package includes an optimized implementation of the ISA and the eisa package provides a convenient interface to run the ISA, visualize its output and put the biclusters into biological context. Potential users of these packages are all R and BioConductor users dealing with tabular (e.g. gene expression) data. AVAILABILITY: http://www.unil.ch/cbg/ISA CONTACT: sven.bergmann@unil.ch
Resumo:
Standard methods for the analysis of linear latent variable models oftenrely on the assumption that the vector of observed variables is normallydistributed. This normality assumption (NA) plays a crucial role inassessingoptimality of estimates, in computing standard errors, and in designinganasymptotic chi-square goodness-of-fit test. The asymptotic validity of NAinferences when the data deviates from normality has been calledasymptoticrobustness. In the present paper we extend previous work on asymptoticrobustnessto a general context of multi-sample analysis of linear latent variablemodels,with a latent component of the model allowed to be fixed across(hypothetical)sample replications, and with the asymptotic covariance matrix of thesamplemoments not necessarily finite. We will show that, under certainconditions,the matrix $\Gamma$ of asymptotic variances of the analyzed samplemomentscan be substituted by a matrix $\Omega$ that is a function only of thecross-product moments of the observed variables. The main advantage of thisis thatinferences based on $\Omega$ are readily available in standard softwareforcovariance structure analysis, and do not require to compute samplefourth-order moments. An illustration with simulated data in the context ofregressionwith errors in variables will be presented.
Resumo:
One major methodological problem in analysis of sequence data is the determination of costs from which distances between sequences are derived. Although this problem is currently not optimally dealt with in the social sciences, it has some similarity with problems that have been solved in bioinformatics for three decades. In this article, the authors propose an optimization of substitution and deletion/insertion costs based on computational methods. The authors provide an empirical way of determining costs for cases, frequent in the social sciences, in which theory does not clearly promote one cost scheme over another. Using three distinct data sets, the authors tested the distances and cluster solutions produced by the new cost scheme in comparison with solutions based on cost schemes associated with other research strategies. The proposed method performs well compared with other cost-setting strategies, while it alleviates the justification problem of cost schemes.
Resumo:
Background and Purpose-The safety and efficacy of thrombolysis in cervical artery dissection (CAD) are controversial. The aim of this meta-analysis was to pool all individual patient data and provide a valid estimate of safety and outcome of thrombolysis in CAD.Methods-We performed a systematic literature search on intravenous and intra-arterial thrombolysis in CAD. We calculated the rates of pooled symptomatic intracranial hemorrhage and mortality and indirectly compared them with matched controls from the Safe Implementation of Thrombolysis in Stroke-International Stroke Thrombolysis Register. We applied multivariate regression models to identify predictors of excellent (modified Rankin Scale=0 to 1) and favorable (modified Rankin Scale=0 to 2) outcome.Results-We obtained individual patient data of 180 patients from 14 retrospective series and 22 case reports. Patients were predominantly female (68%), with a mean +/- SD age of 46 +/- 11 years. Most patients presented with severe stroke (median National Institutes of Health Stroke Scale score=16). Treatment was intravenous thrombolysis in 67% and intra-arterial thrombolysis in 33%. Median follow-up was 3 months. The pooled symptomatic intracranial hemorrhage rate was 3.1% (95% CI, 1.3 to 7.2). Overall mortality was 8.1% (95% CI, 4.9 to 13.2), and 41.0% (95% CI, 31.4 to 51.4) had an excellent outcome. Stroke severity was a strong predictor of outcome. Overlapping confidence intervals of end points indicated no relevant differences with matched controls from the Safe Implementation of Thrombolysis in Stroke-International Stroke Thrombolysis Register.Conclusions-Safety and outcome of thrombolysis in patients with CAD-related stroke appear similar to those for stroke from all causes. Based on our findings, thrombolysis should not be withheld in patients with CAD. (Stroke. 2011;42:2515-2520.)
Resumo:
Computational Biology is the research are that contributes to the analysis of biological data through the development of algorithms which will address significant research problems.The data from molecular biology includes DNA,RNA ,Protein and Gene expression data.Gene Expression Data provides the expression level of genes under different conditions.Gene expression is the process of transcribing the DNA sequence of a gene into mRNA sequences which in turn are later translated into proteins.The number of copies of mRNA produced is called the expression level of a gene.Gene expression data is organized in the form of a matrix. Rows in the matrix represent genes and columns in the matrix represent experimental conditions.Experimental conditions can be different tissue types or time points.Entries in the gene expression matrix are real values.Through the analysis of gene expression data it is possible to determine the behavioral patterns of genes such as similarity of their behavior,nature of their interaction,their respective contribution to the same pathways and so on. Similar expression patterns are exhibited by the genes participating in the same biological process.These patterns have immense relevance and application in bioinformatics and clinical research.Theses patterns are used in the medical domain for aid in more accurate diagnosis,prognosis,treatment planning.drug discovery and protein network analysis.To identify various patterns from gene expression data,data mining techniques are essential.Clustering is an important data mining technique for the analysis of gene expression data.To overcome the problems associated with clustering,biclustering is introduced.Biclustering refers to simultaneous clustering of both rows and columns of a data matrix. Clustering is a global whereas biclustering is a local model.Discovering local expression patterns is essential for identfying many genetic pathways that are not apparent otherwise.It is therefore necessary to move beyond the clustering paradigm towards developing approaches which are capable of discovering local patterns in gene expression data.A biclusters is a submatrix of the gene expression data matrix.The rows and columns in the submatrix need not be contiguous as in the gene expression data matrix.Biclusters are not disjoint.Computation of biclusters is costly because one will have to consider all the combinations of columans and rows in order to find out all the biclusters.The search space for the biclustering problem is 2 m+n where m and n are the number of genes and conditions respectively.Usually m+n is more than 3000.The biclustering problem is NP-hard.Biclustering is a powerful analytical tool for the biologist.The research reported in this thesis addresses the problem of biclustering.Ten algorithms are developed for the identification of coherent biclusters from gene expression data.All these algorithms are making use of a measure called mean squared residue to search for biclusters.The objective here is to identify the biclusters of maximum size with the mean squared residue lower than a given threshold. All these algorithms begin the search from tightly coregulated submatrices called the seeds.These seeds are generated by K-Means clustering algorithm.The algorithms developed can be classified as constraint based,greedy and metaheuristic.Constarint based algorithms uses one or more of the various constaints namely the MSR threshold and the MSR difference threshold.The greedy approach makes a locally optimal choice at each stage with the objective of finding the global optimum.In metaheuristic approaches particle Swarm Optimization(PSO) and variants of Greedy Randomized Adaptive Search Procedure(GRASP) are used for the identification of biclusters.These algorithms are implemented on the Yeast and Lymphoma datasets.Biologically relevant and statistically significant biclusters are identified by all these algorithms which are validated by Gene Ontology database.All these algorithms are compared with some other biclustering algorithms.Algorithms developed in this work overcome some of the problems associated with the already existing algorithms.With the help of some of the algorithms which are developed in this work biclusters with very high row variance,which is higher than the row variance of any other algorithm using mean squared residue, are identified from both Yeast and Lymphoma data sets.Such biclusters which make significant change in the expression level are highly relevant biologically.
Resumo:
A role for sequential test procedures is emerging in genetic and epidemiological studies using banked biological resources. This stems from the methodology's potential for improved use of information relative to comparable fixed sample designs. Studies in which cost, time and ethics feature prominently are particularly suited to a sequential approach. In this paper sequential procedures for matched case–control studies with binary data will be investigated and assessed. Design issues such as sample size evaluation and error rates are identified and addressed. The methodology is illustrated and evaluated using both real and simulated data sets.
Resumo:
Rain acidity may be ascribed to emissions from power station stacks, as well as emissions from other industry, biomass burning, maritime influences, agricultural influences, etc. Rain quality data are available for 30 sites in the South African interior, some from as early as 1985 for up to 14 rainfall seasons, while others only have relatively short records. The article examines trends over time in the raw and volume weighted concentrations of the parameters measured, separately for each of the sites for which sufficient data are available. The main thrust, however, is to examine the inter-relationship structure between the concentrations within each rain event (unweighted data), separately for each site, and to examine whether these inter-relationships have changed over time. The rain events at individual sites can be characterized by approximately eight combinations of rainfall parameters (or rain composition signatures), and these are common to all sites. Some sites will have more events from one signature than another, but there appear to be no signatures unique to a single site. Analysis via factor and cluster analysis, with a correspondence analysis of the results, also aid interpretation of the patterns. This spatio-temporal analysis, performed by pooling all rain event data, irrespective of site or time period, results in nine combinations of rainfall parameters being sufficient to characterize the rain events. The sites and rainfall seasons show patterns in these combinations of parameters, with some combinations appearing more frequently during certain rainfall seasons. In particular, the presence of the combination of low acetate and formate with high magnesium appears to be increasing in the later rainfall seasons, as does this combination together with calcium, sodium, chloride, potassium and fluoride. As expected, sites close together exhibit similar signatures. Copyright © 2002 John Wiley & Sons, Ltd.
Resumo:
Dimensionality reduction is employed for visual data analysis as a way to obtaining reduced spaces for high dimensional data or to mapping data directly into 2D or 3D spaces. Although techniques have evolved to improve data segregation on reduced or visual spaces, they have limited capabilities for adjusting the results according to user's knowledge. In this paper, we propose a novel approach to handling both dimensionality reduction and visualization of high dimensional data, taking into account user's input. It employs Partial Least Squares (PLS), a statistical tool to perform retrieval of latent spaces focusing on the discriminability of the data. The method employs a training set for building a highly precise model that can then be applied to a much larger data set very effectively. The reduced data set can be exhibited using various existing visualization techniques. The training data is important to code user's knowledge into the loop. However, this work also devises a strategy for calculating PLS reduced spaces when no training data is available. The approach produces increasingly precise visual mappings as the user feeds back his or her knowledge and is capable of working with small and unbalanced training sets.
Resumo:
The study describes brain areas involved in medial temporal lobe (mTL) seizures of 12 patients. All patients showed so-called oro-alimentary behavior within the first 20 s of clinical seizure manifestation characteristic of mTL seizures. Single photon emission computed tomography (SPECT) images of regional cerebral blood flow (rCBF) were acquired from the patients in ictal and interictal phases and from normal volunteers. Image analysis employed categorical comparisons with statistical parametric mapping and principal component analysis (PCA) to assess functional connectivity. PCA supplemented the findings of the categorical analysis by decomposing the covariance matrix containing images of patients and healthy subjects into distinct component images of independent variance, including areas not identified by the categorical analysis. Two principal components (PCs) discriminated the subject groups: patients with right or left mTL seizures and normal volunteers, indicating distinct neuronal networks implicated by the seizure. Both PCs were correlated with seizure duration, one positively and the other negatively, confirming their physiological significance. The independence of the two PCs yielded a clear clustering of subject groups. The local pattern within the temporal lobe describes critical relay nodes which are the counterpart of oro-alimentary behavior: (1) right mesial temporal zone and ipsilateral anterior insula in right mTL seizures, and (2) temporal poles on both sides that are densely interconnected by the anterior commissure. Regions remote from the temporal lobe may be related to seizure propagation and include positively and negatively loaded areas. These patterns, the covarying areas of the temporal pole and occipito-basal visual association cortices, for example, are related to known anatomic paths.
Resumo:
This paper describes the application of a new technique, rough clustering, to the problem of market segmentation. Rough clustering produces different solutions to k-means analysis because of the possibility of multiple cluster membership of objects. Traditional clustering methods generate extensional descriptions of groups, that show which objects are members of each cluster. Clustering techniques based on rough sets theory generate intensional descriptions, which outline the main characteristics of each cluster. In this study, a rough cluster analysis was conducted on a sample of 437 responses from a larger study of the relationship between shopping orientation (the general predisposition of consumers toward the act of shopping) and intention to purchase products via the Internet. The cluster analysis was based on five measures of shopping orientation: enjoyment, personalization, convenience, loyalty, and price. The rough clusters obtained provide interpretations of different shopping orientations present in the data without the restriction of attempting to fit each object into only one segment. Such descriptions can be an aid to marketers attempting to identify potential segments of consumers.