924 resultados para Automatic Analysis of Multivariate Categorical Data Sets
Resumo:
A combinatorial protocol (CP) is introduced here to interface it with the multiple linear regression (MLR) for variable selection. The efficiency of CP-MLR is primarily based on the restriction of entry of correlated variables to the model development stage. It has been used for the analysis of Selwood et al data set [16], and the obtained models are compared with those reported from GFA [8] and MUSEUM [9] approaches. For this data set CP-MLR could identify three highly independent models (27, 28 and 31) with Q2 value in the range of 0.632-0.518. Also, these models are divergent and unique. Even though, the present study does not share any models with GFA [8], and MUSEUM [9] results, there are several descriptors common to all these studies, including the present one. Also a simulation is carried out on the same data set to explain the model formation in CP-MLR. The results demonstrate that the proposed method should be able to offer solutions to data sets with 50 to 60 descriptors in reasonable time frame. By carefully selecting the inter-parameter correlation cutoff values in CP-MLR one can identify divergent models and handle data sets larger than the present one without involving excessive computer time.
Resumo:
Complementary to automatic extraction processes, Virtual Reality technologies provide an adequate framework to integrate human perception in the exploration of large data sets. In such multisensory system, thanks to intuitive interactions, a user can take advantage of all his perceptual abilities in the exploration task. In this context the haptic perception, coupled to visual rendering, has been investigated for the last two decades, with significant achievements. In this paper, we present a survey related to exploitation of the haptic feedback in exploration of large data sets. For each haptic technique introduced, we describe its principles and its effectiveness.
Resumo:
Here we present a study of the 11 yr sunspot cycle's imprint on the Northern Hemisphere atmospheric circulation, using three recently developed gridded upper-air data sets that extend back to the early twentieth century. We find a robust response of the tropospheric late-wintertime circulation to the sunspot cycle, independent from the data set. This response is particularly significant over Europe, although results show that it is not directly related to a North Atlantic Oscillation (NAO) modulation; instead, it reveals a significant connection to the more meridional Eurasian pattern (EU). The magnitude of mean seasonal temperature changes over the European land areas locally exceeds 1 K in the lower troposphere over a sunspot cycle. We also analyse surface data to address the question whether the solar signal over Europe is temporally stable for a longer 250 yr period. The results increase our confidence in the existence of an influence of the 11 yr cycle on the European climate, but the signal is much weaker in the first half of the period compared to the second half. The last solar minimum (2005 to 2010), which was not included in our analysis, shows anomalies that are consistent with our statistical results for earlier solar minima.
Resumo:
The COSMIC-2 mission is a follow-on mission of the Constellation Observing System for Meteorology, Ionosphere, and Climate (COSMIC) with an upgraded payload for improved radio occultation (RO) applications. The objective of this paper is to develop a near-real-time (NRT) orbit determination system, called NRT National Chiao Tung University (NCTU) system, to support COSMIC-2 in atmospheric applications and verify the orbit product of COSMIC. The system is capable of automatic determinations of the NRT GPS clocks and LEO orbit and clock. To assess the NRT (NCTU) system, we use eight days of COSMIC data (March 24-31, 2011), which contain a total of 331 GPS observation sessions and 12 393 RO observable files. The parallel scheduling for independent GPS and LEO estimations and automatic time matching improves the computational efficiency by 64% compared to the sequential scheduling. Orbit difference analyses suggest a 10-cm accuracy for the COSMIC orbits from the NRT (NCTU) system, and it is consistent as the NRT University Corporation for Atmospheric Research (URCA) system. The mean velocity accuracy from the NRT orbits of COSMIC is 0.168 mm/s, corresponding to an error of about 0.051 μrad in the bending angle. The rms differences in the NRT COSMIC clock and in GPS clocks between the NRT (NCTU) and the postprocessing products are 3.742 and 1.427 ns. The GPS clocks determined from a partial ground GPS network [from NRT (NCTU)] and a full one [from NRT (UCAR)] result in mean rms frequency stabilities of 6.1E-12 and 2.7E-12, respectively, corresponding to range fluctuations of 5.5 and 2.4 cm and bending angle errors of 3.75 and 1.66 μrad .
Resumo:
Objective: Processes occurring in the course of psychotherapy are characterized by the simple fact that they unfold in time and that the multiple factors engaged in change processes vary highly between individuals (idiographic phenomena). Previous research, however, has neglected the temporal perspective by its traditional focus on static phenomena, which were mainly assessed at the group level (nomothetic phenomena). To support a temporal approach, the authors introduce time-series panel analysis (TSPA), a statistical methodology explicitly focusing on the quantification of temporal, session-to-session aspects of change in psychotherapy. TSPA-models are initially built at the level of individuals and are subsequently aggregated at the group level, thus allowing the exploration of prototypical models. Method: TSPA is based on vector auto-regression (VAR), an extension of univariate auto-regression models to multivariate time-series data. The application of TSPA is demonstrated in a sample of 87 outpatient psychotherapy patients who were monitored by postsession questionnaires. Prototypical mechanisms of change were derived from the aggregation of individual multivariate models of psychotherapy process. In a 2nd step, the associations between mechanisms of change (TSPA) and pre- to postsymptom change were explored. Results: TSPA allowed a prototypical process pattern to be identified, where patient's alliance and self-efficacy were linked by a temporal feedback-loop. Furthermore, therapist's stability over time in both mastery and clarification interventions was positively associated with better outcomes. Conclusions: TSPA is a statistical tool that sheds new light on temporal mechanisms of change. Through this approach, clinicians may gain insight into prototypical patterns of change in psychotherapy.
Resumo:
Recurrent wheezing or asthma is a common problem in children that has increased considerably in prevalence in the past few decades. The causes and underlying mechanisms are poorly understood and it is thought that a numb er of distinct diseases causing similar symptoms are involved. Due to the lack of a biologically founded classification system, children are classified according to their observed disease related features (symptoms, signs, measurements) into phenotypes. The objectives of this PhD project were a) to develop tools for analysing phenotypic variation of a disease, and b) to examine phenotypic variability of wheezing among children by applying these tools to existing epidemiological data. A combination of graphical methods (multivariate co rrespondence analysis) and statistical models (latent variables models) was used. In a first phase, a model for discrete variability (latent class model) was applied to data on symptoms and measurements from an epidemiological study to identify distinct phenotypes of wheezing. In a second phase, the modelling framework was expanded to include continuous variability (e.g. along a severity gradient) and combinations of discrete and continuo us variability (factor models and factor mixture models). The third phase focused on validating the methods using simulation studies. The main body of this thesis consists of 5 articles (3 published, 1 submitted and 1 to be submitted) including applications, methodological contributions and a review. The main findings and contributions were: 1) The application of a latent class model to epidemiological data (symptoms and physiological measurements) yielded plausible pheno types of wheezing with distinguishing characteristics that have previously been used as phenotype defining characteristics. 2) A method was proposed for including responses to conditional questions (e.g. questions on severity or triggers of wheezing are asked only to children with wheeze) in multivariate modelling.ii 3) A panel of clinicians was set up to agree on a plausible model for wheezing diseases. The model can be used to generate datasets for testing the modelling approach. 4) A critical review of methods for defining and validating phenotypes of wheeze in children was conducted. 5) The simulation studies showed that a parsimonious parameterisation of the models is required to identify the true underlying structure of the data. The developed approach can deal with some challenges of real-life cohort data such as variables of mixed mode (continuous and categorical), missing data and conditional questions. If carefully applied, the approach can be used to identify whether the underlying phenotypic variation is discrete (classes), continuous (factors) or a combination of these. These methods could help improve precision of research into causes and mechanisms and contribute to the development of a new classification of wheezing disorders in children and other diseases which are difficult to classify.
Resumo:
The role of clinical chemistry has traditionally been to evaluate acutely ill or hospitalized patients. Traditional statistical methods have serious drawbacks in that they use univariate techniques. To demonstrate alternative methodology, a multivariate analysis of covariance model was developed and applied to the data from the Cooperative Study of Sickle Cell Disease.^ The purpose of developing the model for the laboratory data from the CSSCD was to evaluate the comparability of the results from the different clinics. Several variables were incorporated into the model in order to control for possible differences among the clinics that might confound any real laboratory differences.^ Differences for LDH, alkaline phosphatase and SGOT were identified which will necessitate adjustments by clinic whenever these data are used. In addition, aberrant clinic values for LDH, creatinine and BUN were also identified.^ The use of any statistical technique including multivariate analysis without thoughtful consideration may lead to spurious conclusions that may not be corrected for some time, if ever. However, the advantages of multivariate analysis far outweigh its potential problems. If its use increases as it should, the applicability to the analysis of laboratory data in prospective patient monitoring, quality control programs, and interpretation of data from cooperative studies could well have a major impact on the health and well being of a large number of individuals. ^
Resumo:
Mixture modeling is commonly used to model categorical latent variables that represent subpopulations in which population membership is unknown but can be inferred from the data. In relatively recent years, the potential of finite mixture models has been applied in time-to-event data. However, the commonly used survival mixture model assumes that the effects of the covariates involved in failure times differ across latent classes, but the covariate distribution is homogeneous. The aim of this dissertation is to develop a method to examine time-to-event data in the presence of unobserved heterogeneity under a framework of mixture modeling. A joint model is developed to incorporate the latent survival trajectory along with the observed information for the joint analysis of a time-to-event variable, its discrete and continuous covariates, and a latent class variable. It is assumed that the effects of covariates on survival times and the distribution of covariates vary across different latent classes. The unobservable survival trajectories are identified through estimating the probability that a subject belongs to a particular class based on observed information. We applied this method to a Hodgkin lymphoma study with long-term follow-up and observed four distinct latent classes in terms of long-term survival and distributions of prognostic factors. Our results from simulation studies and from the Hodgkin lymphoma study demonstrated the superiority of our joint model compared with the conventional survival model. This flexible inference method provides more accurate estimation and accommodates unobservable heterogeneity among individuals while taking involved interactions between covariates into consideration.^
Resumo:
In this dissertation, we propose a continuous-time Markov chain model to examine the longitudinal data that have three categories in the outcome variable. The advantage of this model is that it permits a different number of measurements for each subject and the duration between two consecutive time points of measurements can be irregular. Using the maximum likelihood principle, we can estimate the transition probability between two time points. By using the information provided by the independent variables, this model can also estimate the transition probability for each subject. The Monte Carlo simulation method will be used to investigate the goodness of model fitting compared with that obtained from other models. A public health example will be used to demonstrate the application of this method. ^
Resumo:
The analysis of research data plays a key role in data-driven areas of science. Varieties of mixed research data sets exist and scientists aim to derive or validate hypotheses to find undiscovered knowledge. Many analysis techniques identify relations of an entire dataset only. This may level the characteristic behavior of different subgroups in the data. Like automatic subspace clustering, we aim at identifying interesting subgroups and attribute sets. We present a visual-interactive system that supports scientists to explore interesting relations between aggregated bins of multivariate attributes in mixed data sets. The abstraction of data to bins enables the application of statistical dependency tests as the measure of interestingness. An overview matrix view shows all attributes, ranked with respect to the interestingness of bins. Complementary, a node-link view reveals multivariate bin relations by positioning dependent bins close to each other. The system supports information drill-down based on both expert knowledge and algorithmic support. Finally, visual-interactive subset clustering assigns multivariate bin relations to groups. A list-based cluster result representation enables the scientist to communicate multivariate findings at a glance. We demonstrate the applicability of the system with two case studies from the earth observation domain and the prostate cancer research domain. In both cases, the system enabled us to identify the most interesting multivariate bin relations, to validate already published results, and, moreover, to discover unexpected relations.
Resumo:
The objective of this study was to assess the potential of visible and near infrared spectroscopy (VIS+NIRS) combined with multivariate analysis for identifying the geographical origin of cork. The study was carried out on cork planks and natural cork stoppers from the most representative cork-producing areas in the world. Two training sets of international and national cork planks were studied. The first set comprised a total of 479 samples from Morocco, Portugal, and Spain, while the second set comprised a total of 179 samples from the Spanish regions of Andalusia, Catalonia, and Extremadura. A training set of 90 cork stoppers from Andalusia and Catalonia was also studied. Original spectroscopic data were obtained for the transverse sections of the cork planks and for the body and top of the cork stoppers by means of a 6500 Foss-NIRSystems SY II spectrophotometer using a fiber optic probe. Remote reflectance was employed in the wavelength range of 400 to 2500 nm. After analyzing the spectroscopic data, discriminant models were obtained by means of partial least square (PLS) with 70% of the samples. The best models were then validated using 30% of the remaining samples. At least 98% of the international cork plank samples and 95% of the national samples were correctly classified in the calibration and validation stage. The best model for the cork stoppers was obtained for the top of the stoppers, with at least 90% of the samples being correctly classified. The results demonstrate the potential of VIS + NIRS technology as a rapid and accurate method for predicting the geographical origin of cork plank and stoppers
Resumo:
Grapheme-color synesthesia is a neurological phenomenon in which viewing achromatic letters/numbers leads to automatic and involuntary color experiences. In this study, voxel-based morphometry analyses were performed on T1 images and fractional anisotropy measures to examine the whole brain in associator grapheme-color synesthetes. These analyses provide new evidence of variations in emotional areas (both at the cortical and subcortical levels), findings that help understand the emotional component as a relevant aspect of the synesthetic experience. Additionally, this study replicates previous findings in the left intraparietal sulcus and, for the first time, reports the existence of anatomical differences in subcortical gray nuclei of developmental grapheme-color synesthetes, providing a link between acquired and developmental synesthesia. This empirical evidence, which goes beyond modality-specific areas, could lead to a better understanding of grapheme-color synesthesia as well as of other modalities of the phenomenon.
Resumo:
We examine, with recently developed Lagrangian tools, altimeter data and numerical simulations obtained from the HYCOM model in the Gulf of Mexico. Our data correspond to the months just after the Deepwater Horizon oil spill in the year 2010. Our Lagrangian analysis provides a skeleton that allows the interpretation of transport routes over the ocean surface. The transport routes are further verified by the simultaneous study of the evolution of several drifters launched during those months in the Gulf of Mexico. We find that there exist Lagrangian structures that justify the dynamics of the drifters, although the agreement depends on the quality of the data. We discuss the impact of the Lagrangian tools on the assessment of the predictive capacity of these data sets.
Resumo:
Semantic interoperability is essential to facilitate efficient collaboration in heterogeneous multi-site healthcare environments. The deployment of a semantic interoperability solution has the potential to enable a wide range of informatics supported applications in clinical care and research both within as ingle healthcare organization and in a network of organizations. At the same time, building and deploying a semantic interoperability solution may require significant effort to carryout data transformation and to harmonize the semantics of the information in the different systems. Our approach to semantic interoperability leverages existing healthcare standards and ontologies, focusing first on specific clinical domains and key applications, and gradually expanding the solution when needed. An important objective of this work is to create a semantic link between clinical research and care environments to enable applications such as streamlining the execution of multi-centric clinical trials, including the identification of eligible patients for the trials. This paper presents an analysis of the suitability of several widely-used medical ontologies in the clinical domain: SNOMED-CT, LOINC, MedDRA, to capture the semantics of the clinical trial eligibility criteria, of the clinical trial data (e.g., Clinical Report Forms), and of the corresponding patient record data that would enable the automatic identification of eligible patients. Next to the coverage provided by the ontologies we evaluate and compare the sizes of the sets of relevant concepts and their relative frequency to estimate the cost of data transformation, of building the necessary semantic mappings, and of extending the solution to new domains. This analysis shows that our approach is both feasible and scalable.