924 resultados para Automatic Analysis of Multivariate Categorical Data Sets


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Funding The International Primary Care Respiratory Group (IPCRG) provided funding for this research project as an UNLOCK group study for which the funding was obtained through an unrestricted grant by Novartis AG, Basel, Switzerland. The latter funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Database access for the OPCRD was provided by the Respiratory Effectiveness Group (REG) and Research in Real Life; the OPCRD statistical analysis was funded by REG. The Bocholtz Study was funded by PICASSO for COPD, an initiative of Boehringer Ingelheim, Pfizer and the Caphri Research Institute, Maastricht University, The Netherlands.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Two objects with homologous landmarks are said to be of the same shape if the configurations of landmarks of one object can be exactly matched with that of the other by translation, rotation/reflection, and scaling. The observations on an object are coordinates of its landmarks with reference to a set of orthogonal coordinate axes in an appropriate dimensional space. The origin, choice of units, and orientation of the coordinate axes with respect to an object may be different from object to object. In such a case, how do we quantify the shape of an object, find the mean and variation of shape in a population of objects, compare the mean shapes in two or more different populations, and discriminate between objects belonging to two or more different shape distributions. We develop some methods that are invariant to translation, rotation, and scaling of the observations on each object and thereby provide generalizations of multivariate methods for shape analysis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This study examined the genetic and environmental relationships among 5 academic achievement skills of a standardized test of academic achievement, the Queensland Core Skills Test (QCST; Queensland Studies Authority, 2003a). QCST participants included 182 monozygotic pairs and 208 dizygotic pairs (mean 17 years +/- 0.4 standard deviation). IQ data were included in the analysis to correct for ascertainment bias. A genetic general factor explained virtually all genetic variance in the component academic skills scores, and accounted for 32% to 73% of their phenotypic variances. It also explained 56% and 42% of variation in Verbal IQ and Performance IQ respectively, suggesting that this factor is genetic g. Modest specific genetic effects were evident for achievement in mathematical problem solving and written expression. A single common factor adequately explained common environmental effects, which were also modest, and possibly due to assortative mating. The results suggest that general academic ability, derived from genetic influences and to a lesser extent common environmental influences, is the primary source of variation in component skills of the QCST.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper introduces a new technique in the investigation of limited-dependent variable models. This paper illustrates that variable precision rough set theory (VPRS), allied with the use of a modern method of classification, or discretisation of data, can out-perform the more standard approaches that are employed in economics, such as a probit model. These approaches and certain inductive decision tree methods are compared (through a Monte Carlo simulation approach) in the analysis of the decisions reached by the UK Monopolies and Mergers Committee. We show that, particularly in small samples, the VPRS model can improve on more traditional models, both in-sample, and particularly in out-of-sample prediction. A similar improvement in out-of-sample prediction over the decision tree methods is also shown.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents the results of a multivariate spatial analysis of 38 vowel formant variables in the language of 402 informants from 236 cities from across the contiguous United States, based on the acoustic data from the Atlas of North American English (Labov, Ash & Boberg, 2006). The results of the analysis both confirm and challenge the results of the Atlas. Most notably, while the analysis identifies similar patterns as the Atlas in the West and the Southeast, the analysis finds that the Midwest and the Northeast are distinct dialect regions that are considerably stronger than the traditional Midland and Northern dialect region indentified in the Atlas. The analysis also finds evidence that a western vowel shift is actively shaping the language of the Western United States.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The microarray technology provides a high-throughput technique to study gene expression. Microarrays can help us diagnose different types of cancers, understand biological processes, assess host responses to drugs and pathogens, find markers for specific diseases, and much more. Microarray experiments generate large amounts of data. Thus, effective data processing and analysis are critical for making reliable inferences from the data. ^ The first part of dissertation addresses the problem of finding an optimal set of genes (biomarkers) to classify a set of samples as diseased or normal. Three statistical gene selection methods (GS, GS-NR, and GS-PCA) were developed to identify a set of genes that best differentiate between samples. A comparative study on different classification tools was performed and the best combinations of gene selection and classifiers for multi-class cancer classification were identified. For most of the benchmarking cancer data sets, the gene selection method proposed in this dissertation, GS, outperformed other gene selection methods. The classifiers based on Random Forests, neural network ensembles, and K-nearest neighbor (KNN) showed consistently god performance. A striking commonality among these classifiers is that they all use a committee-based approach, suggesting that ensemble classification methods are superior. ^ The same biological problem may be studied at different research labs and/or performed using different lab protocols or samples. In such situations, it is important to combine results from these efforts. The second part of the dissertation addresses the problem of pooling the results from different independent experiments to obtain improved results. Four statistical pooling techniques (Fisher inverse chi-square method, Logit method. Stouffer's Z transform method, and Liptak-Stouffer weighted Z-method) were investigated in this dissertation. These pooling techniques were applied to the problem of identifying cell cycle-regulated genes in two different yeast species. As a result, improved sets of cell cycle-regulated genes were identified. The last part of dissertation explores the effectiveness of wavelet data transforms for the task of clustering. Discrete wavelet transforms, with an appropriate choice of wavelet bases, were shown to be effective in producing clusters that were biologically more meaningful. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The purpose of this research was to demonstrate the applicability of reduced-size STR (Miniplex) primer sets to challenging samples and to provide the forensic community with new information regarding the analysis of degraded and inhibited DNA. The Miniplex primer sets were validated in accordance with guidelines set forth by the Scientific Working Group on DNA Analysis Methods (SWGDAM) in order to demonstrate the scientific validity of the kits. The Miniplex sets were also used in the analysis of DNA extracted from human skeletal remains and telogen hair. In addition, a method for evaluating the mechanism of PCR inhibition was developed using qPCR. The Miniplexes were demonstrated to be a robust and sensitive tool for the analysis of DNA with as low as 100 pg of template DNA. They also proved to be better than commercial kits in the analysis of DNA from human skeletal remains, with 64% of samples tested producing full profiles, compared to 16% for a commercial kit. The Miniplexes also produced amplification of nuclear DNA from human telogen hairs, with partial profiles obtained from as low as 60 pg of template DNA. These data suggest smaller PCR amplicons may provide a useful alternative to mitochondrial DNA for forensic analysis of degraded DNA from human skeletal remains, telogen hairs, and other challenging samples. In the evaluation of inhibition by qPCR, the effect of amplicon length and primer melting temperature was evaluated in order to determine the binding mechanisms of different PCR inhibitors. Several mechanisms were indicated by the inhibitors tested, including binding of the polymerase, binding to the DNA, and effects on the processivity of the polymerase during primer extension. The data obtained from qPCR illustrated a method by which the type of inhibitor could be inferred in forensic samples, and some methods of reducing inhibition for specific inhibitors were demonstrated. An understanding of the mechanism of the inhibitors found in forensic samples will allow analysts to select the proper methods for inhibition removal or the type of analysis that can be performed, and will increase the information that can be obtained from inhibited samples.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The primary goal of this dissertation is the study of patterns of viral evolution inferred from serially-sampled sequence data, i.e., sequence data obtained from strains isolated at consecutive time points from a single patient or host. RNA viral populations have an extremely high genetic variability, largely due to their astronomical population sizes within host systems, high replication rate, and short generation time. It is this aspect of their evolution that demands special attention and a different approach when studying the evolutionary relationships of serially-sampled sequence data. New methods that analyze serially-sampled data were developed shortly after a groundbreaking HIV-1 study of several patients from which viruses were isolated at recurring intervals over a period of 10 or more years. These methods assume a tree-like evolutionary model, while many RNA viruses have the capacity to exchange genetic material with one another using a process called recombination. ^ A genealogy involving recombination is best described by a network structure. A more general approach was implemented in a new computational tool, Sliding MinPD, one that is mindful of the sampling times of the input sequences and that reconstructs the viral evolutionary relationships in the form of a network structure with implicit representations of recombination events. The underlying network organization reveals unique patterns of viral evolution and could help explain the emergence of disease-associated mutants and drug-resistant strains, with implications for patient prognosis and treatment strategies. In order to comprehensively test the developed methods and to carry out comparison studies with other methods, synthetic data sets are critical. Therefore, appropriate sequence generators were also developed to simulate the evolution of serially-sampled recombinant viruses, new and more through evaluation criteria for recombination detection methods were established, and three major comparison studies were performed. The newly developed tools were also applied to "real" HIV-1 sequence data and it was shown that the results represented within an evolutionary network structure can be interpreted in biologically meaningful ways. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The exponential growth of studies on the biological response to ocean acidification over the last few decades has generated a large amount of data. To facilitate data comparison, a data compilation hosted at the data publisher PANGAEA was initiated in 2008 and is updated on a regular basis (doi:10.1594/PANGAEA.149999). By January 2015, a total of 581 data sets (over 4 000 000 data points) from 539 papers had been archived. Here we present the developments of this data compilation five years since its first description by Nisumaa et al. (2010). Most of study sites from which data archived are still in the Northern Hemisphere and the number of archived data from studies from the Southern Hemisphere and polar oceans are still relatively low. Data from 60 studies that investigated the response of a mix of organisms or natural communities were all added after 2010, indicating a welcomed shift from the study of individual organisms to communities and ecosystems. The initial imbalance of considerably more data archived on calcification and primary production than on other processes has improved. There is also a clear tendency towards more data archived from multifactorial studies after 2010. For easier and more effective access to ocean acidification data, the ocean acidification community is strongly encouraged to contribute to the data archiving effort, and help develop standard vocabularies describing the variables and define best practices for archiving ocean acidification data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper is part of a special issue of Applied Geochemistry focusing on reliable applications of compositional multivariate statistical methods. This study outlines the application of compositional data analysis (CoDa) to calibration of geochemical data and multivariate statistical modelling of geochemistry and grain-size data from a set of Holocene sedimentary cores from the Ganges-Brahmaputra (G-B) delta. Over the last two decades, understanding near-continuous records of sedimentary sequences has required the use of core-scanning X-ray fluorescence (XRF) spectrometry, for both terrestrial and marine sedimentary sequences. Initial XRF data are generally unusable in ‘raw-format’, requiring data processing in order to remove instrument bias, as well as informed sequence interpretation. The applicability of these conventional calibration equations to core-scanning XRF data are further limited by the constraints posed by unknown measurement geometry and specimen homogeneity, as well as matrix effects. Log-ratio based calibration schemes have been developed and applied to clastic sedimentary sequences focusing mainly on energy dispersive-XRF (ED-XRF) core-scanning. This study has applied high resolution core-scanning XRF to Holocene sedimentary sequences from the tidal-dominated Indian Sundarbans, (Ganges-Brahmaputra delta plain). The Log-Ratio Calibration Equation (LRCE) was applied to a sub-set of core-scan and conventional ED-XRF data to quantify elemental composition. This provides a robust calibration scheme using reduced major axis regression of log-ratio transformed geochemical data. Through partial least squares (PLS) modelling of geochemical and grain-size data, it is possible to derive robust proxy information for the Sundarbans depositional environment. The application of these techniques to Holocene sedimentary data offers an improved methodological framework for unravelling Holocene sedimentation patterns.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Incidental findings on low-dose CT images obtained during hybrid imaging are an increasing phenomenon as CT technology advances. Understanding the diagnostic value of incidental findings along with the technical limitations is important when reporting image results and recommending follow-up, which may result in an additional radiation dose from further diagnostic imaging and an increase in patient anxiety. This study assessed lesions incidentally detected on CT images acquired for attenuation correction on two SPECT/CT systems. Methods: An anthropomorphic chest phantom containing simulated lesions of varying size and density was imaged on an Infinia Hawkeye 4 and a Symbia T6 using the low-dose CT settings applied for attenuation correction acquisitions in myocardial perfusion imaging. Twenty-two interpreters assessed 46 images from each SPECT/CT system (15 normal images and 31 abnormal images; 41 lesions). Data were evaluated using a jackknife alternative free-response receiver-operating-characteristic analysis (JAFROC). Results: JAFROC analysis showed a significant difference (P < 0.0001) in lesion detection, with the figures of merit being 0.599 (95% confidence interval, 0.568, 0.631) and 0.810 (95% confidence interval, 0.781, 0.839) for the Infinia Hawkeye 4 and Symbia T6, respectively. Lesion detection on the Infinia Hawkeye 4 was generally limited to larger, higher-density lesions. The Symbia T6 allowed improved detection rates for midsized lesions and some lower-density lesions. However, interpreters struggled to detect small (5 mm) lesions on both image sets, irrespective of density. Conclusion: Lesion detection is more reliable on low-dose CT images from the Symbia T6 than from the Infinia Hawkeye 4. This phantom-based study gives an indication of potential lesion detection in the clinical context as shown by two commonly used SPECT/CT systems, which may assist the clinician in determining whether further diagnostic imaging is justified.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Crosswell data set contains a range of angles limited only by the geometry of the source and receiver configuration, the separation of the boreholes and the depth to the target. However, the wide angles reflections present in crosswell imaging result in amplitude-versus-angle (AVA) features not usually observed in surface data. These features include reflections from angles that are near critical and beyond critical for many of the interfaces; some of these reflections are visible only for a small range of angles, presumably near their critical angle. High-resolution crosswell seismic surveys were conducted over a Silurian (Niagaran) reef at two fields in northern Michigan, Springdale and Coldspring. The Springdale wells extended to much greater depths than the reef, and imaging was conducted from above and from beneath the reef. Combining the results from images obtained from above with those from beneath provides additional information, by exhibiting ranges of angles that are different for the two images, especially for reflectors at shallow depths, and second, by providing additional constraints on the solutions for Zoeppritz equations. Inversion of seismic data for impedance has become a standard part of the workflow for quantitative reservoir characterization. Inversion of crosswell data using either deterministic or geostatistical methods can lead to poor results with phase change beyond the critical angle, however, the simultaneous pre-stack inversion of partial angle stacks may be best conducted with restrictions to angles less than critical. Deterministic inversion is designed to yield only a single model of elastic properties (best-fit), while the geostatistical inversion produces multiple models (realizations) of elastic properties, lithology and reservoir properties. Geostatistical inversion produces results with far more detail than deterministic inversion. The magnitude of difference in details between both types of inversion becomes increasingly pronounced for thinner reservoirs, particularly those beyond the vertical resolution of the seismic. For any interface imaged from above and from beneath, the results AVA characters must result from identical contrasts in elastic properties in the two sets of images, albeit in reverse order. An inversion approach to handle both datasets simultaneously, at pre-critical angles, is demonstrated in this work. The main exploration problem for carbonate reefs is determining the porosity distribution. Images of elastic properties, obtained from deterministic and geostatistical simultaneous inversion of a high-resolution crosswell seismic survey were used to obtain the internal structure and reservoir properties (porosity) of Niagaran Michigan reef. The images obtained are the best of any Niagaran pinnacle reef to date.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

To investigate the degree of T2 relaxometry changes over time in groups of patients with familial mesial temporal lobe epilepsy (FMTLE) and asymptomatic relatives. We conducted both cross-sectional and longitudinal analyses of T2 relaxometry with Aftervoxel, an in-house software for medical image visualization. The cross-sectional study included 35 subjects (26 with FMTLE and 9 asymptomatic relatives) and 40 controls; the longitudinal study was composed of 30 subjects (21 with FMTLE and 9 asymptomatic relatives; the mean time interval of MRIs was 4.4 ± 1.5 years) and 16 controls. To increase the size of our groups of patients and relatives, we combined data acquired in 2 scanners (2T and 3T) and obtained z-scores using their respective controls. General linear model on SPSS21® was used for statistical analysis. In the cross-sectional analysis, elevated T2 relaxometry was identified for subjects with seizures and intermediate values for asymptomatic relatives compared to controls. Subjects with MRI signs of hippocampal sclerosis presented elevated T2 relaxometry in the ipsilateral hippocampus, while patients and asymptomatic relatives with normal MRI presented elevated T2 values in the right hippocampus. The longitudinal analysis revealed a significant increase in T2 relaxometry for the ipsilateral hippocampus exclusively in patients with seizures. The longitudinal increase of T2 signal in patients with seizures suggests the existence of an interaction between ongoing seizures and the underlying pathology, causing progressive damage to the hippocampus. The identification of elevated T2 relaxometry in asymptomatic relatives and in patients with normal MRI suggests that genetic factors may be involved in the development of some mild hippocampal abnormalities in FMTLE.