991 resultados para Dimension reduction


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Motivation: Gene Set Enrichment Analysis (GSEA) has been developed recently to capture moderate but coordinated changes in the expression of sets of functionally related genes. We propose number of extensions to GSEA, which uses different statistics to describe the association between genes and phenotype of interest. We make use of dimension reduction procedures, such as principle component analysis to identify gene sets containing coordinated genes. We also address the problem of overlapping among gene sets in this paper. Results: We applied our methods to the data come from a clinical trial in acute lymphoblastic leukemia (ALL) [1]. We identified interesting gene sets using different statistics. We find that gender may have effects on the gene expression in addition to the phenotype effects. Investigating overlap among interesting gene sets indicate that overlapping could alter the interpretation of the significant results.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The advances in computational biology have made simultaneous monitoring of thousands of features possible. The high throughput technologies not only bring about a much richer information context in which to study various aspects of gene functions but they also present challenge of analyzing data with large number of covariates and few samples. As an integral part of machine learning, classification of samples into two or more categories is almost always of interest to scientists. In this paper, we address the question of classification in this setting by extending partial least squares (PLS), a popular dimension reduction tool in chemometrics, in the context of generalized linear regression based on a previous approach, Iteratively ReWeighted Partial Least Squares, i.e. IRWPLS (Marx, 1996). We compare our results with two-stage PLS (Nguyen and Rocke, 2002A; Nguyen and Rocke, 2002B) and other classifiers. We show that by phrasing the problem in a generalized linear model setting and by applying bias correction to the likelihood to avoid (quasi)separation, we often get lower classification error rates.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Nuevas biotecnologías permiten obtener información para caracterizar materiales genéticos a partir de múltiples marcadores, ya sean éstos moleculares y/o morfológicos. La ordenación del material genético a través de la exploración de patrones de variabilidad multidimensionales se aborda mediante diversas técnicas de análisis multivariado. Las técnicas multivariadas de reducción de dimensión (TRD) y la representación gráfica de las mismas cobran sustancial importancia en la visualización de datos multivariados en espacios de baja dimensión ya que facilitan la interpretación de interrelaciones entre las variables (marcadores) y entre los casos u observaciones bajo análisis. Tanto el Análisis de Componentes Principales, como el Análisis de Coordenadas Principales y el Análisis de Procrustes Generalizado son TRD aplicables a datos provenientes de marcadores moleculares y/o morfológicos. Los Árboles de Mínimo Recorrido y los biplots constituyen técnicas para lograr representaciones geométricas de resultados provenientes de TRD. En este trabajo se describen estas técnicas multivariadas y se ilustran sus aplicaciones sobre dos conjuntos de datos, moleculares y morfológicos, usados para caracterizar material genético fúngico.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The monkey anterior intraparietal area (AIP) encodes visual information about three-dimensional object shape that is used to shape the hand for grasping. We modeled shape tuning in visual AIP neurons and its relationship with curvature and gradient information from the caudal intraparietal area (CIP). The main goal was to gain insight into the kinds of shape parameterizations that can account for AIP tuning and that are consistent with both the inputs to AIP and the role of AIP in grasping. We first experimented with superquadric shape parameters. We considered superquadrics because they occupy a role in robotics that is similar to AIP , in that superquadric fits are derived from visual input and used for grasp planning. We also experimented with an alternative shape parameterization that was based on an Isomap dimension reduction of spatial derivatives of depth (i.e., distance from the observer to the object surface). We considered an Isomap-based model because its parameters lacked discontinuities between similar shapes. When we matched the dimension of the Isomap to the number of superquadric parameters, the superquadric model fit the AIP data somewhat more closely. However, higher-dimensional Isomaps provided excellent fits. Also, we found that the Isomap parameters could be approximated much more accurately than superquadric parameters by feedforward neural networks with CIP-like inputs. We conclude that Isomaps, or perhaps alternative dimension reductions of visual inputs to AIP, provide a promising model of AIP electrophysiology data. Further work is needed to test whether such shape parameterizations actually provide an effective basis for grasp control.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The monkey anterior intraparietal area (AIP) encodes visual information about three-dimensional object shape that is used to shape the hand for grasping. In robotics a similar role has been played by modules that fit point cloud data to the superquadric family of shapes and its various extensions. We developed a model of shape tuning in AIP based on cosine tuning to superquadric parameters. However, the model did not fit the data well, and we also found that it was difficult to accurately reproduce these parameters using neural networks with the appropriate inputs (modelled on the caudal intraparietal area, CIP). The latter difficulty was related to the fact that there are large discontinuities in the superquadric parameters between very similar shapes. To address these limitations we adopted an alternative shape parameterization based on an Isomap nonlinear dimension reduction. The Isomap was built using gradients and curvatures of object surface depth. This alternative parameterization was low-dimensional (like superquadrics), but data-driven (similar to an alternative clustering approach that is also sometimes used in robotics) and lacked large discontinuities. Isomaps with 16 or more dimensions reproduced the AIP data fairly well. Moreover, we found that the Isomap parameters could be approximated from CIP-like input much more accurately than the superquadric parameters. We conclude that Isomaps, or perhaps alternative dimension reductions of CIP signals, provide a promising model of AIP tuning. We have now started to integrate our model with a robot hand, to explore the efficacy of Isomap shape reductions in grasp planning. Future work will consider dynamics of spike responses and integration with related visual and motor area models.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The elastic net and related algorithms, such as generative topographic mapping, are key methods for discretized dimension-reduction problems. At their heart are priors that specify the expected topological and geometric properties of the maps. However, up to now, only a very small subset of possible priors has been considered. Here we study a much more general family originating from discrete, high-order derivative operators. We show theoretically that the form of the discrete approximation to the derivative used has a crucial influence on the resulting map. Using a new and more powerful iterative elastic net algorithm, we confirm these results empirically, and illustrate how different priors affect the form of simulated ocular dominance columns.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This preliminary report describes work carried out as part of work package 1.2 of the MUCM research project. The report is split in two parts: the ?rst part (Sections 1 and 2) summarises the state of the art in emulation of computer models, while the second presents some initial work on the emulation of dynamic models. In the ?rst part, we describe the basics of emulation, introduce the notation and put together the key results for the emulation of models with single and multiple outputs, with or without the use of mean function. In the second part, we present preliminary results on the chaotic Lorenz 63 model. We look at emulation of a single time step, and repeated application of the emulator for sequential predic- tion. After some design considerations, the emulator is compared with the exact simulator on a number of runs to assess its performance. Several general issues related to emulating dynamic models are raised and discussed. Current work on the larger Lorenz 96 model (40 variables) is presented in the context of dimension reduction, with results to be provided in a follow-up report. The notation used in this report are summarised in appendix.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Secondary pyrolysis in fluidized bed fast pyrolysis of biomass is the focus of this work. A novel computational fluid dynamics (CFD) model coupled with a comprehensive chemistry scheme (134 species and 4169 reactions, in CHEMKIN format) has been developed to investigate this complex phenomenon. Previous results from a transient three-dimensional model of primary pyrolysis were used for the source terms of primary products in this model. A parametric study of reaction atmospheres (H2O, N2, H2, CO2, CO) has been performed. For the N2 and H2O atmosphere, results of the model compared favorably to experimentally obtained yields after the temperature was adjusted to a value higher than that used in experiments. One notable deviation versus experiments is pyrolytic water yield and yield of higher hydrocarbons. The model suggests a not overly strong impact of the reaction atmosphere. However, both chemical and physical effects were observed. Most notably, effects could be seen on the yield of various compounds, temperature profile throughout the reactor system, residence time, radical concentration, and turbulent intensity. At the investigated temperature (873 K), turbulent intensity appeared to have the strongest influence on liquid yield. With the aid of acceleration techniques, most importantly dimension reduction, chemistry agglomeration, and in-situ tabulation, a converged solution could be obtained within a reasonable time (∼30 h). As such, a new potentially useful method has been suggested for numerical analysis of fast pyrolysis.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Homogenous secondary pyrolysis is category of reactions following the primary pyrolysis and presumed important for fast pyrolysis. For the comprehensive chemistry and fluid dynamics, a probability density functional (PDF) approach is used; with a kinetic scheme comprising 134 species and 4169 reactions being implemented. With aid of acceleration techniques, most importantly Dimension Reduction, Chemistry Agglomeration and In-situ Tabulation (ISAT), a solution within reasonable time was obtained. More work is required; however, a solution for levoglucosan (C6H10O5) being fed through the inlet with fluidizing gas at 500 °C, has been obtained. 88.6% of the levoglucosan remained non-decomposed, and 19 different decomposition product species were found above 0.01% by weight. A homogenous secondary pyrolysis scheme proposed can thus be implemented in a CFD environment and acceleration techniques can speed-up the calculation for application in engineering settings.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Popular dimension reduction and visualisation algorithms rely on the assumption that input dissimilarities are typically Euclidean, for instance Metric Multidimensional Scaling, t-distributed Stochastic Neighbour Embedding and the Gaussian Process Latent Variable Model. It is well known that this assumption does not hold for most datasets and often high-dimensional data sits upon a manifold of unknown global geometry. We present a method for improving the manifold charting process, coupled with Elastic MDS, such that we no longer assume that the manifold is Euclidean, or of any particular structure. We draw on the benefits of different dissimilarity measures allowing for the relative responsibilities, under a linear combination, to drive the visualisation process.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

As massive data sets become increasingly available, people are facing the problem of how to effectively process and understand these data. Traditional sequential computing models are giving way to parallel and distributed computing models, such as MapReduce, both due to the large size of the data sets and their high dimensionality. This dissertation, as in the same direction of other researches that are based on MapReduce, tries to develop effective techniques and applications using MapReduce that can help people solve large-scale problems. Three different problems are tackled in the dissertation. The first one deals with processing terabytes of raster data in a spatial data management system. Aerial imagery files are broken into tiles to enable data parallel computation. The second and third problems deal with dimension reduction techniques that can be used to handle data sets of high dimensionality. Three variants of the nonnegative matrix factorization technique are scaled up to factorize matrices of dimensions in the order of millions in MapReduce based on different matrix multiplication implementations. Two algorithms, which compute CANDECOMP/PARAFAC and Tucker tensor decompositions respectively, are parallelized in MapReduce based on carefully partitioning the data and arranging the computation to maximize data locality and parallelism.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.

Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This thesis introduces two related lines of study on classification of hyperspectral images with nonlinear methods. First, it describes a quantitative and systematic evaluation, by the author, of each major component in a pipeline for classifying hyperspectral images (HSI) developed earlier in a joint collaboration [23]. The pipeline, with novel use of nonlinear classification methods, has reached beyond the state of the art in classification accuracy on commonly used benchmarking HSI data [6], [13]. More importantly, it provides a clutter map, with respect to a predetermined set of classes, toward the real application situations where the image pixels not necessarily fall into a predetermined set of classes to be identified, detected or classified with.

The particular components evaluated are a) band selection with band-wise entropy spread, b) feature transformation with spatial filters and spectral expansion with derivatives c) graph spectral transformation via locally linear embedding for dimension reduction, and d) statistical ensemble for clutter detection. The quantitative evaluation of the pipeline verifies that these components are indispensable to high-accuracy classification.

Secondly, the work extends the HSI classification pipeline with a single HSI data cube to multiple HSI data cubes. Each cube, with feature variation, is to be classified of multiple classes. The main challenge is deriving the cube-wise classification from pixel-wise classification. The thesis presents the initial attempt to circumvent it, and discuss the potential for further improvement.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Microsecond long Molecular Dynamics (MD) trajectories of biomolecular processes are now possible due to advances in computer technology. Soon, trajectories long enough to probe dynamics over many milliseconds will become available. Since these timescales match the physiological timescales over which many small proteins fold, all atom MD simulations of protein folding are now becoming popular. To distill features of such large folding trajectories, we must develop methods that can both compress trajectory data to enable visualization, and that can yield themselves to further analysis, such as the finding of collective coordinates and reduction of the dynamics. Conventionally, clustering has been the most popular MD trajectory analysis technique, followed by principal component analysis (PCA). Simple clustering used in MD trajectory analysis suffers from various serious drawbacks, namely, (i) it is not data driven, (ii) it is unstable to noise and change in cutoff parameters, and (iii) since it does not take into account interrelationships amongst data points, the separation of data into clusters can often be artificial. Usually, partitions generated by clustering techniques are validated visually, but such validation is not possible for MD trajectories of protein folding, as the underlying structural transitions are not well understood. Rigorous cluster validation techniques may be adapted, but it is more crucial to reduce the dimensions in which MD trajectories reside, while still preserving their salient features. PCA has often been used for dimension reduction and while it is computationally inexpensive, being a linear method, it does not achieve good data compression. In this thesis, I propose a different method, a nonmetric multidimensional scaling (nMDS) technique, which achieves superior data compression by virtue of being nonlinear, and also provides a clear insight into the structural processes underlying MD trajectories. I illustrate the capabilities of nMDS by analyzing three complete villin headpiece folding and six norleucine mutant (NLE) folding trajectories simulated by Freddolino and Schulten [1]. Using these trajectories, I make comparisons between nMDS, PCA and clustering to demonstrate the superiority of nMDS. The three villin headpiece trajectories showed great structural heterogeneity. Apart from a few trivial features like early formation of secondary structure, no commonalities between trajectories were found. There were no units of residues or atoms found moving in concert across the trajectories. A flipping transition, corresponding to the flipping of helix 1 relative to the plane formed by helices 2 and 3 was observed towards the end of the folding process in all trajectories, when nearly all native contacts had been formed. However, the transition occurred through a different series of steps in all trajectories, indicating that it may not be a common transition in villin folding. The trajectories showed competition between local structure formation/hydrophobic collapse and global structure formation in all trajectories. Our analysis on the NLE trajectories confirms the notion that a tight hydrophobic core inhibits correct 3-D rearrangement. Only one of the six NLE trajectories folded, and it showed no flipping transition. All the other trajectories get trapped in hydrophobically collapsed states. The NLE residues were found to be buried deeply into the core, compared to the corresponding lysines in the villin headpiece, thereby making the core tighter and harder to undo for 3-D rearrangement. Our results suggest that the NLE may not be a fast folder as experiments suggest. The tightness of the hydrophobic core may be a very important factor in the folding of larger proteins. It is likely that chaperones like GroEL act to undo the tight hydrophobic core of proteins, after most secondary structure elements have been formed, so that global rearrangement is easier. I conclude by presenting facts about chaperone-protein complexes and propose further directions for the study of protein folding.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Undoubtedly, statistics has become one of the most important subjects in the modern world, where its applications are ubiquitous. The importance of statistics is not limited to statisticians, but also impacts upon non-statisticians who have to use statistics within their own disciplines. Several studies have indicated that most of the academic departments around the world have realized the importance of statistics to non-specialist students. Therefore, the number of students enrolled in statistics courses has vastly increased, coming from a variety of disciplines. Consequently, research within the scope of statistics education has been able to develop throughout the last few years. One important issue is how statistics is best taught to, and learned by, non-specialist students. This issue is controlled by several factors that affect the learning and teaching of statistics to non-specialist students, such as the use of technology, the role of the English language (especially for those whose first language is not English), the effectiveness of statistics teachers and their approach towards teaching statistics courses, students’ motivation to learn statistics and the relevance of statistics courses to the main subjects of non-specialist students. Several studies, focused on aspects of learning and teaching statistics, have been conducted in different countries around the world, particularly in Western countries. Conversely, the situation in Arab countries, especially in Saudi Arabia, is different; here, there is very little research in this scope, and what there is does not meet the needs of those countries towards the development of learning and teaching statistics to non-specialist students. This research was instituted in order to develop the field of statistics education. The purpose of this mixed methods study was to generate new insights into this subject by investigating how statistics courses are currently taught to non-specialist students in Saudi universities. Hence, this study will contribute towards filling the knowledge gap that exists in Saudi Arabia. This study used multiple data collection approaches, including questionnaire surveys from 1053 non-specialist students who had completed at least one statistics course in different colleges of the universities in Saudi Arabia. These surveys were followed up with qualitative data collected via semi-structured interviews with 16 teachers of statistics from colleges within all six universities where statistics is taught to non-specialist students in Saudi Arabia’s Eastern Region. The data from questionnaires included several types, so different techniques were used in analysis. Descriptive statistics were used to identify the demographic characteristics of the participants. The chi-square test was used to determine associations between variables. Based on the main issues that are raised from literature review, the questions (items scales) were grouped and five key groups of questions were obtained which are: 1) Effectiveness of Teachers; 2) English Language; 3) Relevance of Course; 4) Student Engagement; 5) Using Technology. Exploratory data analysis was used to explore these issues in more detail. Furthermore, with the existence of clustering in the data (students within departments within colleges, within universities), multilevel generalized linear models for dichotomous analysis have been used to clarify the effects of clustering at those levels. Factor analysis was conducted confirming the dimension reduction of variables (items scales). The data from teachers’ interviews were analysed on an individual basis. The responses were assigned to one of the eight themes that emerged from within the data: 1) the lack of students’ motivation to learn statistics; 2) students' participation; 3) students’ assessment; 4) the effective use of technology; 5) the level of previous mathematical and statistical skills of non-specialist students; 6) the English language ability of non-specialist students; 7) the need for extra time for teaching and learning statistics; and 8) the role of administrators. All the data from students and teachers indicated that the situation of learning and teaching statistics to non-specialist students in Saudi universities needs to be improved in order to meet the needs of those students. The findings of this study suggested a weakness in the use of statistical software applications in these courses. This study showed that there is lack of application of technology such as statistical software programs in these courses, which would allow non-specialist students to consolidate their knowledge. The results also indicated that English language is considered one of the main challenges in learning and teaching statistics, particularly in institutions where English is not used as the main language. Moreover, the weakness of mathematical skills of students is considered another major challenge. Additionally, the results indicated that there was a need to tailor statistics courses to the needs of non-specialist students based on their main subjects. The findings indicate that statistics teachers need to choose appropriate methods when teaching statistics courses.