966 resultados para Positive Matrix Factorization
Resumo:
In this paper, we introduce an application of matrix factorization to produce corpus-derived, distributional
models of semantics that demonstrate cognitive plausibility. We find that word representations
learned by Non-Negative Sparse Embedding (NNSE), a variant of matrix factorization, are sparse,
effective, and highly interpretable. To the best of our knowledge, this is the first approach which
yields semantic representation of words satisfying these three desirable properties. Though extensive
experimental evaluations on multiple real-world tasks and datasets, we demonstrate the superiority
of semantic models learned by NNSE over other state-of-the-art baselines.
Resumo:
In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.
Resumo:
La méthode de factorisation est appliquée sur les données initiales d'un problème de mécanique quantique déja résolu. Les solutions (états propres et fonctions propres) sont presque tous retrouvés.
Resumo:
Cette thèse étudie des modèles de séquences de haute dimension basés sur des réseaux de neurones récurrents (RNN) et leur application à la musique et à la parole. Bien qu'en principe les RNN puissent représenter les dépendances à long terme et la dynamique temporelle complexe propres aux séquences d'intérêt comme la vidéo, l'audio et la langue naturelle, ceux-ci n'ont pas été utilisés à leur plein potentiel depuis leur introduction par Rumelhart et al. (1986a) en raison de la difficulté de les entraîner efficacement par descente de gradient. Récemment, l'application fructueuse de l'optimisation Hessian-free et d'autres techniques d'entraînement avancées ont entraîné la recrudescence de leur utilisation dans plusieurs systèmes de l'état de l'art. Le travail de cette thèse prend part à ce développement. L'idée centrale consiste à exploiter la flexibilité des RNN pour apprendre une description probabiliste de séquences de symboles, c'est-à-dire une information de haut niveau associée aux signaux observés, qui en retour pourra servir d'à priori pour améliorer la précision de la recherche d'information. Par exemple, en modélisant l'évolution de groupes de notes dans la musique polyphonique, d'accords dans une progression harmonique, de phonèmes dans un énoncé oral ou encore de sources individuelles dans un mélange audio, nous pouvons améliorer significativement les méthodes de transcription polyphonique, de reconnaissance d'accords, de reconnaissance de la parole et de séparation de sources audio respectivement. L'application pratique de nos modèles à ces tâches est détaillée dans les quatre derniers articles présentés dans cette thèse. Dans le premier article, nous remplaçons la couche de sortie d'un RNN par des machines de Boltzmann restreintes conditionnelles pour décrire des distributions de sortie multimodales beaucoup plus riches. Dans le deuxième article, nous évaluons et proposons des méthodes avancées pour entraîner les RNN. Dans les quatre derniers articles, nous examinons différentes façons de combiner nos modèles symboliques à des réseaux profonds et à la factorisation matricielle non-négative, notamment par des produits d'experts, des architectures entrée/sortie et des cadres génératifs généralisant les modèles de Markov cachés. Nous proposons et analysons également des méthodes d'inférence efficaces pour ces modèles, telles la recherche vorace chronologique, la recherche en faisceau à haute dimension, la recherche en faisceau élagué et la descente de gradient. Finalement, nous abordons les questions de l'étiquette biaisée, du maître imposant, du lissage temporel, de la régularisation et du pré-entraînement.
Resumo:
Biological systems exhibit rich and complex behavior through the orchestrated interplay of a large array of components. It is hypothesized that separable subsystems with some degree of functional autonomy exist; deciphering their independent behavior and functionality would greatly facilitate understanding the system as a whole. Discovering and analyzing such subsystems are hence pivotal problems in the quest to gain a quantitative understanding of complex biological systems. In this work, using approaches from machine learning, physics and graph theory, methods for the identification and analysis of such subsystems were developed. A novel methodology, based on a recent machine learning algorithm known as non-negative matrix factorization (NMF), was developed to discover such subsystems in a set of large-scale gene expression data. This set of subsystems was then used to predict functional relationships between genes, and this approach was shown to score significantly higher than conventional methods when benchmarking them against existing databases. Moreover, a mathematical treatment was developed to treat simple network subsystems based only on their topology (independent of particular parameter values). Application to a problem of experimental interest demonstrated the need for extentions to the conventional model to fully explain the experimental data. Finally, the notion of a subsystem was evaluated from a topological perspective. A number of different protein networks were examined to analyze their topological properties with respect to separability, seeking to find separable subsystems. These networks were shown to exhibit separability in a nonintuitive fashion, while the separable subsystems were of strong biological significance. It was demonstrated that the separability property found was not due to incomplete or biased data, but is likely to reflect biological structure.
Resumo:
This paper is concerned with tensor clustering with the assistance of dimensionality reduction approaches. A class of formulation for tensor clustering is introduced based on tensor Tucker decomposition models. In this formulation, an extra tensor mode is formed by a collection of tensors of the same dimensions and then used to assist a Tucker decomposition in order to achieve data dimensionality reduction. We design two types of clustering models for the tensors: PCA Tensor Clustering model and Non-negative Tensor Clustering model, by utilizing different regularizations. The tensor clustering can thus be solved by the optimization method based on the alternative coordinate scheme. Interestingly, our experiments show that the proposed models yield comparable or even better performance compared to most recent clustering algorithms based on matrix factorization.
Resumo:
We establish sufficient conditions for a matrix to be almost totally positive, thus extending a result of Craven and Csordas who proved that the corresponding conditions guarantee that a matrix is strictly totally positive. Then we apply our main result in order to obtain a new criteria for a real algebraic polynomial to be a Hurwitz one. The properties of the corresponding extremal Hurwitz polynomials are discussed. (C) 2004 Elsevier B.V. All rights reserved.
Resumo:
Il crescente utilizzo di sistemi di analisi high-throughput per lo studio dello stato fisiologico e metabolico del corpo, ha evidenziato che una corretta alimentazione e una buona forma fisica siano fattori chiave per la salute. L'aumento dell'età media della popolazione evidenzia l'importanza delle strategie di contrasto delle patologie legate all'invecchiamento. Una dieta sana è il primo mezzo di prevenzione per molte patologie, pertanto capire come il cibo influisce sul corpo umano è di fondamentale importanza. In questo lavoro di tesi abbiamo affrontato la caratterizzazione dei sistemi di imaging radiografico Dual-energy X-ray Absorptiometry (DXA). Dopo aver stabilito una metodologia adatta per l'elaborazione di dati DXA su un gruppo di soggetti sani non obesi, la PCA ha evidenziato alcune proprietà emergenti dall'interpretazione delle componenti principali in termini delle variabili di composizione corporea restituite dalla DXA. Le prime componenti sono associabili ad indici macroscopici di descrizione corporea (come BMI e WHR). Queste componenti sono sorprendentemente stabili al variare dello status dei soggetti in età, sesso e nazionalità. Dati di analisi metabolica, ottenuti tramite Magnetic Resonance Spectroscopy (MRS) su campioni di urina, sono disponibili per circa mille anziani (provenienti da cinque paesi europei) di età compresa tra i 65 ed i 79 anni, non affetti da patologie gravi. I dati di composizione corporea sono altresì presenti per questi soggetti. L'algoritmo di Non-negative Matrix Factorization (NMF) è stato utilizzato per esprimere gli spettri MRS come combinazione di fattori di base interpretabili come singoli metaboliti. I fattori trovati sono stabili, quindi spettri metabolici di soggetti sono composti dallo stesso pattern di metaboliti indipendentemente dalla nazionalità. Attraverso un'analisi a singolo cieco sono stati trovati alti valori di correlazione tra le variabili di composizione corporea e lo stato metabolico dei soggetti. Ciò suggerisce la possibilità di derivare la composizione corporea dei soggetti a partire dal loro stato metabolico.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-04
Resumo:
In recent years, the boundaries between e-commerce and social networking have become increasingly blurred. Many e-commerce websites support the mechanism of social login where users can sign on the websites using their social network identities such as their Facebook or Twitter accounts. Users can also post their newly purchased products on microblogs with links to the e-commerce product web pages. In this paper, we propose a novel solution for cross-site cold-start product recommendation, which aims to recommend products from e-commerce websites to users at social networking sites in 'cold-start' situations, a problem which has rarely been explored before. A major challenge is how to leverage knowledge extracted from social networking sites for cross-site cold-start product recommendation. We propose to use the linked users across social networking sites and e-commerce websites (users who have social networking accounts and have made purchases on e-commerce websites) as a bridge to map users' social networking features to another feature representation for product recommendation. In specific, we propose learning both users' and products' feature representations (called user embeddings and product embeddings, respectively) from data collected from e-commerce websites using recurrent neural networks and then apply a modified gradient boosting trees method to transform users' social networking features into user embeddings. We then develop a feature-based matrix factorization approach which can leverage the learnt user embeddings for cold-start product recommendation. Experimental results on a large dataset constructed from the largest Chinese microblogging service Sina Weibo and the largest Chinese B2C e-commerce website JingDong have shown the effectiveness of our proposed framework.
Resumo:
We present in this article an automated framework that extracts product adopter information from online reviews and incorporates the extracted information into feature-based matrix factorization formore effective product recommendation. In specific, we propose a bootstrapping approach for the extraction of product adopters from review text and categorize them into a number of different demographic categories. The aggregated demographic information of many product adopters can be used to characterize both products and users in the form of distributions over different demographic categories. We further propose a graphbased method to iteratively update user- and product-related distributions more reliably in a heterogeneous user-product graph and incorporate them as features into the matrix factorization approach for product recommendation. Our experimental results on a large dataset crawled from JINGDONG, the largest B2C e-commerce website in China, show that our proposed framework outperforms a number of competitive baselines for product recommendation.
Resumo:
As massive data sets become increasingly available, people are facing the problem of how to effectively process and understand these data. Traditional sequential computing models are giving way to parallel and distributed computing models, such as MapReduce, both due to the large size of the data sets and their high dimensionality. This dissertation, as in the same direction of other researches that are based on MapReduce, tries to develop effective techniques and applications using MapReduce that can help people solve large-scale problems. Three different problems are tackled in the dissertation. The first one deals with processing terabytes of raster data in a spatial data management system. Aerial imagery files are broken into tiles to enable data parallel computation. The second and third problems deal with dimension reduction techniques that can be used to handle data sets of high dimensionality. Three variants of the nonnegative matrix factorization technique are scaled up to factorize matrices of dimensions in the order of millions in MapReduce based on different matrix multiplication implementations. Two algorithms, which compute CANDECOMP/PARAFAC and Tucker tensor decompositions respectively, are parallelized in MapReduce based on carefully partitioning the data and arranging the computation to maximize data locality and parallelism.
Resumo:
En los últimos años se ha incrementado el interés de la comunidad científica en la Factorización de matrices no negativas (Non-negative Matrix Factorization, NMF). Este método permite transformar un conjunto de datos de grandes dimensiones en una pequeña colección de elementos que poseen semántica propia en el contexto del análisis. En el caso de Bioinformática, NMF suele emplearse como base de algunos métodos de agrupamiento de datos, que emplean un modelo estadístico para determinar el número de clases más favorable. Este modelo requiere de una gran cantidad de ejecuciones de NMF con distintos parámetros de entrada, lo que representa una enorme carga de trabajo a nivel computacional. La mayoría de las implementaciones de NMF han ido quedando obsoletas ante el constante crecimiento de los datos que la comunidad científica busca analizar, bien sea porque los tiempos de cómputo llegan a alargarse hasta convertirse en inviables, o porque el tamaño de esos datos desborda los recursos del sistema. Por ello, esta tesis doctoral se centra en la optimización y paralelización de la factorización NMF, pero no solo a nivel teórico, sino con el objetivo de proporcionarle a la comunidad científica una nueva herramienta para el análisis de datos de origen biológico. NMF expone un alto grado de paralelismo a nivel de datos, de granularidad variable; mientras que los métodos de agrupamiento mencionados anteriormente presentan un paralelismo a nivel de cómputo, ya que las diversas instancias de NMF que se ejecutan son independientes. Por tanto, desde un punto de vista global, se plantea un modelo de optimización por capas donde se emplean diferentes tecnologías de alto rendimiento...
Resumo:
Single-particle mixing state information can be a powerful tool for assessing the relative impact of local and regional sources of ambient particulate matter in urban environments. However, quantitative mixing state data are challenging to obtain using single-particle mass spectrometers. In this study, the quantitative chemical composition of carbonaceous single particles has been determined using an aerosol time-of-flight mass spectrometer (ATOFMS) as part of the MEGAPOLI 2010 winter campaign in Paris, France. Relative peak areas of marker ions for elemental carbon (EC), organic aerosol (OA), ammonium, nitrate, sulfate and potassium were compared with concurrent measurements from an Aerodyne high-resolution time-of-flight aerosol mass spectrometer (HR-ToF-AMS), a thermal-optical OCEC analyser and a particle into liquid sampler coupled with ion chromatography (PILS-IC). ATOFMS-derived estimated mass concentrations reproduced the variability of these species well (R-2 = 0.67-0.78), and 10 discrete mixing states for carbonaceous particles were identified and quantified. The chemical mixing state of HR-ToF-AMS organic aerosol factors, resolved using positive matrix factorisation, was also investigated through comparison with the ATOFMS dataset. The results indicate that hydrocarbon-like OA (HOA) detected in Paris is associated with two EC-rich mixing states which differ in their relative sulfate content, while fresh biomass burning OA (BBOA) is associated with two mixing states which differ significantly in their OA/EC ratios. Aged biomass burning OA (OOA(2)-BBOA) was found to be significantly internally mixed with nitrate, while secondary, oxidised OA (OOA) was associated with five particle mixing states, each exhibiting different relative secondary inorganic ion content. Externally mixed secondary organic aerosol was not observed. These findings demonstrate the range of primary and secondary organic aerosol mixing states in Paris. Examination of the temporal behaviour and chemical composition of the ATOFMS classes also enabled estimation of the relative contribution of transported emissions of each chemical species and total particle mass in the size range investigated. Only 22% of the total ATOFMS-derived particle mass was apportioned to fresh, local emissions, with 78% apportioned to regional/continental-scale emissions. Single-particle mixing state information can be a powerful tool for assessing the relative impact of local and regional sources of ambient particulate matter in urban environments. However, quantitative mixing state data are challenging to obtain using single-particle mass spectrometers. In this study, the quantitative chemical composition of carbonaceous single particles has been determined using an aerosol time-of-flight mass spectrometer (ATOFMS) as part of the MEGAPOLI 2010 winter campaign in Paris, France. Relative peak areas of marker ions for elemental carbon (EC), organic aerosol (OA), ammonium, nitrate, sulfate and potassium were compared with concurrent measurements from an Aerodyne high-resolution time-of-flight aerosol mass spectrometer (HR-ToF-AMS), a thermal-optical OCEC analyser and a particle into liquid sampler coupled with ion chromatography (PILS-IC). ATOFMS-derived estimated mass concentrations reproduced the variability of these species well (R-2 = 0.67-0.78), and 10 discrete mixing states for carbonaceous particles were identified and quantified. The chemical mixing state of HR-ToF-AMS organic aerosol factors, resolved using positive matrix factorisation, was also investigated through comparison with the ATOFMS dataset. The results indicate that hydrocarbon-like OA (HOA) detected in Paris is associated with two EC-rich mixing states which differ in their relative sulfate content, while fresh biomass burning OA (BBOA) is associated with two mixing states which differ significantly in their OA/EC ratios. Aged biomass burning OA (OOA(2)-BBOA) was found to be significantly internally mixed with nitrate, while secondary, oxidised OA (OOA) was associated with five particle mixing states, each exhibiting different relative secondary inorganic ion content. Externally mixed secondary organic aerosol was not observed. These findings demonstrate the range of primary and secondary organic aerosol mixing states in Paris. Examination of the temporal behaviour and chemical composition of the ATOFMS classes also enabled estimation of the relative contribution of transported emissions of each chemical species and total particle mass in the size range investigated. Only 22% of the total ATOFMS-derived particle mass was apportioned to fresh, local emissions, with 78% apportioned to regional/continental-scale emissions.
Resumo:
An aerosol time-of-flight mass spectrometer (ATOFMS) was deployed for the measurement of the size resolved chemical composition of single particles at a site in Cork Harbour, Ireland for three weeks in August 2008. The ATOFMS was co-located with a suite of semi-continuous instrumentation for the measurement of particle number, elemental carbon (EC), organic carbon (OC), sulfate and particulate matter smaller than 2.5 μm in diameter (PM2.5). The temporality of the ambient ATOFMS particle classes was subsequently used in conjunction with the semi-continuous measurements to apportion PM2.5 mass using positive matrix factorisation. The synergy of the single particle classification procedure and positive matrix factorisation allowed for the identification of six factors, corresponding to vehicular traffic, marine, long-range transport, various combustion, domestic solid fuel combustion and shipping traffic with estimated contributions to the measured PM2.5 mass of 23%, 14%, 13%, 11%, 5% and 1.5% respectively. Shipping traffic was found to contribute 18% of the measured particle number (20–600 nm mobility diameter), and thus may have important implications for human health considering the size and composition of ship exhaust particles. The positive matrix factorisation procedure enabled a more refined interpretation of the single particle results by providing source contributions to PM2.5 mass, while the single particle data enabled the identification of additional factors not possible with typical semi-continuous measurements, including local shipping traffic.