972 resultados para Matrix factorization
Resumo:
Cette thèse étudie des modèles de séquences de haute dimension basés sur des réseaux de neurones récurrents (RNN) et leur application à la musique et à la parole. Bien qu'en principe les RNN puissent représenter les dépendances à long terme et la dynamique temporelle complexe propres aux séquences d'intérêt comme la vidéo, l'audio et la langue naturelle, ceux-ci n'ont pas été utilisés à leur plein potentiel depuis leur introduction par Rumelhart et al. (1986a) en raison de la difficulté de les entraîner efficacement par descente de gradient. Récemment, l'application fructueuse de l'optimisation Hessian-free et d'autres techniques d'entraînement avancées ont entraîné la recrudescence de leur utilisation dans plusieurs systèmes de l'état de l'art. Le travail de cette thèse prend part à ce développement. L'idée centrale consiste à exploiter la flexibilité des RNN pour apprendre une description probabiliste de séquences de symboles, c'est-à-dire une information de haut niveau associée aux signaux observés, qui en retour pourra servir d'à priori pour améliorer la précision de la recherche d'information. Par exemple, en modélisant l'évolution de groupes de notes dans la musique polyphonique, d'accords dans une progression harmonique, de phonèmes dans un énoncé oral ou encore de sources individuelles dans un mélange audio, nous pouvons améliorer significativement les méthodes de transcription polyphonique, de reconnaissance d'accords, de reconnaissance de la parole et de séparation de sources audio respectivement. L'application pratique de nos modèles à ces tâches est détaillée dans les quatre derniers articles présentés dans cette thèse. Dans le premier article, nous remplaçons la couche de sortie d'un RNN par des machines de Boltzmann restreintes conditionnelles pour décrire des distributions de sortie multimodales beaucoup plus riches. Dans le deuxième article, nous évaluons et proposons des méthodes avancées pour entraîner les RNN. Dans les quatre derniers articles, nous examinons différentes façons de combiner nos modèles symboliques à des réseaux profonds et à la factorisation matricielle non-négative, notamment par des produits d'experts, des architectures entrée/sortie et des cadres génératifs généralisant les modèles de Markov cachés. Nous proposons et analysons également des méthodes d'inférence efficaces pour ces modèles, telles la recherche vorace chronologique, la recherche en faisceau à haute dimension, la recherche en faisceau élagué et la descente de gradient. Finalement, nous abordons les questions de l'étiquette biaisée, du maître imposant, du lissage temporel, de la régularisation et du pré-entraînement.
Resumo:
Biological systems exhibit rich and complex behavior through the orchestrated interplay of a large array of components. It is hypothesized that separable subsystems with some degree of functional autonomy exist; deciphering their independent behavior and functionality would greatly facilitate understanding the system as a whole. Discovering and analyzing such subsystems are hence pivotal problems in the quest to gain a quantitative understanding of complex biological systems. In this work, using approaches from machine learning, physics and graph theory, methods for the identification and analysis of such subsystems were developed. A novel methodology, based on a recent machine learning algorithm known as non-negative matrix factorization (NMF), was developed to discover such subsystems in a set of large-scale gene expression data. This set of subsystems was then used to predict functional relationships between genes, and this approach was shown to score significantly higher than conventional methods when benchmarking them against existing databases. Moreover, a mathematical treatment was developed to treat simple network subsystems based only on their topology (independent of particular parameter values). Application to a problem of experimental interest demonstrated the need for extentions to the conventional model to fully explain the experimental data. Finally, the notion of a subsystem was evaluated from a topological perspective. A number of different protein networks were examined to analyze their topological properties with respect to separability, seeking to find separable subsystems. These networks were shown to exhibit separability in a nonintuitive fashion, while the separable subsystems were of strong biological significance. It was demonstrated that the separability property found was not due to incomplete or biased data, but is likely to reflect biological structure.
Resumo:
This paper is concerned with tensor clustering with the assistance of dimensionality reduction approaches. A class of formulation for tensor clustering is introduced based on tensor Tucker decomposition models. In this formulation, an extra tensor mode is formed by a collection of tensors of the same dimensions and then used to assist a Tucker decomposition in order to achieve data dimensionality reduction. We design two types of clustering models for the tensors: PCA Tensor Clustering model and Non-negative Tensor Clustering model, by utilizing different regularizations. The tensor clustering can thus be solved by the optimization method based on the alternative coordinate scheme. Interestingly, our experiments show that the proposed models yield comparable or even better performance compared to most recent clustering algorithms based on matrix factorization.
Resumo:
Trace element measurements in PM10–2.5, PM2.5–1.0 and PM1.0–0.3 aerosol were performed with 2 h time resolution at kerbside, urban background and rural sites during the ClearfLo winter 2012 campaign in London. The environment-dependent variability of emissions was characterized using the Multilinear Engine implementation of the positive matrix factorization model, conducted on data sets comprising all three sites but segregated by size. Combining the sites enabled separation of sources with high temporal covariance but significant spatial variability. Separation of sizes improved source resolution by preventing sources occurring in only a single size fraction from having too small a contribution for the model to resolve. Anchor profiles were retrieved internally by analysing data subsets, and these profiles were used in the analyses of the complete data sets of all sites for enhanced source apportionment. A total of nine different factors were resolved (notable elements in brackets): in PM10–2.5, brake wear (Cu, Zr, Sb, Ba), other traffic-related (Fe), resuspended dust (Si, Ca), sea/road salt (Cl), aged sea salt (Na, Mg) and industrial (Cr, Ni); in PM2.5–1.0, brake wear, other traffic-related, resuspended dust, sea/road salt, aged sea salt and S-rich (S); and in PM1.0–0.3, traffic-related (Fe, Cu, Zr, Sb, Ba), resuspended dust, sea/road salt, aged sea salt, reacted Cl (Cl), S-rich and solid fuel (K, Pb). Human activities enhance the kerb-to-rural concentration gradients of coarse aged sea salt, typically considered to have a natural source, by 1.7–2.2. These site-dependent concentration differences reflect the effect of local resuspension processes in London. The anthropogenically influenced factors traffic (brake wear and other traffic-related processes), dust and sea/road salt provide further kerb-to-rural concentration enhancements by direct source emissions by a factor of 3.5–12.7. The traffic and dust factors are mainly emitted in PM10–2.5 and show strong diurnal variations with concentrations up to 4 times higher during rush hour than during night-time. Regionally influenced S-rich and solid fuel factors, occurring primarily in PM1.0–0.3, have negligible resuspension influences, and concentrations are similar throughout the day and across the regions.
Resumo:
In this work, new tools in atmospheric pollutant sampling and analysis were applied in order to go deeper in source apportionment study. The project was developed mainly by the study of atmospheric emission sources in a suburban area influenced by a municipal solid waste incinerator (MSWI), a medium-sized coastal tourist town and a motorway. Two main research lines were followed. For what concerns the first line, the potentiality of the use of PM samplers coupled with a wind select sensor was assessed. Results showed that they may be a valid support in source apportionment studies. However, meteorological and territorial conditions could strongly affect the results. Moreover, new markers were investigated, particularly focusing on the processes of biomass burning. OC revealed a good biomass combustion process indicator, as well as all determined organic compounds. Among metals, lead and aluminium are well related to the biomass combustion. Surprisingly PM was not enriched of potassium during bonfire event. The second research line consists on the application of Positive Matrix factorization (PMF), a new statistical tool in data analysis. This new technique was applied to datasets which refer to different time resolution data. PMF application to atmospheric deposition fluxes identified six main sources affecting the area. The incinerator’s relative contribution seemed to be negligible. PMF analysis was then applied to PM2.5 collected with samplers coupled with a wind select sensor. The higher number of determined environmental indicators allowed to obtain more detailed results on the sources affecting the area. Vehicular traffic revealed the source of greatest concern for the study area. Also in this case, incinerator’s relative contribution seemed to be negligible. Finally, the application of PMF analysis to hourly aerosol data demonstrated that the higher the temporal resolution of the data was, the more the source profiles were close to the real one.
Resumo:
Il crescente utilizzo di sistemi di analisi high-throughput per lo studio dello stato fisiologico e metabolico del corpo, ha evidenziato che una corretta alimentazione e una buona forma fisica siano fattori chiave per la salute. L'aumento dell'età media della popolazione evidenzia l'importanza delle strategie di contrasto delle patologie legate all'invecchiamento. Una dieta sana è il primo mezzo di prevenzione per molte patologie, pertanto capire come il cibo influisce sul corpo umano è di fondamentale importanza. In questo lavoro di tesi abbiamo affrontato la caratterizzazione dei sistemi di imaging radiografico Dual-energy X-ray Absorptiometry (DXA). Dopo aver stabilito una metodologia adatta per l'elaborazione di dati DXA su un gruppo di soggetti sani non obesi, la PCA ha evidenziato alcune proprietà emergenti dall'interpretazione delle componenti principali in termini delle variabili di composizione corporea restituite dalla DXA. Le prime componenti sono associabili ad indici macroscopici di descrizione corporea (come BMI e WHR). Queste componenti sono sorprendentemente stabili al variare dello status dei soggetti in età, sesso e nazionalità. Dati di analisi metabolica, ottenuti tramite Magnetic Resonance Spectroscopy (MRS) su campioni di urina, sono disponibili per circa mille anziani (provenienti da cinque paesi europei) di età compresa tra i 65 ed i 79 anni, non affetti da patologie gravi. I dati di composizione corporea sono altresì presenti per questi soggetti. L'algoritmo di Non-negative Matrix Factorization (NMF) è stato utilizzato per esprimere gli spettri MRS come combinazione di fattori di base interpretabili come singoli metaboliti. I fattori trovati sono stabili, quindi spettri metabolici di soggetti sono composti dallo stesso pattern di metaboliti indipendentemente dalla nazionalità. Attraverso un'analisi a singolo cieco sono stati trovati alti valori di correlazione tra le variabili di composizione corporea e lo stato metabolico dei soggetti. Ciò suggerisce la possibilità di derivare la composizione corporea dei soggetti a partire dal loro stato metabolico.
Resumo:
Although previous studies report on the effect of street washing on ambient particulate matter levels, there is a lack of studies investigating the results of street washing on the emission strength of road dust. A sampling campaign was conducted in Madrid urban area during July 2009 where road dust samples were collected in two sites, namely Reference site (where the road surface was not washed) and Pelayo site (where street washing was performed daily during night). Following the chemical characterization of the road dust particles the emission sources were resolved by means of Positive Matrix Factorization, PMF (Multilinear Engine scripting) and the mass contribution of each source was calculated for the two sites. Mineral dust, brake wear, tire wear, carbonaceous emissions and construction dust were the main sources of road dust with mineral and construction dust being the major contributors to inhalable road dust load. To evaluate the effectiveness of street washing on the emission sources, the sources mass contributions between the two sites were compared. Although brake wear and tire wear had lower concentrations at the site where street washing was performed, these mass differences were not statistically significant and the temporal variation did not show the expected build-up after dust removal. It was concluded that the washing activities resulted merely in a road dust moistening, without effective removal and that mobilization of particles took place in a few hours between washing and sampling. The results also indicated that it is worth paying attention to the dust dispersed from the construction sites as they affect the emission strength in nearby streets.
Resumo:
In early spring the Baltic region is frequently affected by high-pollution events due to biomass burning in that area. Here we present a comprehensive study to investigate the impact of biomass/grass burning (BB) on the evolution and composition of aerosol in Preila, Lithuania, during springtime open fires. Non-refractory submicron particulate matter (NR-PM1) was measured by an Aerodyne aerosol chemical speciation monitor (ACSM) and a source apportionment with the multilinear engine (ME-2) running the positive matrix factorization (PMF) model was applied to the organic aerosol fraction to investigate the impact of biomass/grass burning. Satellite observations over regions of biomass burning activity supported the results and identification of air mass transport to the area of investigation. Sharp increases in biomass burning tracers, such as levoglucosan up to 683 ngm-3 and black carbon (BC) up to 17 μgm-3 were observed during this period. A further separation between fossil and non-fossil primary and secondary contributions was obtained by coupling ACSM PMF results and radiocarbon (14C) measurements of the elemental (EC) and organic (OC) carbon fractions. Non-fossil organic carbon (OCnf/ was the dominant fraction of PM1, with the primary (POCnf/ and secondary (SOCnf/ fractions contributing 26–44% and 13–23% to the total carbon (TC), respectively. 5–8% of the TC had a primary fossil origin (POCf/, whereas the contribution of fossil secondary organic carbon (SOCf/ was 4–13 %. Nonfossil EC (ECnf/ and fossil EC (ECf/ ranged from 13–24 and 7–13 %, respectively. Isotope ratios of stable carbon and nitrogen isotopes were used to distinguish aerosol particles associated with solid and liquid fossil fuel burning.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-04
Resumo:
In recent years, the boundaries between e-commerce and social networking have become increasingly blurred. Many e-commerce websites support the mechanism of social login where users can sign on the websites using their social network identities such as their Facebook or Twitter accounts. Users can also post their newly purchased products on microblogs with links to the e-commerce product web pages. In this paper, we propose a novel solution for cross-site cold-start product recommendation, which aims to recommend products from e-commerce websites to users at social networking sites in 'cold-start' situations, a problem which has rarely been explored before. A major challenge is how to leverage knowledge extracted from social networking sites for cross-site cold-start product recommendation. We propose to use the linked users across social networking sites and e-commerce websites (users who have social networking accounts and have made purchases on e-commerce websites) as a bridge to map users' social networking features to another feature representation for product recommendation. In specific, we propose learning both users' and products' feature representations (called user embeddings and product embeddings, respectively) from data collected from e-commerce websites using recurrent neural networks and then apply a modified gradient boosting trees method to transform users' social networking features into user embeddings. We then develop a feature-based matrix factorization approach which can leverage the learnt user embeddings for cold-start product recommendation. Experimental results on a large dataset constructed from the largest Chinese microblogging service Sina Weibo and the largest Chinese B2C e-commerce website JingDong have shown the effectiveness of our proposed framework.
Resumo:
We present in this article an automated framework that extracts product adopter information from online reviews and incorporates the extracted information into feature-based matrix factorization formore effective product recommendation. In specific, we propose a bootstrapping approach for the extraction of product adopters from review text and categorize them into a number of different demographic categories. The aggregated demographic information of many product adopters can be used to characterize both products and users in the form of distributions over different demographic categories. We further propose a graphbased method to iteratively update user- and product-related distributions more reliably in a heterogeneous user-product graph and incorporate them as features into the matrix factorization approach for product recommendation. Our experimental results on a large dataset crawled from JINGDONG, the largest B2C e-commerce website in China, show that our proposed framework outperforms a number of competitive baselines for product recommendation.
Resumo:
As massive data sets become increasingly available, people are facing the problem of how to effectively process and understand these data. Traditional sequential computing models are giving way to parallel and distributed computing models, such as MapReduce, both due to the large size of the data sets and their high dimensionality. This dissertation, as in the same direction of other researches that are based on MapReduce, tries to develop effective techniques and applications using MapReduce that can help people solve large-scale problems. Three different problems are tackled in the dissertation. The first one deals with processing terabytes of raster data in a spatial data management system. Aerial imagery files are broken into tiles to enable data parallel computation. The second and third problems deal with dimension reduction techniques that can be used to handle data sets of high dimensionality. Three variants of the nonnegative matrix factorization technique are scaled up to factorize matrices of dimensions in the order of millions in MapReduce based on different matrix multiplication implementations. Two algorithms, which compute CANDECOMP/PARAFAC and Tucker tensor decompositions respectively, are parallelized in MapReduce based on carefully partitioning the data and arranging the computation to maximize data locality and parallelism.
Resumo:
En los últimos años se ha incrementado el interés de la comunidad científica en la Factorización de matrices no negativas (Non-negative Matrix Factorization, NMF). Este método permite transformar un conjunto de datos de grandes dimensiones en una pequeña colección de elementos que poseen semántica propia en el contexto del análisis. En el caso de Bioinformática, NMF suele emplearse como base de algunos métodos de agrupamiento de datos, que emplean un modelo estadístico para determinar el número de clases más favorable. Este modelo requiere de una gran cantidad de ejecuciones de NMF con distintos parámetros de entrada, lo que representa una enorme carga de trabajo a nivel computacional. La mayoría de las implementaciones de NMF han ido quedando obsoletas ante el constante crecimiento de los datos que la comunidad científica busca analizar, bien sea porque los tiempos de cómputo llegan a alargarse hasta convertirse en inviables, o porque el tamaño de esos datos desborda los recursos del sistema. Por ello, esta tesis doctoral se centra en la optimización y paralelización de la factorización NMF, pero no solo a nivel teórico, sino con el objetivo de proporcionarle a la comunidad científica una nueva herramienta para el análisis de datos de origen biológico. NMF expone un alto grado de paralelismo a nivel de datos, de granularidad variable; mientras que los métodos de agrupamiento mencionados anteriormente presentan un paralelismo a nivel de cómputo, ya que las diversas instancias de NMF que se ejecutan son independientes. Por tanto, desde un punto de vista global, se plantea un modelo de optimización por capas donde se emplean diferentes tecnologías de alto rendimiento...
Resumo:
Receptor modelling was performed on quadrupole unit mass resolution aerosol mass spectrometer (Q-AMS) sub-micron particulate matter (PM) chemical speciation measurements from Windsor, Ontario, an industrial city situated across the Detroit River from Detroit, Michigan. Aerosol and trace gas measurements were collected on board Environment Canada’s CRUISER mobile laboratory. Positive matrix factorization (PMF) was performed on the AMS full particle-phase mass spectrum (PMFFull MS) encompassing both organic and inorganic components. This approach was compared to the more common method of analysing only the organic mass spectra (PMFOrg MS). PMF of the full mass spectrum revealed that variability in the non-refractory sub-micron aerosol concentration and composition was best explained by six factors: an amine-containing factor (Amine); an ammonium sulphate and oxygenated organic aerosol containing factor (Sulphate-OA); an ammonium nitrate and oxygenated organic aerosol containing factor (Nitrate-OA); an ammonium chloride containing factor (Chloride); a hydrocarbon like organic aerosol (HOA) factor; and a moderately oxygenated organic aerosol factor (OOA). PMF of the organic mass spectrum revealed three factors of similar composition to some of those revealed through PMFFull MS: Amine, HOA and OOA. Including both the inorganic and organic mass proved to be a beneficial approach to analysing the unit mass resolution AMS data for several reasons. First, it provided a method for potentially calculating more accurate sub-micron PM mass concentrations, particularly when unusual factors are present, in this case, an Amine factor. As this method does not rely on a priori knowledge of chemical species, it circumvents the need for any adjustments to the traditional AMS species fragmentation patterns to account for atypical species, and can thus lead to more complete factor profiles. It is expected that this method would be even more useful for HR-ToF-AMS data, due to the ability to better understand the chemical nature of atypical factors from high resolution mass spectra. Second, utilizing PMF to extract factors containing inorganic species allowed for the determination of extent of neutralization, which could have implications for aerosol parameterization. Third, subtler differences in organic aerosol components were resolved through the incorporation of inorganic mass into the PMF matrix. The additional temporal features provided by the inorganic aerosol components allowed for the resolution of more types of oxygenated organic aerosol than could be reliably re-solved from PMF of organics alone. Comparison of findings from the PMFFull MS and PMFOrg MS methods showed that for the Windsor airshed, the PMFFull MS method enabled additional conclusions to be drawn in terms of aerosol sources and chemical processes. While performing PMFOrg MS can provide important distinctions between types of organic aerosol, it is shown that including inorganic species in the PMF analysis can permit further apportionment of organics for unit mass resolution AMS mass spectra.
Resumo:
Ambient wintertime background urban aerosol in Cork city, Ireland, was characterized using aerosol mass spectrometry. During the three-week measurement study in 2009, 93% of the ca. 1 350 000 single particles characterized by an Aerosol Time-of-Flight Mass Spectrometer (TSI ATOFMS) were classified into five organic-rich particle types, internally mixed to different proportions with elemental carbon (EC), sulphate and nitrate, while the remaining 7% was predominantly inorganic in nature. Non-refractory PM1 aerosol was characterized using a High Resolution Time-of-Flight Aerosol Mass Spectrometer (Aerodyne HR-ToF-AMS) and was also found to comprise organic aerosol as the most abundant species (62 %), followed by nitrate (15 %), sulphate (9 %) and ammonium (9 %), and chloride (5 %). Positive matrix factorization (PMF) was applied to the HR-ToF-AMS organic matrix, and a five-factor solution was found to describe the variance in the data well. Specifically, "hydrocarbon-like" organic aerosol (HOA) comprised 20% of the mass, "low-volatility" oxygenated organic aerosol (LV-OOA) comprised 18 %, "biomass burning" organic aerosol (BBOA) comprised 23 %, non-wood solid-fuel combustion "peat and coal" organic aerosol (PCOA) comprised 21 %, and finally a species type characterized by primary m/z peaks at 41 and 55, similar to previously reported "cooking" organic aerosol (COA), but possessing different diurnal variations to what would be expected for cooking activities, contributed 18 %. Correlations between the different particle types obtained by the two aerosol mass spectrometers are also discussed. Despite wood, coal and peat being minor fuel types used for domestic space heating in urban areas, their relatively low combustion efficiencies result in a significant contribution to PM1 aerosol mass (44% and 28% of the total organic aerosol mass and non-refractory total PM1, respectively).Ambient wintertime background urban aerosol in Cork city, Ireland, was characterized using aerosol mass spectrometry. During the three-week measurement study in 2009, 93% of the ca. 1 350 000 single particles characterized by an Aerosol Time-of-Flight Mass Spectrometer (TSI ATOFMS) were classified into five organic-rich particle types, internally mixed to different proportions with elemental carbon (EC), sulphate and nitrate, while the remaining 7% was predominantly inorganic in nature. Non-refractory PM1 aerosol was characterized using a High Resolution Time-of-Flight Aerosol Mass Spectrometer (Aerodyne HR-ToF-AMS) and was also found to comprise organic aerosol as the most abundant species (62 %), followed by nitrate (15 %), sulphate (9 %) and ammonium (9 %), and chloride (5 %). Positive matrix factorization (PMF) was applied to the HR-ToF-AMS organic matrix, and a five-factor solution was found to describe the variance in the data well. Specifically, "hydrocarbon-like" organic aerosol (HOA) comprised 20% of the mass, "low-volatility" oxygenated organic aerosol (LV-OOA) comprised 18 %, "biomass burning" organic aerosol (BBOA) comprised 23 %, non-wood solid-fuel combustion "peat and coal" organic aerosol (PCOA) comprised 21 %, and finally a species type characterized by primary m/z peaks at 41 and 55, similar to previously reported "cooking" organic aerosol (COA), but possessing different diurnal variations to what would be expected for cooking activities, contributed 18 %. Correlations between the different particle types obtained by the two aerosol mass spectrometers are also discussed. Despite wood, coal and peat being minor fuel types used for domestic space heating in urban areas, their relatively low combustion efficiencies result in a significant contribution to PM1 aerosol mass (44% and 28% of the total organic aerosol mass and non-refractory total PM1, respectively).