931 resultados para large data sets
Resumo:
This Thesis describes the application of automatic learning methods for a) the classification of organic and metabolic reactions, and b) the mapping of Potential Energy Surfaces(PES). The classification of reactions was approached with two distinct methodologies: a representation of chemical reactions based on NMR data, and a representation of chemical reactions from the reaction equation based on the physico-chemical and topological features of chemical bonds. NMR-based classification of photochemical and enzymatic reactions. Photochemical and metabolic reactions were classified by Kohonen Self-Organizing Maps (Kohonen SOMs) and Random Forests (RFs) taking as input the difference between the 1H NMR spectra of the products and the reactants. The development of such a representation can be applied in automatic analysis of changes in the 1H NMR spectrum of a mixture and their interpretation in terms of the chemical reactions taking place. Examples of possible applications are the monitoring of reaction processes, evaluation of the stability of chemicals, or even the interpretation of metabonomic data. A Kohonen SOM trained with a data set of metabolic reactions catalysed by transferases was able to correctly classify 75% of an independent test set in terms of the EC number subclass. Random Forests improved the correct predictions to 79%. With photochemical reactions classified into 7 groups, an independent test set was classified with 86-93% accuracy. The data set of photochemical reactions was also used to simulate mixtures with two reactions occurring simultaneously. Kohonen SOMs and Feed-Forward Neural Networks (FFNNs) were trained to classify the reactions occurring in a mixture based on the 1H NMR spectra of the products and reactants. Kohonen SOMs allowed the correct assignment of 53-63% of the mixtures (in a test set). Counter-Propagation Neural Networks (CPNNs) gave origin to similar results. The use of supervised learning techniques allowed an improvement in the results. They were improved to 77% of correct assignments when an ensemble of ten FFNNs were used and to 80% when Random Forests were used. This study was performed with NMR data simulated from the molecular structure by the SPINUS program. In the design of one test set, simulated data was combined with experimental data. The results support the proposal of linking databases of chemical reactions to experimental or simulated NMR data for automatic classification of reactions and mixtures of reactions. Genome-scale classification of enzymatic reactions from their reaction equation. The MOLMAP descriptor relies on a Kohonen SOM that defines types of bonds on the basis of their physico-chemical and topological properties. The MOLMAP descriptor of a molecule represents the types of bonds available in that molecule. The MOLMAP descriptor of a reaction is defined as the difference between the MOLMAPs of the products and the reactants, and numerically encodes the pattern of bonds that are broken, changed, and made during a chemical reaction. The automatic perception of chemical similarities between metabolic reactions is required for a variety of applications ranging from the computer validation of classification systems, genome-scale reconstruction (or comparison) of metabolic pathways, to the classification of enzymatic mechanisms. Catalytic functions of proteins are generally described by the EC numbers that are simultaneously employed as identifiers of reactions, enzymes, and enzyme genes, thus linking metabolic and genomic information. Different methods should be available to automatically compare metabolic reactions and for the automatic assignment of EC numbers to reactions still not officially classified. In this study, the genome-scale data set of enzymatic reactions available in the KEGG database was encoded by the MOLMAP descriptors, and was submitted to Kohonen SOMs to compare the resulting map with the official EC number classification, to explore the possibility of predicting EC numbers from the reaction equation, and to assess the internal consistency of the EC classification at the class level. A general agreement with the EC classification was observed, i.e. a relationship between the similarity of MOLMAPs and the similarity of EC numbers. At the same time, MOLMAPs were able to discriminate between EC sub-subclasses. EC numbers could be assigned at the class, subclass, and sub-subclass levels with accuracies up to 92%, 80%, and 70% for independent test sets. The correspondence between chemical similarity of metabolic reactions and their MOLMAP descriptors was applied to the identification of a number of reactions mapped into the same neuron but belonging to different EC classes, which demonstrated the ability of the MOLMAP/SOM approach to verify the internal consistency of classifications in databases of metabolic reactions. RFs were also used to assign the four levels of the EC hierarchy from the reaction equation. EC numbers were correctly assigned in 95%, 90%, 85% and 86% of the cases (for independent test sets) at the class, subclass, sub-subclass and full EC number level,respectively. Experiments for the classification of reactions from the main reactants and products were performed with RFs - EC numbers were assigned at the class, subclass and sub-subclass level with accuracies of 78%, 74% and 63%, respectively. In the course of the experiments with metabolic reactions we suggested that the MOLMAP / SOM concept could be extended to the representation of other levels of metabolic information such as metabolic pathways. Following the MOLMAP idea, the pattern of neurons activated by the reactions of a metabolic pathway is a representation of the reactions involved in that pathway - a descriptor of the metabolic pathway. This reasoning enabled the comparison of different pathways, the automatic classification of pathways, and a classification of organisms based on their biochemical machinery. The three levels of classification (from bonds to metabolic pathways) allowed to map and perceive chemical similarities between metabolic pathways even for pathways of different types of metabolism and pathways that do not share similarities in terms of EC numbers. Mapping of PES by neural networks (NNs). In a first series of experiments, ensembles of Feed-Forward NNs (EnsFFNNs) and Associative Neural Networks (ASNNs) were trained to reproduce PES represented by the Lennard-Jones (LJ) analytical potential function. The accuracy of the method was assessed by comparing the results of molecular dynamics simulations (thermal, structural, and dynamic properties) obtained from the NNs-PES and from the LJ function. The results indicated that for LJ-type potentials, NNs can be trained to generate accurate PES to be used in molecular simulations. EnsFFNNs and ASNNs gave better results than single FFNNs. A remarkable ability of the NNs models to interpolate between distant curves and accurately reproduce potentials to be used in molecular simulations is shown. The purpose of the first study was to systematically analyse the accuracy of different NNs. Our main motivation, however, is reflected in the next study: the mapping of multidimensional PES by NNs to simulate, by Molecular Dynamics or Monte Carlo, the adsorption and self-assembly of solvated organic molecules on noble-metal electrodes. Indeed, for such complex and heterogeneous systems the development of suitable analytical functions that fit quantum mechanical interaction energies is a non-trivial or even impossible task. The data consisted of energy values, from Density Functional Theory (DFT) calculations, at different distances, for several molecular orientations and three electrode adsorption sites. The results indicate that NNs require a data set large enough to cover well the diversity of possible interaction sites, distances, and orientations. NNs trained with such data sets can perform equally well or even better than analytical functions. Therefore, they can be used in molecular simulations, particularly for the ethanol/Au (111) interface which is the case studied in the present Thesis. Once properly trained, the networks are able to produce, as output, any required number of energy points for accurate interpolations.
Resumo:
With the advent of wearable sensing and mobile technologies, biosignals have seen an increasingly growing number of application areas, leading to the collection of large volumes of data. One of the difficulties in dealing with these data sets, and in the development of automated machine learning systems which use them as input, is the lack of reliable ground truth information. In this paper we present a new web-based platform for visualization, retrieval and annotation of biosignals by non-technical users, aimed at improving the process of ground truth collection for biomedical applications. Moreover, a novel extendable and scalable data representation model and persistency framework is presented. The results of the experimental evaluation with possible users has further confirmed the potential of the presented framework.
Resumo:
In global scientific experiments with collaborative scenarios involving multinational teams there are big challenges related to data access, namely data movements are precluded to other regions or Clouds due to the constraints on latency costs, data privacy and data ownership. Furthermore, each site is processing local data sets using specialized algorithms and producing intermediate results that are helpful as inputs to applications running on remote sites. This paper shows how to model such collaborative scenarios as a scientific workflow implemented with AWARD (Autonomic Workflow Activities Reconfigurable and Dynamic), a decentralized framework offering a feasible solution to run the workflow activities on distributed data centers in different regions without the need of large data movements. The AWARD workflow activities are independently monitored and dynamically reconfigured and steering by different users, namely by hot-swapping the algorithms to enhance the computation results or by changing the workflow structure to support feedback dependencies where an activity receives feedback output from a successor activity. A real implementation of one practical scenario and its execution on multiple data centers of the Amazon Cloud is presented including experimental results with steering by multiple users.
Resumo:
Dissertação apresentada como requisito parcial para obtenção do grau de Mestre em Ciência e Sistemas de Informação Geográfica
Resumo:
A quantidade e variedade de conteúdos multimédia actualmente disponíveis cons- tituem um desafio para os utilizadores dado que o espaço de procura e escolha de fontes e conteúdos excede o tempo e a capacidade de processamento dos utilizado- res. Este problema da selecção, em função do perfil do utilizador, de informação em grandes conjuntos heterogéneos de dados é complexo e requer ferramentas específicas. Os Sistemas de Recomendação surgem neste contexto e são capazes de sugerir ao utilizador itens que se coadunam com os seus gostos, interesses ou necessidades, i.e., o seu perfil, recorrendo a metodologias de inteligência artificial. O principal objectivo desta tese é demonstrar que é possível recomendar em tempo útil conteúdos multimédia a partir do perfil pessoal e social do utilizador, recorrendo exclusivamente a fontes públicas e heterogéneas de dados. Neste sen- tido, concebeu-se e desenvolveu-se um Sistema de Recomendação de conteúdos multimédia baseado no conteúdo, i.e., nas características dos itens, no historial e preferências pessoais e nas interacções sociais do utilizador. Os conteúdos mul- timédia recomendados, i.e., os itens sugeridos ao utilizador, são provenientes da estação televisiva britânica, British Broadcasting Corporation (BBC), e estão classificados de acordo com as categorias dos programas da BBC. O perfil do utilizador é construído levando em conta o historial, o contexto, as preferências pessoais e as actividades sociais. O YouTube é a fonte do histo- rial pessoal utilizada, permitindo simular a principal fonte deste tipo de dados - a Set-Top Box (STB). O historial do utilizador é constituído pelo conjunto de vídeos YouTube e programas da BBC vistos pelo utilizador. O conteúdo dos vídeos do YouTube está classificado segundo as categorias de vídeo do próprio YouTube, sendo efectuado o mapeamento para as categorias dos programas da BBC. A informação social, que é proveniente das redes sociais Facebook e Twit- ter, é recolhida através da plataforma Beancounter. As actividades sociais do utilizador obtidas são filtradas para extrair os filmes e séries que são, por sua vez, enriquecidos semanticamente através do recurso a repositórios abertos de dados interligados. Neste caso, os filmes e séries são classificados através dos géneros da IMDb e, posteriormente, mapeados para as categorias de programas da BBC. Por último, a informação do contexto e das preferências explícitas, através da classificação dos itens recomendados, do utilizador são também contempladas. O sistema desenvolvido efectua recomendações em tempo real baseado nas actividades das redes sociais Facebook e Twitter, no historial de vídeos Youtube e de programas da BBC vistos e preferências explícitas. Foram realizados testes com cinco utilizadores e o tempo médio de resposta do sistema para criar o conjunto inicial de recomendações foi 30 s. As recomendações personalizadas são geradas e actualizadas mediante pedido expresso do utilizador.
Resumo:
The conjugate margins system of the Gulf of Lion and West Sardinia (GLWS) represents a unique natural laboratory for addressing fundamental questions about rifting due to its landlocked situation, its youth, its thick sedimentary layers, including prominent palaeo-marker such as the MSC event, and the amount of available data and multidisciplinary studies. The main goals of the SARDINIA experiment, were to (i) investigate the deep structure of the entire system within the two conjugate margins: the Gulf of Lion and West Sardinia, (ii) characterize the nature of the crust, and (iii) define the geometry of the basin and provide important constrains on its genesis. This paper presents the results of P-wave velocity modelling on three coincident near-vertical reflection multi-channel seismic (MCS) and wide-angle seismic profiles acquired in the Gulf of Lion, to a depth of 35 km. A companion paper [part II Afilhado et al., 2015] addresses the results of two other SARDINIA profiles located on the oriental conjugate West Sardinian margin. Forward wide-angle modelling of both data sets confirms that the margin is characterised by three distinct domains following the onshore unthinned, 33 km-thick continental crust domain: Domain I is bounded by two necking zones, where the crust thins respectively from 30 to 20 and from 20 to 7 km over a width of about 170 km; the outermost necking is imprinted by the well-known T-reflector at its crustal base; Domain II is characterised by a 7 km-thick crust with anomalous velocities ranging from 6 to 7.5 km/s; it represents the transition between the thinned continental crust (Domain I) and a very thin (only 4-5 km) "atypical" oceanic crust (Domain III). In Domain II, the hypothesis of the presence of exhumed mantle is falsified by our results: this domain may likely consist of a thin exhumed lower continental crust overlying a heterogeneous, intruded lower layer. Moreover, despite the difference in their magnetic signatures, Domains II and III present the very similar seismic velocities profiles, and we discuss the possibility of a connection between these two different domains.
Resumo:
Geophysical data acquired on the conjugate margins system of the Gulf of Lion and West Sardinia (GLWS) is unique in its ability to address fundamental questions about rifting (i.e. crustal thinning, the nature of the continent-ocean transition zone, the style of rifting and subsequent evolution, and the connection between deep and surface processes). While the Gulf of Lion (GoL) was the site of several deep seismic experiments, which occurred before the SARDINIA Experiment (ESP and ECORS Experiments in 1981 and 1988 respectively), the crustal structure of the West Sardinia margin remains unknown. This paper describes the first modeling of wide-angle and near-vertical reflection multi-channel seismic (MCS) profiles crossing the West Sardinia margin, in the Mediterranean Sea. The profiles were acquired, together with the exact conjugate of the profiles crossing the GoL, during the SARDINIA experiment in December 2006 with the French R/V L'Atalante. Forward wide-angle modeling of both data sets (wide-angle and multi-channel seismic) confirms that the margin is characterized by three distinct domains following the onshore unthinned, 26 km-thick continental crust : Domain V, where the crust thins from 26 to 6 km in a width of about 75 km; Domain IV where the basement is characterized by high velocity gradients and lower crustal seismic velocities from 6.8 to 7.25 km/s, which are atypical for either crustal or upper mantle material, and Domain III composed of "atypical" oceanic crust.The structure observed on the West Sardinian margin presents a distribution of seismic velocities that is symmetrical with those observed on the Gulf of Lion's side, except for the dimension of each domain and with respect to the initiation of seafloor spreading. This result does not support the hypothesis of simple shear mechanism operating along a lithospheric detachment during the formation of the Liguro-Provencal basin.
Resumo:
Hyperspectral imaging has become one of the main topics in remote sensing applications, which comprise hundreds of spectral bands at different (almost contiguous) wavelength channels over the same area generating large data volumes comprising several GBs per flight. This high spectral resolution can be used for object detection and for discriminate between different objects based on their spectral characteristics. One of the main problems involved in hyperspectral analysis is the presence of mixed pixels, which arise when the spacial resolution of the sensor is not able to separate spectrally distinct materials. Spectral unmixing is one of the most important task for hyperspectral data exploitation. However, the unmixing algorithms can be computationally very expensive, and even high power consuming, which compromises the use in applications under on-board constraints. In recent years, graphics processing units (GPUs) have evolved into highly parallel and programmable systems. Specifically, several hyperspectral imaging algorithms have shown to be able to benefit from this hardware taking advantage of the extremely high floating-point processing performance, compact size, huge memory bandwidth, and relatively low cost of these units, which make them appealing for onboard data processing. In this paper, we propose a parallel implementation of an augmented Lagragian based method for unsupervised hyperspectral linear unmixing on GPUs using CUDA. The method called simplex identification via split augmented Lagrangian (SISAL) aims to identify the endmembers of a scene, i.e., is able to unmix hyperspectral data sets in which the pure pixel assumption is violated. The efficient implementation of SISAL method presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory.
Resumo:
The application of compressive sensing (CS) to hyperspectral images is an active area of research over the past few years, both in terms of the hardware and the signal processing algorithms. However, CS algorithms can be computationally very expensive due to the extremely large volumes of data collected by imaging spectrometers, a fact that compromises their use in applications under real-time constraints. This paper proposes four efficient implementations of hyperspectral coded aperture (HYCA) for CS, two of them termed P-HYCA and P-HYCA-FAST and two additional implementations for its constrained version (CHYCA), termed P-CHYCA and P-CHYCA-FAST on commodity graphics processing units (GPUs). HYCA algorithm exploits the high correlation existing among the spectral bands of the hyperspectral data sets and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. The proposed P-HYCA and P-CHYCA implementations have been developed using the compute unified device architecture (CUDA) and the cuFFT library. Moreover, this library has been replaced by a fast iterative method in the P-HYCA-FAST and P-CHYCA-FAST implementations that leads to very significant speedup factors in order to achieve real-time requirements. The proposed algorithms are evaluated not only in terms of reconstruction error for different compressions ratios but also in terms of computational performance using two different GPU architectures by NVIDIA: 1) GeForce GTX 590; and 2) GeForce GTX TITAN. Experiments are conducted using both simulated and real data revealing considerable acceleration factors and obtaining good results in the task of compressing remotely sensed hyperspectral data sets.
Resumo:
Nowadays, data centers are large energy consumers and the trend for next years is expected to increase further, considering the growth in the order of cloud services. A large portion of this power consumption is due to the control of physical parameters of the data center (such as temperature and humidity). However, these physical parameters are tightly coupled with computations, and even more so in upcoming data centers, where the location of workloads can vary substantially due, for example, to workloads being moved in the cloud infrastructure hosted in the data center. Therefore, managing the physical and compute infrastructure of a large data center is an embodiment of a Cyber-Physical System (CPS). In this paper, we describe a data collection and distribution architecture that enables gathering physical parameters of a large data center at a very high temporal and spatial resolution of the sensor measurements. We think this is an important characteristic to enable more accurate heat-flow models of the data center and with them, find opportunities to optimize energy consumptions. Having a high-resolution picture of the data center conditions, also enables minimizing local hot-spots, perform more accurate predictive maintenance (failures in all infrastructure equipments can be more promptly detected) and more accurate billing. We detail this architecture and define the structure of the underlying messaging system that is used to collect and distribute the data. Finally, we show the results of a preliminary study of a typical data center radio environment.
Resumo:
Demo in Workshop on ns-3 (WNS3 2015). 13 to 14, May, 2015. Castelldefels, Spain.
Resumo:
Dissertação para obtenção do Grau de Mestre em Engenharia Informática
Resumo:
Grasslands in semi-arid regions, like Mongolian steppes, are facing desertification and degradation processes, due to climate change. Mongolia’s main economic activity consists on an extensive livestock production and, therefore, it is a concerning matter for the decision makers. Remote sensing and Geographic Information Systems provide the tools for advanced ecosystem management and have been widely used for monitoring and management of pasture resources. This study investigates which is the higher thematic detail that is possible to achieve through remote sensing, to map the steppe vegetation, using medium resolution earth observation imagery in three districts (soums) of Mongolia: Dzag, Buutsagaan and Khureemaral. After considering different thematic levels of detail for classifying the steppe vegetation, the existent pasture types within the steppe were chosen to be mapped. In order to investigate which combination of data sets yields the best results and which classification algorithm is more suitable for incorporating these data sets, a comparison between different classification methods were tested for the study area. Sixteen classifications were performed using different combinations of estimators, Landsat-8 (spectral bands and Landsat-8 NDVI-derived) and geophysical data (elevation, mean annual precipitation and mean annual temperature) using two classification algorithms, maximum likelihood and decision tree. Results showed that the best performing model was the one that incorporated Landsat-8 bands with mean annual precipitation and mean annual temperature (Model 13), using the decision tree. For maximum likelihood, the model that incorporated Landsat-8 bands with mean annual precipitation (Model 5) and the one that incorporated Landsat-8 bands with mean annual precipitation and mean annual temperature (Model 13), achieved the higher accuracies for this algorithm. The decision tree models consistently outperformed the maximum likelihood ones.
Resumo:
We present a study on human mobility at small spatial scales. Differently from large scale mobility, recently studied through dollar-bill tracking and mobile phone data sets within one big country or continent, we report Brownian features of human mobility at smaller scales. In particular, the scaling exponents found at the smallest scales is typically close to one-half, differently from the larger values for the exponent characterizing mobility at larger scales. We carefully analyze 12-month data of the Eduroam database within the Portuguese university of Minho. A full procedure is introduced with the aim of properly characterizing the human mobility within the network of access points composing the wireless system of the university. In particular, measures of flux are introduced for estimating a distance between access points. This distance is typically non-Euclidean, since the spatial constraints at such small scales distort the continuum space on which human mobility occurs. Since two different ex- ponents are found depending on the scale human motion takes place, we raise the question at which scale the transition from Brownian to non-Brownian motion takes place. In this context, we discuss how the numerical approach can be extended to larger scales, using the full Eduroam in Europe and in Asia, for uncovering the transi- tion between both dynamical regimes.
Resumo:
We construct estimates of educational attainment for a sample of OECD countries using previously unexploited sources. We follow a heuristic approach to obtain plausible time profiles for attainment levels by removing sharp breaks in the data that seem to reflect changes in classification criteria. We then construct indicators of the information content of our series and a number of previously available data sets and examine their performance in several growth specifications. We find a clear positive correlation between data quality and the size and significance of human capital coefficients in growth regressions. Using an extension of the classical errors in variables model, we construct a set of meta-estimates of the coefficient of years of schooling in an aggregate Cobb-Douglas production function. Our results suggest that, after correcting for measurement error bias, the value of this parameter is well above 0.50.