903 resultados para Data-driven Methods
Resumo:
Following the workshop on new developments in daily licensing practice in November 2011, we brought together fourteen representatives from national consortia (from Denmark, Germany, Netherlands and the UK) and publishers (Elsevier, SAGE and Springer) met in Copenhagen on 9 March 2012 to discuss provisions in licences to accommodate new developments. The one day workshop aimed to: present background and ideas regarding the provisions KE Licensing Expert Group developed; introduce and explain the provisions the invited publishers currently use;ascertain agreement on the wording for long term preservation, continuous access and course packs; give insight and more clarity about the use of open access provisions in licences; discuss a roadmap for inclusion of the provisions in the publishers’ licences; result in report to disseminate the outcome of the meeting. Participants of the workshop were: United Kingdom: Lorraine Estelle (Jisc Collections) Denmark: Lotte Eivor Jørgensen (DEFF), Lone Madsen (Southern University of Denmark), Anne Sandfær (DEFF/Knowledge Exchange) Germany: Hildegard Schaeffler (Bavarian State Library), Markus Brammer (TIB) The Netherlands: Wilma Mossink (SURF), Nol Verhagen (University of Amsterdam), Marc Dupuis (SURF/Knowledge Exchange) Publishers: Alicia Wise (Elsevier), Yvonne Campfens (Springer), Bettina Goerner (Springer), Leo Walford (Sage) Knowledge Exchange: Keith Russell The main outcome of the workshop was that it would be valuable to have a standard set of clauses which could used in negotiations, this would make concluding licences a lot easier and more efficient. The comments on the model provisions the Licensing Expert group had drafted will be taken into account and the provisions will be reformulated. Data and text mining is a new development and demand for access to allow for this is growing. It would be easier if there was a simpler way to access materials so they could be more easily mined. However there are still outstanding questions on how authors of articles that have been mined can be properly attributed.
Resumo:
Discovery Driven Analysis (DDA) is a common feature of OLAP technology to analyze structured data. In essence, DDA helps analysts to discover anomalous data by highlighting 'unexpected' values in the OLAP cube. By giving indications to the analyst on what dimensions to explore, DDA speeds up the process of discovering anomalies and their causes. However, Discovery Driven Analysis (and OLAP in general) is only applicable on structured data, such as records in databases. We propose a system to extend DDA technology to semi-structured text documents, that is, text documents with a few structured data. Our system pipeline consists of two stages: first, the text part of each document is structured around user specified dimensions, using semi-PLSA algorithm; then, we adapt DDA to these fully structured documents, thus enabling DDA on text documents. We present some applications of this system in OLAP analysis and show how scalability issues are solved. Results show that our system can handle reasonable datasets of documents, in real time, without any need for pre-computation.
Resumo:
The recent advent of new technologies has led to huge amounts of genomic data. With these data come new opportunities to understand biological cellular processes underlying hidden regulation mechanisms and to identify disease related biomarkers for informative diagnostics. However, extracting biological insights from the immense amounts of genomic data is a challenging task. Therefore, effective and efficient computational techniques are needed to analyze and interpret genomic data. In this thesis, novel computational methods are proposed to address such challenges: a Bayesian mixture model, an extended Bayesian mixture model, and an Eigen-brain approach. The Bayesian mixture framework involves integration of the Bayesian network and the Gaussian mixture model. Based on the proposed framework and its conjunction with K-means clustering and principal component analysis (PCA), biological insights are derived such as context specific/dependent relationships and nested structures within microarray where biological replicates are encapsulated. The Bayesian mixture framework is then extended to explore posterior distributions of network space by incorporating a Markov chain Monte Carlo (MCMC) model. The extended Bayesian mixture model summarizes the sampled network structures by extracting biologically meaningful features. Finally, an Eigen-brain approach is proposed to analyze in situ hybridization data for the identification of the cell-type specific genes, which can be useful for informative blood diagnostics. Computational results with region-based clustering reveals the critical evidence for the consistency with brain anatomical structure.
Resumo:
Aim The spread of non-indigenous species in marine ecosystems world-wide is one of today's most serious environmental concerns. Using mechanistic modelling, we investigated how global change relates to the invasion of European coasts by a non-native marine invertebrate, the Pacific oyster Crassostrea gigas. Location Bourgneuf Bay on the French Atlantic coast was considered as the northern boundary of C. gigas expansion at the time of its introduction to Europe in the 1970s. From this latitudinal reference, variations in the spatial distribution of the C. gigas reproductive niche were analysed along the north-western European coast from Gibraltar to Norway. Methods The effects of environmental variations on C. gigas physiology and phenology were studied using a bioenergetics model based on Dynamic Energy Budget theory. The model was forced with environmental time series including in situ phytoplankton data, and satellite data of sea surface temperature and suspended particulate matter concentration. Results Simulation outputs were successfully validated against in situ oyster growth data. In Bourgneuf Bay, the rise in seawater temperature and phytoplankton concentration has increased C. gigas reproductive effort and led to precocious spawning periods since the 1960s. At the European scale, seawater temperature increase caused a drastic northward shift (1400 km within 30 years) in the C. gigas reproductive niche and optimal thermal conditions for early life stage development. Main conclusions We demonstrated that the poleward expansion of the invasive species C. gigas is related to global warming and increase in phytoplankton abundance. The combination of mechanistic bioenergetics modelling with in situ and satellite environmental data is a valuable framework for ecosystem studies. It offers a generic approach to analyse historical geographical shifts and to predict the biogeographical changes expected to occur in a climate-changing world.
Resumo:
Background: Understanding transcriptional regulation by genome-wide microarray studies can contribute to unravel complex relationships between genes. Attempts to standardize the annotation of microarray data include the Minimum Information About a Microarray Experiment (MIAME) recommendations, the MAGE-ML format for data interchange, and the use of controlled vocabularies or ontologies. The existing software systems for microarray data analysis implement the mentioned standards only partially and are often hard to use and extend. Integration of genomic annotation data and other sources of external knowledge using open standards is therefore a key requirement for future integrated analysis systems. Results: The EMMA 2 software has been designed to resolve shortcomings with respect to full MAGE-ML and ontology support and makes use of modern data integration techniques. We present a software system that features comprehensive data analysis functions for spotted arrays, and for the most common synthesized oligo arrays such as Agilent, Affymetrix and NimbleGen. The system is based on the full MAGE object model. Analysis functionality is based on R and Bioconductor packages and can make use of a compute cluster for distributed services. Conclusion: Our model-driven approach for automatically implementing a full MAGE object model provides high flexibility and compatibility. Data integration via SOAP-based web-services is advantageous in a distributed client-server environment as the collaborative analysis of microarray data is gaining more and more relevance in international research consortia. The adequacy of the EMMA 2 software design and implementation has been proven by its application in many distributed functional genomics projects. Its scalability makes the current architecture suited for extensions towards future transcriptomics methods based on high-throughput sequencing approaches which have much higher computational requirements than microarrays.
Resumo:
Mass spectrometry (MS)-based proteomics has seen significant technical advances during the past two decades and mass spectrometry has become a central tool in many biosciences. Despite the popularity of MS-based methods, the handling of the systematic non-biological variation in the data remains a common problem. This biasing variation can result from several sources ranging from sample handling to differences caused by the instrumentation. Normalization is the procedure which aims to account for this biasing variation and make samples comparable. Many normalization methods commonly used in proteomics have been adapted from the DNA-microarray world. Studies comparing normalization methods with proteomics data sets using some variability measures exist. However, a more thorough comparison looking at the quantitative and qualitative differences of the performance of the different normalization methods and at their ability in preserving the true differential expression signal of proteins, is lacking. In this thesis, several popular and widely used normalization methods (the Linear regression normalization, Local regression normalization, Variance stabilizing normalization, Quantile-normalization, Median central tendency normalization and also variants of some of the forementioned methods), representing different strategies in normalization are being compared and evaluated with a benchmark spike-in proteomics data set. The normalization methods are evaluated in several ways. The performance of the normalization methods is evaluated qualitatively and quantitatively on a global scale and in pairwise comparisons of sample groups. In addition, it is investigated, whether performing the normalization globally on the whole data or pairwise for the comparison pairs examined, affects the performance of the normalization method in normalizing the data and preserving the true differential expression signal. In this thesis, both major and minor differences in the performance of the different normalization methods were found. Also, the way in which the normalization was performed (global normalization of the whole data or pairwise normalization of the comparison pair) affected the performance of some of the methods in pairwise comparisons. Differences among variants of the same methods were also observed.
Resumo:
The work presented herein focused on the automation of coordination-driven self assembly, exploring methods that allow syntheses to be followed more closely while forming new ligands, as part of the fundamental study of the digitization of chemical synthesis and discovery. Whilst the control and understanding of the principle of pre-organization and self-sorting under non-equilibrium conditions remains a key goal, a clear gap has been identified in the absence of approaches that can permit fast screening and real-time observation of the reaction process under different conditions. A firm emphasis was thus placed on the realization of an autonomous chemical robot, which can not only monitor and manipulate coordination chemistry in real-time, but can also allow the exploration of a large chemical parameter space defined by the ligand building blocks and the metal to coordinate. The self-assembly of imine ligands with copper and nickel cations has been studied in a multi-step approach using a self-built flow system capable of automatically controlling the liquid-handling and collecting data in real-time using a benchtop MS and NMR spectrometer. This study led to the identification of a transient Cu(I) species in situ which allows for the formation of dimeric and trimeric carbonato bridged Cu(II) assemblies. Furthermore, new Ni(II) complexes and more remarkably also a new binuclear Cu(I) complex, which usually requires long and laborious inert conditions, could be isolated. The study was then expanded to the autonomous optimization of the ligand synthesis by enabling feedback control on the chemical system via benchtop NMR. The synthesis of new polydentate ligands has emerged as a result of the study aiming to enhance the complexity of the chemical system to accelerate the discovery of new complexes. This type of ligand consists of 1-pyridinyl-4-imino-1,2,3-triazole units, which can coordinate with different metal salts. The studies to test for the CuAAC synthesis via microwave lead to the discovery of four new Cu complexes, one of them being a coordination polymer obtained from a solvent dependent crystallization technique. With the goal of easier integration into an automated system, copper tubing has been exploited as the chemical reactor for the synthesis of this ligand, as it efficiently enhances the rate of the triazole formation and consequently promotes the formation of the full ligand in high yields within two hours. Lastly, the digitization of coordination-driven self-assembly has been realized for the first time using an in-house autonomous chemical robot, herein named the ‘Finder’. The chemical parameter space to explore was defined by the selection of six variables, which consist of the ligand precursors necessary to form complex ligands (aldehydes, alkineamines and azides), of the metal salt solutions and of other reaction parameters – duration, temperature and reagent volumes. The platform was assembled using rounded bottom flasks, flow syringe pumps, copper tubing, as an active reactor, and in-line analytics – a pH meter probe, a UV-vis flow cell and a benchtop MS. The control over the system was then obtained with an algorithm capable of autonomously focusing the experiments on the most reactive region (by avoiding areas of low interest) of the chemical parameter space to explore. This study led to interesting observations, such as metal exchange phenomena, and also to the autonomous discovery of self assembled structures in solution and solid state – such as 1-pyridinyl-4-imino-1,2,3-triazole based Fe complexes and two helicates based on the same ligand coordination motif.
Resumo:
The accuracy of a map is dependent on the reference dataset used in its construction. Classification analyses used in thematic mapping can, for example, be sensitive to a range of sampling and data quality concerns. With particular focus on the latter, the effects of reference data quality on land cover classifications from airborne thematic mapper data are explored. Variations in sampling intensity and effort are highlighted in a dataset that is widely used in mapping and modelling studies; these may need accounting for in analyses. The quality of the labelling in the reference dataset was also a key variable influencing mapping accuracy. Accuracy varied with the amount and nature of mislabelled training cases with the nature of the effects varying between classifiers. The largest impacts on accuracy occurred when mislabelling involved confusion between similar classes. Accuracy was also typically negatively related to the magnitude of mislabelled cases and the support vector machine (SVM), which has been claimed to be relatively insensitive to training data error, was the most sensitive of the set of classifiers investigated, with overall classification accuracy declining by 8% (significant at 95% level of confidence) with the use of a training set containing 20% mislabelled cases.
Resumo:
Recent marine long-offset transient electromagnetic (LOTEM) measurements yielded the offshore delineation of a fresh groundwater body beneath the seafloor in the region of Bat Yam, Israel. The LOTEM application was effective in detecting this freshwater body underneath the Mediterranean Sea and allowed an estimation of its seaward extent. However, the measured data set was insufficient to understand the hydrogeological configuration and mechanism controlling the occurrence of this fresh groundwater discovery. Especially the lateral geometry of the freshwater boundary, important for the hydrogeological modelling, could not be resolved. Without such an understanding, a rational management of this unexploited groundwater reservoir is not possible. Two new high-resolution marine time-domain electromagnetic methods are theoretically developed to derive the hydrogeological structure of the western aquifer boundary. The first is called Circular Electric Dipole (CED). It is the land-based analogous of the Vertical Electric Dipole (VED), which is commonly applied to detect resistive structures in the subsurface. Although the CED shows exceptional detectability characteristics in the step-off signal towards the sub-seafloor freshwater body, an actual application was not carried out in the extent of this study. It was found that the method suffers from an insufficient signal strength to adequately delineate the resistive aquifer under realistic noise conditions. Moreover, modelling studies demonstrated that severe signal distortions are caused by the slightest geometrical inaccuracies. As a result, a successful application of CED in Israel proved to be rather doubtful. A second method called Differential Electric Dipole (DED) is developed as an alternative to the intended CED method. Compared to the conventional marine time-domain electromagnetic system that commonly applies a horizontal electric dipole transmitter, the DED is composed of two horizontal electric dipoles in an in-line configuration that share a common central electrode. Theoretically, DED has similar detectability/resolution characteristics compared to the conventional LOTEM system. However, the superior lateral resolution towards multi-dimensional resistivity structures make an application desirable. Furthermore, the method is less susceptible towards geometrical errors making an application in Israel feasible. In the extent of this thesis, the novel marine DED method is substantiated using several one-dimensional (1D) and multi-dimensional (2D/3D) modelling studies. The main emphasis lies on the application in Israel. Preliminary resistivity models are derived from the previous marine LOTEM measurement and tested for a DED application. The DED method is effective in locating the two-dimensional resistivity structure at the western aquifer boundary. Moreover, a prediction regarding the hydrogeological boundary conditions are feasible, provided a brackish water zone exists at the head of the interface. A seafloor-based DED transmitter/receiver system is designed and built at the Institute of Geophysics and Meteorology at the University of Cologne. The first DED measurements were carried out in Israel in April 2016. The acquired data set is the first of its kind. The measured data is processed and subsequently interpreted using 1D inversion. The intended aim of interpreting both step-on and step-off signals failed, due to the insufficient data quality of the latter. Yet, the 1D inversion models of the DED step-on signals clearly detect the freshwater body for receivers located close to the Israeli coast. Additionally, a lateral resistivity contrast is observable in the 1D inversion models that allow to constrain the seaward extent of this freshwater body. A large-scale 2D modelling study followed the 1D interpretation. In total, 425 600 forward calculations are conducted to find a sub-seafloor resistivity distribution that adequately explains the measured data. The results indicate that the western aquifer boundary is located at 3600 m - 3700 m before the coast. Moreover, a brackish water zone of 3 Omega*m to 5 Omega*m with a lateral extent of less than 300 m is likely located at the head of the freshwater aquifer. Based on these results, it is predicted that the sub-seafloor freshwater body is indeed open to the sea and may be vulnerable to seawater intrusion.
Resumo:
Nowadays robotic applications are widespread and most of the manipulation tasks are efficiently solved. However, Deformable-Objects (DOs) still represent a huge limitation for robots. The main difficulty in DOs manipulation is dealing with the shape and dynamics uncertainties, which prevents the use of model-based approaches (since they are excessively computationally complex) and makes sensory data difficult to interpret. This thesis reports the research activities aimed to address some applications in robotic manipulation and sensing of Deformable-Linear-Objects (DLOs), with particular focus to electric wires. In all the works, a significant effort was made in the study of an effective strategy for analyzing sensory signals with various machine learning algorithms. In the former part of the document, the main focus concerns the wire terminals, i.e. detection, grasping, and insertion. First, a pipeline that integrates vision and tactile sensing is developed, then further improvements are proposed for each module. A novel procedure is proposed to gather and label massive amounts of training images for object detection with minimal human intervention. Together with this strategy, we extend a generic object detector based on Convolutional-Neural-Networks for orientation prediction. The insertion task is also extended by developing a closed-loop control capable to guide the insertion of a longer and curved segment of wire through a hole, where the contact forces are estimated by means of a Recurrent-Neural-Network. In the latter part of the thesis, the interest shifts to the DLO shape. Robotic reshaping of a DLO is addressed by means of a sequence of pick-and-place primitives, while a decision making process driven by visual data learns the optimal grasping locations exploiting Deep Q-learning and finds the best releasing point. The success of the solution leverages on a reliable interpretation of the DLO shape. For this reason, further developments are made on the visual segmentation.
Resumo:
Noise is constant presence in measurements. Its origin is related to the microscopic properties of matter. Since the seminal work of Brown in 1828, the study of stochastic processes has gained an increasing interest with the development of new mathematical and analytical tools. In the last decades, the central role that noise plays in chemical and physiological processes has become recognized. The dual role of noise as nuisance/resource pushes towards the development of new decomposition techniques that divide a signal into its deterministic and stochastic components. In this thesis I show how methods based on Singular Spectrum Analysis have the right properties to fulfil the previously mentioned requirement. During my work I applied SSA to different signals of interest in chemistry: I developed a novel iterative procedure for the denoising of powder X-ray diffractograms; I “denoised” bi-dimensional images from experiments of electrochemiluminescence imaging of micro-beads obtaining new insight on ECL mechanism. I also used Principal Component Analysis to investigate the relationship between brain electrophysiological signals and voice emission.
Resumo:
Following the approval of the 2030 Agenda for Sustainable Development in 2015, sustainability became a hotly debated topic. In order to build a better and more sustainable future by 2030, this agenda addressed several global issues, including inequality, climate change, peace, and justice, in the form of 17 Sustainable Development Goals (SDGs), that should be understood and pursued by nations, corporations, institutions, and individuals. In this thesis, we researched how to exploit and integrate Human-Computer Interaction (HCI) and Data Visualization to promote knowledge and awareness about SDG 8, which wants to encourage lasting, inclusive, and sustainable economic growth, full and productive employment, and decent work for all. In particular, we focused on three targets: green economy, sustainable tourism, employment, decent work for all, and social protection. The primary goal of this research is to determine whether HCI approaches may be used to create and validate interactive data visualization that can serve as helpful decision-making aids for specific groups and raise their knowledge of public-interest issues. To accomplish this goal, we analyzed four case studies. In the first two, we wanted to promote knowledge and awareness about green economy issues: we investigated the Human-Building Interaction inside a Smart Campus and the dematerialization process inside a University. In the third, we focused on smart tourism, investigating the relationship between locals and tourists to create meaningful connections and promote more sustainable tourism. In the fourth, we explored the industry context to highlight sustainability policies inside well-known companies. This research focuses on the hypothesis that interactive data visualization tools can make communities aware of sustainability aspects related to SDG8 and its targets. The research questions addressed are two: "how to promote awareness about SDG8 and its targets through interactive data visualizations?" and "to what extent are these interactive data visualizations effective?".
Resumo:
Machine Learning makes computers capable of performing tasks typically requiring human intelligence. A domain where it is having a considerable impact is the life sciences, allowing to devise new biological analysis protocols, develop patients’ treatments efficiently and faster, and reduce healthcare costs. This Thesis work presents new Machine Learning methods and pipelines for the life sciences focusing on the unsupervised field. At a methodological level, two methods are presented. The first is an “Ab Initio Local Principal Path” and it is a revised and improved version of a pre-existing algorithm in the manifold learning realm. The second contribution is an improvement over the Import Vector Domain Description (one-class learning) through the Kullback-Leibler divergence. It hybridizes kernel methods to Deep Learning obtaining a scalable solution, an improved probabilistic model, and state-of-the-art performances. Both methods are tested through several experiments, with a central focus on their relevance in life sciences. Results show that they improve the performances achieved by their previous versions. At the applicative level, two pipelines are presented. The first one is for the analysis of RNA-Seq datasets, both transcriptomic and single-cell data, and is aimed at identifying genes that may be involved in biological processes (e.g., the transition of tissues from normal to cancer). In this project, an R package is released on CRAN to make the pipeline accessible to the bioinformatic Community through high-level APIs. The second pipeline is in the drug discovery domain and is useful for identifying druggable pockets, namely regions of a protein with a high probability of accepting a small molecule (a drug). Both these pipelines achieve remarkable results. Lastly, a detour application is developed to identify the strengths/limitations of the “Principal Path” algorithm by analyzing Convolutional Neural Networks induced vector spaces. This application is conducted in the music and visual arts domains.
Resumo:
Negative-ion mode electrospray ionization, ESI(-), with Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) was coupled to a Partial Least Squares (PLS) regression and variable selection methods to estimate the total acid number (TAN) of Brazilian crude oil samples. Generally, ESI(-)-FT-ICR mass spectra present a power of resolution of ca. 500,000 and a mass accuracy less than 1 ppm, producing a data matrix containing over 5700 variables per sample. These variables correspond to heteroatom-containing species detected as deprotonated molecules, [M - H](-) ions, which are identified primarily as naphthenic acids, phenols and carbazole analog species. The TAN values for all samples ranged from 0.06 to 3.61 mg of KOH g(-1). To facilitate the spectral interpretation, three methods of variable selection were studied: variable importance in the projection (VIP), interval partial least squares (iPLS) and elimination of uninformative variables (UVE). The UVE method seems to be more appropriate for selecting important variables, reducing the dimension of the variables to 183 and producing a root mean square error of prediction of 0.32 mg of KOH g(-1). By reducing the size of the data, it was possible to relate the selected variables with their corresponding molecular formulas, thus identifying the main chemical species responsible for the TAN values.
Resumo:
The microabrasion technique of enamel consists of selectively abrading the discolored areas or causing superficial structural changes in a selective way. In microabrasion technique, abrasive products associated with acids are used, and the evaluation of enamel roughness after this treatment, as well as surface polishing, is necessary. This in-vitro study evaluated the enamel roughness after microabrasion, followed by different polishing techniques. Roughness analyses were performed before microabrasion (L1), after microabrasion (L2), and after polishing (L3).Thus, 60 bovine incisive teeth divided into two groups were selected (n=30): G1- 37% phosphoric acid (37%) (Dentsply) and pumice; G2- hydrochloric acid (6.6%) associated with silicon carbide (Opalustre - Ultradent). Thereafter, the groups were divided into three sub-groups (n=10), according to the system of polishing: A - Fine and superfine granulation aluminum oxide discs (SofLex 3M); B - Diamond Paste (FGM) associated with felt discs (FGM); C - Silicone tips (Enhance - Dentsply). A PROC MIXED procedure was applied after data exploratory analysis, as well as the Tukey-Kramer test (5%). No statistical differences were found between G1 and G2 groups. L2 differed statistically from L1 and showed superior amounts of roughness. Differences in the amounts of post-polishing roughness for specific groups (1A, 2B, and 1C) arose, which demonstrated less roughness in L3 and differed statistically from L2 in the polishing system. All products increased enamel roughness, and the effectiveness of the polishing systems was dependent upon the abrasive used.