16 resultados para decentralised data fusion framework
em Helda - Digital Repository of University of Helsinki
Resumo:
The core aim of machine learning is to make a computer program learn from the experience. Learning from data is usually defined as a task of learning regularities or patterns in data in order to extract useful information, or to learn the underlying concept. An important sub-field of machine learning is called multi-view learning where the task is to learn from multiple data sets or views describing the same underlying concept. A typical example of such scenario would be to study a biological concept using several biological measurements like gene expression, protein expression and metabolic profiles, or to classify web pages based on their content and the contents of their hyperlinks. In this thesis, novel problem formulations and methods for multi-view learning are presented. The contributions include a linear data fusion approach during exploratory data analysis, a new measure to evaluate different kinds of representations for textual data, and an extension of multi-view learning for novel scenarios where the correspondence of samples in the different views or data sets is not known in advance. In order to infer the one-to-one correspondence of samples between two views, a novel concept of multi-view matching is proposed. The matching algorithm is completely data-driven and is demonstrated in several applications such as matching of metabolites between humans and mice, and matching of sentences between documents in two languages.
Resumo:
This work belongs to the field of computational high-energy physics (HEP). The key methods used in this thesis work to meet the challenges raised by the Large Hadron Collider (LHC) era experiments are object-orientation with software engineering, Monte Carlo simulation, the computer technology of clusters, and artificial neural networks. The first aspect discussed is the development of hadronic cascade models, used for the accurate simulation of medium-energy hadron-nucleus reactions, up to 10 GeV. These models are typically needed in hadronic calorimeter studies and in the estimation of radiation backgrounds. Various applications outside HEP include the medical field (such as hadron treatment simulations), space science (satellite shielding), and nuclear physics (spallation studies). Validation results are presented for several significant improvements released in Geant4 simulation tool, and the significance of the new models for computing in the Large Hadron Collider era is estimated. In particular, we estimate the ability of the Bertini cascade to simulate Compact Muon Solenoid (CMS) hadron calorimeter HCAL. LHC test beam activity has a tightly coupled cycle of simulation-to-data analysis. Typically, a Geant4 computer experiment is used to understand test beam measurements. Thus an another aspect of this thesis is a description of studies related to developing new CMS H2 test beam data analysis tools and performing data analysis on the basis of CMS Monte Carlo events. These events have been simulated in detail using Geant4 physics models, full CMS detector description, and event reconstruction. Using the ROOT data analysis framework we have developed an offline ANN-based approach to tag b-jets associated with heavy neutral Higgs particles, and we show that this kind of NN methodology can be successfully used to separate the Higgs signal from the background in the CMS experiment.
Resumo:
Numerical weather prediction (NWP) models provide the basis for weather forecasting by simulating the evolution of the atmospheric state. A good forecast requires that the initial state of the atmosphere is known accurately, and that the NWP model is a realistic representation of the atmosphere. Data assimilation methods are used to produce initial conditions for NWP models. The NWP model background field, typically a short-range forecast, is updated with observations in a statistically optimal way. The objective in this thesis has been to develope methods in order to allow data assimilation of Doppler radar radial wind observations. The work has been carried out in the High Resolution Limited Area Model (HIRLAM) 3-dimensional variational data assimilation framework. Observation modelling is a key element in exploiting indirect observations of the model variables. In the radar radial wind observation modelling, the vertical model wind profile is interpolated to the observation location, and the projection of the model wind vector on the radar pulse path is calculated. The vertical broadening of the radar pulse volume, and the bending of the radar pulse path due to atmospheric conditions are taken into account. Radar radial wind observations are modelled within observation errors which consist of instrumental, modelling, and representativeness errors. Systematic and random modelling errors can be minimized by accurate observation modelling. The impact of the random part of the instrumental and representativeness errors can be decreased by calculating spatial averages from the raw observations. Model experiments indicate that the spatial averaging clearly improves the fit of the radial wind observations to the model in terms of observation minus model background (OmB) standard deviation. Monitoring the quality of the observations is an important aspect, especially when a new observation type is introduced into a data assimilation system. Calculating the bias for radial wind observations in a conventional way can result in zero even in case there are systematic differences in the wind speed and/or direction. A bias estimation method designed for this observation type is introduced in the thesis. Doppler radar radial wind observation modelling, together with the bias estimation method, enables the exploitation of the radial wind observations also for NWP model validation. The one-month model experiments performed with the HIRLAM model versions differing only in a surface stress parameterization detail indicate that the use of radar wind observations in NWP model validation is very beneficial.
Resumo:
This thesis is an exploratory case study that aims to understand the attitudes affecting adoption of mobile self-services. This study used a demo mobile self-service that could be used by consumers for making address changes. The service was branded with a large and trusted Finnish brand. The theoretical framework that was used consisted of adoption theories of technology, adoption theories of self-service and literature concerning mobile services. The reviewed adoption theories of both technology and self-service had their foundation in IDT or TRA/TPB. Based on the reviewed theories an initial framework was created. The empirical data collection was done through three computer aided group interview sessions with a total of 32 respondents. The data analysis started from the premises of the initial framework. Based on the empirical data the framework was constantly reviewed and altered and the data recoded accordingly. The result of this thesis was a list of attitudinal factors that affect the adoption of a mobile self-service either positively or negatively. The factors that were found to affect the attitudes towards adoption of mobile self-services positively were: that the service was time & place independent and saved time. Most respondents, but not all, also had a positive attitude towards adoption due to ease of use and being mentally compatible with the service. Factors that affected adoption negatively were lack of technical compatibility, perceived risk for high costs and risk for malicious software. The identified factors were triangulated in respect to existing literature and general attitudes towards mobile services.
Resumo:
Thunderstorm is a dangerous electrical phenomena in the atmosphere. Thundercloud is formed when thermal energy is transported rapidly upwards in convective updraughts. Electrification occurs in the collisions of cloud particles in the strong updraught. When the amount of charge in the cloud is large enough, electrical breakdown, better known as a flash, occurs. Lightning location is nowadays an essential tool for the detection of severe weather. Located flashes indicate in real time the movement of hazardous areas and the intensity of lightning activity. Also, an estimate for the flash peak current can be determined. The observations can be used in damage surveys. The most simple way to represent lightning data is to plot the locations on a map, but the data can be processed in more complex end-products and exploited in data fusion. Lightning data serves as an important tool also in the research of lightning-related phenomena, such as Transient Luminous Events. Most of the global thunderstorms occur in areas with plenty of heat, moisture and tropospheric instability, for example in the tropical land areas. In higher latitudes like in Finland, the thunderstorm season is practically restricted to the summer season. Particular feature of the high-latitude climatology is the large annual variation, which regards also thunderstorms. Knowing the performance of any measuring device is important because it affects the accuracy of the end-products. In lightning location systems, the detection efficiency means the ratio between located and actually occurred flashes. Because in practice it is impossible to know the true number of actually occurred flashes, the detection efficiency has to be esimated with theoretical methods.
Resumo:
Although previous research has recognised adaptation as a central aspect in relationships, the adaptation of the sales process to the buying process has not been studied. Furthermore, the linking of relationship orientation as mindset with adaptation as a strategy and forming the means has not been elaborated upon in previous research. Adaptation in the context of relationships has mostly been studied in relationship marketing. In sales and sales management research, adaptation has been studied with reference to personal selling. This study focuses on adaptation of the sales process to strategically match it to the buyer’s mindset and buying process. The purpose of this study is to develop a framework for strategic adaptation of the seller’s sales process to match the buyer’s buying process in a business-to-business context to make sales processes more relationship oriented. In order to arrive at a holistic view of adaptation of the sales process during relationship initiation, both the seller and buyer are included in an extensive case analysed in the study. However, the selected perspective is primarily that of the seller, and the level focused on is that of the sales process. The epistemological perspective adopted is constructivism. The study is a qualitative one applying a retrospective case study, where the main sources of information are in-depth semi-structured interviews with key informants representing the counterparts at the seller and the buyer in the software development and telecommunications industries. The main theoretical contributions of this research involve targeting a new area in the crossroads of relationship marketing, sales and sales management, and buying and purchasing by studying adaptation in a business-to-business context from a new perspective. Primarily, this study contributes to research in sales and sales management with reference to relationship orientation and strategic sales process adaptation. This research fills three research gaps. Firstly, linking the relationship orientation mindset with adaptation as strategy. Secondly, extending adaptation in sales from adaptation in selling to strategic adaptation of the sales process. Thirdly, extending adaptation to include facilitation of adaptation. The approach applied in the study, systematic combining, is characterised by continuously moving back and forth between theory and empirical data. The framework that emerges, in which linking mindset with strategy with mindset and means forms a central aspect, includes three layers: purchasing portfolio, seller-buyer relationship orientation, and strategic sales process adaptation. Linking the three layers enables an analysis of where sales process adaptation can make a contribution. Furthermore, implications for managerial use are demonstrated, for example how sellers can avoid the ‘trap’ of ad-hoc adaptation. This includes involving the company, embracing the buyer’s purchasing portfolio, understanding the current position that the seller has in this portfolio, and possibly educating the buyer about advantages of adopting a relationship-oriented approach.
Resumo:
During the last decades there has been a global shift in forest management from a focus solely on timber management to ecosystem management that endorses all aspects of forest functions: ecological, economic and social. This has resulted in a shift in paradigm from sustained yield to sustained diversity of values, goods and benefits obtained at the same time, introducing new temporal and spatial scales into forest resource management. The purpose of the present dissertation was to develop methods that would enable spatial and temporal scales to be introduced into the storage, processing, access and utilization of forest resource data. The methods developed are based on a conceptual view of a forest as a hierarchically nested collection of objects that can have a dynamically changing set of attributes. The temporal aspect of the methods consists of lifetime management for the objects and their attributes and of a temporal succession linking the objects together. Development of the forest resource data processing method concentrated on the extensibility and configurability of the data content and model calculations, allowing for a diverse set of processing operations to be executed using the same framework. The contribution of this dissertation to the utilisation of multi-scale forest resource data lies in the development of a reference data generation method to support forest inventory methods in approaching single-tree resolution.
Resumo:
Whether a statistician wants to complement a probability model for observed data with a prior distribution and carry out fully probabilistic inference, or base the inference only on the likelihood function, may be a fundamental question in theory, but in practice it may well be of less importance if the likelihood contains much more information than the prior. Maximum likelihood inference can be justified as a Gaussian approximation at the posterior mode, using flat priors. However, in situations where parametric assumptions in standard statistical models would be too rigid, more flexible model formulation, combined with fully probabilistic inference, can be achieved using hierarchical Bayesian parametrization. This work includes five articles, all of which apply probability modeling under various problems involving incomplete observation. Three of the papers apply maximum likelihood estimation and two of them hierarchical Bayesian modeling. Because maximum likelihood may be presented as a special case of Bayesian inference, but not the other way round, in the introductory part of this work we present a framework for probability-based inference using only Bayesian concepts. We also re-derive some results presented in the original articles using the toolbox equipped herein, to show that they are also justifiable under this more general framework. Here the assumption of exchangeability and de Finetti's representation theorem are applied repeatedly for justifying the use of standard parametric probability models with conditionally independent likelihood contributions. It is argued that this same reasoning can be applied also under sampling from a finite population. The main emphasis here is in probability-based inference under incomplete observation due to study design. This is illustrated using a generic two-phase cohort sampling design as an example. The alternative approaches presented for analysis of such a design are full likelihood, which utilizes all observed information, and conditional likelihood, which is restricted to a completely observed set, conditioning on the rule that generated that set. Conditional likelihood inference is also applied for a joint analysis of prevalence and incidence data, a situation subject to both left censoring and left truncation. Other topics covered are model uncertainty and causal inference using posterior predictive distributions. We formulate a non-parametric monotonic regression model for one or more covariates and a Bayesian estimation procedure, and apply the model in the context of optimal sequential treatment regimes, demonstrating that inference based on posterior predictive distributions is feasible also in this case.
Resumo:
Advancements in the analysis techniques have led to a rapid accumulation of biological data in databases. Such data often are in the form of sequences of observations, examples including DNA sequences and amino acid sequences of proteins. The scale and quality of the data give promises of answering various biologically relevant questions in more detail than what has been possible before. For example, one may wish to identify areas in an amino acid sequence, which are important for the function of the corresponding protein, or investigate how characteristics on the level of DNA sequence affect the adaptation of a bacterial species to its environment. Many of the interesting questions are intimately associated with the understanding of the evolutionary relationships among the items under consideration. The aim of this work is to develop novel statistical models and computational techniques to meet with the challenge of deriving meaning from the increasing amounts of data. Our main concern is on modeling the evolutionary relationships based on the observed molecular data. We operate within a Bayesian statistical framework, which allows a probabilistic quantification of the uncertainties related to a particular solution. As the basis of our modeling approach we utilize a partition model, which is used to describe the structure of data by appropriately dividing the data items into clusters of related items. Generalizations and modifications of the partition model are developed and applied to various problems. Large-scale data sets provide also a computational challenge. The models used to describe the data must be realistic enough to capture the essential features of the current modeling task but, at the same time, simple enough to make it possible to carry out the inference in practice. The partition model fulfills these two requirements. The problem-specific features can be taken into account by modifying the prior probability distributions of the model parameters. The computational efficiency stems from the ability to integrate out the parameters of the partition model analytically, which enables the use of efficient stochastic search algorithms.
Resumo:
In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals.
Resumo:
Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.
Resumo:
The tackling of coastal eutrophication requires water protection measures based on status assessments of water quality. The main purpose of this thesis was to evaluate whether it is possible both scientifically and within the terms of the European Union Water Framework Directive (WFD) to assess the status of coastal marine waters reliably by using phytoplankton biomass (ww) and chlorophyll a (Chl) as indicators of eutrophication in Finnish coastal waters. Empirical approaches were used to study whether the criteria, established for determining an indicator, are fulfilled. The first criterion (i) was that an indicator should respond to anthropogenic stresses in a predictable manner and has low variability in its response. Summertime Chl could be predicted accurately by nutrient concentrations, but not from the external annual loads alone, because of the rapid affect of primary production and sedimentation close to the loading sources in summer. The most accurate predictions were achieved in the Archipelago Sea, where total phosphorus (TP) and total nitrogen (TN) alone accounted for 87% and 78% of the variation in Chl, respectively. In river estuaries, the TP mass-balance regression model predicted Chl most accurately when nutrients originated from point-sources, whereas land-use regression models were most accurate in cases when nutrients originated mainly from diffuse sources. The inclusion of morphometry (e.g. mean depth) into nutrient models improved accuracy of the predictions. The second criterion (ii) was associated with the WFD. It requires that an indicator should have type-specific reference conditions, which are defined as "conditions where the values of the biological quality elements are at high ecological status". In establishing reference conditions, the empirical approach could only be used in the outer coastal water types, where historical observations of Secchi depth of the early 1900s are available. The most accurate prediction was achieved in the Quark. In the inner coastal water types, reference Chl, estimated from present monitoring data, are imprecise - not only because of the less accurate estimation method but also because the intrinsic characteristics, described for instance by morphometry, vary considerably inside these extensive inner coastal types. As for phytoplankton biomass, the reference values were less accurate than in the case of Chl, because it was possible to estimate reference conditions for biomass only by using the reconstructed Chl values, not the historical Secchi observations. An paleoecological approach was also applied to estimate annual average reference conditions for Chl. In Laajalahti, an urban embayment off Helsinki, strongly loaded by municipal waste waters in the 1960s and 1970s, reference conditions prevailed in the mid- and late 1800s. The recovery of the bay from pollution has been delayed as a consequence of benthic release of nutrients. Laajalahti will probably not achieve the good quality objectives of the WFD on time. The third criterion (iii) was associated with coastal management including the resources it has available. Analyses of Chl are cheap and fast to carry out compared to the analyses of phytoplankton biomass and species composition; the fact which has an effect on number of samples to be taken and thereby on the reliability of assessments. However, analyses on phytoplankton biomass and species composition provide more metrics for ecological classification, the metrics which reveal various aspects of eutrophication contrary to what Chl alone does.
Resumo:
Fusion power is an appealing source of clean and abundant energy. The radiation resistance of reactor materials is one of the greatest obstacles on the path towards commercial fusion power. These materials are subject to a harsh radiation environment, and cannot fail mechanically or contaminate the fusion plasma. Moreover, for a power plant to be economically viable, the reactor materials must withstand long operation times, with little maintenance. The fusion reactor materials will contain hydrogen and helium, due to deposition from the plasma and nuclear reactions because of energetic neutron irradiation. The first wall divertor materials, carbon and tungsten in existing and planned test reactors, will be subject to intense bombardment of low energy deuterium and helium, which erodes and modifies the surface. All reactor materials, including the structural steel, will suffer irradiation of high energy neutrons, causing displacement cascade damage. Molecular dynamics simulation is a valuable tool for studying irradiation phenomena, such as surface bombardment and the onset of primary damage due to displacement cascades. The governing mechanisms are on the atomic level, and hence not easily studied experimentally. In order to model materials, interatomic potentials are needed to describe the interaction between the atoms. In this thesis, new interatomic potentials were developed for the tungsten-carbon-hydrogen system and for iron-helium and chromium-helium. Thus, the study of previously inaccessible systems was made possible, in particular the effect of H and He on radiation damage. The potentials were based on experimental and ab initio data from the literature, as well as density-functional theory calculations performed in this work. As a model for ferritic steel, iron-chromium with 10% Cr was studied. The difference between Fe and FeCr was shown to be negligible for threshold displacement energies. The properties of small He and He-vacancy clusters in Fe and FeCr were also investigated. The clusters were found to be more mobile and dissociate more rapidly than previously assumed, and the effect of Cr was small. The primary damage formed by displacement cascades was found to be heavily influenced by the presence of He, both in FeCr and W. Many important issues with fusion reactor materials remain poorly understood, and will require a huge effort by the international community. The development of potential models for new materials and the simulations performed in this thesis reveal many interesting features, but also serve as a platform for further studies.
Resumo:
Neuroblastoma has successfully served as a model system for the identification of neuroectoderm-derived oncogenes. However, in spite of various efforts, only a few clinically useful prognostic markers have been found. Here, we present a framework, which integrates DNA, RNA and tissue data to identify and prioritize genetic events that represent clinically relevant new therapeutic targets and prognostic biomarkers for neuroblastoma.
Resumo:
Thermonuclear fusion is a sustainable energy solution, in which energy is produced using similar processes as in the sun. In this technology hydrogen isotopes are fused to gain energy and consequently to produce electricity. In a fusion reactor hydrogen isotopes are confined by magnetic fields as ionized gas, the plasma. Since the core plasma is millions of degrees hot, there are special needs for the plasma-facing materials. Moreover, in the plasma the fusion of hydrogen isotopes leads to the production of high energetic neutrons which sets demanding abilities for the structural materials of the reactor. This thesis investigates the irradiation response of materials to be used in future fusion reactors. Interactions of the plasma with the reactor wall leads to the removal of surface atoms, migration of them, and formation of co-deposited layers such as tungsten carbide. Sputtering of tungsten carbide and deuterium trapping in tungsten carbide was investigated in this thesis. As the second topic the primary interaction of the neutrons in the structural material steel was examined. As model materials for steel iron chromium and iron nickel were used. This study was performed theoretically by the means of computer simulations on the atomic level. In contrast to previous studies in the field, in which simulations were limited to pure elements, in this work more complex materials were used, i.e. they were multi-elemental including two or more atom species. The results of this thesis are in the microscale. One of the results is a catalogue of atom species, which were removed from tungsten carbide by the plasma. Another result is e.g. the atomic distributions of defects in iron chromium caused by the energetic neutrons. These microscopic results are used in data bases for multiscale modelling of fusion reactor materials, which has the aim to explain the macroscopic degradation in the materials. This thesis is therefore a relevant contribution to investigate the connection of microscopic and macroscopic radiation effects, which is one objective in fusion reactor materials research.