26 resultados para cluster algorithms
em Helda - Digital Repository of University of Helsinki
Resumo:
The analysis of sequential data is required in many diverse areas such as telecommunications, stock market analysis, and bioinformatics. A basic problem related to the analysis of sequential data is the sequence segmentation problem. A sequence segmentation is a partition of the sequence into a number of non-overlapping segments that cover all data points, such that each segment is as homogeneous as possible. This problem can be solved optimally using a standard dynamic programming algorithm. In the first part of the thesis, we present a new approximation algorithm for the sequence segmentation problem. This algorithm has smaller running time than the optimal dynamic programming algorithm, while it has bounded approximation ratio. The basic idea is to divide the input sequence into subsequences, solve the problem optimally in each subsequence, and then appropriately combine the solutions to the subproblems into one final solution. In the second part of the thesis, we study alternative segmentation models that are devised to better fit the data. More specifically, we focus on clustered segmentations and segmentations with rearrangements. While in the standard segmentation of a multidimensional sequence all dimensions share the same segment boundaries, in a clustered segmentation the multidimensional sequence is segmented in such a way that dimensions are allowed to form clusters. Each cluster of dimensions is then segmented separately. We formally define the problem of clustered segmentations and we experimentally show that segmenting sequences using this segmentation model, leads to solutions with smaller error for the same model cost. Segmentation with rearrangements is a novel variation to the segmentation problem: in addition to partitioning the sequence we also seek to apply a limited amount of reordering, so that the overall representation error is minimized. We formulate the problem of segmentation with rearrangements and we show that it is an NP-hard problem to solve or even to approximate. We devise effective algorithms for the proposed problem, combining ideas from dynamic programming and outlier detection algorithms in sequences. In the final part of the thesis, we discuss the problem of aggregating results of segmentation algorithms on the same set of data points. In this case, we are interested in producing a partitioning of the data that agrees as much as possible with the input partitions. We show that this problem can be solved optimally in polynomial time using dynamic programming. Furthermore, we show that not all data points are candidates for segment boundaries in the optimal solution.
Resumo:
This study examined the efficacy of a participatory ergonomics intervention in preventing musculoskeletal disorders (MSDs) and changing unsatisfactory psychosocial working conditions among municipal kitchen workers. The occurrence of multiple-site musculoskeletal pain (MSP) and associations between MSP and psychosocial factors at work over time were studied secondarily. A cluster randomized controlled trial was conducted during 2002-2005 in 119 municipal kitchens with 504 workers. The kitchens were randomized to an intervention (n = 59) and control (n = 60) group. The intervention lasted 11 to 14 months. The workers identified strenuous work tasks and sought solutions for decreasing physical and mental workload. The main outcomes were the occurrence of and trouble caused by musculoskeletal pain in seven anatomical sites, local musculoskeletal fatigue after work, and musculoskeletal sick leaves. Psychosocial factors at work (job control, skill discretion, co-worker relationships, supervisor support, mental strenuousness of work, hurry, job satisfaction) and mental stress were studied as intermediate outcomes of the intervention. Questionnaire data were collected at three months intervals during the intervention and the one-year post-intervention follow-up. Response rates varied between 92 % and 99 %. In total, 402 ergonomic changes were implemented. In the control group, 80 changes were spontaneously implemented within normal activity. The intervention did not reduce perceived physical workload and no systematic differences in any health outcomes were found between the intervention and control groups during the intervention or during the one-year follow-up. The results suggest that the intervention as studied in the present trial was not more effective in reducing perceived physical workload or preventing MSDs compared with no such intervention. Little previous evidence of the effectiveness of ergonomics interventions in preventing MSDs exists. The effects on psychosocial factors at work were adverse, especially in the two of the participating cities where re-organization of foodservices timed simultaneously with the intervention. If organizational reforms at workplace are expected to occur, the execution of other workplace interventions at the same time should be avoided. The co-occurrence of musculoskeletal pain at several sites is observed to be more common than pain at single anatomical sites. However, the risk factors of MSP are largely unknown. This study showed that at baseline, 73 % of the women reported pain in at least two, 36 % in four or more, and 10 % in six to seven sites. The seven pain symptoms occurred in over 80 different combinations. When co-occurrence of pain was studied in three larger anatomical areas (neck/low back, upper limbs, lower limbs), concurrent pain in all three areas was the most common combination (36 %). The 3-month prevalence of MSP (≥ 3 of seven sites) varied between 50 % and 61 % during the two-year follow-up period. Psychosocial factors at work and mental stress were strong predictors for MSP over time and, vice versa, MSP predicted psychosocial factors at work and mental stress. The reciprocality of the relationships implies either two mutually dependent processes in time, or some shared common underlying factor(s).
Resumo:
The aim of this thesis is to develop a fully automatic lameness detection system that operates in a milking robot. The instrumentation, measurement software, algorithms for data analysis and a neural network model for lameness detection were developed. Automatic milking has become a common practice in dairy husbandry, and in the year 2006 about 4000 farms worldwide used over 6000 milking robots. There is a worldwide movement with the objective of fully automating every process from feeding to milking. Increase in automation is a consequence of increasing farm sizes, the demand for more efficient production and the growth of labour costs. As the level of automation increases, the time that the cattle keeper uses for monitoring animals often decreases. This has created a need for systems for automatically monitoring the health of farm animals. The popularity of milking robots also offers a new and unique possibility to monitor animals in a single confined space up to four times daily. Lameness is a crucial welfare issue in the modern dairy industry. Limb disorders cause serious welfare, health and economic problems especially in loose housing of cattle. Lameness causes losses in milk production and leads to early culling of animals. These costs could be reduced with early identification and treatment. At present, only a few methods for automatically detecting lameness have been developed, and the most common methods used for lameness detection and assessment are various visual locomotion scoring systems. The problem with locomotion scoring is that it needs experience to be conducted properly, it is labour intensive as an on-farm method and the results are subjective. A four balance system for measuring the leg load distribution of dairy cows during milking in order to detect lameness was developed and set up in the University of Helsinki Research farm Suitia. The leg weights of 73 cows were successfully recorded during almost 10,000 robotic milkings over a period of 5 months. The cows were locomotion scored weekly, and the lame cows were inspected clinically for hoof lesions. Unsuccessful measurements, caused by cows standing outside the balances, were removed from the data with a special algorithm, and the mean leg loads and the number of kicks during milking was calculated. In order to develop an expert system to automatically detect lameness cases, a model was needed. A probabilistic neural network (PNN) classifier model was chosen for the task. The data was divided in two parts and 5,074 measurements from 37 cows were used to train the model. The operation of the model was evaluated for its ability to detect lameness in the validating dataset, which had 4,868 measurements from 36 cows. The model was able to classify 96% of the measurements correctly as sound or lame cows, and 100% of the lameness cases in the validation data were identified. The number of measurements causing false alarms was 1.1%. The developed model has the potential to be used for on-farm decision support and can be used in a real-time lameness monitoring system.
Resumo:
Bacteria play an important role in many ecological systems. The molecular characterization of bacteria using either cultivation-dependent or cultivation-independent methods reveals the large scale of bacterial diversity in natural communities, and the vastness of subpopulations within a species or genus. Understanding how bacterial diversity varies across different environments and also within populations should provide insights into many important questions of bacterial evolution and population dynamics. This thesis presents novel statistical methods for analyzing bacterial diversity using widely employed molecular fingerprinting techniques. The first objective of this thesis was to develop Bayesian clustering models to identify bacterial population structures. Bacterial isolates were identified using multilous sequence typing (MLST), and Bayesian clustering models were used to explore the evolutionary relationships among isolates. Our method involves the inference of genetic population structures via an unsupervised clustering framework where the dependence between loci is represented using graphical models. The population dynamics that generate such a population stratification were investigated using a stochastic model, in which homologous recombination between subpopulations can be quantified within a gene flow network. The second part of the thesis focuses on cluster analysis of community compositional data produced by two different cultivation-independent analyses: terminal restriction fragment length polymorphism (T-RFLP) analysis, and fatty acid methyl ester (FAME) analysis. The cluster analysis aims to group bacterial communities that are similar in composition, which is an important step for understanding the overall influences of environmental and ecological perturbations on bacterial diversity. A common feature of T-RFLP and FAME data is zero-inflation, which indicates that the observation of a zero value is much more frequent than would be expected, for example, from a Poisson distribution in the discrete case, or a Gaussian distribution in the continuous case. We provided two strategies for modeling zero-inflation in the clustering framework, which were validated by both synthetic and empirical complex data sets. We show in the thesis that our model that takes into account dependencies between loci in MLST data can produce better clustering results than those methods which assume independent loci. Furthermore, computer algorithms that are efficient in analyzing large scale data were adopted for meeting the increasing computational need. Our method that detects homologous recombination in subpopulations may provide a theoretical criterion for defining bacterial species. The clustering of bacterial community data include T-RFLP and FAME provides an initial effort for discovering the evolutionary dynamics that structure and maintain bacterial diversity in the natural environment.
Resumo:
The ever expanding growth of the wireless access to the Internet in recent years has led to the proliferation of wireless and mobile devices to connect to the Internet. This has created the possibility of mobile devices equipped with multiple radio interfaces to connect to the Internet using any of several wireless access network technologies such as GPRS, WLAN and WiMAX in order to get the connectivity best suited for the application. These access networks are highly heterogeneous and they vary widely in their characteristics such as bandwidth, propagation delay and geographical coverage. The mechanism by which a mobile device switches between these access networks during an ongoing connection is referred to as vertical handoff and it often results in an abrupt and significant change in the access link characteristics. The most common Internet applications such as Web browsing and e-mail make use of the Transmission Control Protocol (TCP) as their transport protocol and the behaviour of TCP depends on the end-to-end path characteristics such as bandwidth and round-trip time (RTT). As the wireless access link is most likely the bottleneck of a TCP end-to-end path, the abrupt changes in the link characteristics due to a vertical handoff may affect TCP behaviour adversely degrading the performance of the application. The focus of this thesis is to study the effect of a vertical handoff on TCP behaviour and to propose algorithms that improve the handoff behaviour of TCP using cross-layer information about the changes in the access link characteristics. We begin this study by identifying the various problems of TCP due to a vertical handoff based on extensive simulation experiments. We use this study as a basis to develop cross-layer assisted TCP algorithms in handoff scenarios involving GPRS and WLAN access networks. We then extend the scope of the study by developing cross-layer assisted TCP algorithms in a broader context applicable to a wide range of bandwidth and delay changes during a handoff. And finally, the algorithms developed here are shown to be easily extendable to the multiple-TCP flow scenario. We evaluate the proposed algorithms by comparison with standard TCP (TCP SACK) and show that the proposed algorithms are effective in improving TCP behavior in vertical handoff involving a wide range of bandwidth and delay of the access networks. Our algorithms are easy to implement in real systems and they involve modifications to the TCP sender algorithm only. The proposed algorithms are conservative in nature and they do not adversely affect the performance of TCP in the absence of cross-layer information.
Resumo:
Matrix decompositions, where a given matrix is represented as a product of two other matrices, are regularly used in data mining. Most matrix decompositions have their roots in linear algebra, but the needs of data mining are not always those of linear algebra. In data mining one needs to have results that are interpretable -- and what is considered interpretable in data mining can be very different to what is considered interpretable in linear algebra. --- The purpose of this thesis is to study matrix decompositions that directly address the issue of interpretability. An example is a decomposition of binary matrices where the factor matrices are assumed to be binary and the matrix multiplication is Boolean. The restriction to binary factor matrices increases interpretability -- factor matrices are of the same type as the original matrix -- and allows the use of Boolean matrix multiplication, which is often more intuitive than normal matrix multiplication with binary matrices. Also several other decomposition methods are described, and the computational complexity of computing them is studied together with the hardness of approximating the related optimization problems. Based on these studies, algorithms for constructing the decompositions are proposed. Constructing the decompositions turns out to be computationally hard, and the proposed algorithms are mostly based on various heuristics. Nevertheless, the algorithms are shown to be capable of finding good results in empirical experiments conducted with both synthetic and real-world data.
Resumo:
The metabolism of an organism consists of a network of biochemical reactions that transform small molecules, or metabolites, into others in order to produce energy and building blocks for essential macromolecules. The goal of metabolic flux analysis is to uncover the rates, or the fluxes, of those biochemical reactions. In a steady state, the sum of the fluxes that produce an internal metabolite is equal to the sum of the fluxes that consume the same molecule. Thus the steady state imposes linear balance constraints to the fluxes. In general, the balance constraints imposed by the steady state are not sufficient to uncover all the fluxes of a metabolic network. The fluxes through cycles and alternative pathways between the same source and target metabolites remain unknown. More information about the fluxes can be obtained from isotopic labelling experiments, where a cell population is fed with labelled nutrients, such as glucose that contains 13C atoms. Labels are then transferred by biochemical reactions to other metabolites. The relative abundances of different labelling patterns in internal metabolites depend on the fluxes of pathways producing them. Thus, the relative abundances of different labelling patterns contain information about the fluxes that cannot be uncovered from the balance constraints derived from the steady state. The field of research that estimates the fluxes utilizing the measured constraints to the relative abundances of different labelling patterns induced by 13C labelled nutrients is called 13C metabolic flux analysis. There exist two approaches of 13C metabolic flux analysis. In the optimization approach, a non-linear optimization task, where candidate fluxes are iteratively generated until they fit to the measured abundances of different labelling patterns, is constructed. In the direct approach, linear balance constraints given by the steady state are augmented with linear constraints derived from the abundances of different labelling patterns of metabolites. Thus, mathematically involved non-linear optimization methods that can get stuck to the local optima can be avoided. On the other hand, the direct approach may require more measurement data than the optimization approach to obtain the same flux information. Furthermore, the optimization framework can easily be applied regardless of the labelling measurement technology and with all network topologies. In this thesis we present a formal computational framework for direct 13C metabolic flux analysis. The aim of our study is to construct as many linear constraints to the fluxes from the 13C labelling measurements using only computational methods that avoid non-linear techniques and are independent from the type of measurement data, the labelling of external nutrients and the topology of the metabolic network. The presented framework is the first representative of the direct approach for 13C metabolic flux analysis that is free from restricting assumptions made about these parameters.In our framework, measurement data is first propagated from the measured metabolites to other metabolites. The propagation is facilitated by the flow analysis of metabolite fragments in the network. Then new linear constraints to the fluxes are derived from the propagated data by applying the techniques of linear algebra.Based on the results of the fragment flow analysis, we also present an experiment planning method that selects sets of metabolites whose relative abundances of different labelling patterns are most useful for 13C metabolic flux analysis. Furthermore, we give computational tools to process raw 13C labelling data produced by tandem mass spectrometry to a form suitable for 13C metabolic flux analysis.
Resumo:
This thesis studies optimisation problems related to modern large-scale distributed systems, such as wireless sensor networks and wireless ad-hoc networks. The concrete tasks that we use as motivating examples are the following: (i) maximising the lifetime of a battery-powered wireless sensor network, (ii) maximising the capacity of a wireless communication network, and (iii) minimising the number of sensors in a surveillance application. A sensor node consumes energy both when it is transmitting or forwarding data, and when it is performing measurements. Hence task (i), lifetime maximisation, can be approached from two different perspectives. First, we can seek for optimal data flows that make the most out of the energy resources available in the network; such optimisation problems are examples of so-called max-min linear programs. Second, we can conserve energy by putting redundant sensors into sleep mode; we arrive at the sleep scheduling problem, in which the objective is to find an optimal schedule that determines when each sensor node is asleep and when it is awake. In a wireless network simultaneous radio transmissions may interfere with each other. Task (ii), capacity maximisation, therefore gives rise to another scheduling problem, the activity scheduling problem, in which the objective is to find a minimum-length conflict-free schedule that satisfies the data transmission requirements of all wireless communication links. Task (iii), minimising the number of sensors, is related to the classical graph problem of finding a minimum dominating set. However, if we are not only interested in detecting an intruder but also locating the intruder, it is not sufficient to solve the dominating set problem; formulations such as minimum-size identifying codes and locating dominating codes are more appropriate. This thesis presents approximation algorithms for each of these optimisation problems, i.e., for max-min linear programs, sleep scheduling, activity scheduling, identifying codes, and locating dominating codes. Two complementary approaches are taken. The main focus is on local algorithms, which are constant-time distributed algorithms. The contributions include local approximation algorithms for max-min linear programs, sleep scheduling, and activity scheduling. In the case of max-min linear programs, tight upper and lower bounds are proved for the best possible approximation ratio that can be achieved by any local algorithm. The second approach is the study of centralised polynomial-time algorithms in local graphs these are geometric graphs whose structure exhibits spatial locality. Among other contributions, it is shown that while identifying codes and locating dominating codes are hard to approximate in general graphs, they admit a polynomial-time approximation scheme in local graphs.
Resumo:
Diffuse large B-cell lymphoma (DLBCL) is the most common of the non-Hodgkin lymphomas. As DLBCL is characterized by heterogeneous clinical and biological features, its prognosis varies. To date, the International Prognostic Index has been the strongest predictor of outcome for DLBCL patients. However, no biological characters of the disease are taken into account. Gene expression profiling studies have identified two major cell-of-origin phenotypes in DLBCL with different prognoses, the favourable germinal centre B-cell-like (GCB) and the unfavourable activated B-cell-like (ABC) phenotypes. However, results of the prognostic impact of the immunohistochemically defined GCB and non-GCB distinction are controversial. Furthermore, since the addition of the CD20 antibody rituximab to chemotherapy has been established as the standard treatment of DLBCL, all molecular markers need to be evaluated in the post-rituximab era. In this study, we aimed to evaluate the predictive value of immunohistochemically defined cell-of-origin classification in DLBCL patients. The GCB and non-GCB phenotypes were defined according to the Hans algorithm (CD10, BCL6 and MUM1/IRF4) among 90 immunochemotherapy- and 104 chemotherapy-treated DLBCL patients. In the chemotherapy group, we observed a significant difference in survival between GCB and non-GCB patients, with a good and a poor prognosis, respectively. However, in the rituximab group, no prognostic value of the GCB phenotype was observed. Likewise, among 29 high-risk de novo DLBCL patients receiving high-dose chemotherapy and autologous stem cell transplantation, the survival of non-GCB patients was improved, but no difference in outcome was seen between GCB and non-GCB subgroups. Since the results suggested that the Hans algorithm was not applicable in immunochemotherapy-treated DLBCL patients, we aimed to further focus on algorithms based on ABC markers. We examined the modified activated B-cell-like algorithm based (MUM1/IRF4 and FOXP1), as well as a previously reported Muris algorithm (BCL2, CD10 and MUM1/IRF4) among 88 DLBCL patients uniformly treated with immunochemotherapy. Both algorithms distinguished the unfavourable ABC-like subgroup with a significantly inferior failure-free survival relative to the GCB-like DLBCL patients. Similarly, the results of the individual predictive molecular markers transcription factor FOXP1 and anti-apoptotic protein BCL2 have been inconsistent and should be assessed in immunochemotherapy-treated DLBCL patients. The markers were evaluated in a cohort of 117 patients treated with rituximab and chemotherapy. FOXP1 expression could not distinguish between patients, with favourable and those with poor outcomes. In contrast, BCL2-negative DLBCL patients had significantly superior survival relative to BCL2-positive patients. Our results indicate that the immunohistochemically defined cell-of-origin classification in DLBCL has a prognostic impact in the immunochemotherapy era, when the identifying algorithms are based on ABC-associated markers. We also propose that BCL2 negativity is predictive of a favourable outcome. Further investigational efforts are, however, warranted to identify the molecular features of DLBCL that could enable individualized cancer therapy in routine patient care.
Resumo:
Thin film applications have become increasingly important in our search for multifunctional and economically viable technological solutions of the future. Thin film coatings can be used for a multitude of purposes, ranging from a basic enhancement of aesthetic attributes to the addition of a complex surface functionality. Anything from electronic or optical properties, to an increased catalytic or biological activity, can be added or enhanced by the deposition of a thin film, with a thickness of only a few atomic layers at the best, on an already existing surface. Thin films offer both a means of saving in materials and the possibility for improving properties without a critical enlargement of devices. Nanocluster deposition is a promising new method for the growth of structured thin films. Nanoclusters are small aggregates of atoms or molecules, ranging in sizes from only a few nanometers up to several hundreds of nanometers in diameter. Due to their large surface to volume ratio, and the confinement of atoms and electrons in all three dimensions, nanoclusters exhibit a wide variety of exotic properties that differ notably from those of both single atoms and bulk materials. Nanoclusters are a completely new type of building block for thin film deposition. As preformed entities, clusters provide a new means of tailoring the properties of thin films before their growth, simply by changing the size or composition of the clusters that are to be deposited. Contrary to contemporary methods of thin film growth, which mainly rely on the deposition of single atoms, cluster deposition also allows for a more precise assembly of thin films, as the configuration of single atoms with respect to each other is already predetermined in clusters. Nanocluster deposition offers a possibility for the coating of virtually any material with a nanostructured thin film, and therein the enhancement of already existing physical or chemical properties, or the addition of some exciting new feature. A clearer understanding of cluster-surface interactions, and the growth of thin films by cluster deposition, must, however, be achieved, if clusters are to be successfully used in thin film technologies. Using a combination of experimental techniques and molecular dynamics simulations, both the deposition of nanoclusters, and the growth and modification of cluster-assembled thin films, are studied in this thesis. Emphasis is laid on an understanding of the interaction between metal clusters and surfaces, and therein the behaviour of these clusters during deposition and thin film growth. The behaviour of single metal clusters, as they impact on clean metal surfaces, is analysed in detail, from which it is shown that there exists a cluster size and deposition energy dependent limit, below which epitaxial alignment occurs. If larger clusters are deposited at low energies, or cluster-surface interactions are weaker, non-epitaxial deposition will take place, resulting in the formation of nanocrystalline structures. The effect of cluster size and deposition energy on the morphology of cluster-assembled thin films is also determined, from which it is shown that nanocrystalline cluster-assembled films will be porous. Modification of these thin films, with the purpose of enhancing their mechanical properties and durability, without destroying their nanostructure, is presented. Irradiation with heavy ions is introduced as a feasible method for increasing the density, and therein the mechanical stability, of cluster-assembled thin films, without critically destroying their nanocrystalline properties. The results of this thesis demonstrate that nanocluster deposition is a suitable technique for the growth of nanostructured thin films. The interactions between nanoclusters and their supporting surfaces must, however, be carefully considered, if a controlled growth of cluster-assembled thin films, with precisely tailored properties, is to be achieved.
Resumo:
In this thesis I examine one commonly used class of methods for the analytic approximation of cellular automata, the so-called local cluster approximations. This class subsumes the well known mean-field and pair approximations, as well as higher order generalizations of these. While a straightforward method known as Bayesian extension exists for constructing cluster approximations of arbitrary order on one-dimensional lattices (and certain other cases), for higher-dimensional systems the construction of approximations beyond the pair level becomes more complicated due to the presence of loops. In this thesis I describe the one-dimensional construction as well as a number of approximations suggested for higher-dimensional lattices, comparing them against a number of consistency criteria that such approximations could be expected to satisfy. I also outline a general variational principle for constructing consistent cluster approximations of arbitrary order with minimal bias, and show that the one-dimensional construction indeed satisfies this principle. Finally, I apply this variational principle to derive a novel consistent expression for symmetric three cell cluster frequencies as estimated from pair frequencies, and use this expression to construct a quantitatively improved pair approximation of the well-known lattice contact process on a hexagonal lattice.