12 resultados para Data mining models
em Helda - Digital Repository of University of Helsinki
Resumo:
Telecommunications network management is based on huge amounts of data that are continuously collected from elements and devices from all around the network. The data is monitored and analysed to provide information for decision making in all operation functions. Knowledge discovery and data mining methods can support fast-pace decision making in network operations. In this thesis, I analyse decision making on different levels of network operations. I identify the requirements decision-making sets for knowledge discovery and data mining tools and methods, and I study resources that are available to them. I then propose two methods for augmenting and applying frequent sets to support everyday decision making. The proposed methods are Comprehensive Log Compression for log data summarisation and Queryable Log Compression for semantic compression of log data. Finally I suggest a model for a continuous knowledge discovery process and outline how it can be implemented and integrated to the existing network operations infrastructure.
Resumo:
Matrix decompositions, where a given matrix is represented as a product of two other matrices, are regularly used in data mining. Most matrix decompositions have their roots in linear algebra, but the needs of data mining are not always those of linear algebra. In data mining one needs to have results that are interpretable -- and what is considered interpretable in data mining can be very different to what is considered interpretable in linear algebra. --- The purpose of this thesis is to study matrix decompositions that directly address the issue of interpretability. An example is a decomposition of binary matrices where the factor matrices are assumed to be binary and the matrix multiplication is Boolean. The restriction to binary factor matrices increases interpretability -- factor matrices are of the same type as the original matrix -- and allows the use of Boolean matrix multiplication, which is often more intuitive than normal matrix multiplication with binary matrices. Also several other decomposition methods are described, and the computational complexity of computing them is studied together with the hardness of approximating the related optimization problems. Based on these studies, algorithms for constructing the decompositions are proposed. Constructing the decompositions turns out to be computationally hard, and the proposed algorithms are mostly based on various heuristics. Nevertheless, the algorithms are shown to be capable of finding good results in empirical experiments conducted with both synthetic and real-world data.
Resumo:
Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.
Resumo:
Cell transition data is obtained from a cellular phone that switches its current serving cell tower. The data consists of a sequence of transition events, which are pairs of cell identifiers and transition times. The focus of this thesis is applying data mining methods to such data, developing new algorithms, and extracting knowledge that will be a solid foundation on which to build location-aware applications. In addition to a thorough exploration of the features of the data, the tools and methods developed in this thesis provide solutions to three distinct research problems. First, we develop clustering algorithms that produce a reliable mapping between cell transitions and physical locations observed by users of mobile devices. The main clustering algorithm operates in online fashion, and we consider also a number of offline clustering methods for comparison. Second, we define the concept of significant locations, known as bases, and give an online algorithm for determining them. Finally, we consider the task of predicting the movement of the user, based on historical data. We develop a prediction algorithm that considers paths of movement in their entirety, instead of just the most recent movement history. All of the presented methods are evaluated with a significant body of real cell transition data, collected from about one hundred different individuals. The algorithms developed in this thesis are designed to be implemented on a mobile device, and require no extra hardware sensors or network infrastructure. By not relying on external services and keeping the user information as much as possible on the user s own personal device, we avoid privacy issues and let the users control the disclosure of their location information.
Resumo:
We propose to compress weighted graphs (networks), motivated by the observation that large networks of social, biological, or other relations can be complex to handle and visualize. In the process also known as graph simplication, nodes and (unweighted) edges are grouped to supernodes and superedges, respectively, to obtain a smaller graph. We propose models and algorithms for weighted graphs. The interpretation (i.e. decompression) of a compressed, weighted graph is that a pair of original nodes is connected by an edge if their supernodes are connected by one, and that the weight of an edge is approximated to be the weight of the superedge. The compression problem now consists of choosing supernodes, superedges, and superedge weights so that the approximation error is minimized while the amount of compression is maximized. In this paper, we formulate this task as the 'simple weighted graph compression problem'. We then propose a much wider class of tasks under the name of 'generalized weighted graph compression problem'. The generalized task extends the optimization to preserve longer-range connectivities between nodes, not just individual edge weights. We study the properties of these problems and propose a range of algorithms to solve them, with dierent balances between complexity and quality of the result. We evaluate the problems and algorithms experimentally on real networks. The results indicate that weighted graphs can be compressed efficiently with relatively little compression error.
Resumo:
Helicobacter pylori infection is a risk factor for gastric cancer, which is a major health issue worldwide. Gastric cancer has a poor prognosis due to the unnoticeable progression of the disease and surgery is the only available treatment in gastric cancer. Therefore, gastric cancer patients would greatly benefit from identifying biomarker genes that would improve diagnostic and prognostic prediction and provide targets for molecular therapies. DNA copy number amplifications are the hallmarks of cancers in various anatomical locations. Mechanisms of amplification predict that DNA double-strand breaks occur at the margins of the amplified region. The first objective of this thesis was to identify the genes that were differentially expressed in H. pylori infection as well as the transcription factors and signal transduction pathways that were associated with the gene expression changes. The second objective was to identify putative biomarker genes in gastric cancer with correlated expression and copy number, and the last objective was to characterize cancers based on DNA copy number amplifications. DNA microarrays, an in vitro model and real-time polymerase chain reaction were used to measure gene expression changes in H. pylori infected AGS cells. In order to identify the transcription factors and signal transduction pathways that were activated after H. pylori infection, gene expression profiling data from the H. pylori experiments and a bioinformatics approach accompanied by experimental validation were used. Genome-wide expression and copy number microarray analysis of clinical gastric cancer samples and immunohistochemistry on tissue microarray were used to identify putative gastric cancer genes. Data mining and machine learning techniques were applied to study amplifications in a cross-section of cancers. FOS and various stress response genes were regulated by H. pylori infection. H. pylori regulated genes were enriched in the chromosomal regions that are frequently changed in gastric cancer, suggesting that molecular pathways of gastric cancer and premalignant H. pylori infection that induces gastritis are interconnected. 16 transcription factors were identified as being associated with H. pylori infection induced changes in gene expression. NF-κB transcription factor and p50 and p65 subunits were verified using elecrophoretic mobility shift assays. ERBB2 and other genes located in 17q12- q21 were found to be up-regulated in association with copy number amplification in gastric cancer. Cancers with similar cell type and origin clustered together based on the genomic localization of the amplifications. Cancer genes and large genes were co-localized with amplified regions and fragile sites, telomeres, centromeres and light chromosome bands were enriched at the amplification boundaries. H. pylori activated transcription factors and signal transduction pathways function in cellular mechanisms that might be capable of promoting carcinogenesis of the stomach. Intestinal and diffuse type gastric cancers showed distinct molecular genetic profiles. Integration of gene expression and copy number microarray data allowed the identification of genes that might be involved in gastric carcinogenesis and have clinical relevance. Gene amplifications were demonstrated to be non-random genomic instabilities. Cell lineage, properties of precursor stem cells, tissue microenvironment and genomic map localization of specific oncogenes define the site specificity of DNA amplifications, whereas labile genomic features define the structures of amplicons. These conclusions suggest that the definition of genomic changes in cancer is based on the interplay between the cancer cell and the tumor microenvironment.
Resumo:
We study how probabilistic reasoning and inductive querying can be combined within ProbLog, a recent probabilistic extension of Prolog. ProbLog can be regarded as a database system that supports both probabilistic and inductive reasoning through a variety of querying mechanisms. After a short introduction to ProbLog, we provide a survey of the different types of inductive queries that ProbLog supports, and show how it can be applied to the mining of large biological networks.
Resumo:
Uveal melanoma (UM) is the second most common primary intraocular cancer worldwide. It is a relatively rare cancer, but still the second most common type of primary malignant melanoma in humans. UM is a slowly growing tumor, and gives rise to distant metastasis mainly to the liver via the bloodstream. About 40% of patients with UM die of metastatic disease within 10 years of diagnosis, irrespective of the type of treatment. During the last decade, two main lines of research have aimed to achieve enhanced understanding of the metastasis process and accurate prognosis of patients with UM. One emphasizes the characteristics of tumor cells, particularly their nucleoli, and markers of proliferation, and the other the characteristics of tumor blood vessels. Of several morphometric measurements, the mean diameter of the ten largest nucleoli (MLN) has become the most widely applied. A large MLN has consistently been associated with high likelihood of dying from UM. Blood vessels are of paramount importance in metastasis of UM. Different extravascular matrix patterns can be seen in UM, like loops and networks. This presence is associated with death from metastatic melanoma. However, the density of microvessels is also of prognostic importance. This study was undertaken to help understanding some histopathological factors which might contribute to developing metastasis in UM patients. Factors which could be related to tumor progression to metastasis disease, namely nucleolar size, MLN, microvascular density (MVD), cell proliferation, and The Insulin-like Growth Factor 1 Receptor(IGF-1R), were investigated. The primary aim of this thesis was to study the relationship between prognostic factors such as tumor cell nucleolar size, proliferation, extravascular matrix patterns, and dissemination of UM, and to assess to what extent there is a relationship to metastasis. The secondary goal was to develop a multivariate model which includes MLN and cell proliferation in addition to MVD, and which would fit better with population-based, melanoma-related survival data than previous models. I studied 167 patients with UM, who developed metastasis even after a very long time following removal of the eye, metastatic disease was the main cause of death, as documented in the Finnish Cancer Registry and on death certificates. Using an independent population-based data set, it was confirmed that MLN and extravascular matrix loops and networks were unrelated, independent predictors of survival in UM. Also, it has been found that multivariate models including MVD in addition to MLN fitted significantly better with survival data than models which excluded MVD. This supports the idea that both the characteristics of the blood vessels and the cells are important, and the future direction would be to look for the gene expression profile, whether it is associated more with MVD or MLN. The former relates to the host response to the tumor and may not be as tightly associated with the gene expression profile, yet most likely involved in the process of hematogenous metastasis. Because fresh tumor material is needed for reliable genetic analysis, such analysis could not be performed Although noninvasive detection of certain extravascular matrix patterns is now technically possible,in managing patients with UM, this study and tumor genetics suggest that such noninvasive methods will not fully capture the process of clinical metastasis. Progress in resection and biopsy techniques is likely in the near future to result in fresh material for the ophthalmic pathologist to correlate angiographic data, histopathological characteristics such as MLN, and genetic data. This study supported the theory that tumors containing epithelioid cells grow faster and have poorer prognosis when studied by cell proliferation in UM based on Ki-67 immunoreactivity. Cell proliferation index fitted best with the survival data when combined with MVD, MLN, and presence of epithelioid cells. Analogous with the finding that high MVD in primary UM is associated with shorter time to metastasis than low MVD, high MVD in hepatic metastasis tends to be associated with shorter survival after diagnosis of metastasis. Because the liver is the main organ for metastasis from UM, growth factors largely produced in the liver hepatocyte growth factor, epidermal growth factor and insulin-like growth factor-1 (IGF-1) together with their receptors may have a role in the homing and survival of metastatic cells. Therefore the association between immunoreactivity for IGF-1R in primary UM and metastatic death was studied. It was found that immunoreactivity for IGF-IR did not independently predict metastasis from primary UM in my series.