14 resultados para MULTI-RELATIONAL DATA MINING
em Helda - Digital Repository of University of Helsinki
Resumo:
Telecommunications network management is based on huge amounts of data that are continuously collected from elements and devices from all around the network. The data is monitored and analysed to provide information for decision making in all operation functions. Knowledge discovery and data mining methods can support fast-pace decision making in network operations. In this thesis, I analyse decision making on different levels of network operations. I identify the requirements decision-making sets for knowledge discovery and data mining tools and methods, and I study resources that are available to them. I then propose two methods for augmenting and applying frequent sets to support everyday decision making. The proposed methods are Comprehensive Log Compression for log data summarisation and Queryable Log Compression for semantic compression of log data. Finally I suggest a model for a continuous knowledge discovery process and outline how it can be implemented and integrated to the existing network operations infrastructure.
Resumo:
Matrix decompositions, where a given matrix is represented as a product of two other matrices, are regularly used in data mining. Most matrix decompositions have their roots in linear algebra, but the needs of data mining are not always those of linear algebra. In data mining one needs to have results that are interpretable -- and what is considered interpretable in data mining can be very different to what is considered interpretable in linear algebra. --- The purpose of this thesis is to study matrix decompositions that directly address the issue of interpretability. An example is a decomposition of binary matrices where the factor matrices are assumed to be binary and the matrix multiplication is Boolean. The restriction to binary factor matrices increases interpretability -- factor matrices are of the same type as the original matrix -- and allows the use of Boolean matrix multiplication, which is often more intuitive than normal matrix multiplication with binary matrices. Also several other decomposition methods are described, and the computational complexity of computing them is studied together with the hardness of approximating the related optimization problems. Based on these studies, algorithms for constructing the decompositions are proposed. Constructing the decompositions turns out to be computationally hard, and the proposed algorithms are mostly based on various heuristics. Nevertheless, the algorithms are shown to be capable of finding good results in empirical experiments conducted with both synthetic and real-world data.
Resumo:
Solar flares were first observed by plain eye in white light by William Carrington in England in 1859. Since then these eruptions in the solar corona have intrigued scientists. It is known that flares influence the space weather experienced by the planets in a multitude of ways, for example by causing aurora borealis. Understanding flares is at the epicentre of human survival in space, as astronauts cannot survive the highly energetic particles associated with large flares in high doses without contracting serious radiation disease symptoms, unless they shield themselves effectively during space missions. Flares may be at the epicentre of man s survival in the past as well: it has been suggested that giant flares might have played a role in exterminating many of the large species on Earth, including dinosaurs. Having said that prebiotic synthesis studies have shown lightning to be a decisive requirement for amino acid synthesis on the primordial Earth. Increased lightning activity could be attributed to space weather, and flares. This thesis studies flares in two ways: in the spectral and the spatial domain. We have extracted solar spectra using three different instruments, namely GOES (Geostationary Operational Environmental Satellite), RHESSI (Reuven Ramaty High Energy Solar Spectroscopic Imager) and XSM (X-ray Solar Monitor) for the same flares. The GOES spectra are low resolution obtained with a gas proportional counter, the RHESSI spectra are higher resolution obtained with Germanium detectors and the XSM spectra are very high resolution observed with a silicon detector. It turns out that the detector technology and response influence the spectra we see substantially, and are important to understanding what conclusions to draw from the data. With imaging data, there was not such a luxury of choice available. We used RHESSI imaging data to observe the spatial size of solar flares. In the present work the focus was primarily on current solar flares. However, we did make use of our improved understanding of solar flares to observe young suns in NGC 2547. The same techniques used with solar monitors were applied with XMM-Newton, a stellar X-ray monitor, and coupled with ground based Halpha observations these techniques yielded estimates for flare parameters in young suns. The material in this thesis is therefore structured from technology to application, covering the full processing path from raw data and detector responses to concrete physical parameter results, such as the first measurement of the length of plasma flare loops in young suns.
Resumo:
Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.
Resumo:
Cell transition data is obtained from a cellular phone that switches its current serving cell tower. The data consists of a sequence of transition events, which are pairs of cell identifiers and transition times. The focus of this thesis is applying data mining methods to such data, developing new algorithms, and extracting knowledge that will be a solid foundation on which to build location-aware applications. In addition to a thorough exploration of the features of the data, the tools and methods developed in this thesis provide solutions to three distinct research problems. First, we develop clustering algorithms that produce a reliable mapping between cell transitions and physical locations observed by users of mobile devices. The main clustering algorithm operates in online fashion, and we consider also a number of offline clustering methods for comparison. Second, we define the concept of significant locations, known as bases, and give an online algorithm for determining them. Finally, we consider the task of predicting the movement of the user, based on historical data. We develop a prediction algorithm that considers paths of movement in their entirety, instead of just the most recent movement history. All of the presented methods are evaluated with a significant body of real cell transition data, collected from about one hundred different individuals. The algorithms developed in this thesis are designed to be implemented on a mobile device, and require no extra hardware sensors or network infrastructure. By not relying on external services and keeping the user information as much as possible on the user s own personal device, we avoid privacy issues and let the users control the disclosure of their location information.
Resumo:
The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. The latest instantation is based on the so-called Normalized Maximum Likelihood (NML) distribution which has been shown to possess several important theoretical properties. However, the applications of this modern version of the MDL have been quite rare because of computational complexity problems, i.e., for discrete data, the definition of NML involves an exponential sum, and in the case of continuous data, a multi-dimensional integral usually infeasible to evaluate or even approximate accurately. In this doctoral dissertation, we present mathematical techniques for computing NML efficiently for some model families involving discrete data. We also show how these techniques can be used to apply MDL in two practical applications: histogram density estimation and clustering of multi-dimensional data.
Resumo:
Helicobacter pylori infection is a risk factor for gastric cancer, which is a major health issue worldwide. Gastric cancer has a poor prognosis due to the unnoticeable progression of the disease and surgery is the only available treatment in gastric cancer. Therefore, gastric cancer patients would greatly benefit from identifying biomarker genes that would improve diagnostic and prognostic prediction and provide targets for molecular therapies. DNA copy number amplifications are the hallmarks of cancers in various anatomical locations. Mechanisms of amplification predict that DNA double-strand breaks occur at the margins of the amplified region. The first objective of this thesis was to identify the genes that were differentially expressed in H. pylori infection as well as the transcription factors and signal transduction pathways that were associated with the gene expression changes. The second objective was to identify putative biomarker genes in gastric cancer with correlated expression and copy number, and the last objective was to characterize cancers based on DNA copy number amplifications. DNA microarrays, an in vitro model and real-time polymerase chain reaction were used to measure gene expression changes in H. pylori infected AGS cells. In order to identify the transcription factors and signal transduction pathways that were activated after H. pylori infection, gene expression profiling data from the H. pylori experiments and a bioinformatics approach accompanied by experimental validation were used. Genome-wide expression and copy number microarray analysis of clinical gastric cancer samples and immunohistochemistry on tissue microarray were used to identify putative gastric cancer genes. Data mining and machine learning techniques were applied to study amplifications in a cross-section of cancers. FOS and various stress response genes were regulated by H. pylori infection. H. pylori regulated genes were enriched in the chromosomal regions that are frequently changed in gastric cancer, suggesting that molecular pathways of gastric cancer and premalignant H. pylori infection that induces gastritis are interconnected. 16 transcription factors were identified as being associated with H. pylori infection induced changes in gene expression. NF-κB transcription factor and p50 and p65 subunits were verified using elecrophoretic mobility shift assays. ERBB2 and other genes located in 17q12- q21 were found to be up-regulated in association with copy number amplification in gastric cancer. Cancers with similar cell type and origin clustered together based on the genomic localization of the amplifications. Cancer genes and large genes were co-localized with amplified regions and fragile sites, telomeres, centromeres and light chromosome bands were enriched at the amplification boundaries. H. pylori activated transcription factors and signal transduction pathways function in cellular mechanisms that might be capable of promoting carcinogenesis of the stomach. Intestinal and diffuse type gastric cancers showed distinct molecular genetic profiles. Integration of gene expression and copy number microarray data allowed the identification of genes that might be involved in gastric carcinogenesis and have clinical relevance. Gene amplifications were demonstrated to be non-random genomic instabilities. Cell lineage, properties of precursor stem cells, tissue microenvironment and genomic map localization of specific oncogenes define the site specificity of DNA amplifications, whereas labile genomic features define the structures of amplicons. These conclusions suggest that the definition of genomic changes in cancer is based on the interplay between the cancer cell and the tumor microenvironment.
Resumo:
We study how probabilistic reasoning and inductive querying can be combined within ProbLog, a recent probabilistic extension of Prolog. ProbLog can be regarded as a database system that supports both probabilistic and inductive reasoning through a variety of querying mechanisms. After a short introduction to ProbLog, we provide a survey of the different types of inductive queries that ProbLog supports, and show how it can be applied to the mining of large biological networks.
Resumo:
The new paradigm of connectedness and empowerment brought by the interactivity feature of the Web 2.0 has been challenging the traditional centralized performance of mainstream media. The corporation has been able to survive the strong winds by transforming itself into a global multimedia business network embedded in the network society. By establishing networks, e.g. networks of production and distribution, the global multimedia business network has been able to sight potential solutions by opening the doors to innovation in a decentralized and flexible manner. Under this emerging context of re-organization, traditional practices like sourcing need to be re- explained and that is precisely what this thesis attempts to tackle. Based on ICT and on the network society, the study seeks to explain within the Finnish context the particular case of Helsingin Sanomat (HS) and its relations with the youth news agency, Youth Voice Editorial Board (NÄT). In that sense, the study can be regarded as an explanatory embedded single case study, where HS is the principal unit of analysis and NÄT its embedded unit of analysis. The thesis was able to reach explanations through interrelated steps. First, it determined the role of ICT in HS’s sourcing practices. Then it mapped an overview of the HS’s sourcing relations and provided a context in which NÄT was located. And finally, it established conceptualized institutional relational data between HS and NÄT for their posterior measurement through social network analysis. The data set was collected via qualitative interviews addressed to online and offline editors of HS as well as interviews addressed to NÄT’s personnel. The study concluded that ICT’s interactivity and User Generated Content (UGC) are not sourcing tools as such but mechanism used by HS for getting ideas that could turn into potential news stories. However, when it comes to visual communication, some exemptions were found. The lack of official sources amidst the immediacy leads HS to rely on ICT’s interaction and UGC. More than meets the eye, ICT’s input into the sourcing practice may be more noticeable if the interaction and UGC is well organized and coordinated into proper and innovative networks of alternative content collaboration. Currently, HS performs this sourcing practice via two projects that differ, precisely, by the mode they are coordinated. The first project found, Omakaupunki, is coordinated internally by Sanoma Group’s owned media houses HS, Vartti and Metro. The second project found is coordinated externally. The external alternative sourcing network, as it was labeled, consists of three actors, namely HS, NÄT (professionals in charge) and the youth. This network is a balanced and complete triad in which the actors connect themselves in relations of feedback, recognition, creativity and filtering. However, as innovation is approached very reluctantly, this content collaboration is a laboratory of experiments; a ‘COLLABORATORY’.
Resumo:
We propose to compress weighted graphs (networks), motivated by the observation that large networks of social, biological, or other relations can be complex to handle and visualize. In the process also known as graph simplication, nodes and (unweighted) edges are grouped to supernodes and superedges, respectively, to obtain a smaller graph. We propose models and algorithms for weighted graphs. The interpretation (i.e. decompression) of a compressed, weighted graph is that a pair of original nodes is connected by an edge if their supernodes are connected by one, and that the weight of an edge is approximated to be the weight of the superedge. The compression problem now consists of choosing supernodes, superedges, and superedge weights so that the approximation error is minimized while the amount of compression is maximized. In this paper, we formulate this task as the 'simple weighted graph compression problem'. We then propose a much wider class of tasks under the name of 'generalized weighted graph compression problem'. The generalized task extends the optimization to preserve longer-range connectivities between nodes, not just individual edge weights. We study the properties of these problems and propose a range of algorithms to solve them, with dierent balances between complexity and quality of the result. We evaluate the problems and algorithms experimentally on real networks. The results indicate that weighted graphs can be compressed efficiently with relatively little compression error.