13 resultados para Spatial data mining
em Helda - Digital Repository of University of Helsinki
Resumo:
Telecommunications network management is based on huge amounts of data that are continuously collected from elements and devices from all around the network. The data is monitored and analysed to provide information for decision making in all operation functions. Knowledge discovery and data mining methods can support fast-pace decision making in network operations. In this thesis, I analyse decision making on different levels of network operations. I identify the requirements decision-making sets for knowledge discovery and data mining tools and methods, and I study resources that are available to them. I then propose two methods for augmenting and applying frequent sets to support everyday decision making. The proposed methods are Comprehensive Log Compression for log data summarisation and Queryable Log Compression for semantic compression of log data. Finally I suggest a model for a continuous knowledge discovery process and outline how it can be implemented and integrated to the existing network operations infrastructure.
Resumo:
Matrix decompositions, where a given matrix is represented as a product of two other matrices, are regularly used in data mining. Most matrix decompositions have their roots in linear algebra, but the needs of data mining are not always those of linear algebra. In data mining one needs to have results that are interpretable -- and what is considered interpretable in data mining can be very different to what is considered interpretable in linear algebra. --- The purpose of this thesis is to study matrix decompositions that directly address the issue of interpretability. An example is a decomposition of binary matrices where the factor matrices are assumed to be binary and the matrix multiplication is Boolean. The restriction to binary factor matrices increases interpretability -- factor matrices are of the same type as the original matrix -- and allows the use of Boolean matrix multiplication, which is often more intuitive than normal matrix multiplication with binary matrices. Also several other decomposition methods are described, and the computational complexity of computing them is studied together with the hardness of approximating the related optimization problems. Based on these studies, algorithms for constructing the decompositions are proposed. Constructing the decompositions turns out to be computationally hard, and the proposed algorithms are mostly based on various heuristics. Nevertheless, the algorithms are shown to be capable of finding good results in empirical experiments conducted with both synthetic and real-world data.
Resumo:
Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.
Resumo:
Cell transition data is obtained from a cellular phone that switches its current serving cell tower. The data consists of a sequence of transition events, which are pairs of cell identifiers and transition times. The focus of this thesis is applying data mining methods to such data, developing new algorithms, and extracting knowledge that will be a solid foundation on which to build location-aware applications. In addition to a thorough exploration of the features of the data, the tools and methods developed in this thesis provide solutions to three distinct research problems. First, we develop clustering algorithms that produce a reliable mapping between cell transitions and physical locations observed by users of mobile devices. The main clustering algorithm operates in online fashion, and we consider also a number of offline clustering methods for comparison. Second, we define the concept of significant locations, known as bases, and give an online algorithm for determining them. Finally, we consider the task of predicting the movement of the user, based on historical data. We develop a prediction algorithm that considers paths of movement in their entirety, instead of just the most recent movement history. All of the presented methods are evaluated with a significant body of real cell transition data, collected from about one hundred different individuals. The algorithms developed in this thesis are designed to be implemented on a mobile device, and require no extra hardware sensors or network infrastructure. By not relying on external services and keeping the user information as much as possible on the user s own personal device, we avoid privacy issues and let the users control the disclosure of their location information.
Resumo:
Helicobacter pylori infection is a risk factor for gastric cancer, which is a major health issue worldwide. Gastric cancer has a poor prognosis due to the unnoticeable progression of the disease and surgery is the only available treatment in gastric cancer. Therefore, gastric cancer patients would greatly benefit from identifying biomarker genes that would improve diagnostic and prognostic prediction and provide targets for molecular therapies. DNA copy number amplifications are the hallmarks of cancers in various anatomical locations. Mechanisms of amplification predict that DNA double-strand breaks occur at the margins of the amplified region. The first objective of this thesis was to identify the genes that were differentially expressed in H. pylori infection as well as the transcription factors and signal transduction pathways that were associated with the gene expression changes. The second objective was to identify putative biomarker genes in gastric cancer with correlated expression and copy number, and the last objective was to characterize cancers based on DNA copy number amplifications. DNA microarrays, an in vitro model and real-time polymerase chain reaction were used to measure gene expression changes in H. pylori infected AGS cells. In order to identify the transcription factors and signal transduction pathways that were activated after H. pylori infection, gene expression profiling data from the H. pylori experiments and a bioinformatics approach accompanied by experimental validation were used. Genome-wide expression and copy number microarray analysis of clinical gastric cancer samples and immunohistochemistry on tissue microarray were used to identify putative gastric cancer genes. Data mining and machine learning techniques were applied to study amplifications in a cross-section of cancers. FOS and various stress response genes were regulated by H. pylori infection. H. pylori regulated genes were enriched in the chromosomal regions that are frequently changed in gastric cancer, suggesting that molecular pathways of gastric cancer and premalignant H. pylori infection that induces gastritis are interconnected. 16 transcription factors were identified as being associated with H. pylori infection induced changes in gene expression. NF-κB transcription factor and p50 and p65 subunits were verified using elecrophoretic mobility shift assays. ERBB2 and other genes located in 17q12- q21 were found to be up-regulated in association with copy number amplification in gastric cancer. Cancers with similar cell type and origin clustered together based on the genomic localization of the amplifications. Cancer genes and large genes were co-localized with amplified regions and fragile sites, telomeres, centromeres and light chromosome bands were enriched at the amplification boundaries. H. pylori activated transcription factors and signal transduction pathways function in cellular mechanisms that might be capable of promoting carcinogenesis of the stomach. Intestinal and diffuse type gastric cancers showed distinct molecular genetic profiles. Integration of gene expression and copy number microarray data allowed the identification of genes that might be involved in gastric carcinogenesis and have clinical relevance. Gene amplifications were demonstrated to be non-random genomic instabilities. Cell lineage, properties of precursor stem cells, tissue microenvironment and genomic map localization of specific oncogenes define the site specificity of DNA amplifications, whereas labile genomic features define the structures of amplicons. These conclusions suggest that the definition of genomic changes in cancer is based on the interplay between the cancer cell and the tumor microenvironment.
Resumo:
Habitat requirements of fish are most strict during the early life stages, and the quality and quantity of reproduction habitats lays the basis for fish production. A considerable number of fish species in the northern Baltic Sea reproduce in the shallow coastal areas, which are also the most heavily exploited parts of the brackish marine area. However, the coastal fish reproduction habitats in the northern Baltic Sea are poorly known. The studies presented in this thesis focused on the influence of environmental conditions on the distribution of coastal reproduction habitats of freshwater fish. They were conducted in vegetated littoral zone along an exposure and salinity gradient extending from the innermost bays to the outer archipelago on the south-western and southern coasts of Finland, in the northern Baltic Sea. Special emphasis was placed on reed-covered Phragmites australis shores, which form a dominant vegetation type in several coastal archipelago areas. The main aims of this research were to (1) develop and test new survey and mapping methods, (2) investigate the environmental requirements that govern the reproduction of freshwater fish in the coastal area and (3) survey, map and model the distribution of the reproduction habitats of pike (Esox lucius) and roach (Rutilus rutilus). The white plate and scoop method with a standardized sampling time and effort was demonstrated to be a functional method for sampling the early life stages of fish in dense vegetation and shallow water. Reed-covered shores were shown to form especially important reproduction habitats for several freshwater fish species, such as pike, roach, other cyprinids and burbot, in the northern Baltic Sea. The reproduction habitats of pike were limited to sheltered reed- and moss-covered shores of the inner and middle archipelago, where suitable zooplankton prey were available and the influence of the open sea was low. The reproduction habitats of roach were even more limited and roach reproduction was successful only in the very sheltered reed-covered shores of the innermost bay areas, where salinity remained low (< 4‰) during the spawning season due to freshwater inflow. After identifying the critical factors restricting the reproduction of pike and roach, the spatial distribution of their reproduction habitats was successfully mapped and modelled along the environmental gradients using only a few environmental predictor variables. Reproduction habitat maps are a valuable tool promoting the sustainable use and management of exploited coastal areas and helping to maintain the sustainability of fish populations. However, the large environmental gradients and the extensiveness of the archipelago zone in the northern Baltic Sea demand an especially high spatial resolution of the coastal predictor variables. Therefore, the current lack of accurate large-scale, high-resolution spatial data gathered at exactly the right time is a considerable limitation for predictive modelling of shallow coastal waters.
Resumo:
We study how probabilistic reasoning and inductive querying can be combined within ProbLog, a recent probabilistic extension of Prolog. ProbLog can be regarded as a database system that supports both probabilistic and inductive reasoning through a variety of querying mechanisms. After a short introduction to ProbLog, we provide a survey of the different types of inductive queries that ProbLog supports, and show how it can be applied to the mining of large biological networks.
Resumo:
The world of mapping has changed. Earlier, only professional experts were responsible for map production, but today ordinary people without any training or experience can become map-makers. The number of online mapping sites, and the number of volunteer mappers has increased significantly. The development of the technology, such as satellite navigation systems, Web 2.0, broadband Internet connections, and smartphones, have had one of the key roles in enabling the rise of volunteered geographic information (VGI). As opening governmental data to public is a current topic in many countries, the opening of high quality geographical data has a central role in this study. The aim of this study is to investigate how is the quality of spatial data produced by volunteers by comparing it with the map data produced by public authorities, to follow what occurs when spatial data are opened for users, and to get acquainted with the user profile of these volunteer mappers. A central part of this study is OpenStreetMap project (OSM), which aim is to create a map of the entire world by volunteers. Anyone can become an OpenStreetMap contributor, and the data created by the volunteers are free to use for anyone without restricting copyrights or license charges. In this study OpenStreetMap is investigated from two viewpoints. In the first part of the study, the aim was to investigate the quality of volunteered geographic information. A pilot project was implemented by following what occurs when a high-resolution aerial imagery is released freely to the OpenStreetMap contributors. The quality of VGI was investigated by comparing the OSM datasets with the map data of The National Land Survey of Finland (NLS). The quality of OpenStreetMap data was investigated by inspecting the positional accuracy and the completeness of the road datasets, as well as the differences in the attribute datasets between the studied datasets. Also the OSM community was under analysis and the development of the map data of OpenStreetMap was investigated by visual analysis. The aim of the second part of the study was to analyse the user profile of OpenStreetMap contributors, and to investigate how the contributors act when collecting data and editing OpenStreetMap. The aim was also to investigate what motivates users to map and how is the quality of volunteered geographic information envisaged. The second part of the study was implemented by conducting a web inquiry to the OpenStreetMap contributors. The results of the study show that the quality of OpenStreetMap data compared with the data of National Land Survey of Finland can be defined as good. OpenStreetMap differs from the map of National Land Survey especially because of the amount of uncertainty, for example because of the completeness and uniformity of the map are not known. The results of the study reveal that opening spatial data increased notably the amount of the data in the study area, and both the positional accuracy and completeness improved significantly. The study confirms the earlier arguments that only few contributors have created the majority of the data in OpenStreetMap. The inquiry made for the OpenStreetMap users revealed that the data are most often collected by foot or by bicycle using GPS device, or by editing the map with the help of aerial imageries. According to the responses, the users take part to the OpenStreetMap project because they want to make maps better, and want to produce maps, which have information that is up-to-date and cannot be found from any other maps. Almost all of the users exploit the maps by themselves, most popular methods being downloading the map into a navigator or into a mobile device. The users regard the quality of OpenStreetMap as good, especially because of the up-to-dateness and the accuracy of the map.
Resumo:
We propose to compress weighted graphs (networks), motivated by the observation that large networks of social, biological, or other relations can be complex to handle and visualize. In the process also known as graph simplication, nodes and (unweighted) edges are grouped to supernodes and superedges, respectively, to obtain a smaller graph. We propose models and algorithms for weighted graphs. The interpretation (i.e. decompression) of a compressed, weighted graph is that a pair of original nodes is connected by an edge if their supernodes are connected by one, and that the weight of an edge is approximated to be the weight of the superedge. The compression problem now consists of choosing supernodes, superedges, and superedge weights so that the approximation error is minimized while the amount of compression is maximized. In this paper, we formulate this task as the 'simple weighted graph compression problem'. We then propose a much wider class of tasks under the name of 'generalized weighted graph compression problem'. The generalized task extends the optimization to preserve longer-range connectivities between nodes, not just individual edge weights. We study the properties of these problems and propose a range of algorithms to solve them, with dierent balances between complexity and quality of the result. We evaluate the problems and algorithms experimentally on real networks. The results indicate that weighted graphs can be compressed efficiently with relatively little compression error.