11 resultados para Data mining and knowledge discovery

em Helda - Digital Repository of University of Helsinki


Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The research question of this thesis was how knowledge can be managed with information systems. Information systems can support but not replace knowledge management. Systems can mainly store epistemic organisational knowledge included in content, and process data and information. Certain value can be achieved by adding communication technology to systems. All communication, however, can not be managed. A new layer between communication and manageable information was named as knowformation. Knowledge management literature was surveyed, together with information species from philosophy, physics, communication theory, and information system science. Positivism, post-positivism, and critical theory were studied, but knowformation in extended organisational memory seemed to be socially constructed. A memory management model of an extended enterprise (M3.exe) and knowformation concept were findings from iterative case studies, covering data, information and knowledge management systems. The cases varied from groups towards extended organisation. Systems were investigated, and administrators, users (knowledge workers) and managers interviewed. The model building required alternative sets of data, information and knowledge, instead of using the traditional pyramid. Also the explicit-tacit dichotomy was reconsidered. As human knowledge is the final aim of all data and information in the systems, the distinction between management of information vs. management of people was harmonised. Information systems were classified as the core of organisational memory. The content of the systems is in practice between communication and presentation. Firstly, the epistemic criterion of knowledge is not required neither in the knowledge management literature, nor from the content of the systems. Secondly, systems deal mostly with containers, and the knowledge management literature with applied knowledge. Also the construction of reality based on the system content and communication supports the knowformation concept. Knowformation belongs to memory management model of an extended enterprise (M3.exe) that is divided into horizontal and vertical key dimensions. Vertically, processes deal with content that can be managed, whereas communication can be supported, mainly by infrastructure. Horizontally, the right hand side of the model contains systems, and the left hand side content, which should be independent from each other. A strategy based on the model was defined.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Telecommunications network management is based on huge amounts of data that are continuously collected from elements and devices from all around the network. The data is monitored and analysed to provide information for decision making in all operation functions. Knowledge discovery and data mining methods can support fast-pace decision making in network operations. In this thesis, I analyse decision making on different levels of network operations. I identify the requirements decision-making sets for knowledge discovery and data mining tools and methods, and I study resources that are available to them. I then propose two methods for augmenting and applying frequent sets to support everyday decision making. The proposed methods are Comprehensive Log Compression for log data summarisation and Queryable Log Compression for semantic compression of log data. Finally I suggest a model for a continuous knowledge discovery process and outline how it can be implemented and integrated to the existing network operations infrastructure.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Matrix decompositions, where a given matrix is represented as a product of two other matrices, are regularly used in data mining. Most matrix decompositions have their roots in linear algebra, but the needs of data mining are not always those of linear algebra. In data mining one needs to have results that are interpretable -- and what is considered interpretable in data mining can be very different to what is considered interpretable in linear algebra. --- The purpose of this thesis is to study matrix decompositions that directly address the issue of interpretability. An example is a decomposition of binary matrices where the factor matrices are assumed to be binary and the matrix multiplication is Boolean. The restriction to binary factor matrices increases interpretability -- factor matrices are of the same type as the original matrix -- and allows the use of Boolean matrix multiplication, which is often more intuitive than normal matrix multiplication with binary matrices. Also several other decomposition methods are described, and the computational complexity of computing them is studied together with the hardness of approximating the related optimization problems. Based on these studies, algorithms for constructing the decompositions are proposed. Constructing the decompositions turns out to be computationally hard, and the proposed algorithms are mostly based on various heuristics. Nevertheless, the algorithms are shown to be capable of finding good results in empirical experiments conducted with both synthetic and real-world data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The study of soil microbiota and their activities is central to the understanding of many ecosystem processes such as decomposition and nutrient cycling. The collection of microbiological data from soils generally involves several sequential steps of sampling, pretreatment and laboratory measurements. The reliability of results is dependent on reliable methods in every step. The aim of this thesis was to critically evaluate some central methods and procedures used in soil microbiological studies in order to increase our understanding of the factors that affect the measurement results and to provide guidance and new approaches for the design of experiments. The thesis focuses on four major themes: 1) soil microbiological heterogeneity and sampling, 2) storage of soil samples, 3) DNA extraction from soil, and 4) quantification of specific microbial groups by the most-probable-number (MPN) procedure. Soil heterogeneity and sampling are discussed as a single theme because knowledge on spatial (horizontal and vertical) and temporal variation is crucial when designing sampling procedures. Comparison of adjacent forest, meadow and cropped field plots showed that land use has a strong impact on the degree of horizontal variation of soil enzyme activities and bacterial community structure. However, regardless of the land use, the variation of microbiological characteristics appeared not to have predictable spatial structure at 0.5-10 m. Temporal and soil depth-related patterns were studied in relation to plant growth in cropped soil. The results showed that most enzyme activities and microbial biomass have a clear decreasing trend in the top 40 cm soil profile and a temporal pattern during the growing season. A new procedure for sampling of soil microbiological characteristics based on stratified sampling and pre-characterisation of samples was developed. A practical example demonstrated the potential of the new procedure to reduce the analysis efforts involved in laborious microbiological measurements without loss of precision. The investigation of storage of soil samples revealed that freezing (-20 °C) of small sample aliquots retains the activity of hydrolytic enzymes and the structure of the bacterial community in different soil matrices relatively well whereas air-drying cannot be recommended as a storage method for soil microbiological properties due to large reductions in activity. Freezing below -70 °C was the preferred method of storage for samples with high organic matter content. Comparison of different direct DNA extraction methods showed that the cell lysis treatment has a strong impact on the molecular size of DNA obtained and on the bacterial community structure detected. An improved MPN method for the enumeration of soil naphthalene degraders was introduced as an alternative to more complex MPN protocols or the DNA-based quantification approach. The main advantage of the new method is the simple protocol and the possibility to analyse a large number of samples and replicates simultaneously.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Helicobacter pylori infection is a risk factor for gastric cancer, which is a major health issue worldwide. Gastric cancer has a poor prognosis due to the unnoticeable progression of the disease and surgery is the only available treatment in gastric cancer. Therefore, gastric cancer patients would greatly benefit from identifying biomarker genes that would improve diagnostic and prognostic prediction and provide targets for molecular therapies. DNA copy number amplifications are the hallmarks of cancers in various anatomical locations. Mechanisms of amplification predict that DNA double-strand breaks occur at the margins of the amplified region. The first objective of this thesis was to identify the genes that were differentially expressed in H. pylori infection as well as the transcription factors and signal transduction pathways that were associated with the gene expression changes. The second objective was to identify putative biomarker genes in gastric cancer with correlated expression and copy number, and the last objective was to characterize cancers based on DNA copy number amplifications. DNA microarrays, an in vitro model and real-time polymerase chain reaction were used to measure gene expression changes in H. pylori infected AGS cells. In order to identify the transcription factors and signal transduction pathways that were activated after H. pylori infection, gene expression profiling data from the H. pylori experiments and a bioinformatics approach accompanied by experimental validation were used. Genome-wide expression and copy number microarray analysis of clinical gastric cancer samples and immunohistochemistry on tissue microarray were used to identify putative gastric cancer genes. Data mining and machine learning techniques were applied to study amplifications in a cross-section of cancers. FOS and various stress response genes were regulated by H. pylori infection. H. pylori regulated genes were enriched in the chromosomal regions that are frequently changed in gastric cancer, suggesting that molecular pathways of gastric cancer and premalignant H. pylori infection that induces gastritis are interconnected. 16 transcription factors were identified as being associated with H. pylori infection induced changes in gene expression. NF-κB transcription factor and p50 and p65 subunits were verified using elecrophoretic mobility shift assays. ERBB2 and other genes located in 17q12- q21 were found to be up-regulated in association with copy number amplification in gastric cancer. Cancers with similar cell type and origin clustered together based on the genomic localization of the amplifications. Cancer genes and large genes were co-localized with amplified regions and fragile sites, telomeres, centromeres and light chromosome bands were enriched at the amplification boundaries. H. pylori activated transcription factors and signal transduction pathways function in cellular mechanisms that might be capable of promoting carcinogenesis of the stomach. Intestinal and diffuse type gastric cancers showed distinct molecular genetic profiles. Integration of gene expression and copy number microarray data allowed the identification of genes that might be involved in gastric carcinogenesis and have clinical relevance. Gene amplifications were demonstrated to be non-random genomic instabilities. Cell lineage, properties of precursor stem cells, tissue microenvironment and genomic map localization of specific oncogenes define the site specificity of DNA amplifications, whereas labile genomic features define the structures of amplicons. These conclusions suggest that the definition of genomic changes in cancer is based on the interplay between the cancer cell and the tumor microenvironment.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The aim of this thesis is to develop a fully automatic lameness detection system that operates in a milking robot. The instrumentation, measurement software, algorithms for data analysis and a neural network model for lameness detection were developed. Automatic milking has become a common practice in dairy husbandry, and in the year 2006 about 4000 farms worldwide used over 6000 milking robots. There is a worldwide movement with the objective of fully automating every process from feeding to milking. Increase in automation is a consequence of increasing farm sizes, the demand for more efficient production and the growth of labour costs. As the level of automation increases, the time that the cattle keeper uses for monitoring animals often decreases. This has created a need for systems for automatically monitoring the health of farm animals. The popularity of milking robots also offers a new and unique possibility to monitor animals in a single confined space up to four times daily. Lameness is a crucial welfare issue in the modern dairy industry. Limb disorders cause serious welfare, health and economic problems especially in loose housing of cattle. Lameness causes losses in milk production and leads to early culling of animals. These costs could be reduced with early identification and treatment. At present, only a few methods for automatically detecting lameness have been developed, and the most common methods used for lameness detection and assessment are various visual locomotion scoring systems. The problem with locomotion scoring is that it needs experience to be conducted properly, it is labour intensive as an on-farm method and the results are subjective. A four balance system for measuring the leg load distribution of dairy cows during milking in order to detect lameness was developed and set up in the University of Helsinki Research farm Suitia. The leg weights of 73 cows were successfully recorded during almost 10,000 robotic milkings over a period of 5 months. The cows were locomotion scored weekly, and the lame cows were inspected clinically for hoof lesions. Unsuccessful measurements, caused by cows standing outside the balances, were removed from the data with a special algorithm, and the mean leg loads and the number of kicks during milking was calculated. In order to develop an expert system to automatically detect lameness cases, a model was needed. A probabilistic neural network (PNN) classifier model was chosen for the task. The data was divided in two parts and 5,074 measurements from 37 cows were used to train the model. The operation of the model was evaluated for its ability to detect lameness in the validating dataset, which had 4,868 measurements from 36 cows. The model was able to classify 96% of the measurements correctly as sound or lame cows, and 100% of the lameness cases in the validation data were identified. The number of measurements causing false alarms was 1.1%. The developed model has the potential to be used for on-farm decision support and can be used in a real-time lameness monitoring system.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The core aim of machine learning is to make a computer program learn from the experience. Learning from data is usually defined as a task of learning regularities or patterns in data in order to extract useful information, or to learn the underlying concept. An important sub-field of machine learning is called multi-view learning where the task is to learn from multiple data sets or views describing the same underlying concept. A typical example of such scenario would be to study a biological concept using several biological measurements like gene expression, protein expression and metabolic profiles, or to classify web pages based on their content and the contents of their hyperlinks. In this thesis, novel problem formulations and methods for multi-view learning are presented. The contributions include a linear data fusion approach during exploratory data analysis, a new measure to evaluate different kinds of representations for textual data, and an extension of multi-view learning for novel scenarios where the correspondence of samples in the different views or data sets is not known in advance. In order to infer the one-to-one correspondence of samples between two views, a novel concept of multi-view matching is proposed. The matching algorithm is completely data-driven and is demonstrated in several applications such as matching of metabolites between humans and mice, and matching of sentences between documents in two languages.