31 resultados para Knowledge discovery in databases
Resumo:
Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art. We define a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels. We present a method for searching over this space of structures which mirrors the scientific discovery process. The learned structures can often decompose functions into interpretable components and enable long-range extrapolation on time-series datasets. Our structure search method outperforms many widely used kernels and kernel combination methods on a variety of prediction tasks.
Resumo:
Design knowledge can be acquired from various sources and generally requires an integrated representation for its effective and efficient re-use. Though knowledge about products and processes can illustrate the solutions created (know-what) and the courses of actions (know-how) involved in their creation, the reasoning process (know-why) underlying the solutions and actions is still needed for an integrated representation of design knowledge. Design rationale is an effective way of capturing that missing part, since it records the issues addressed, the options considered, and the arguments used when specific design solutions are created and evaluated. Apart from the need for an integrated representation, effective retrieval methods are also of great importance for the re-use of design knowledge, as the knowledge involved in designing complex products can be huge. Developing methods for the retrieval of design rationale is very useful as part of the effective management of design knowledge, for the following reasons. Firstly, design engineers tend to want to consider issues and solutions before looking at solid models or process specifications in detail. Secondly, design rationale is mainly described using text, which often embodies much relevant design knowledge. Last but not least, design rationale is generally captured by identifying elements and their dependencies, i.e. in a structured way which opens the opportunity for going beyond simple keyword-based searching. In this paper, the management of design rationale for the re-use of design knowledge is presented. The retrieval of design rationale records in particular is discussed in detail. As evidenced in the development and evaluation, the methods proposed are useful for the re-use of design knowledge and can be generalised to be used for the retrieval of other kinds of structured design knowledge. © 2012 Elsevier Ltd. All rights reserved.
Resumo:
The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.
Resumo:
Design rationale is an effective way of capturing knowledge, since it records the issues addressed, the options considered, and the arguments used when specific decisions are made during the design process. Design rationale is generally captured by identifying elements and their dependencies, i.e. in a structured way. Current retrieval methods focus mainly on either the classification of rationale or on keyword-based searches of records. Keyword-based retrieval is reasonably effective as the information in design rationale records is mainly described using text. However, most of the current keyword-based retrieval methods discard the implicit structures of these records, resulting either in poor precision of retrieval or in isolated pieces of information that are difficult to understand. This ongoing research aims to go beyond keyword-based retrieval by developing methods and tools to facilitate the provision of useful design knowledge in new design projects. Our first step is to understand the structured information derived from the relationship between lumps of text held in different nodes in the design rationale captured via a software tool currently used in industry, and study how this information can be utilised to improve retrieval performance. Specifically, methods for utilising various structured information are developed and implemented on a prototype keyword-based retrieval system developed in our earlier work. The implementation and evaluation of these methods shows that the structured information can be utilised in a number of ways, such as filtering the results and providing more complete information. This allows the retrieval system to present results that are easy to understand, and which closely match designers' queries. Like design rationale, other methods for representing design knowledge also in essence involve structured information and thus the methods proposed can be generalised to be adapted and applied for the retrieval of other kinds of design knowledge. Copyright © 2002-2012 The Design Society. All rights reserved.
Resumo:
Compared with construction data sources that are usually stored and analyzed in spreadsheets and single data tables, data sources with more complicated structures, such as text documents, site images, web pages, and project schedules have been less intensively studied due to additional challenges in data preparation, representation, and analysis. In this paper, our definition and vision for advanced data analysis addressing such challenges are presented, together with related research results from previous work, as well as our recent developments of data analysis on text-based, image-based, web-based, and network-based construction sources. It is shown in this paper that particular data preparation, representation, and analysis operations should be identified, and integrated with careful problem investigations and scientific validation measures in order to provide general frameworks in support of information search and knowledge discovery from such information-abundant data sources.
Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data
Resumo:
We present a nonparametric Bayesian method for disease subtype discovery in multi-dimensional cancer data. Our method can simultaneously analyse a wide range of data types, allowing for both agreement and disagreement between their underlying clustering structure. It includes feature selection and infers the most likely number of disease subtypes, given the data. We apply the method to 277 glioblastoma samples from The Cancer Genome Atlas, for which there are gene expression, copy number variation, methylation and microRNA data. We identify 8 distinct consensus subtypes and study their prognostic value for death, new tumour events, progression and recurrence. The consensus subtypes are prognostic of tumour recurrence (log-rank p-value of $3.6 \times 10^{-4}$ after correction for multiple hypothesis tests). This is driven principally by the methylation data (log-rank p-value of $2.0 \times 10^{-3}$) but the effect is strengthened by the other 3 data types, demonstrating the value of integrating multiple data types. Of particular note is a subtype of 47 patients characterised by very low levels of methylation. This subtype has very low rates of tumour recurrence and no new events in 10 years of follow up. We also identify a small gene expression subtype of 6 patients that shows particularly poor survival outcomes. Additionally, we note a consensus subtype that showly a highly distinctive data signature and suggest that it is therefore a biologically distinct subtype of glioblastoma. The code is available from https://sites.google.com/site/multipledatafusion/