788 resultados para data mining applications
Resumo:
The paper introduces a method for dependencies discovery during human-machine interaction. It is based on an analysis of numerical data sets in knowledge-poor environments. The driven procedures are independent and they interact on a competitive principle. The research focuses on seven of them. The application is in Number Theory.
Resumo:
Formal grammars can used for describing complex repeatable structures such as DNA sequences. In this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar. L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant development, and model the morphology of a variety of organisms. We believe that parallel grammars also can be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species, but there are many exceptions which makes the promoter recognition a complex problem. We replace the problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L- grammar rules are analyzed and compared with natural promoter sequences.
Resumo:
В статье рассмотрена проблема семантической разницы между содержимым мультимедиа и его текстовым описанием, определяемым вручную. Предложен комбинированный подход к представлению семантики мультимедиа, основанный на объединении близких по содержанию и текстовому описанию мультимедиа в классы, содержащие обобщённые описания объектов, связей между ними и ключевых слов текстовых метаданных из некоторого тезауруса. Для формирования этих классов используются операции иерархической кластеризации и машинного обучения. Данный подход позволяет расширить область поиска и навигации мультимедиа благодаря привлечению медиа-данных, имеющих схожее содержание и текстовое описание.
Resumo:
In the nonparametric framework of Data Envelopment Analysis the statistical properties of its estimators have been investigated and only asymptotic results are available. For DEA estimators results of practical use have been proved only for the case of one input and one output. However, in the real world problems the production process is usually well described by many variables. In this paper a machine learning approach to variable aggregation based on Canonical Correlation Analysis is presented. This approach is applied for efficiency estimation of all the farms in Terceira Island of the Azorean archipelago.
Resumo:
The purpose of discussed optimal valid partitioning (OVP) methods is uncovering of ordinal or continuous explanatory variables effect on outcome variables of different types. The OVP approach is based on searching partitions of explanatory variables space that in the best way separate observations with different levels of outcomes. Partitions of single variables ranges or two-dimensional admissible areas for pairs of variables are searched inside corresponding families. Statistical validity associated with revealed regularities is estimated with the help of permutation test repeating search of optimal partition for each permuted dataset. Method for output regularities selection is discussed that is based on validity evaluating with the help of two types of permutation tests.
Resumo:
Sequential pattern mining is an important subject in data mining with broad applications in many different areas. However, previous sequential mining algorithms mostly aimed to calculate the number of occurrences (the support) without regard to the degree of importance of different data items. In this paper, we propose to explore the search space of subsequences with normalized weights. We are not only interested in the number of occurrences of the sequences (supports of sequences), but also concerned about importance of sequences (weights). When generating subsequence candidates we use both the support and the weight of the candidates while maintaining the downward closure property of these patterns which allows to accelerate the process of candidate generation.
Resumo:
The Semantic Web has come a long way since its inception in 2001, especially in terms of technical development and research progress. However, adoption by non- technical practitioners is still an ongoing process, and in some areas this process is just now starting. Emergency response is an area where reliability and timeliness of information and technologies is of essence. Therefore it is quite natural that more widespread adoption in this area has not been seen until now, when Semantic Web technologies are mature enough to support the high requirements of the application area. Nevertheless, to leverage the full potential of Semantic Web research results for this application area, there is need for an arena where practitioners and researchers can meet and exchange ideas and results. Our intention is for this workshop, and hopefully coming workshops in the same series, to be such an arena for discussion. The Extended Semantic Web Conference (ESWC - formerly the European Semantic Web conference) is one of the major research conferences in the Semantic Web field, whereas this is a suitable location for this workshop in order to discuss the application of Semantic Web technology to our specific area of applications. Hence, we chose to arrange our first SMILE workshop at ESWC 2013. However, this workshop does not focus solely on semantic technologies for emergency response, but rather Semantic Web technologies in combination with technologies and principles for what is sometimes called the "social web". Social media has already been used successfully in many cases, as a tool for supporting emergency response. The aim of this workshop is therefore to take this to the next level and answer questions like: "how can we make sense of, and furthermore make use of, all the data that is produced by different kinds of social media platforms in an emergency situation?" For the first edition of this workshop the chairs collected the following main topics of interest: • Semantic Annotation for understanding the content and context of social media streams. • Integration of Social Media with Linked Data. • Interactive Interfaces and visual analytics methodologies for managing multiple large-scale, dynamic, evolving datasets. • Stream reasoning and event detection. • Social Data Mining. • Collaborative tools and services for Citizens, Organisations, Communities. • Privacy, ethics, trustworthiness and legal issues in the Social Semantic Web. • Use case analysis, with specific interest for use cases that involve the application of Social Media and Linked Data methodologies in real-life scenarios. All of these, applied in the context of: • Crisis and Disaster Management • Emergency Response • Security and Citizen Journalism The workshop received 6 high-quality paper submissions and based on a thorough review process, thanks to our program committee, the decision was made to accept four of these papers for the workshop (67% acceptance rate). These four papers can be found later in this proceedings volume. Three out of four of these papers particularly discuss the integration and analysis of social media data, using Semantic Web technologies, e.g. for detecting complex events in social media streams, for visualizing and analysing sentiments with respect to certain topics in social media, or for detecting small-scale incidents entirely through the use of social media information. Finally, the fourth paper presents an architecture for using Semantic Web technologies in resource management during a disaster. Additionally, the workshop featured an invited keynote speech by Dr. Tomi Kauppinen from Aalto university. Dr. Kauppinen shared experiences from his work on applying Semantic Web technologies to application fields such as geoinformatics and scientific research, i.e. so-called Linked Science, but also recent ideas and applications in the emergency response field. His input was also highly valuable for the roadmapping discussion, which was held at the end of the workshop. A separate summary of the roadmapping session can be found at the end of these proceedings. Finally, we would like to thank our invited speaker Dr. Tomi Kauppinen, all our program committee members, as well as the workshop chair of ESWC2013, Johanna Völker (University of Mannheim), for helping us to make this first SMILE workshop a highly interesting and successful event!
Resumo:
Today, databases have become an integral part of information systems. In the past two decades, we have seen different database systems being developed independently and used in different applications domains. Today's interconnected networks and advanced applications, such as data warehousing, data mining & knowledge discovery and intelligent data access to information on the Web, have created a need for integrated access to such heterogeneous, autonomous, distributed database systems. Heterogeneous/multidatabase research has focused on this issue resulting in many different approaches. However, a single, generally accepted methodology in academia or industry has not emerged providing ubiquitous intelligent data access from heterogeneous, autonomous, distributed information sources. ^ This thesis describes a heterogeneous database system being developed at High-performance Database Research Center (HPDRC). A major impediment to ubiquitous deployment of multidatabase technology is the difficulty in resolving semantic heterogeneity. That is, identifying related information sources for integration and querying purposes. Our approach considers the semantics of the meta-data constructs in resolving this issue. The major contributions of the thesis work include: (i) providing a scalable, easy-to-implement architecture for developing a heterogeneous multidatabase system, utilizing Semantic Binary Object-oriented Data Model (Sem-ODM) and Semantic SQL query language to capture the semantics of the data sources being integrated and to provide an easy-to-use query facility; (ii) a methodology for semantic heterogeneity resolution by investigating into the extents of the meta-data constructs of component schemas. This methodology is shown to be correct, complete and unambiguous; (iii) a semi-automated technique for identifying semantic relations, which is the basis of semantic knowledge for integration and querying, using shared ontologies for context-mediation; (iv) resolutions for schematic conflicts and a language for defining global views from a set of component Sem-ODM schemas; (v) design of a knowledge base for storing and manipulating meta-data and knowledge acquired during the integration process. This knowledge base acts as the interface between integration and query processing modules; (vi) techniques for Semantic SQL query processing and optimization based on semantic knowledge in a heterogeneous database environment; and (vii) a framework for intelligent computing and communication on the Internet applying the concepts of our work. ^
Resumo:
The main challenges of multimedia data retrieval lie in the effective mapping between low-level features and high-level concepts, and in the individual users' subjective perceptions of multimedia content. ^ The objectives of this dissertation are to develop an integrated multimedia indexing and retrieval framework with the aim to bridge the gap between semantic concepts and low-level features. To achieve this goal, a set of core techniques have been developed, including image segmentation, content-based image retrieval, object tracking, video indexing, and video event detection. These core techniques are integrated in a systematic way to enable the semantic search for images/videos, and can be tailored to solve the problems in other multimedia related domains. In image retrieval, two new methods of bridging the semantic gap are proposed: (1) for general content-based image retrieval, a stochastic mechanism is utilized to enable the long-term learning of high-level concepts from a set of training data, such as user access frequencies and access patterns of images. (2) In addition to whole-image retrieval, a novel multiple instance learning framework is proposed for object-based image retrieval, by which a user is allowed to more effectively search for images that contain multiple objects of interest. An enhanced image segmentation algorithm is developed to extract the object information from images. This segmentation algorithm is further used in video indexing and retrieval, by which a robust video shot/scene segmentation method is developed based on low-level visual feature comparison, object tracking, and audio analysis. Based on shot boundaries, a novel data mining framework is further proposed to detect events in soccer videos, while fully utilizing the multi-modality features and object information obtained through video shot/scene detection. ^ Another contribution of this dissertation is the potential of the above techniques to be tailored and applied to other multimedia applications. This is demonstrated by their utilization in traffic video surveillance applications. The enhanced image segmentation algorithm, coupled with an adaptive background learning algorithm, improves the performance of vehicle identification. A sophisticated object tracking algorithm is proposed to track individual vehicles, while the spatial and temporal relationships of vehicle objects are modeled by an abstract semantic model. ^
Resumo:
The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^
Resumo:
With increasing competition and more demanding members, clubs need a tool to help them belter attract and retain members and predict their behavior. Data mining is such a tool. This article presents an overview of how data warehousing, data marting, and data mining can provide the foundation on which clubs can build strategies to outsmart competitors, build Ioyalty identify new members, and lower costs.
Resumo:
Large read-only or read-write transactions with a large read set and a small write set constitute an important class of transactions used in such applications as data mining, data warehousing, statistical applications, and report generators. Such transactions are best supported with optimistic concurrency, because locking of large amounts of data for extended periods of time is not an acceptable solution. The abort rate in regular optimistic concurrency algorithms increases exponentially with the size of the transaction. The algorithm proposed in this dissertation solves this problem by using a new transaction scheduling technique that allows a large transaction to commit safely with significantly greater probability that can exceed several orders of magnitude versus regular optimistic concurrency algorithms. A performance simulation study and a formal proof of serializability and external consistency of the proposed algorithm are also presented.^ This dissertation also proposes a new query optimization technique (lazy queries). Lazy Queries is an adaptive query execution scheme which optimizes itself as the query runs. Lazy queries can be used to find an intersection of sub-queries in a very efficient way, which does not require full execution of large sub-queries nor does it require any statistical knowledge about the data.^ An efficient optimistic concurrency control algorithm used in a massively parallel B-tree with variable-length keys is introduced. B-trees with variable-length keys can be effectively used in a variety of database types. In particular, we show how such a B-tree was used in our implementation of a semantic object-oriented DBMS. The concurrency control algorithm uses semantically safe optimistic virtual "locks" that achieve very fine granularity in conflict detection. This algorithm ensures serializability and external consistency by using logical clocks and backward validation of transactional queries. A formal proof of correctness of the proposed algorithm is also presented. ^
Resumo:
Today, databases have become an integral part of information systems. In the past two decades, we have seen different database systems being developed independently and used in different applications domains. Today's interconnected networks and advanced applications, such as data warehousing, data mining & knowledge discovery and intelligent data access to information on the Web, have created a need for integrated access to such heterogeneous, autonomous, distributed database systems. Heterogeneous/multidatabase research has focused on this issue resulting in many different approaches. However, a single, generally accepted methodology in academia or industry has not emerged providing ubiquitous intelligent data access from heterogeneous, autonomous, distributed information sources. This thesis describes a heterogeneous database system being developed at Highperformance Database Research Center (HPDRC). A major impediment to ubiquitous deployment of multidatabase technology is the difficulty in resolving semantic heterogeneity. That is, identifying related information sources for integration and querying purposes. Our approach considers the semantics of the meta-data constructs in resolving this issue. The major contributions of the thesis work include: (i.) providing a scalable, easy-to-implement architecture for developing a heterogeneous multidatabase system, utilizing Semantic Binary Object-oriented Data Model (Sem-ODM) and Semantic SQL query language to capture the semantics of the data sources being integrated and to provide an easy-to-use query facility; (ii.) a methodology for semantic heterogeneity resolution by investigating into the extents of the meta-data constructs of component schemas. This methodology is shown to be correct, complete and unambiguous; (iii.) a semi-automated technique for identifying semantic relations, which is the basis of semantic knowledge for integration and querying, using shared ontologies for context-mediation; (iv.) resolutions for schematic conflicts and a language for defining global views from a set of component Sem-ODM schemas; (v.) design of a knowledge base for storing and manipulating meta-data and knowledge acquired during the integration process. This knowledge base acts as the interface between integration and query processing modules; (vi.) techniques for Semantic SQL query processing and optimization based on semantic knowledge in a heterogeneous database environment; and (vii.) a framework for intelligent computing and communication on the Internet applying the concepts of our work.
Resumo:
La tesi presenta uno studio della libreria grafica per web D3, sviluppata in javascript, e ne presenta una catalogazione dei grafici implementati e reperibili sul web. Lo scopo è quello di valutare la libreria e studiarne i pregi e difetti per capire se sia opportuno utilizzarla nell'ambito di un progetto Europeo. Per fare questo vengono studiati i metodi di classificazione dei grafici presenti in letteratura e viene esposto e descritto lo stato dell'arte del data visualization. Viene poi descritto il metodo di classificazione proposto dal team di progettazione e catalogata la galleria di grafici presente sul sito della libreria D3. Infine viene presentato e studiato in maniera formale un algoritmo per selezionare un grafico in base alle esigenze dell'utente.
Resumo:
Peer reviewed