774 resultados para outlier detection, data mining, gpgpu, gpu computing, supercomputing
Resumo:
The method (algorithm BIDIMS) of multivariate objects display to bidimensional structure in which the sum of differences of objects properties and their nearest neighbors is minimal is being described. The basic regularities on the set of objects at this ordering become evident. Besides, such structures (tables) have high inductive opportunities: many latent properties of objects may be predicted on their coordinates in this table. Opportunities of a method are illustrated on an example of bidimentional ordering of chemical elements. The table received in result practically coincides with the periodic Mendeleev table.
Resumo:
Formal grammars can used for describing complex repeatable structures such as DNA sequences. In this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar. L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant development, and model the morphology of a variety of organisms. We believe that parallel grammars also can be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species, but there are many exceptions which makes the promoter recognition a complex problem. We replace the problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L- grammar rules are analyzed and compared with natural promoter sequences.
Resumo:
A novel association rule mining algorithm is composed, using the unit cube chain decomposition structures introduced in [HAN, 1966; TON, 1976]. [HAN, 1966] established the chain split theory. [TON, 1976] invented an excellent chain computation framework which brings chain split into the practical domain. We integrate these technologies around the rule mining procedures. Effectiveness is related to the intention of low complexity of rules mined. Complexity of the procedure composed is complementary to the known Apriori algorithm which is defacto standard in rule mining area.
Resumo:
This paper surveys research in the field of data mining, which is related to discovering the dependencies between attributes in databases. We consider a number of approaches to finding the distribution intervals of association rules, to discovering branching dependencies between a given set of attributes and a given attribute in a database relation, to finding fractional dependencies between a given set of attributes and a given attribute in a database relation, and to collaborative filtering.
Resumo:
his article presents some of the results of the Ph.D. thesis Class Association Rule Mining Using MultiDimensional Numbered Information Spaces by Iliya Mitov (Institute of Mathematics and Informatics, BAS), successfully defended at Hasselt University, Faculty of Science on 15 November 2011 in Belgium
Resumo:
The “trial and error” method is fundamental for Master Minddecision algorithms. On the basis of Master Mind games and strategies weconsider some data mining methods for tests using students as teachers.Voting, twins, opposite, simulate and observer methods are investigated.For a pure data base these combinatorial algorithms are faster then manyAI and Master Mind methods. The complexities of these algorithms arecompared with basic combinatorial methods in AI. ACM Computing Classification System (1998): F.3.2, G.2.1, H.2.1, H.2.8, I.2.6.
Resumo:
A rough set approach for attribute reduction is an important research subject in data mining and machine learning. However, most attribute reduction methods are performed on a complete decision system table. In this paper, we propose methods for attribute reduction in static incomplete decision systems and dynamic incomplete decision systems with dynamically-increasing and decreasing conditional attributes. Our methods use generalized discernibility matrix and function in tolerance-based rough sets.
Resumo:
Sequential pattern mining is an important subject in data mining with broad applications in many different areas. However, previous sequential mining algorithms mostly aimed to calculate the number of occurrences (the support) without regard to the degree of importance of different data items. In this paper, we propose to explore the search space of subsequences with normalized weights. We are not only interested in the number of occurrences of the sequences (supports of sequences), but also concerned about importance of sequences (weights). When generating subsequence candidates we use both the support and the weight of the candidates while maintaining the downward closure property of these patterns which allows to accelerate the process of candidate generation.
Resumo:
The Semantic Web has come a long way since its inception in 2001, especially in terms of technical development and research progress. However, adoption by non- technical practitioners is still an ongoing process, and in some areas this process is just now starting. Emergency response is an area where reliability and timeliness of information and technologies is of essence. Therefore it is quite natural that more widespread adoption in this area has not been seen until now, when Semantic Web technologies are mature enough to support the high requirements of the application area. Nevertheless, to leverage the full potential of Semantic Web research results for this application area, there is need for an arena where practitioners and researchers can meet and exchange ideas and results. Our intention is for this workshop, and hopefully coming workshops in the same series, to be such an arena for discussion. The Extended Semantic Web Conference (ESWC - formerly the European Semantic Web conference) is one of the major research conferences in the Semantic Web field, whereas this is a suitable location for this workshop in order to discuss the application of Semantic Web technology to our specific area of applications. Hence, we chose to arrange our first SMILE workshop at ESWC 2013. However, this workshop does not focus solely on semantic technologies for emergency response, but rather Semantic Web technologies in combination with technologies and principles for what is sometimes called the "social web". Social media has already been used successfully in many cases, as a tool for supporting emergency response. The aim of this workshop is therefore to take this to the next level and answer questions like: "how can we make sense of, and furthermore make use of, all the data that is produced by different kinds of social media platforms in an emergency situation?" For the first edition of this workshop the chairs collected the following main topics of interest: • Semantic Annotation for understanding the content and context of social media streams. • Integration of Social Media with Linked Data. • Interactive Interfaces and visual analytics methodologies for managing multiple large-scale, dynamic, evolving datasets. • Stream reasoning and event detection. • Social Data Mining. • Collaborative tools and services for Citizens, Organisations, Communities. • Privacy, ethics, trustworthiness and legal issues in the Social Semantic Web. • Use case analysis, with specific interest for use cases that involve the application of Social Media and Linked Data methodologies in real-life scenarios. All of these, applied in the context of: • Crisis and Disaster Management • Emergency Response • Security and Citizen Journalism The workshop received 6 high-quality paper submissions and based on a thorough review process, thanks to our program committee, the decision was made to accept four of these papers for the workshop (67% acceptance rate). These four papers can be found later in this proceedings volume. Three out of four of these papers particularly discuss the integration and analysis of social media data, using Semantic Web technologies, e.g. for detecting complex events in social media streams, for visualizing and analysing sentiments with respect to certain topics in social media, or for detecting small-scale incidents entirely through the use of social media information. Finally, the fourth paper presents an architecture for using Semantic Web technologies in resource management during a disaster. Additionally, the workshop featured an invited keynote speech by Dr. Tomi Kauppinen from Aalto university. Dr. Kauppinen shared experiences from his work on applying Semantic Web technologies to application fields such as geoinformatics and scientific research, i.e. so-called Linked Science, but also recent ideas and applications in the emergency response field. His input was also highly valuable for the roadmapping discussion, which was held at the end of the workshop. A separate summary of the roadmapping session can be found at the end of these proceedings. Finally, we would like to thank our invited speaker Dr. Tomi Kauppinen, all our program committee members, as well as the workshop chair of ESWC2013, Johanna Völker (University of Mannheim), for helping us to make this first SMILE workshop a highly interesting and successful event!
Resumo:
This paper proposes a new thermography-based maximum power point tracking (MPPT) scheme to address photovoltaic (PV) partial shading faults. Solar power generation utilizes a large number of PV cells connected in series and in parallel in an array, and that are physically distributed across a large field. When a PV module is faulted or partial shading occurs, the PV system sees a nonuniform distribution of generated electrical power and thermal profile, and the generation of multiple maximum power points (MPPs). If left untreated, this reduces the overall power generation and severe faults may propagate, resulting in damage to the system. In this paper, a thermal camera is employed for fault detection and a new MPPT scheme is developed to alter the operating point to match an optimized MPP. Extensive data mining is conducted on the images from the thermal camera in order to locate global MPPs. Based on this, a virtual MPPT is set out to find the global MPP. This can reduce MPPT time and be used to calculate the MPP reference voltage. Finally, the proposed methodology is experimentally implemented and validated by tests on a 600-W PV array.
Resumo:
Today, databases have become an integral part of information systems. In the past two decades, we have seen different database systems being developed independently and used in different applications domains. Today's interconnected networks and advanced applications, such as data warehousing, data mining & knowledge discovery and intelligent data access to information on the Web, have created a need for integrated access to such heterogeneous, autonomous, distributed database systems. Heterogeneous/multidatabase research has focused on this issue resulting in many different approaches. However, a single, generally accepted methodology in academia or industry has not emerged providing ubiquitous intelligent data access from heterogeneous, autonomous, distributed information sources. ^ This thesis describes a heterogeneous database system being developed at High-performance Database Research Center (HPDRC). A major impediment to ubiquitous deployment of multidatabase technology is the difficulty in resolving semantic heterogeneity. That is, identifying related information sources for integration and querying purposes. Our approach considers the semantics of the meta-data constructs in resolving this issue. The major contributions of the thesis work include: (i) providing a scalable, easy-to-implement architecture for developing a heterogeneous multidatabase system, utilizing Semantic Binary Object-oriented Data Model (Sem-ODM) and Semantic SQL query language to capture the semantics of the data sources being integrated and to provide an easy-to-use query facility; (ii) a methodology for semantic heterogeneity resolution by investigating into the extents of the meta-data constructs of component schemas. This methodology is shown to be correct, complete and unambiguous; (iii) a semi-automated technique for identifying semantic relations, which is the basis of semantic knowledge for integration and querying, using shared ontologies for context-mediation; (iv) resolutions for schematic conflicts and a language for defining global views from a set of component Sem-ODM schemas; (v) design of a knowledge base for storing and manipulating meta-data and knowledge acquired during the integration process. This knowledge base acts as the interface between integration and query processing modules; (vi) techniques for Semantic SQL query processing and optimization based on semantic knowledge in a heterogeneous database environment; and (vii) a framework for intelligent computing and communication on the Internet applying the concepts of our work. ^
Resumo:
The main challenges of multimedia data retrieval lie in the effective mapping between low-level features and high-level concepts, and in the individual users' subjective perceptions of multimedia content. ^ The objectives of this dissertation are to develop an integrated multimedia indexing and retrieval framework with the aim to bridge the gap between semantic concepts and low-level features. To achieve this goal, a set of core techniques have been developed, including image segmentation, content-based image retrieval, object tracking, video indexing, and video event detection. These core techniques are integrated in a systematic way to enable the semantic search for images/videos, and can be tailored to solve the problems in other multimedia related domains. In image retrieval, two new methods of bridging the semantic gap are proposed: (1) for general content-based image retrieval, a stochastic mechanism is utilized to enable the long-term learning of high-level concepts from a set of training data, such as user access frequencies and access patterns of images. (2) In addition to whole-image retrieval, a novel multiple instance learning framework is proposed for object-based image retrieval, by which a user is allowed to more effectively search for images that contain multiple objects of interest. An enhanced image segmentation algorithm is developed to extract the object information from images. This segmentation algorithm is further used in video indexing and retrieval, by which a robust video shot/scene segmentation method is developed based on low-level visual feature comparison, object tracking, and audio analysis. Based on shot boundaries, a novel data mining framework is further proposed to detect events in soccer videos, while fully utilizing the multi-modality features and object information obtained through video shot/scene detection. ^ Another contribution of this dissertation is the potential of the above techniques to be tailored and applied to other multimedia applications. This is demonstrated by their utilization in traffic video surveillance applications. The enhanced image segmentation algorithm, coupled with an adaptive background learning algorithm, improves the performance of vehicle identification. A sophisticated object tracking algorithm is proposed to track individual vehicles, while the spatial and temporal relationships of vehicle objects are modeled by an abstract semantic model. ^
Resumo:
The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^
Resumo:
With increasing competition and more demanding members, clubs need a tool to help them belter attract and retain members and predict their behavior. Data mining is such a tool. This article presents an overview of how data warehousing, data marting, and data mining can provide the foundation on which clubs can build strategies to outsmart competitors, build Ioyalty identify new members, and lower costs.
Resumo:
Large read-only or read-write transactions with a large read set and a small write set constitute an important class of transactions used in such applications as data mining, data warehousing, statistical applications, and report generators. Such transactions are best supported with optimistic concurrency, because locking of large amounts of data for extended periods of time is not an acceptable solution. The abort rate in regular optimistic concurrency algorithms increases exponentially with the size of the transaction. The algorithm proposed in this dissertation solves this problem by using a new transaction scheduling technique that allows a large transaction to commit safely with significantly greater probability that can exceed several orders of magnitude versus regular optimistic concurrency algorithms. A performance simulation study and a formal proof of serializability and external consistency of the proposed algorithm are also presented.^ This dissertation also proposes a new query optimization technique (lazy queries). Lazy Queries is an adaptive query execution scheme which optimizes itself as the query runs. Lazy queries can be used to find an intersection of sub-queries in a very efficient way, which does not require full execution of large sub-queries nor does it require any statistical knowledge about the data.^ An efficient optimistic concurrency control algorithm used in a massively parallel B-tree with variable-length keys is introduced. B-trees with variable-length keys can be effectively used in a variety of database types. In particular, we show how such a B-tree was used in our implementation of a semantic object-oriented DBMS. The concurrency control algorithm uses semantically safe optimistic virtual "locks" that achieve very fine granularity in conflict detection. This algorithm ensures serializability and external consistency by using logical clocks and backward validation of transactional queries. A formal proof of correctness of the proposed algorithm is also presented. ^