866 resultados para mining closure


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In data mining, an important goal is to generate an abstraction of the data. Such an abstraction helps in reducing the space and search time requirements of the overall decision making process. Further, it is important that the abstraction is generated from the data with a small number of disk scans. We propose a novel data structure, pattern count tree (PC-tree), that can be built by scanning the database only once. PC-tree is a minimal size complete representation of the data and it can be used to represent dynamic databases with the help of knowledge that is either static or changing. We show that further compactness can be achieved by constructing the PC-tree on segmented patterns. We exploit the flexibility offered by rough sets to realize a rough PC-tree and use it for efficient and effective rough classification. To be consistent with the sizes of the branches of the PC-tree, we use upper and lower approximations of feature sets in a manner different from the conventional rough set theory. We conducted experiments using the proposed classification scheme on a large-scale hand-written digit data set. We use the experimental results to establish the efficacy of the proposed approach. (C) 2002 Elsevier Science B.V. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Rapid urbanisation in India has posed serious challenges to the decision makers in regional planning involving plethora of issues including provision of basic amenities (like electricity, water, sanitation, transport, etc.). Urban planning entails an understanding of landscape and urban dynamics with causal factors. Identifying, delineating and mapping landscapes on temporal scale provide an opportunity to monitor the changes, which is important for natural resource management and sustainable planning activities. Multi-source, multi-sensor, multi-temporal, multi-frequency or multi-polarization remote sensing data with efficient classification algorithms and pattern recognition techniques aid in capturing these dynamics. This paper analyses the landscape dynamics of Greater Bangalore by: (i) characterisation of direct impervious surface, (ii) computation of forest fragmentation indices and (iii) modeling to quantify and categorise urban changes. Linear unmixing is used for solving the mixed pixel problem of coarse resolution super spectral MODIS data for impervious surface characterisation. Fragmentation indices were used to classify forests – interior, perforated, edge, transitional, patch and undetermined. Based on this, urban growth model was developed to determine the type of urban growth – Infill, Expansion and Outlying growth. This helped in visualising urban growth poles and consequence of earlier policy decisions that can help in evolving strategies for effective land use policies.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Mining association rules from a large collection of databases is based on two main tasks. One is generation of large itemsets; and the other is finding associations between the discovered large itemsets. Existing formalism for association rules are based on a single transaction database which is not sufficient to describe the association rules based on multiple database environment. In this paper, we give a general characterization of association rules and also give a framework for knowledge-based mining of multiple databases for association rules.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Data mining is concerned with analysing large volumes of (often unstructured) data to automatically discover interesting regularities or relationships which in turn lead to better understanding of the underlying processes. The field of temporal data mining is concerned with such analysis in the case of ordered data streams with temporal interdependencies. Over the last decade many interesting techniques of temporal data mining were proposed and shown to be useful in many applications. Since temporal data mining brings together techniques from different fields such as statistics, machine learning and databases, the literature is scattered among many different sources. In this article, we present an overview of techniques of temporal data mining.We mainly concentrate on algorithms for pattern discovery in sequential data streams.We also describe some recent results regarding statistical analysis of pattern discovery methods.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A method, system, and computer program product for fault data correlation in a diagnostic system are provided. The method includes receiving the fault data including a plurality of faults collected over a period of time, and identifying a plurality of episodes within the fault data, where each episode includes a sequence of the faults. The method further includes calculating a frequency of the episodes within the fault data, calculating a correlation confidence of the faults relative to the episodes as a function of the frequency of the episodes, and outputting a report of the faults with the correlation confidence.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A system for temporal data mining includes a computer readable medium having an application configured to receive at an input module a temporal data series having events with start times and end times, a set of allowed dwelling times and a threshold frequency. The system is further configured to identify, using a candidate identification and tracking module, one or more occurrences in the temporal data series of a candidate episode and increment a count for each identified occurrence. The system is also configured to produce at an output module an output for those episodes whose count of occurrences results in a frequency exceeding the threshold frequency.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Chemical reactions inside cells are typically subject to the effects both of the cell's confining surfaces and of the viscoelastic behavior of its contents. In this paper, we show how the outcome of one particular reaction of relevance to cellular biochemistry - the diffusion-limited cyclization of long chain polymers - is influenced by such confinement and crowding effects. More specifically, starting from the Rouse model of polymer dynamics, and invoking the Wilemski-Fixman approximation, we determine the scaling relationship between the mean closure time t(c) of a flexible chain (no excluded volume or hydrodynamic interactions) and the length N of its contour under the following separate conditions: (a) confinement of the chain to a sphere of radius d and (b) modulation of its dynamics by colored Gaussian noise. Among other results, we find that in case (a) when d is much smaller than the size of the chain, t(c) similar to Nd-2, and that in case (b), t(c) similar to N-2/(2 (2H)), H being a number between 1/2 and 1 that characterizes the decay of the noise correlations. H is not known a priori, but values of about 0.7 have been used in the successful characterization of protein conformational dynamics. At this value of H (selected for purposes of illustration), t(c) similar to N-3.4, the high scaling exponent reflecting the slow relaxation of the chain in a viscoelastic medium. (C) 2012 American Institute of Physics. http://dx.doi.org/10.1063/1.4729041]

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Song-selection and mood are interdependent. If we capture a song’s sentiment, we can determine the mood of the listener, which can serve as a basis for recommendation systems. Songs are generally classified according to genres, which don’t entirely reflect sentiments. Thus, we require an unsupervised scheme to mine them. Sentiments are classified into either two (positive/negative) or multiple (happy/angry/sad/...) classes, depending on the application. We are interested in analyzing the feelings invoked by a song, involving multi-class sentiments. To mine the hidden sentimental structure behind a song, in terms of “topics”, we consider its lyrics and use Latent Dirichlet Allocation (LDA). Each song is a mixture of moods. Topics mined by LDA can represent moods. Thus we get a scheme of collecting similar-mood songs. For validation, we use a dataset of songs containing 6 moods annotated by users of a particular website.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We address the problem of mining targeted association rules over multidimensional market-basket data. Here, each transaction has, in addition to the set of purchased items, ancillary dimension attributes associated with it. Based on these dimensions, transactions can be visualized as distributed over cells of an n-dimensional cube. In this framework, a targeted association rule is of the form {X -> Y} R, where R is a convex region in the cube and X. Y is a traditional association rule within region R. We first describe the TOARM algorithm, based on classical techniques, for identifying targeted association rules. Then, we discuss the concepts of bottom-up aggregation and cubing, leading to the CellUnion technique. This approach is further extended, using notions of cube-count interleaving and credit-based pruning, to derive the IceCube algorithm. Our experiments demonstrate that IceCube consistently provides the best execution time performance, especially for large and complex data cubes.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The rapid growth in the field of data mining has lead to the development of various methods for outlier detection. Though detection of outliers has been well explored in the context of numerical data, dealing with categorical data is still evolving. In this paper, we propose a two-phase algorithm for detecting outliers in categorical data based on a novel definition of outliers. In the first phase, this algorithm explores a clustering of the given data, followed by the ranking phase for determining the set of most likely outliers. The proposed algorithm is expected to perform better as it can identify different types of outliers, employing two independent ranking schemes based on the attribute value frequencies and the inherent clustering structure in the given data. Unlike some existing methods, the computational complexity of this algorithm is not affected by the number of outliers to be detected. The efficacy of this algorithm is demonstrated through experiments on various public domain categorical data sets.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper primarily intends to develop a GIS (geographical information system)-based data mining approach for optimally selecting the locations and determining installed capacities for setting up distributed biomass power generation systems in the context of decentralized energy planning for rural regions. The optimal locations within a cluster of villages are obtained by matching the installed capacity needed with the demand for power, minimizing the cost of transportation of biomass from dispersed sources to power generation system, and cost of distribution of electricity from the power generation system to demand centers or villages. The methodology was validated by using it for developing an optimal plan for implementing distributed biomass-based power systems for meeting the rural electricity needs of Tumkur district in India consisting of 2700 villages. The approach uses a k-medoid clustering algorithm to divide the total region into clusters of villages and locate biomass power generation systems at the medoids. The optimal value of k is determined iteratively by running the algorithm for the entire search space for different values of k along with demand-supply matching constraints. The optimal value of the k is chosen such that it minimizes the total cost of system installation, costs of transportation of biomass, and transmission and distribution. A smaller region, consisting of 293 villages was selected to study the sensitivity of the results to varying demand and supply parameters. The results of clustering are represented on a GIS map for the region.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Mycobacterium tuberculosis owes its high pathogenic potential to its ability to evade host immune responses and thrive inside the macrophage. The outcome of infection is largely determined by the cellular response comprising a multitude of molecular events. The complexity and inter-relatedness in the processes makes it essential to adopt systems approaches to study them. In this work, we construct a comprehensive network of infection-related processes in a human macrophage comprising 1888 proteins and 14,016 interactions. We then compute response networks based on available gene expression profiles corresponding to states of health, disease and drug treatment. We use a novel formulation for mining response networks that has led to identifying highest activities in the cell. Highest activity paths provide mechanistic insights into pathogenesis and response to treatment. The approach used here serves as a generic framework for mining dynamic changes in genome-scale protein interaction networks.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Automatic and accurate detection of the closure-burst transition events of stops and affricates serves many applications in speech processing. A temporal measure named the plosion index is proposed to detect such events, which are characterized by an abrupt increase in energy. Using the maxima of the pitch-synchronous normalized cross correlation as an additional temporal feature, a rule-based algorithm is designed that aims at selecting only those events associated with the closure-burst transitions of stops and affricates. The performance of the algorithm, characterized by receiver operating characteristic curves and temporal accuracy, is evaluated using the labeled closure-burst transitions of stops and affricates of the entire TIMIT test and training databases. The robustness of the algorithm is studied with respect to global white and babble noise as well as local noise using the TIMIT test set and on telephone quality speech using the NTIMIT test set. For these experiments, the proposed algorithm, which does not require explicit statistical training and is based on two one-dimensional temporal measures, gives a performance comparable to or better than the state-of-the-art methods. In addition, to test the scalability, the algorithm is applied on the Buckeye conversational speech corpus and databases of two Indian languages. (C) 2014 Acoustical Society of America.