96 resultados para Pattern Mining
Resumo:
In this paper, we approach the classical problem of clustering using solution concepts from cooperative game theory such as Nucleolus and Shapley value. We formulate the problem of clustering as a characteristic form game and develop a novel algorithm DRAC (Density-Restricted Agglomerative Clustering) for clustering. With extensive experimentation on standard data sets, we compare the performance of DRAC with that of well known algorithms. We show an interesting result that four prominent solution concepts, Nucleolus, Shapley value, Gately point and \tau-value coincide for the defined characteristic form game. This vindicates the choice of the characteristic function of the clustering game and also provides strong intuitive foundation for our approach.
Resumo:
Network Intrusion Detection Systems (NIDS) intercept the traffic at an organization's network periphery to thwart intrusion attempts. Signature-based NIDS compares the intercepted packets against its database of known vulnerabilities and malware signatures to detect such cyber attacks. These signatures are represented using Regular Expressions (REs) and strings. Regular Expressions, because of their higher expressive power, are preferred over simple strings to write these signatures. We present Cascaded Automata Architecture to perform memory efficient Regular Expression pattern matching using existing string matching solutions. The proposed architecture performs two stage Regular Expression pattern matching. We replace the substring and character class components of the Regular Expression with new symbols. We address the challenges involved in this approach. We augment the Word-based Automata, obtained from the re-written Regular Expressions, with counter-based states and length bound transitions to perform Regular Expression pattern matching. We evaluated our architecture on Regular Expressions taken from Snort rulesets. We were able to reduce the number of automata states between 50% to 85%. Additionally, we could reduce the number of transitions by a factor of 3 leading to further reduction in the memory requirements.
Resumo:
Background: A better understanding of the quality of cellular immune responses directed against molecularly defined targets will guide the development of TB diagnostics and identification of molecularly defined, clinically relevant M.tb vaccine candidates. Methods: Recombinant proteins (n = 8) and peptide pools (n = 14) from M. tuberculosis (M.tb) targets were used to compare cellular immune responses defined by IFN-gamma and IL-17 production using a Whole Blood Assay (WBA) in a cohort of 148 individuals, i.e. patients with TB + (n = 38), TB- individuals with other pulmonary diseases (n = 81) and individuals exposed to TB without evidence of clinical TB (health care workers, n = 29). Results: M.tb antigens Rv2958c (glycosyltransferase), Rv2962c (mycolyltransferase), Rv1886c (Ag85B), Rv3804c (Ag85A), and the PPE family member Rv3347c were frequently recognized, defined by IFN-gamma production, in blood from healthy individuals exposed to M.tb (health care workers). A different recognition pattern was found for IL-17 production in blood from M.tb exposed individuals responding to TB10.4 (Rv0288), Ag85B (Rv1886c) and the PPE family members Rv0978c and Rv1917c. Conclusions: The pattern of immune target recognition is different in regard to IFN-gamma and IL-17 production to defined molecular M.tb targets in PBMCs from individuals frequently exposed to M.tb. The data represent the first mapping of cellular immune responses against M.tb targets in TB patients from Honduras.
Resumo:
We address the problem of mining targeted association rules over multidimensional market-basket data. Here, each transaction has, in addition to the set of purchased items, ancillary dimension attributes associated with it. Based on these dimensions, transactions can be visualized as distributed over cells of an n-dimensional cube. In this framework, a targeted association rule is of the form {X -> Y} R, where R is a convex region in the cube and X. Y is a traditional association rule within region R. We first describe the TOARM algorithm, based on classical techniques, for identifying targeted association rules. Then, we discuss the concepts of bottom-up aggregation and cubing, leading to the CellUnion technique. This approach is further extended, using notions of cube-count interleaving and credit-based pruning, to derive the IceCube algorithm. Our experiments demonstrate that IceCube consistently provides the best execution time performance, especially for large and complex data cubes.
Resumo:
The rapid growth in the field of data mining has lead to the development of various methods for outlier detection. Though detection of outliers has been well explored in the context of numerical data, dealing with categorical data is still evolving. In this paper, we propose a two-phase algorithm for detecting outliers in categorical data based on a novel definition of outliers. In the first phase, this algorithm explores a clustering of the given data, followed by the ranking phase for determining the set of most likely outliers. The proposed algorithm is expected to perform better as it can identify different types of outliers, employing two independent ranking schemes based on the attribute value frequencies and the inherent clustering structure in the given data. Unlike some existing methods, the computational complexity of this algorithm is not affected by the number of outliers to be detected. The efficacy of this algorithm is demonstrated through experiments on various public domain categorical data sets.
Resumo:
This paper primarily intends to develop a GIS (geographical information system)-based data mining approach for optimally selecting the locations and determining installed capacities for setting up distributed biomass power generation systems in the context of decentralized energy planning for rural regions. The optimal locations within a cluster of villages are obtained by matching the installed capacity needed with the demand for power, minimizing the cost of transportation of biomass from dispersed sources to power generation system, and cost of distribution of electricity from the power generation system to demand centers or villages. The methodology was validated by using it for developing an optimal plan for implementing distributed biomass-based power systems for meeting the rural electricity needs of Tumkur district in India consisting of 2700 villages. The approach uses a k-medoid clustering algorithm to divide the total region into clusters of villages and locate biomass power generation systems at the medoids. The optimal value of k is determined iteratively by running the algorithm for the entire search space for different values of k along with demand-supply matching constraints. The optimal value of the k is chosen such that it minimizes the total cost of system installation, costs of transportation of biomass, and transmission and distribution. A smaller region, consisting of 293 villages was selected to study the sensitivity of the results to varying demand and supply parameters. The results of clustering are represented on a GIS map for the region.
Resumo:
Mycobacterium tuberculosis owes its high pathogenic potential to its ability to evade host immune responses and thrive inside the macrophage. The outcome of infection is largely determined by the cellular response comprising a multitude of molecular events. The complexity and inter-relatedness in the processes makes it essential to adopt systems approaches to study them. In this work, we construct a comprehensive network of infection-related processes in a human macrophage comprising 1888 proteins and 14,016 interactions. We then compute response networks based on available gene expression profiles corresponding to states of health, disease and drug treatment. We use a novel formulation for mining response networks that has led to identifying highest activities in the cell. Highest activity paths provide mechanistic insights into pathogenesis and response to treatment. The approach used here serves as a generic framework for mining dynamic changes in genome-scale protein interaction networks.
Resumo:
Design and development of a piezoelectric polyvinylidene fluoride (PVDF) thin film based nasal sensor to monitor human respiration pattern (RP) from each nostril simultaneously is presented in this paper. Thin film based PVDF nasal sensor is designed in a cantilever beam configuration. Two cantilevers are mounted on a spectacle frame in such a way that the air flow from each nostril impinges on this sensor causing bending of the cantilever beams. Voltage signal produced due to air flow induced dynamic piezoelectric effect produce a respective RP. A group of 23 healthy awake human subjects are studied. The RP in terms of respiratory rate (RR) and Respiratory air-flow changes/alterations obtained from the developed PVDF nasal sensor are compared with RP obtained from respiratory inductance plethysmograph (RIP) device. The mean RR of the developed nasal sensor (19.65 +/- A 4.1) and the RIP (19.57 +/- A 4.1) are found to be almost same (difference not significant, p > 0.05) with the correlation coefficient 0.96, p < 0.0001. It was observed that any change/alterations in the pattern of RIP is followed by same amount of change/alterations in the pattern of PVDF nasal sensor with k = 0.815 indicating strong agreement between the PVDF nasal sensor and RIP respiratory air-flow pattern. The developed sensor is simple in design, non-invasive, patient friendly and hence shows promising routine clinical usage. The preliminary result shows that this new method can have various applications in respiratory monitoring and diagnosis.
Resumo:
Frequent episode discovery is a popular framework for pattern discovery from sequential data. It has found many applications in domains like alarm management in telecommunication networks, fault analysis in the manufacturing plants, predicting user behavior in web click streams and so on. In this paper, we address the discovery of serial episodes. In the episodes context, there have been multiple ways to quantify the frequency of an episode. Most of the current algorithms for episode discovery under various frequencies are apriori-based level-wise methods. These methods essentially perform a breadth-first search of the pattern space. However currently there are no depth-first based methods of pattern discovery in the frequent episode framework under many of the frequency definitions. In this paper, we try to bridge this gap. We provide new depth-first based algorithms for serial episode discovery under non-overlapped and total frequencies. Under non-overlapped frequency, we present algorithms that can take care of span constraint and gap constraint on episode occurrences. Under total frequency we present an algorithm that can handle span constraint. We provide proofs of correctness for the proposed algorithms. We demonstrate the effectiveness of the proposed algorithms by extensive simulations. We also give detailed run-time comparisons with the existing apriori-based methods and illustrate scenarios under which the proposed pattern-growth algorithms perform better than their apriori counterparts. (C) 2013 Elsevier B.V. All rights reserved.
Resumo:
In this paper, we consider the setting of the pattern maximum likelihood (PML) problem studied by Orlitsky et al. We present a well-motivated heuristic algorithm for deciding the question of when the PML distribution of a given pattern is uniform. The algorithm is based on the concept of a ``uniform threshold''. This is a threshold at which the uniform distribution exhibits an interesting phase transition in the PML problem, going from being a local maximum to being a local minimum.
Resumo:
The problem of classification of time series data is an interesting problem in the field of data mining. Even though several algorithms have been proposed for the problem of time series classification we have developed an innovative algorithm which is computationally fast and accurate in several cases when compared with 1NN classifier. In our method we are calculating the fuzzy membership of each test pattern to be classified to each class. We have experimented with 6 benchmark datasets and compared our method with 1NN classifier.
Resumo:
Group VB and VIB M-Si systems are considered to show an interesting pattern in the diffusion of components with the change in atomic number in a particular group (M = V, Nb, Ta or M = Mo, W, respectively). Mainly two phases, MSi2 and M5Si3 are considered for this discussion. Except for Ta-silicides, the activation energy for the integrated diffusion of MSi2 is always lower than M5Si3. In both phases, the relative mobilities measured by the ratio of the tracer diffusion coefficients, , decrease with an increasing atomic number in the given group. If determined at the same homologous temperature, the interdiffusion coefficients increase with the atomic number of the refractory metal in the MSi2 phases and decrease in the M5Si3 ones. This behaviour features the basic changes in the defect concentrations on different sublattices with a change in the atomic number of the refractory components.
Resumo:
In today's API-rich world, programmer productivity depends heavily on the programmer's ability to discover the required APIs. In this paper, we present a technique and tool, called MATHFINDER, to discover APIs for mathematical computations by mining unit tests of API methods. Given a math expression, MATHFINDER synthesizes pseudo-code to compute the expression by mapping its subexpressions to API method calls. For each subexpression, MATHFINDER searches for a method such that there is a mapping between method inputs and variables of the subexpression. The subexpression, when evaluated on the test inputs of the method under this mapping, should produce results that match the method output on a large number of tests. We implemented MATHFINDER as an Eclipse plugin for discovery of third-party Java APIs and performed a user study to evaluate its effectiveness. In the study, the use of MATHFINDER resulted in a 2x improvement in programmer productivity. In 96% of the subexpressions queried for in the study, MATHFINDER retrieved the desired API methods as the top-most result. The top-most pseudo-code snippet to implement the entire expression was correct in 93% of the cases. Since the number of methods and unit tests to mine could be large in practice, we also implement MATHFINDER in a MapReduce framework and evaluate its scalability and response time.
Resumo:
Today's programming languages are supported by powerful third-party APIs. For a given application domain, it is common to have many competing APIs that provide similar functionality. Programmer productivity therefore depends heavily on the programmer's ability to discover suitable APIs both during an initial coding phase, as well as during software maintenance. The aim of this work is to support the discovery and migration of math APIs. Math APIs are at the heart of many application domains ranging from machine learning to scientific computations. Our approach, called MATHFINDER, combines executable specifications of mathematical computations with unit tests (operational specifications) of API methods. Given a math expression, MATHFINDER synthesizes pseudo-code comprised of API methods to compute the expression by mining unit tests of the API methods. We present a sequential version of our unit test mining algorithm and also design a more scalable data-parallel version. We perform extensive evaluation of MATHFINDER (1) for API discovery, where math algorithms are to be implemented from scratch and (2) for API migration, where client programs utilizing a math API are to be migrated to another API. We evaluated the precision and recall of MATHFINDER on a diverse collection of math expressions, culled from algorithms used in a wide range of application areas such as control systems and structural dynamics. In a user study to evaluate the productivity gains obtained by using MATHFINDER for API discovery, the programmers who used MATHFINDER finished their programming tasks twice as fast as their counterparts who used the usual techniques like web and code search, IDE code completion, and manual inspection of library documentation. For the problem of API migration, as a case study, we used MATHFINDER to migrate Weka, a popular machine learning library. Overall, our evaluation shows that MATHFINDER is easy to use, provides highly precise results across several math APIs and application domains even with a small number of unit tests per method, and scales to large collections of unit tests.
Resumo:
We demonstrate a new technique to generate multiple light-sheets for fluorescence microscopy. This is possible by illuminating the cylindrical lens using multiple copies of Gaussian beams. A diffraction grating placed just before the cylindrical lens splits the incident Gaussian beam into multiple beams traveling at different angles. Subsequently, this gives rise to diffraction-limited light-sheets after the Gaussian beams pass through the combined cylindrical lens-objective sub-system. Direct measurement of field at and around the focus of objective lens shows multi-sheet pattern with an average thickness of 7.5 mu m and inter-sheet separation of 380 mu m. Employing an independent orthogonal detection sub-system, we successfully imaged fluorescently-coated yeast cells (approximate to 4 mu m) encaged in agarose gel-matrix. Such a diffraction-limited sheet-pattern equipped with dedicated detection system may find immediate applications in the field of optical microscopy and fluorescence imaging. (C) 2015 Optical Society of America