45 resultados para mining sector
Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences
Resumo:
Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.
Resumo:
RATIONALE The ratio of the measured abundance of 13C18O bonding CO2 to its stochastic abundance, prescribed by the delta 13C and delta 18O values from a carbonate mineral, is sensitive to its growth temperature. Recently, clumped-isotope thermometry, which uses this ratio, has been adopted as a new tool to elucidate paleotemperatures quantitatively. METHODS Clumped isotopes in CO2 were measured with a small-sector isotope ratio mass spectrometer. CO2 samples digested from several kinds of calcium carbonates by phosphoric acid at 25 degrees C were purified using both cryogenic and gas-chromatographic separations, and their isotopic composition (delta 13C, delta 18O, Delta 47, Delta 48 and Delta 49 values) were then determined using a dual-inlet Delta XP mass spectrometer. RESULTS The internal precisions of the single gas Delta 47 measurements were 0.005 and 0.02 parts per thousand (1 SE) for the optimum and the routine analytical conditions, respectively, which are comparable with those obtained using a MAT 253 mass spectrometer. The long-term variations in the Delta 47 values for the in-house working standard and the heated CO2 gases since 2007 were close to the routine, single gas uncertainty while showing seasonal-like periodicities with a decreasing trend. Unlike the MAT 253, the Delta XP did not show any significant relationship between the Delta 47 and delta 47 values. CONCLUSIONS The Delta XP gave results that were approximately as precise as those of the MAT 253 for clumped-isotope analysis. The temporal stability of the Delta XP seemed to be lower, although an advantage of the Delta XP was that no dependency of delta 47 on Delta 47 was found. Copyright (c) 2012 John Wiley & Sons, Ltd.
Resumo:
Song-selection and mood are interdependent. If we capture a song’s sentiment, we can determine the mood of the listener, which can serve as a basis for recommendation systems. Songs are generally classified according to genres, which don’t entirely reflect sentiments. Thus, we require an unsupervised scheme to mine them. Sentiments are classified into either two (positive/negative) or multiple (happy/angry/sad/...) classes, depending on the application. We are interested in analyzing the feelings invoked by a song, involving multi-class sentiments. To mine the hidden sentimental structure behind a song, in terms of “topics”, we consider its lyrics and use Latent Dirichlet Allocation (LDA). Each song is a mixture of moods. Topics mined by LDA can represent moods. Thus we get a scheme of collecting similar-mood songs. For validation, we use a dataset of songs containing 6 moods annotated by users of a particular website.
Resumo:
We address the problem of mining targeted association rules over multidimensional market-basket data. Here, each transaction has, in addition to the set of purchased items, ancillary dimension attributes associated with it. Based on these dimensions, transactions can be visualized as distributed over cells of an n-dimensional cube. In this framework, a targeted association rule is of the form {X -> Y} R, where R is a convex region in the cube and X. Y is a traditional association rule within region R. We first describe the TOARM algorithm, based on classical techniques, for identifying targeted association rules. Then, we discuss the concepts of bottom-up aggregation and cubing, leading to the CellUnion technique. This approach is further extended, using notions of cube-count interleaving and credit-based pruning, to derive the IceCube algorithm. Our experiments demonstrate that IceCube consistently provides the best execution time performance, especially for large and complex data cubes.
Resumo:
The rapid growth in the field of data mining has lead to the development of various methods for outlier detection. Though detection of outliers has been well explored in the context of numerical data, dealing with categorical data is still evolving. In this paper, we propose a two-phase algorithm for detecting outliers in categorical data based on a novel definition of outliers. In the first phase, this algorithm explores a clustering of the given data, followed by the ranking phase for determining the set of most likely outliers. The proposed algorithm is expected to perform better as it can identify different types of outliers, employing two independent ranking schemes based on the attribute value frequencies and the inherent clustering structure in the given data. Unlike some existing methods, the computational complexity of this algorithm is not affected by the number of outliers to be detected. The efficacy of this algorithm is demonstrated through experiments on various public domain categorical data sets.
Resumo:
This paper primarily intends to develop a GIS (geographical information system)-based data mining approach for optimally selecting the locations and determining installed capacities for setting up distributed biomass power generation systems in the context of decentralized energy planning for rural regions. The optimal locations within a cluster of villages are obtained by matching the installed capacity needed with the demand for power, minimizing the cost of transportation of biomass from dispersed sources to power generation system, and cost of distribution of electricity from the power generation system to demand centers or villages. The methodology was validated by using it for developing an optimal plan for implementing distributed biomass-based power systems for meeting the rural electricity needs of Tumkur district in India consisting of 2700 villages. The approach uses a k-medoid clustering algorithm to divide the total region into clusters of villages and locate biomass power generation systems at the medoids. The optimal value of k is determined iteratively by running the algorithm for the entire search space for different values of k along with demand-supply matching constraints. The optimal value of the k is chosen such that it minimizes the total cost of system installation, costs of transportation of biomass, and transmission and distribution. A smaller region, consisting of 293 villages was selected to study the sensitivity of the results to varying demand and supply parameters. The results of clustering are represented on a GIS map for the region.
Resumo:
The high level of public accountability attached to Public Sector Enterprises as a result of public ownership makes them socially responsible. The Committee of Public Undertakings in 1992 examined the issue relating to social obligations of Central Public Sector Enterprises and observed that ``being part of the `State', every Public Sector enterprise has a moral responsibility to play an active role in discharging the social obligations endowed on a welfare state, subject to the financial health of the enterprise''. It issued the Corporate Social Responsibility Guidelines in 2010 where all Central Public Enterprises, through a Board Resolution, are mandated to create a CSR budget as a specified percentage of net profit of the previous year. This paper examines the CSR activities of the biggest engineering public sector organization in India, Bharath Heavy Electricals Limited. The objectives are twofold, one, to develop a case study of the organization about the funds allocated and utilized for various CSR activities, and two, to examine its status with regard to other organizations, the 2010 guidelines, and the local socio-economic development. Secondary data analysis results show three interesting trends. One, it reveals increasing organizational social orientation with the formal guidelines in place. Two, Firms can no longer continue to exploit environmental resources and escape from their responsibilities by acting separate entities regardless of the interest of the society and Three the thrust of CSR in public sector is on inclusive growth, sustainable development and capacity building with due attention to the socio-economic needs of the neglected and marginalized sections of the society.
Resumo:
Mycobacterium tuberculosis owes its high pathogenic potential to its ability to evade host immune responses and thrive inside the macrophage. The outcome of infection is largely determined by the cellular response comprising a multitude of molecular events. The complexity and inter-relatedness in the processes makes it essential to adopt systems approaches to study them. In this work, we construct a comprehensive network of infection-related processes in a human macrophage comprising 1888 proteins and 14,016 interactions. We then compute response networks based on available gene expression profiles corresponding to states of health, disease and drug treatment. We use a novel formulation for mining response networks that has led to identifying highest activities in the cell. Highest activity paths provide mechanistic insights into pathogenesis and response to treatment. The approach used here serves as a generic framework for mining dynamic changes in genome-scale protein interaction networks.
Resumo:
In several systems, the physical parameters of the system vary over time or operating points. A popular way of representing such plants with structured or parametric uncertainties is by means of interval polynomials. However, ensuring the stability of such systems is a robust control problem. Fortunately, Kharitonov's theorem enables the analysis of such interval plants and also provides tools for design of robust controllers in such cases. The present paper considers one such case, where the interval plant is connected with a timeinvariant, static, odd, sector type nonlinearity in its feedback path. This paper provides necessary conditions for the existence of self sustaining periodic oscillations in such interval plants, and indicates a possible design algorithm to avoid such periodic solutions or limit cycles. The describing function technique is used to approximate the nonlinearity and subsequently arrive at the results. Furthermore, the value set approach, along with Mikhailov conditions, are resorted to in providing graphical techniques for the derivation of the conditions and subsequent design algorithm of the controller.
Resumo:
The agriculture, forestry and other land use (AFOLU) sector is responsible for approximately 25% of anthropogenic GHG emissions mainly from deforestation and agricultural emissions from livestock, soil and nutrient management. Mitigation from the sector is thus extremely important in meeting emission reduction targets. The sector offers a variety of cost-competitive mitigation options with most analyses indicating a decline in emissions largely due to decreasing deforestation rates. Sustainability criteria are needed to guide development and implementation of AFOLU mitigation measures with particular focus on multifunctional systems that allow the delivery of multiple services from land. It is striking that almost all of the positive and negative impacts, opportunities and barriers are context specific, precluding generic statements about which AFOLU mitigation measures have the greatest promise at a global scale. This finding underlines the importance of considering each mitigation strategy on a case-by-case basis, systemic effects when implementing mitigation options on the national scale, and suggests that policies need to be flexible enough to allow such assessments. National and international agricultural and forest (climate) policies have the potential to alter the opportunity costs of specific land uses in ways that increase opportunities or barriers for attaining climate change mitigation goals. Policies governing practices in agriculture and in forest conservation and management need to account for both effective mitigation and adaptation and can help to orient practices in agriculture and in forestry towards global sharing of innovative technologies for the efficient use of land resources. Different policy instruments, especially economic incentives and regulatory approaches, are currently being applied however, for its successful implementation it is critical to understand how land-use decisions are made and how new social, political and economic forces in the future will influence this process.
Resumo:
In today's API-rich world, programmer productivity depends heavily on the programmer's ability to discover the required APIs. In this paper, we present a technique and tool, called MATHFINDER, to discover APIs for mathematical computations by mining unit tests of API methods. Given a math expression, MATHFINDER synthesizes pseudo-code to compute the expression by mapping its subexpressions to API method calls. For each subexpression, MATHFINDER searches for a method such that there is a mapping between method inputs and variables of the subexpression. The subexpression, when evaluated on the test inputs of the method under this mapping, should produce results that match the method output on a large number of tests. We implemented MATHFINDER as an Eclipse plugin for discovery of third-party Java APIs and performed a user study to evaluate its effectiveness. In the study, the use of MATHFINDER resulted in a 2x improvement in programmer productivity. In 96% of the subexpressions queried for in the study, MATHFINDER retrieved the desired API methods as the top-most result. The top-most pseudo-code snippet to implement the entire expression was correct in 93% of the cases. Since the number of methods and unit tests to mine could be large in practice, we also implement MATHFINDER in a MapReduce framework and evaluate its scalability and response time.
Resumo:
Today's programming languages are supported by powerful third-party APIs. For a given application domain, it is common to have many competing APIs that provide similar functionality. Programmer productivity therefore depends heavily on the programmer's ability to discover suitable APIs both during an initial coding phase, as well as during software maintenance. The aim of this work is to support the discovery and migration of math APIs. Math APIs are at the heart of many application domains ranging from machine learning to scientific computations. Our approach, called MATHFINDER, combines executable specifications of mathematical computations with unit tests (operational specifications) of API methods. Given a math expression, MATHFINDER synthesizes pseudo-code comprised of API methods to compute the expression by mining unit tests of the API methods. We present a sequential version of our unit test mining algorithm and also design a more scalable data-parallel version. We perform extensive evaluation of MATHFINDER (1) for API discovery, where math algorithms are to be implemented from scratch and (2) for API migration, where client programs utilizing a math API are to be migrated to another API. We evaluated the precision and recall of MATHFINDER on a diverse collection of math expressions, culled from algorithms used in a wide range of application areas such as control systems and structural dynamics. In a user study to evaluate the productivity gains obtained by using MATHFINDER for API discovery, the programmers who used MATHFINDER finished their programming tasks twice as fast as their counterparts who used the usual techniques like web and code search, IDE code completion, and manual inspection of library documentation. For the problem of API migration, as a case study, we used MATHFINDER to migrate Weka, a popular machine learning library. Overall, our evaluation shows that MATHFINDER is easy to use, provides highly precise results across several math APIs and application domains even with a small number of unit tests per method, and scales to large collections of unit tests.
Resumo:
The disclosure of information and its misuse in Privacy Preserving Data Mining (PPDM) systems is a concern to the parties involved. In PPDM systems data is available amongst multiple parties collaborating to achieve cumulative mining accuracy. The vertically partitioned data available with the parties involved cannot provide accurate mining results when compared to the collaborative mining results. To overcome the privacy issue in data disclosure this paper describes a Key Distribution-Less Privacy Preserving Data Mining (KDLPPDM) system in which the publication of local association rules generated by the parties is published. The association rules are securely combined to form the combined rule set using the Commutative RSA algorithm. The combined rule sets established are used to classify or mine the data. The results discussed in this paper compare the accuracy of the rules generated using the C4. 5 based KDLPPDM system and the CS. 0 based KDLPPDM system using receiver operating characteristics curves (ROC).
Resumo:
In this paper, we search for the regions of the phenomenological minimal supersymmetric standard model (pMSSM) parameter space where one can expect to have moderate Higgs mixing angle (alpha) with relatively light (up to 600 GeV) additional Higgses after satisfying the current LHC data. We perform a global fit analysis using most updated data (till December 2014) from the LHC and Tevatron experiments. The constraints coming from the precision measurements of the rare b-decays B-s -> mu(+)mu(-) and b -> s gamma are also considered. We find that low M-A(less than or similar to 350) and high tan beta(greater than or similar to 25) regions are disfavored by the combined effect of the global analysis and flavor data. However, regions with Higgs mixing angle alpha similar to 0.1-0.8 are still allowed by the current data. We then study the existing direct search bounds on the heavy scalar/pseudoscalar (H/A) and charged Higgs boson (H-+/-) masses and branchings at the LHC. It has been found that regions with low to moderate values of tan beta with light additional Higgses (mass <= 600 GeV) are unconstrained by the data, while the regions with tan beta > 20 are excluded considering the direct search bounds by the LHC-8 data. The possibility to probe the region with tan beta <= 20 at the high luminosity run of LHC are also discussed, giving special attention to the H -> hh, H/A -> t (t) over bar and H/A -> tau(+)tau(-) decay modes.
Resumo:
Online Social Networks (OSNs) facilitate to create and spread information easily and rapidly, influencing others to participate and propagandize. This work proposes a novel method of profiling Influential Blogger (IB) based on the activities performed on one's blog documents who influences various other bloggers in Social Blog Network (SBN). After constructing a social blogging site, a SBN is analyzed with appropriate parameters to get the Influential Blog Power (IBP) of each blogger in the network and demonstrate that profiling IB is adequate and accurate. The proposed Profiling Influential Blogger (PIB) Algorithm survival rate of IB is high and stable. (C) 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).