68 resultados para Association mining
Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences
Resumo:
Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.
Resumo:
Competition theory predicts that local communities should consist of species that are more dissimilar than expected by chance. We find a strikingly different pattern in a multicontinent data set (55 presence-absence matrices from 24 locations) on the composition of mixed-species bird flocks, which are important sub-units of local bird communities the world over. By using null models and randomization tests followed by meta-analysis, we find the association strengths of species in flocks to be strongly related to similarity in body size and foraging behavior and higher for congeneric compared with noncongeneric species pairs. Given the local spatial scales of our individual analyses, differences in the habitat preferences of species are unlikely to have caused these association patterns; the patterns observed are most likely the outcome of species interactions. Extending group-living and social-information-use theory to a heterospecific context, we discuss potential behavioral mechanisms that lead to positive interactions among similar species in flocks, as well as ways in which competition costs are reduced. Our findings highlight the need to consider positive interactions along with competition when seeking to explain community assembly.
Resumo:
We study the question of determining locations of base stations (BSs) that may belong to the same or to competing service providers. We take into account the impact of these decisions on the behavior of intelligent mobile terminals that can connect to the base station that offers the best utility. The signal-to-interference-plus-noise ratio (SINR) is used as the quantity that determines the association. We first study the SINR association-game: We determine the cells corresponding to each base stations, i.e., the locations at which mobile terminals prefer to connect to a given base station than to others. We make some surprising observations: 1) displacing a base station a little in one direction may result in a displacement of the boundary of the corresponding cell to the opposite direction; 2) a cell corresponding to a BS may be the union of disconnected subcells. We then study the hierarchical equilibrium in the combined BS location and mobile association problem: We determine where to locate the BSs so as to maximize the revenues obtained at the induced SINR mobile association game. We consider the cases of single frequency band and two frequency bands of operation. Finally, we also consider hierarchical equilibria in two frequency systems with successive interference cancellation.
Resumo:
Song-selection and mood are interdependent. If we capture a song’s sentiment, we can determine the mood of the listener, which can serve as a basis for recommendation systems. Songs are generally classified according to genres, which don’t entirely reflect sentiments. Thus, we require an unsupervised scheme to mine them. Sentiments are classified into either two (positive/negative) or multiple (happy/angry/sad/...) classes, depending on the application. We are interested in analyzing the feelings invoked by a song, involving multi-class sentiments. To mine the hidden sentimental structure behind a song, in terms of “topics”, we consider its lyrics and use Latent Dirichlet Allocation (LDA). Each song is a mixture of moods. Topics mined by LDA can represent moods. Thus we get a scheme of collecting similar-mood songs. For validation, we use a dataset of songs containing 6 moods annotated by users of a particular website.
Resumo:
Fast content addressable data access mechanisms have compelling applications in today's systems. Many of these exploit the powerful wildcard matching capabilities provided by ternary content addressable memories. For example, TCAM based implementations of important algorithms in data mining been developed in recent years; these achieve an an order of magnitude speedup over prevalent techniques. However, large hardware TCAMs are still prohibitively expensive in terms of power consumption and cost per bit. This has been a barrier to extending their exploitation beyond niche and special purpose systems. We propose an approach to overcome this barrier by extending the traditional virtual memory hierarchy to scale up the user visible capacity of TCAMs while mitigating the power consumption overhead. By exploiting the notion of content locality (as opposed to spatial locality), we devise a novel combination of software and hardware techniques to provide an abstraction of a large virtual ternary content addressable space. In the long run, such abstractions enable applications to disassociate considerations of spatial locality and contiguity from the way data is referenced. If successful, ideas for making content addressability a first class abstraction in computing systems can open up a radical shift in the way applications are optimized for memory locality, just as storage class memories are soon expected to shift away from the way in which applications are typically optimized for disk access locality.
Resumo:
Users can rarely reveal their information need in full detail to a search engine within 1--2 words, so search engines need to "hedge their bets" and present diverse results within the precious 10 response slots. Diversity in ranking is of much recent interest. Most existing solutions estimate the marginal utility of an item given a set of items already in the response, and then use variants of greedy set cover. Others design graphs with the items as nodes and choose diverse items based on visit rates (PageRank). Here we introduce a radically new and natural formulation of diversity as finding centers in resistive graphs. Unlike in PageRank, we do not specify the edge resistances (equivalently, conductances) and ask for node visit rates. Instead, we look for a sparse set of center nodes so that the effective conductance from the center to the rest of the graph has maximum entropy. We give a cogent semantic justification for turning PageRank thus on its head. In marked deviation from prior work, our edge resistances are learnt from training data. Inference and learning are NP-hard, but we give practical solutions. In extensive experiments with subtopic retrieval, social network search, and document summarization, our approach convincingly surpasses recently-published diversity algorithms like subtopic cover, max-marginal relevance (MMR), Grasshopper, DivRank, and SVMdiv.
Resumo:
The rapid growth in the field of data mining has lead to the development of various methods for outlier detection. Though detection of outliers has been well explored in the context of numerical data, dealing with categorical data is still evolving. In this paper, we propose a two-phase algorithm for detecting outliers in categorical data based on a novel definition of outliers. In the first phase, this algorithm explores a clustering of the given data, followed by the ranking phase for determining the set of most likely outliers. The proposed algorithm is expected to perform better as it can identify different types of outliers, employing two independent ranking schemes based on the attribute value frequencies and the inherent clustering structure in the given data. Unlike some existing methods, the computational complexity of this algorithm is not affected by the number of outliers to be detected. The efficacy of this algorithm is demonstrated through experiments on various public domain categorical data sets.
Resumo:
This paper primarily intends to develop a GIS (geographical information system)-based data mining approach for optimally selecting the locations and determining installed capacities for setting up distributed biomass power generation systems in the context of decentralized energy planning for rural regions. The optimal locations within a cluster of villages are obtained by matching the installed capacity needed with the demand for power, minimizing the cost of transportation of biomass from dispersed sources to power generation system, and cost of distribution of electricity from the power generation system to demand centers or villages. The methodology was validated by using it for developing an optimal plan for implementing distributed biomass-based power systems for meeting the rural electricity needs of Tumkur district in India consisting of 2700 villages. The approach uses a k-medoid clustering algorithm to divide the total region into clusters of villages and locate biomass power generation systems at the medoids. The optimal value of k is determined iteratively by running the algorithm for the entire search space for different values of k along with demand-supply matching constraints. The optimal value of the k is chosen such that it minimizes the total cost of system installation, costs of transportation of biomass, and transmission and distribution. A smaller region, consisting of 293 villages was selected to study the sensitivity of the results to varying demand and supply parameters. The results of clustering are represented on a GIS map for the region.
Resumo:
Background: Recent research on glioblastoma (GBM) has focused on deducing gene signatures predicting prognosis. The present study evaluated the mRNA expression of selected genes and correlated with outcome to arrive at a prognostic gene signature. Methods: Patients with GBM (n = 123) were prospectively recruited, treated with a uniform protocol and followed up. Expression of 175 genes in GBM tissue was determined using qRT-PCR. A supervised principal component analysis followed by derivation of gene signature was performed. Independent validation of the signature was done using TCGA data. Gene Ontology and KEGG pathway analysis was carried out among patients from TCGA cohort. Results: A 14 gene signature was identified that predicted outcome in GBM. A weighted gene (WG) score was found to be an independent predictor of survival in multivariate analysis in the present cohort (HR = 2.507; B = 0.919; p < 0.001) and in TCGA cohort. Risk stratification by standardized WG score classified patients into low and high risk predicting survival both in our cohort (p = <0.001) and TCGA cohort (p = 0.001). Pathway analysis using the most differentially regulated genes (n = 76) between the low and high risk groups revealed association of activated inflammatory/immune response pathways and mesenchymal subtype in the high risk group. Conclusion: We have identified a 14 gene expression signature that can predict survival in GBM patients. A network analysis revealed activation of inflammatory response pathway specifically in high risk group. These findings may have implications in understanding of gliomagenesis, development of targeted therapies and selection of high risk cancer patients for alternate adjuvant therapies.
Resumo:
Background: Insulin like growth factor binding proteins modulate the mitogenic and pro survival effects of IGF. Elevated expression of IGFBP2 is associated with progression of tumors that include prostate, ovarian, glioma among others. Though implicated in the progression of breast cancer, the molecular mechanisms involved in IGFBP2 actions are not well defined. This study investigates the molecular targets and biological pathways targeted by IGFBP2 in breast cancer. Methods: Transcriptome analysis of breast tumor cells (BT474) with stable knockdown of IGFBP2 and breast tumors having differential expression of IGFBP2 by immunohistochemistry was performed using microarray. Differential gene expression was established using R-Bioconductor package. For validation, gene expression was determined by qPCR. Inhibitors of IGF1R and integrin pathway were utilized to study the mechanism of regulation of beta-catenin. Immunohistochemical and immunocytochemical staining was performed on breast tumors and experimental cells, respectively for beta-catenin and IGFBP2 expression. Results: Knockdown of IGFBP2 resulted in differential expression of 2067 up regulated and 2002 down regulated genes in breast cancer cells. Down regulated genes principally belong to cell cycle, DNA replication, repair, p53 signaling, oxidative phosphorylation, Wnt signaling. Whole genome expression analysis of breast tumors with or without IGFBP2 expression indicated changes in genes belonging to Focal adhesion, Map kinase and Wnt signaling pathways. Interestingly, IGFBP2 knockdown clones showed reduced expression of beta-catenin compared to control cells which was restored upon IGFBP2 re-expression. The regulation of beta-catenin by IGFBP2 was found to be IGF1R and integrin pathway dependent. Furthermore, IGFBP2 and beta-catenin are co-ordinately overexpressed in breast tumors and correlate with lymph node metastasis. Conclusion: This study highlights regulation of beta-catenin by IGFBP2 in breast cancer cells and most importantly, combined expression of IGFBP2 and beta-catenin is associated with lymph node metastasis of breast tumors.
Resumo:
A causative agent in approximately 40% of diarrhea] cases. still remains unidentified. Though many enteroviruses (EVs) are transmitted through fecal-oral route and replicate in the intestinal cells, their association with acute diarrhea has not so far been recognized due to lack of detailed epidemiological investigations. This long-term, detailed molecular epidemiological study aims to conclusively determine the association of non-polio enteroviruses (NPEVs) with acute diarrhea in comaparison with rotavirus (RV) in children. Diarrheal stool specimens from 2161 children aged 0-2 years and 169 children between 2 and 9 years, and 1800 normal stool samples from age-matched healthy children between 0 and 9 years were examined during 2008-2012 for enterovirus (oral polio vaccine strains (OPVs) and NPEVs). Enterovirus serotypes were identified by complete VP1 gene sequence analysis. Enterovirus and rotavirus were detected in 19.01% (380/2330) and 13.82% (322/2330) diarrheal stools. During the study period, annual prevalence of EV- and RV-associated diarrhea ranged between 8% and 22%, but with contrasting seasonal prevalence with RV predominating during winter months and NPEV prevailing in other seasons. NPEVs are associated with epidemics-like outbreaks during which they are detected in up to 50% of diarrheic children, and in non-epidemic seasons in 0-10% of the patients. After subtraction of OPV-positive diarrheal cases (1.81%), while NPEVs are associated with about 17% of acute diarrhea, about 6% of healthy children showed asymptomatic NPEV excretion. Of 37 NPEV serotypes detected in diarrheal children, seven echovirus types 1, 7, 11, 13, 14, 30 and 33 are frequently observed, with Ell being more prevalent followed by E30. In conclusion, NPEVs are significantly associated with acute diarrhea, and NPEVs and rotavirus exhibit contrasting seasonal predominance. This study signifies the need for a new direction of research on enteroviruses involving systematic analysis of their contribution to diarrheal burden. (C) 2013 Elsevier B.V. All rights reserved.
Resumo:
Mycobacterium tuberculosis owes its high pathogenic potential to its ability to evade host immune responses and thrive inside the macrophage. The outcome of infection is largely determined by the cellular response comprising a multitude of molecular events. The complexity and inter-relatedness in the processes makes it essential to adopt systems approaches to study them. In this work, we construct a comprehensive network of infection-related processes in a human macrophage comprising 1888 proteins and 14,016 interactions. We then compute response networks based on available gene expression profiles corresponding to states of health, disease and drug treatment. We use a novel formulation for mining response networks that has led to identifying highest activities in the cell. Highest activity paths provide mechanistic insights into pathogenesis and response to treatment. The approach used here serves as a generic framework for mining dynamic changes in genome-scale protein interaction networks.
Resumo:
There is a growing recognition of the need to integrate non-trophic interactions into ecological networks for a better understanding of whole-community organization. To achieve this, the first step is to build networks of individual non-trophic interactions. In this study, we analyzed a network of interdependencies among bird species that participated in heterospecific foraging associations (flocks) in an evergreen forest site in the Western Ghats, India. We found the flock network to contain a small core of highly important species that other species are strongly dependent on, a pattern seen in many other biological networks. Further, we found that structural importance of species in the network was strongly correlated to functional importance of species at the individual flock level. Finally, comparisons with flock networks from other Asian forests showed that the same taxonomic groups were important in general, suggesting that species importance was an intrinsic trait and not dependent on local ecological conditions. Hence, given a list of species in an area, it may be possible to predict which ones are likely to be important. Our study provides a framework for the investigation of other heterospecific foraging associations and associations among species in other non-trophic contexts.
Resumo:
In a typical enterprise WLAN, a station has a choice of multiple access points to associate with. The default association policy is based on metrics such as Re-ceived Signal Strength(RSS), and “link quality” to choose a particular access point among many. Such an approach can lead to unequal load sharing and diminished system performance. We consider the RAT (Rate And Throughput) policy [1] which leads to better system performance. The RAT policy has been implemented on home-grown centralized WLAN controller, ADWISER [2] and we demonstrate that the RAT policy indeed provides a better system performance.
Resumo:
Bird species are hypothesized to join mixed-species flocks (flocks hereon) either for direct foraging or anti-predation-related benefits. In this study, conducted in a tropical evergreen forest in the Western Ghats of India, we used intra-flock association patterns to generate a community-wide assessment of flocking benefits for different species. We assumed that individuals needed to be physically proximate to particular heterospecific individuals within flocks to obtain any direct foraging benefit (flushed prey, kleptoparasitism, copying foraging locations). Alternatively, for anti-predation benefits, physical proximity to particular heterospecifics is not required, i.e. just being in the flock vicinity can suffice. Therefore, we used choice of locations within flocks to infer whether individual species are obtaining direct foraging or anti-predation benefits. A small subset of the bird community (5/29 species), composed of all members of the sallying guild, showed non-random physical proximity to heterospecifics within flocks. All preferred associates were from non-sallying guilds, suggesting that the sallying species were likely obtaining direct foraging benefits either in the form of flushed or kleptoparasitized prey. The majority of the species (24/29) chose locations randomly with respect to heterospecifics within flocks and, thus, were likely obtaining antipredation benefits. In summary, our study indicates that direct foraging benefits are important for only a small proportion of species in flocks and that predation is likely to be the main driver of flocking for most participants. Our findings apart, our study provides methodological advances that might be useful in understanding asymmetric interactions in social groups of single and multiple species.