959 resultados para Association rule mining


Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB. RESULTS: The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments. CONCLUSIONS: The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at http://eagl.unige.ch/PTM/.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

It is common practice in genome-wide association studies (GWAS) to focus on the relationship between disease risk and genetic variants one marker at a time. When relevant genes are identified it is often possible to implicate biological intermediates and pathways likely to be involved in disease aetiology. However, single genetic variants typically explain small amounts of disease risk. Our idea is to construct allelic scores that explain greater proportions of the variance in biological intermediates, and subsequently use these scores to data mine GWAS. To investigate the approach's properties, we indexed three biological intermediates where the results of large GWAS meta-analyses were available: body mass index, C-reactive protein and low density lipoprotein levels. We generated allelic scores in the Avon Longitudinal Study of Parents and Children, and in publicly available data from the first Wellcome Trust Case Control Consortium. We compared the explanatory ability of allelic scores in terms of their capacity to proxy for the intermediate of interest, and the extent to which they associated with disease. We found that allelic scores derived from known variants and allelic scores derived from hundreds of thousands of genetic markers explained significant portions of the variance in biological intermediates of interest, and many of these scores showed expected correlations with disease. Genome-wide allelic scores however tended to lack specificity suggesting that they should be used with caution and perhaps only to proxy biological intermediates for which there are no known individual variants. Power calculations confirm the feasibility of extending our strategy to the analysis of tens of thousands of molecular phenotypes in large genome-wide meta-analyses. We conclude that our method represents a simple way in which potentially tens of thousands of molecular phenotypes could be screened for causal relationships with disease without having to expensively measure these variables in individual disease collections.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Airborne particles can come from a variety of sources and contain variable chemical constituents. Some particles are formed by natural processes, such as volcanoes, erosion, sea spray, and forest fires, while other are formed by anthropogenic processes, such as industrial- and motor vehicle-related combustion, road-related wear, and mining. In general, larger particles (those greater than 2.5 μm) are formed by mechanical processes, while those less than 2.5 μm are formed by combustion processes. The chemical composition of particles is highly influenced by the source: for combustion-related particles, factors such as temperature of combustion, fuel type, and presence of oxygen or other gases can also have a large impact on PM composition. These differences can often be observed at a regional level, such as the greater sulphate-composition of PM in regions that burn coal for electricity production (which contains sulphur) versus regions that do not. Most countries maintain air monitoring networks, and studies based on the resulting data are the most common basis for epidemiology studies on the health effects of PM. Data from these monitoring stations can be used to evaluate the relationship between community-level exposure to ambient particles and health outcomes (i.e., morbidity or mortality from various causes). Respiratory and cardiovascular outcomes are the most commonly assessed, although studies have also considered other related specific outcomes such as diabetes and congenital heart disease. The data on particle characteristics is usually not very detailed and most often includes some combination of PM2.5, PM10, sulphate, and NO2. Other descriptors that are less commonly found include particle number (ultrafine particles), metal components of PM, local traffic intensity, and EC/OC. Measures of association are usually reported per 10 μg/m3 or interquartile range increase in pollutant concentration. As the exposure data are taken from regional monitoring stations, the measurements are not representative of an individual's exposure. Particle size is an important descriptor for understanding where in the human respiratory system the particles will deposit: as a general rule, smaller particles penetrate to deeper regions of the lungs. Initial studies on the health effects of particulate matter focused on mass of the particles, including either all particles (often termed total suspended particulate or TSP) or PM10 (all particles with an aerodynamic diameter less than 10 μm). More recently, studies have considered both PM10 and PM2.5, with the latter corresponding more directly to combustion-related processes. UFPs are a dominant source of particles in terms of PNC, yet are negligible in terms of mass. Very few epidemiology studies have measured the effect of UFPs on health; however, the numbers of studies on this topic are increasing. In addition to size, chemical composition is of importance when understanding the toxicity of particles. Some studies consider the composition of particles in addition to mass; however this is not common, in part due the cost and labour involved in such analyses.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This note corrects a previous treatment of algorithms for the metric DTR, Depth by the Rule.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The genetic analysis workshop 15 (GAW15) problem 1 contained baseline expression levels of 8793 genes in immortalised B cells from 194 individuals in 14 Centre d’Etude du Polymorphisme Humane (CEPH) Utah pedigrees. Previous analysis of the data showed linkage and association and evidence of substantial individual variations. In particular, correlation was examined on expression levels of 31 genes and 25 target genes corresponding to two master regulatory regions. In this analysis, we apply Bayesian network analysis to gain further insight into these findings. We identify strong dependences and therefore provide additional insight into the underlying relationships between the genes involved. More generally, the approach is expected to be applicable for integrated analysis of genes on biological pathways.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This is a report on the data-mining of two chess databases, the objective being to compare their sub-7-man content with perfect play as documented in Nalimov endgame tables. Van der Heijden’s ENDGAME STUDY DATABASE IV is a definitive collection of 76,132 studies in which White should have an essentially unique route to the stipulated goal. Chessbase’s BIG DATABASE 2010 holds some 4.5 million games. Insight gained into both database content and data-mining has led to some delightful surprises and created a further agenda.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly recorded data. However alternative technologies have been derived that often produce better rules but do not scale well on large datasets. Such an alternative to TDIDT is the PrismTCS algorithm. PrismTCS performs particularly well on noisy data but does not scale well on large datasets. In this paper we introduce Prism and investigate its scaling behaviour. We describe how we improved the scalability of the serial version of Prism and investigate its limitations. We then describe our work to overcome these limitations by developing a framework to parallelise algorithms of the Prism family and similar algorithms. We also present the scale up results of a first prototype implementation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In a world where data is captured on a large scale the major challenge for data mining algorithms is to be able to scale up to large datasets. There are two main approaches to inducing classification rules, one is the divide and conquer approach, also known as the top down induction of decision trees; the other approach is called the separate and conquer approach. A considerable amount of work has been done on scaling up the divide and conquer approach. However, very little work has been conducted on scaling up the separate and conquer approach.In this work we describe a parallel framework that allows the parallelisation of a certain family of separate and conquer algorithms, the Prism family. Parallelisation helps the Prism family of algorithms to harvest additional computer resources in a network of computers in order to make the induction of classification rules scale better on large datasets. Our framework also incorporates a pre-pruning facility for parallel Prism algorithms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Induction of classification rules is one of the most important technologies in data mining. Most of the work in this field has concentrated on the Top Down Induction of Decision Trees (TDIDT) approach. However, alternative approaches have been developed such as the Prism algorithm for inducing modular rules. Prism often produces qualitatively better rules than TDIDT but suffers from higher computational requirements. We investigate approaches that have been developed to minimize the computational requirements of TDIDT, in order to find analogous approaches that could reduce the computational requirements of Prism.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Advances in hardware and software technologies allow to capture streaming data. The area of Data Stream Mining (DSM) is concerned with the analysis of these vast amounts of data as it is generated in real-time. Data stream classification is one of the most important DSM techniques allowing to classify previously unseen data instances. Different to traditional classifiers for static data, data stream classifiers need to adapt to concept changes (concept drift) in the stream in real-time in order to reflect the most recent concept in the data as accurately as possible. A recent addition to the data stream classifier toolbox is eRules which induces and updates a set of expressive rules that can easily be interpreted by humans. However, like most rule-based data stream classifiers, eRules exhibits a poor computational performance when confronted with continuous attributes. In this work, we propose an approach to deal with continuous data effectively and accurately in rule-based classifiers by using the Gaussian distribution as heuristic for building rule terms on continuous attributes. We show on the example of eRules that incorporating our method for continuous attributes indeed speeds up the real-time rule induction process while maintaining a similar level of accuracy compared with the original eRules classifier. We termed this new version of eRules with our approach G-eRules.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The most popular endgame tables (EGTs) documenting ‘DTM’ Depth to Mate in chess endgames are those of Eugene Nalimov but these do not recognise the FIDE 50-move rule ‘50mr’. This paper marks the creation by the first author of EGTs for sub-6-man (s6m) chess and beyond which give DTM as affected by the ply count pc. The results are put into the context of previous work recognising the 50mr and are compared with the original unmoderated DTM results. The work is also notable for being the first EGT generation work to use the functional programming language HASKELL.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

One way to organize knowledge and make its search and retrieval easier is to create a structural representation divided by hierarchically related topics. Once this structure is built, it is necessary to find labels for each of the obtained clusters. In many cases the labels have to be built using only the terms in the documents of the collection. This paper presents the SeCLAR (Selecting Candidate Labels using Association Rules) method, which explores the use of association rules for the selection of good candidates for labels of hierarchical document clusters. The candidates are processed by a classical method to generate the labels. The idea of the proposed method is to process each parent-child relationship of the nodes as an antecedent-consequent relationship of association rules. The experimental results show that the proposed method can improve the precision and recall of labels obtained by classical methods. © 2010 Springer-Verlag.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This study aimed to evaluate the sediment quality in the estuarine protected area known as Canan,ia-Iguape-Peruibe (CIP-PA), located on the southeastern coast of Brazil. The study was designed considering possible negative effects induced by the city of Canan,ia on the sediment quality of surrounding areas. This evaluation was performed using chemical and ecotoxicological analyses. Sediments were predominantly sandy, with low CaCO3 contents. Amounts of organic matter varied, but higher contents occurred closer to the city, as well as did Fe and Total Recoverable Oils and Greases (TROGs) concentrations. Contamination by Cd and Cu was revealed in some samples, while concentrations of Zn were considered low. Chronic toxicity was detected in all tested sediments and acute toxicity occurred only in sediments collected near the city. The principal component analysis (PCA) revealed an association among Cd, Cu, Fe, TROG, fines, organic matter, CaCO3, and chronic toxicity, whereas acute toxicity was found to be associated with Zn and mud. However, because Zn levels were low, acute toxicity was likely due to a contaminant that was not measured. Results show that there is a broad area within the CIP-PA that is under the influence of mining activities (chronic toxicity, moderate contamination by metals), whereas poorer conditions occur closer to Canan,ia (acute toxicity); thus, the urban area seems to constitute a relevant source of contaminants for the estuarine complex. These results show that contamination is already capable of producing risks for the local aquatic biota, which suggests that the CIP-PA effectiveness in protecting estuarine biota may be threatened.