13 resultados para sequence data mining

em QUB Research Portal - Research Directory and Institutional Repository for Queen's University Belfast


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the last decade, data mining has emerged as one of the most dynamic and lively areas in information technology. Although many algorithms and techniques for data mining have been proposed, they either focus on domain independent techniques or on very specific domain problems. A general requirement in bridging the gap between academia and business is to cater to general domain-related issues surrounding real-life applications, such as constraints, organizational factors, domain expert knowledge, domain adaption, and operational knowledge. Unfortunately, these either have not been addressed, or have not been sufficiently addressed, in current data mining research and development.Domain-Driven Data Mining (D3M) aims to develop general principles, methodologies, and techniques for modeling and merging comprehensive domain-related factors and synthesized ubiquitous intelligence surrounding problem domains with the data mining process, and discovering knowledge to support business decision-making. This paper aims to report original, cutting-edge, and state-of-the-art progress in D3M. It covers theoretical and applied contributions aiming to: 1) propose next-generation data mining frameworks and processes for actionable knowledge discovery, 2) investigate effective (automated, human and machine-centered and/or human-machined-co-operated) principles and approaches for acquiring, representing, modelling, and engaging ubiquitous intelligence in real-world data mining, and 3) develop workable and operational systems balancing technical significance and applications concerns, and converting and delivering actionable knowledge into operational applications rules to seamlessly engage application processes and systems.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background. The assembly of the tree of life has seen significant progress in recent years but algae and protists have been largely overlooked in this effort. Many groups of algae and protists have ancient roots and it is unclear how much data will be required to resolve their phylogenetic relationships for incorporation in the tree of life. The red algae, a group of primary photosynthetic eukaryotes of more than a billion years old, provide the earliest fossil evidence for eukaryotic multicellularity and sexual reproduction. Despite this evolutionary significance, their phylogenetic relationships are understudied. This study aims to infer a comprehensive red algal tree of life at the family level from a supermatrix containing data mined from GenBank. We aim to locate remaining regions of low support in the topology, evaluate their causes and estimate the amount of data required to resolve them. Results. Phylogenetic analysis of a supermatrix of 14 loci and 98 red algal families yielded the most complete red algal tree of life to date. Visualization of statistical support showed the presence of five poorly supported regions. Causes for low support were identified with statistics about the age of the region, data availability and node density, showing that poor support has different origins in different parts of the tree. Parametric simulation experiments yielded optimistic estimates of how much data will be needed to resolve the poorly supported regions (ca. 103 to ca. 104 nucleotides for the different regions). Nonparametric simulations gave a markedly more pessimistic image, some regions requiring more than 2.8 105 nucleotides or not achieving the desired level of support at all. The discrepancies between parametric and nonparametric simulations are discussed in light of our dataset and known attributes of both approaches. Conclusions. Our study takes the red algae one step closer to meaningful inclusion in the tree of life. In addition to the recovery of stable relationships, the recognition of five regions in need of further study is a significant outcome of this work. Based on our analyses of current availability and future requirements of data, we make clear recommendations for forthcoming research.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We conducted data-mining analyses of genome wide association (GWA) studies of the CATIE and MGS-GAIN datasets, and found 13 markers in the two physically linked genes, PTPN21 and EML5, showing nominally significant association with schizophrenia. Linkage disequilibrium (LD) analysis indicated that all 7 markers from PTPN21 shared high LD (r(2)>0.8), including rs2274736 and rs2401751, the two non-synonymous markers with the most significant association signals (rs2401751, P=1.10 × 10(-3) and rs2274736, P=1.21 × 10(-3)). In a meta-analysis of all 13 replication datasets with a total of 13,940 subjects, we found that the two non-synonymous markers are significantly associated with schizophrenia (rs2274736, OR=0.92, 95% CI: 0.86-0.97, P=5.45 × 10(-3) and rs2401751, OR=0.92, 95% CI: 0.86-0.97, P=5.29 × 10(-3)). One SNP (rs7147796) in EML5 is also significantly associated with the disease (OR=1.08, 95% CI: 1.02-1.14, P=6.43 × 10(-3)). These 3 markers remain significant after Bonferroni correction. Furthermore, haplotype conditioned analyses indicated that the association signals observed between rs2274736/rs2401751 and rs7147796 are statistically independent. Given the results that 2 non-synonymous markers in PTPN21 are associated with schizophrenia, further investigation of this locus is warranted.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The skin of fish is the first line of defense against pathogens and parasites. The skin transcriptome of the Atlantic salmon is poorly characterized, and currently only 2,089 expressed sequence tags (ESTs) out of a total of half a million sequences are generated from skin-derived cDNA libraries. The primary aim of this study was to enhance the transcriptomic knowledge of salmon skin by using next-generation sequencing (NGS) technology, namely the Roche-454 platform. An equimolar mixture of high-quality RNA from skin and epidermal samples of salmon reared in either freshwater or seawater was used for 454-sequencing. This technique yielded over 600,000 reads, which were assembled into 34,696 isotigs using Newbler. Of these isotigs, 12 % had not been sequenced in Atlantic salmon, hence representing previously unreported salmon mRNAs that can potentially be skin-specific. Many full-length genes have been acquired, representing numerous biological processes. Mucin proteins are the main structural component of mucus and we examined in greater detail the sequences we obtained for these genes. Several isotigs exhibited homology to mammalian mucins (MUC2, MUC5AC and MUC5B). Mucin mRNAs are generally > 10 kbp and contain large repetitive units, which pose a challenge towards full-length sequence discovery. To date, we have not unearthed any full-length salmon mucin genes with this dataset, but have both N- and C-terminal regions of a mucin type 5. This highlights the fact that, while NGS is indeed a formidable tool for sequence data mining of non-model species, it must be complemented with additional experimental and bioinformatic work to characterize some mRNA sequences with complex features.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Promoter hypermethylation is central in deregulating gene expression in cancer. Identification of novel methylation targets in specific cancers provides a basis for their use as biomarkers of disease occurrence and progression. We developed an in silico strategy to globally identify potential targets of promoter hypermethylation in prostate cancer by screening for 5' CpG islands in 631 genes that were reported as downregulated in prostate cancer. A virtual archive of 338 potential targets of methylation was produced. One candidate, IGFBP3, was selected for investigation, along with glutathione-S-transferase pi (GSTP1), a well-known methylation target in prostate cancer. Methylation of IGFBP3 was detected by quantitative methylation-specific PCR in 49/79 primary prostate adenocarcinoma and 7/14 adjacent preinvasive high-grade prostatic intraepithelial neoplasia, but in only 5/37 benign prostatic hyperplasia (P < 0.0001) and in 0/39 histologically normal adjacent prostate tissue, which implies that methylation of IGFBP3 may be involved in the early stages of prostate cancer development. Hypermethylation of IGFBP3 was only detected in samples that also demonstrated methylation of GSTP1 and was also correlated with Gleason score > or =7 (P=0.01), indicating that it has potential as a prognostic marker. In addition, pharmacological demethylation induced strong expression of IGFBP3 in LNCaP prostate cancer cells. Our concept of a methylation candidate gene bank was successful in identifying a novel target of frequent hypermethylation in early-stage prostate cancer. Evaluation of further relevant genes could contribute towards a methylation signature of this disease.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Cyathostomins comprise a group of 50 species of parasitic nematodes that infect equids. Ribosomal DNA sequences, in particular the intergenic spacer (IGS) region, have been utilized via several methodologies to identify pre-parasitic stages of the commonest species that affect horses. These methods rely on the availability of accurate sequence information for each species, as well as detailed knowledge of the levels of intra- and inter-specific variation. Here, the IGS DNA region was amplified and sequenced from 10 cyathostomin species for which sequence was not previously available. Also, additional IGS DNA sequences were generated from individual worms of 8 species already studied. Comparative analysis of these sequences revealed a greater range of intra-specific variation than previously reported (up to 23%); whilst the level of inter-specific variation (3-62%) was similar to that identified in earlier studies. The reverse line blot (RLB) method has been used to exploit the cyathostomin IGS DNA region for species identification. Here, we report validation of novel and existing DNA probes for identification of cyathostomins using this method and highlight their application in differentiating life-cycle stages such as third-stage larvae that cannot be identified to species by morphological means.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The last decade has witnessed an unprecedented growth in availability of data having spatio-temporal characteristics. Given the scale and richness of such data, finding spatio-temporal patterns that demonstrate significantly different behavior from their neighbors could be of interest for various application scenarios such as – weather modeling, analyzing spread of disease outbreaks, monitoring traffic congestions, and so on. In this paper, we propose an automated approach of exploring and discovering such anomalous patterns irrespective of the underlying domain from which the data is recovered. Our approach differs significantly from traditional methods of spatial outlier detection, and employs two phases – i) discovering homogeneous regions, and ii) evaluating these regions as anomalies based on their statistical difference from a generalized neighborhood. We evaluate the quality of our approach and distinguish it from existing techniques via an extensive experimental evaluation.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The problem of detecting spatially-coherent groups of data that exhibit anomalous behavior has started to attract attention due to applications across areas such as epidemic analysis and weather forecasting. Earlier efforts from the data mining community have largely focused on finding outliers, individual data objects that display deviant behavior. Such point-based methods are not easy to extend to find groups of data that exhibit anomalous behavior. Scan Statistics are methods from the statistics community that have considered the problem of identifying regions where data objects exhibit a behavior that is atypical of the general dataset. The spatial scan statistic and methods that build upon it mostly adopt the framework of defining a character for regions (e.g., circular or elliptical) of objects and repeatedly sampling regions of such character followed by applying a statistical test for anomaly detection. In the past decade, there have been efforts from the statistics community to enhance efficiency of scan statstics as well as to enable discovery of arbitrarily shaped anomalous regions. On the other hand, the data mining community has started to look at determining anomalous regions that have behavior divergent from their neighborhood.In this chapter,we survey the space of techniques for detecting anomalous regions on spatial data from across the data mining and statistics communities while outlining connections to well-studied problems in clustering and image segmentation. We analyze the techniques systematically by categorizing them appropriately to provide a structured birds eye view of the work on anomalous region detection;we hope that this would encourage better cross-pollination of ideas across communities to help advance the frontier in anomaly detection.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Association rule mining is an indispensable tool for discovering
insights from large databases and data warehouses.
The data in a warehouse being multi-dimensional, it is often
useful to mine rules over subsets of data defined by selections
over the dimensions. Such interactive rule mining
over multi-dimensional query windows is difficult since rule
mining is computationally expensive. Current methods using
pre-computation of frequent itemsets require counting
of some itemsets by revisiting the transaction database at
query time, which is very expensive. We develop a method
(RMW) that identifies the minimal set of itemsets to compute
and store for each cell, so that rule mining over any
query window may be performed without going back to the
transaction database. We give formal proofs that the set of
itemsets chosen by RMW is sufficient to answer any query
and also prove that it is the optimal set to be computed
for 1 dimensional queries. We demonstrate through an extensive
empirical evaluation that RMW achieves extremely
fast query response time compared to existing methods, with
only moderate overhead in pre-computation and storage

Relevância:

90.00% 90.00%

Publicador:

Resumo:

BACKGROUND: Proteins belonging to the serine protease inhibitor (serpin) superfamily play essential physiological roles in many organisms. In pathogens, serpins are thought to have evolved specifically to limit host immune responses by interfering with the host immune-stimulatory signals. Serpins are less well characterised in parasitic helminths, although some are thought to be involved in mechanisms associated with host immune modulation. In this study, we cloned and partially characterised a secretory serpin from Schistosoma japonicum termed SjB6, these findings provide the basis for possible functional roles.

METHODS: SjB6 gene was identified through database mining of our previously published microarray data, cloned and detailed sequence and structural analysis and comparative modelling carried out using various bioinformatics and proteomics tools. Gene transcriptional profiling was determined by real-time PCR and the expression of native protein determined by immunoblotting. An immunological profile of the recombinant protein produced in insect cells was determined by ELISA.

RESULTS: SjB6 contains an open reading frame of 1160 base pairs that encodes a protein of 387 amino acid residues. Detailed sequence analysis, comparative modelling and structural-based alignment revealed that SjB6 contains the essential structural motifs and consensus secondary structures typical of inhibitory serpins. The presence of an N-terminal signal sequence indicated that SjB6 is a secretory protein. Real-time data indicated that SjB6 is expressed exclusively in the intra-mammalian stage of the parasite life cycle with its highest expression levels in the egg stage (p < 0.0001). The native protein is approximately 60 kDa in size and recombinant SjB6 (rSjB6) was recognised strongly by sera from rats experimentally infected with S. japonicum.

CONCLUSIONS: The significantly high expression of SjB6 in schistosome eggs, when compared to other life cycle stages, suggests a possible association with disease pathology, while the strong reactivity of sera from experimentally infected rats against rSjB6 suggests that native SjB6 is released into host tissue and induces an immune response. This study presents a comprehensive demonstration of sequence and structural-based analysis of a secretory serpin from a trematode and suggests SjB6 may be associated with important functional roles in S. japonicum, particularly in parasite modulation of the host microenvironment.