766 resultados para Data Mining, Big Data, Consumi energetici, Weka Data Cleaning


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Most research on technology roadmapping has focused on its practical applications and the development of methods to enhance its operational process. Thus, despite a demand for well-supported, systematic information, little attention has been paid to how/which information can be utilised in technology roadmapping. Therefore, this paper aims at proposing a methodology to structure technological information in order to facilitate the process. To this end, eight methods are suggested to provide useful information for technology roadmapping: summary, information extraction, clustering, mapping, navigation, linking, indicators and comparison. This research identifies the characteristics of significant data that can potentially be used in roadmapping, and presents an approach to extracting important information from such raw data through various data mining techniques including text mining, multi-dimensional scaling and K-means clustering. In addition, this paper explains how this approach can be applied in each step of roadmapping. The proposed approach is applied to develop a roadmap of radio-frequency identification (RFID) technology to illustrate the process practically. © 2013 © 2013 Taylor & Francis.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We present Random Partition Kernels, a new class of kernels derived by demonstrating a natural connection between random partitions of objects and kernels between those objects. We show how the construction can be used to create kernels from methods that would not normally be viewed as random partitions, such as Random Forest. To demonstrate the potential of this method, we propose two new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show that these kernels consistently outperform standard kernels on problems involving real-world datasets. Finally, we show how the form of these kernels lend themselves to a natural approximation that is appropriate for certain big data problems, allowing $O(N)$ inference in methods such as Gaussian Processes, Support Vector Machines and Kernel PCA.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Expressed sequence tags (ESTs) are a source for microsatellite development. In the present study, EST-derived microsatelltes (EST-SSRs) were generated and characterized in the common carp (Cyprinus carpio) by data mining from updated public EST databases and by subsequent testing for polymorphism. About 5.5% (555) of 10,088 ESTs contain repeat motifs of various types and lengths with CA being the most abundant dinucleotide one. Out of the 60 EST-SSRs for which PCR primers were designed, 25 loci showed polymorphism in a common carp population with the alleles per locus ranging from 3 to 17 (mean 7). The observed (H-O) and expected (HE) heterozygosities of these EST-SSRs were 0.13-1.00 and 0.12-0.91, respectively. Six EST-SSR loci significantly deviated from the Hardy-Weinberg equilibrium (HWE) expectation, and the remaining 19 loci were in HWE. Of the 60 primer sets, the rates of polymorphic EST-SSRs were 42% in common carp, 17% in crucian carp (Carassius auratus), and 5% in silver carp (Hypophthalmichthys molitrix), respectively. These new EST-SSR markers would provide sufficient polymorphism for population genetic studies and genome mapping of the common carp and its closely related fishes. (c) 2007 Published by Elsevier B.V.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

King, R. D. and Wise, P. H. and Clare, A. (2004) Confirmation of Data Mining Based Predictions of Protein Function. Bioinformatics 20(7), 1110-1118

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Clare, A. and King R.D. (2003) Data mining the yeast genome in a lazy functional language. In Practical Aspects of Declarative Languages (PADL'03) (won Best/Most Practical Paper award).

Relevância:

80.00% 80.00%

Publicador:

Resumo:

BACKGROUND: Over the past two decades more than fifty thousand unique clinical and biological samples have been assayed using the Affymetrix HG-U133 and HG-U95 GeneChip microarray platforms. This substantial repository has been used extensively to characterize changes in gene expression between biological samples, but has not been previously mined en masse for changes in mRNA processing. We explored the possibility of using HG-U133 microarray data to identify changes in alternative mRNA processing in several available archival datasets. RESULTS: Data from these and other gene expression microarrays can now be mined for changes in transcript isoform abundance using a program described here, SplicerAV. Using in vivo and in vitro breast cancer microarray datasets, SplicerAV was able to perform both gene and isoform specific expression profiling within the same microarray dataset. Our reanalysis of Affymetrix U133 plus 2.0 data generated by in vitro over-expression of HRAS, E2F3, beta-catenin (CTNNB1), SRC, and MYC identified several hundred oncogene-induced mRNA isoform changes, one of which recognized a previously unknown mechanism of EGFR family activation. Using clinical data, SplicerAV predicted 241 isoform changes between low and high grade breast tumors; with changes enriched among genes coding for guanyl-nucleotide exchange factors, metalloprotease inhibitors, and mRNA processing factors. Isoform changes in 15 genes were associated with aggressive cancer across the three breast cancer datasets. CONCLUSIONS: Using SplicerAV, we identified several hundred previously uncharacterized isoform changes induced by in vitro oncogene over-expression and revealed a previously unknown mechanism of EGFR activation in human mammary epithelial cells. We analyzed Affymetrix GeneChip data from over 400 human breast tumors in three independent studies, making this the largest clinical dataset analyzed for en masse changes in alternative mRNA processing. The capacity to detect RNA isoform changes in archival microarray data using SplicerAV allowed us to carry out the first analysis of isoform specific mRNA changes directly associated with cancer survival.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Time-series and sequences are important patterns in data mining. Based on an ontology of time-elements, this paper presents a formal characterization of time-series and state-sequences, where a state denotes a collection of data whose validation is dependent on time. While a time-series is formalized as a vector of time-elements temporally ordered one after another, a state-sequence is denoted as a list of states correspondingly ordered by a time-series. In general, a time-series and a state-sequence can be incomplete in various ways. This leads to the distinction between complete and incomplete time-series, and between complete and incomplete state-sequences, which allows the expression of both absolute and relative temporal knowledge in data mining.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In the last decade, data mining has emerged as one of the most dynamic and lively areas in information technology. Although many algorithms and techniques for data mining have been proposed, they either focus on domain independent techniques or on very specific domain problems. A general requirement in bridging the gap between academia and business is to cater to general domain-related issues surrounding real-life applications, such as constraints, organizational factors, domain expert knowledge, domain adaption, and operational knowledge. Unfortunately, these either have not been addressed, or have not been sufficiently addressed, in current data mining research and development.Domain-Driven Data Mining (D3M) aims to develop general principles, methodologies, and techniques for modeling and merging comprehensive domain-related factors and synthesized ubiquitous intelligence surrounding problem domains with the data mining process, and discovering knowledge to support business decision-making. This paper aims to report original, cutting-edge, and state-of-the-art progress in D3M. It covers theoretical and applied contributions aiming to: 1) propose next-generation data mining frameworks and processes for actionable knowledge discovery, 2) investigate effective (automated, human and machine-centered and/or human-machined-co-operated) principles and approaches for acquiring, representing, modelling, and engaging ubiquitous intelligence in real-world data mining, and 3) develop workable and operational systems balancing technical significance and applications concerns, and converting and delivering actionable knowledge into operational applications rules to seamlessly engage application processes and systems.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Background. The assembly of the tree of life has seen significant progress in recent years but algae and protists have been largely overlooked in this effort. Many groups of algae and protists have ancient roots and it is unclear how much data will be required to resolve their phylogenetic relationships for incorporation in the tree of life. The red algae, a group of primary photosynthetic eukaryotes of more than a billion years old, provide the earliest fossil evidence for eukaryotic multicellularity and sexual reproduction. Despite this evolutionary significance, their phylogenetic relationships are understudied. This study aims to infer a comprehensive red algal tree of life at the family level from a supermatrix containing data mined from GenBank. We aim to locate remaining regions of low support in the topology, evaluate their causes and estimate the amount of data required to resolve them. Results. Phylogenetic analysis of a supermatrix of 14 loci and 98 red algal families yielded the most complete red algal tree of life to date. Visualization of statistical support showed the presence of five poorly supported regions. Causes for low support were identified with statistics about the age of the region, data availability and node density, showing that poor support has different origins in different parts of the tree. Parametric simulation experiments yielded optimistic estimates of how much data will be needed to resolve the poorly supported regions (ca. 103 to ca. 104 nucleotides for the different regions). Nonparametric simulations gave a markedly more pessimistic image, some regions requiring more than 2.8 105 nucleotides or not achieving the desired level of support at all. The discrepancies between parametric and nonparametric simulations are discussed in light of our dataset and known attributes of both approaches. Conclusions. Our study takes the red algae one step closer to meaningful inclusion in the tree of life. In addition to the recovery of stable relationships, the recognition of five regions in need of further study is a significant outcome of this work. Based on our analyses of current availability and future requirements of data, we make clear recommendations for forthcoming research.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We conducted data-mining analyses of genome wide association (GWA) studies of the CATIE and MGS-GAIN datasets, and found 13 markers in the two physically linked genes, PTPN21 and EML5, showing nominally significant association with schizophrenia. Linkage disequilibrium (LD) analysis indicated that all 7 markers from PTPN21 shared high LD (r(2)>0.8), including rs2274736 and rs2401751, the two non-synonymous markers with the most significant association signals (rs2401751, P=1.10 × 10(-3) and rs2274736, P=1.21 × 10(-3)). In a meta-analysis of all 13 replication datasets with a total of 13,940 subjects, we found that the two non-synonymous markers are significantly associated with schizophrenia (rs2274736, OR=0.92, 95% CI: 0.86-0.97, P=5.45 × 10(-3) and rs2401751, OR=0.92, 95% CI: 0.86-0.97, P=5.29 × 10(-3)). One SNP (rs7147796) in EML5 is also significantly associated with the disease (OR=1.08, 95% CI: 1.02-1.14, P=6.43 × 10(-3)). These 3 markers remain significant after Bonferroni correction. Furthermore, haplotype conditioned analyses indicated that the association signals observed between rs2274736/rs2401751 and rs7147796 are statistically independent. Given the results that 2 non-synonymous markers in PTPN21 are associated with schizophrenia, further investigation of this locus is warranted.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This article proposes that a complementary relationship exists between the formalised nature of digital loyalty card data, and the informal nature of small business market orientation. A longitudinal, case-based research approach analysed this relationship in small firms given access to Tesco Clubcard data. The findings reveal a new-found structure and precision in small firm marketing planning from data exposure; this complemented rather than conflicted with an intuitive feel for markets. In addition, small firm owners were encouraged to include employees in marketing planning.

Relevância:

80.00% 80.00%

Publicador: