868 resultados para Data Mining
Resumo:
Discrete Conditional Phase-type (DC-Ph) models are a family of models which represent skewed survival data conditioned on specific inter-related discrete variables. The survival data is modeled using a Coxian phase-type distribution which is associated with the inter-related variables using a range of possible data mining approaches such as Bayesian networks (BNs), the Naïve Bayes Classification method and classification regression trees. This paper utilizes the Discrete Conditional Phase-type model (DC-Ph) to explore the modeling of patient waiting times in an Accident and Emergency Department of a UK hospital. The resulting DC-Ph model takes on the form of the Coxian phase-type distribution conditioned on the outcome of a logistic regression model.
Resumo:
The skin of fish is the first line of defense against pathogens and parasites. The skin transcriptome of the Atlantic salmon is poorly characterized, and currently only 2,089 expressed sequence tags (ESTs) out of a total of half a million sequences are generated from skin-derived cDNA libraries. The primary aim of this study was to enhance the transcriptomic knowledge of salmon skin by using next-generation sequencing (NGS) technology, namely the Roche-454 platform. An equimolar mixture of high-quality RNA from skin and epidermal samples of salmon reared in either freshwater or seawater was used for 454-sequencing. This technique yielded over 600,000 reads, which were assembled into 34,696 isotigs using Newbler. Of these isotigs, 12 % had not been sequenced in Atlantic salmon, hence representing previously unreported salmon mRNAs that can potentially be skin-specific. Many full-length genes have been acquired, representing numerous biological processes. Mucin proteins are the main structural component of mucus and we examined in greater detail the sequences we obtained for these genes. Several isotigs exhibited homology to mammalian mucins (MUC2, MUC5AC and MUC5B). Mucin mRNAs are generally > 10 kbp and contain large repetitive units, which pose a challenge towards full-length sequence discovery. To date, we have not unearthed any full-length salmon mucin genes with this dataset, but have both N- and C-terminal regions of a mucin type 5. This highlights the fact that, while NGS is indeed a formidable tool for sequence data mining of non-model species, it must be complemented with additional experimental and bioinformatic work to characterize some mRNA sequences with complex features.
Resumo:
Secretory factors that drive cancer progression are attractive immunotherapeutic targets. We used a whole-genome data-mining approach on multiple cohorts of breast tumours annotated for clinical outcomes to discover such factors. We identified Serine protease inhibitor Kazal-type 1 (SPINK1) to be associated with poor survival in estrogen receptor-positive (ER+) cases. Immunohistochemistry showed that SPINK1 was absent in normal breast, present in early and advanced tumours, and its expression correlated with poor survival in ER+ tumours. In ER- cases, the prognostic effect did not reach statistical significance. Forced expression and/or exposure to recombinant SPINK1 induced invasiveness without affecting cell proliferation. However, down-regulation of SPINK1 resulted in cell death. Further, SPINK1 overexpressing cells were resistant to drug-induced apoptosis due to reduced caspase-3 levels and high expression of Bcl2 and phospho-Bcl2 proteins. Intriguingly, these anti-apoptotic effects of SPINK1 were abrogated by mutations of its protease inhibition domain. Thus, SPINK1 affects multiple aggressive properties in breast cancer: survival, invasiveness and chemoresistance. Because SPINK1 effects are abrogated by neutralizing antibodies, we suggest that SPINK1 is a viable potential therapeutic target in breast cancer.
Resumo:
Achieving a clearer picture of categorial distinctions in the brain is essential for our understanding of the conceptual lexicon, but much more fine-grained investigations are required in order for this evidence to contribute to lexical research. Here we present a collection of advanced data-mining techniques that allows the category of individual concepts to be decoded from single trials of EEG data. Neural activity was recorded while participants silently named images of mammals and tools, and category could be detected in single trials with an accuracy well above chance, both when considering data from single participants, and when group-training across participants. By aggregating across all trials, single concepts could be correctly assigned to their category with an accuracy of 98%. The pattern of classifications made by the algorithm confirmed that the neural patterns identified are due to conceptual category, and not any of a series of processing-related confounds. The time intervals, frequency bands and scalp locations that proved most informative for prediction permit physiological interpretation: the widespread activation shortly after appearance of the stimulus (from 100. ms) is consistent both with accounts of multi-pass processing, and distributed representations of categories. These methods provide an alternative to fMRI for fine-grained, large-scale investigations of the conceptual lexicon. © 2010 Elsevier Inc.
Resumo:
Mobile malware has continued to grow at an alarming rate despite on-going mitigation efforts. This has been much more prevalent on Android due to being an open platform that is rapidly overtaking other competing platforms in the mobile smart devices market. Recently, a new generation of Android malware families has emerged with advanced evasion capabilities which make them much more difficult to detect using conventional methods. This paper proposes and investigates a parallel machine learning based classification approach for early detection of Android malware. Using real malware samples and benign applications, a composite classification model is developed from parallel combination of heterogeneous classifiers. The empirical evaluation of the model under different combination schemes demonstrates its efficacy and potential to improve detection accuracy. More importantly, by utilizing several classifiers with diverse characteristics, their strengths can be harnessed not only for enhanced Android malware detection but also quicker white box analysis by means of the more interpretable constituent classifiers.
Resumo:
Biodiversity, a multidimensional property of natural systems, is difficult to quantify partly because of the multitude of indices proposed for this purpose. Indices aim to describe general properties of communities that allow us to compare different regions, taxa, and trophic levels. Therefore, they are of fundamental importance for environmental monitoring and conservation, although there is no consensus about which indices are more appropriate and informative. We tested several common diversity indices in a range of simple to complex statistical analyses in order to determine whether some were better suited for certain analyses than others. We used data collected around the focal plant Plantago lanceolata on 60 temperate grassland plots embedded in an agricultural landscape to explore relationships between the common diversity indices of species richness (S), Shannon's diversity (H'), Simpson's diversity (D1), Simpson's dominance (D2), Simpson's evenness (E), and Berger–Parker dominance (BP). We calculated each of these indices for herbaceous plants, arbuscular mycorrhizal fungi, aboveground arthropods, belowground insect larvae, and P. lanceolata molecular and chemical diversity. Including these trait-based measures of diversity allowed us to test whether or not they behaved similarly to the better studied species diversity. We used path analysis to determine whether compound indices detected more relationships between diversities of different organisms and traits than more basic indices. In the path models, more paths were significant when using H', even though all models except that with E were equally reliable. This demonstrates that while common diversity indices may appear interchangeable in simple analyses, when considering complex interactions, the choice of index can profoundly alter the interpretation of results. Data mining in order to identify the index producing the most significant results should be avoided, but simultaneously considering analyses using multiple indices can provide greater insight into the interactions in a system.
Resumo:
How can we correlate the neural activity in the human brain as it responds to typed words, with properties of these terms (like ‘edible’, ‘fits in hand’)? In short, we want to find latent variables, that jointly explain both the brain activity, as well as the behavioral responses. This is one of many settings of the Coupled Matrix-Tensor Factorization (CMTF) problem.
Can we accelerate any CMTF solver, so that it runs within a few minutes instead of tens of hours to a day, while maintaining good accuracy? We introduce Turbo-SMT, a meta-method capable of doing exactly that: it boosts the performance of any CMTF algorithm, by up to 200x, along with an up to 65 fold increase in sparsity, with comparable accuracy to the baseline.
We apply Turbo-SMT to BrainQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. Turbo-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy.
Resumo:
This paper proposes a new thermography-based maximum power point tracking (MPPT) scheme to address photovoltaic (PV) partial shading faults. Solar power generation utilizes a large number of PV cells connected in series and in parallel in an array, and that are physically distributed across a large field. When a PV module is faulted or partial shading occurs, the PV system sees a nonuniform distribution of generated electrical power and thermal profile, and the generation of multiple maximum power points (MPPs). If left untreated, this reduces the overall power generation and severe faults may propagate, resulting in damage to the system. In this paper, a thermal camera is employed for fault detection and a new MPPT scheme is developed to alter the operating point to match an optimized MPP. Extensive data mining is conducted on the images from the thermal camera in order to locate global MPPs. Based on this, a virtual MPPT is set out to find the global MPP. This can reduce MPPT time and be used to calculate the MPP reference voltage. Finally, the proposed methodology is experimentally implemented and validated by tests on a 600-W PV array.
Resumo:
In this paper we propose a graph stream clustering algorithm with a unied similarity measure on both structural and attribute properties of vertices, with each attribute being treated as a vertex. Unlike others, our approach does not require an input parameter for the number of clusters, instead, it dynamically creates new sketch-based clusters and periodically merges existing similar clusters. Experiments on two publicly available datasets reveal the advantages of our approach in detecting vertex clusters in the graph stream. We provide a detailed investigation into how parameters affect the algorithm performance. We also provide a quantitative evaluation and comparison with a well-known offline community detection algorithm which shows that our streaming algorithm can achieve comparable or better average cluster purity.
Resumo:
Embedded memories account for a large fraction of the overall silicon area and power consumption in modern SoC(s). While embedded memories are typically realized with SRAM, alternative solutions, such as embedded dynamic memories (eDRAM), can provide higher density and/or reduced power consumption. One major challenge that impedes the widespread adoption of eDRAM is that they require frequent refreshes potentially reducing the availability of the memory in periods of high activity and also consuming significant amount of power due to such frequent refreshes. Reducing the refresh rate while on one hand can reduce the power overhead, if not performed in a timely manner, can cause some cells to lose their content potentially resulting in memory errors. In this paper, we consider extending the refresh period of gain-cell based dynamic memories beyond the worst-case point of failure, assuming that the resulting errors can be tolerated when the use-cases are in the domain of inherently error-resilient applications. For example, we observe that for various data mining applications, a large number of memory failures can be accepted with tolerable imprecision in output quality. In particular, our results indicate that by allowing as many as 177 errors in a 16 kB memory, the maximum loss in output quality is 11%. We use this failure limit to study the impact of relaxing reliability constraints on memory availability and retention power for different technologies.
Resumo:
With over 50 billion downloads and more than 1.3 million apps in Google’s official market, Android has continued to gain popularity amongst smartphone users worldwide. At the same time there has been a rise in malware targeting the platform, with more recent strains employing highly sophisticated detection avoidance techniques. As traditional signature based methods become less potent in detecting unknown malware, alternatives are needed for timely zero-day discovery. Thus this paper proposes an approach that utilizes ensemble learning for Android malware detection. It combines advantages of static analysis with the efficiency and performance of ensemble machine learning to improve Android malware detection accuracy. The machine learning models are built using a large repository of malware samples and benign apps from a leading antivirus vendor. Experimental results and analysis presented shows that the proposed method which uses a large feature space to leverage the power of ensemble learning is capable of 97.3 % to 99% detection accuracy with very low false positive rates.
Resumo:
The battle to mitigate Android malware has become more critical with the emergence of new strains incorporating increasingly sophisticated evasion techniques, in turn necessitating more advanced detection capabilities. Hence, in this paper we propose and evaluate a machine learning based approach based on eigenspace analysis for Android malware detection using features derived from static analysis characterization of Android applications. Empirical evaluation with a dataset of real malware and benign samples show that detection rate of over 96% with a very low false positive rate is achievable using the proposed method.