966 resultados para data Mining


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Biodiversity, a multidimensional property of natural systems, is difficult to quantify partly because of the multitude of indices proposed for this purpose. Indices aim to describe general properties of communities that allow us to compare different regions, taxa, and trophic levels. Therefore, they are of fundamental importance for environmental monitoring and conservation, although there is no consensus about which indices are more appropriate and informative. We tested several common diversity indices in a range of simple to complex statistical analyses in order to determine whether some were better suited for certain analyses than others. We used data collected around the focal plant Plantago lanceolata on 60 temperate grassland plots embedded in an agricultural landscape to explore relationships between the common diversity indices of species richness (S), Shannon's diversity (H'), Simpson's diversity (D1), Simpson's dominance (D2), Simpson's evenness (E), and Berger–Parker dominance (BP). We calculated each of these indices for herbaceous plants, arbuscular mycorrhizal fungi, aboveground arthropods, belowground insect larvae, and P. lanceolata molecular and chemical diversity. Including these trait-based measures of diversity allowed us to test whether or not they behaved similarly to the better studied species diversity. We used path analysis to determine whether compound indices detected more relationships between diversities of different organisms and traits than more basic indices. In the path models, more paths were significant when using H', even though all models except that with E were equally reliable. This demonstrates that while common diversity indices may appear interchangeable in simple analyses, when considering complex interactions, the choice of index can profoundly alter the interpretation of results. Data mining in order to identify the index producing the most significant results should be avoided, but simultaneously considering analyses using multiple indices can provide greater insight into the interactions in a system.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

How can we correlate the neural activity in the human brain as it responds to typed words, with properties of these terms (like ‘edible’, ‘fits in hand’)? In short, we want to find latent variables, that jointly explain both the brain activity, as well as the behavioral responses. This is one of many settings of the Coupled Matrix-Tensor Factorization (CMTF) problem.

Can we accelerate any CMTF solver, so that it runs within a few minutes instead of tens of hours to a day, while maintaining good accuracy? We introduce Turbo-SMT, a meta-method capable of doing exactly that: it boosts the performance of any CMTF algorithm, by up to 200x, along with an up to 65 fold increase in sparsity, with comparable accuracy to the baseline.

We apply Turbo-SMT to BrainQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. Turbo-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy.




Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper proposes a new thermography-based maximum power point tracking (MPPT) scheme to address photovoltaic (PV) partial shading faults. Solar power generation utilizes a large number of PV cells connected in series and in parallel in an array, and that are physically distributed across a large field. When a PV module is faulted or partial shading occurs, the PV system sees a nonuniform distribution of generated electrical power and thermal profile, and the generation of multiple maximum power points (MPPs). If left untreated, this reduces the overall power generation and severe faults may propagate, resulting in damage to the system. In this paper, a thermal camera is employed for fault detection and a new MPPT scheme is developed to alter the operating point to match an optimized MPP. Extensive data mining is conducted on the images from the thermal camera in order to locate global MPPs. Based on this, a virtual MPPT is set out to find the global MPP. This can reduce MPPT time and be used to calculate the MPP reference voltage. Finally, the proposed methodology is experimentally implemented and validated by tests on a 600-W PV array. 

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper we propose a graph stream clustering algorithm with a unied similarity measure on both structural and attribute properties of vertices, with each attribute being treated as a vertex. Unlike others, our approach does not require an input parameter for the number of clusters, instead, it dynamically creates new sketch-based clusters and periodically merges existing similar clusters. Experiments on two publicly available datasets reveal the advantages of our approach in detecting vertex clusters in the graph stream. We provide a detailed investigation into how parameters affect the algorithm performance. We also provide a quantitative evaluation and comparison with a well-known offline community detection algorithm which shows that our streaming algorithm can achieve comparable or better average cluster purity.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Embedded memories account for a large fraction of the overall silicon area and power consumption in modern SoC(s). While embedded memories are typically realized with SRAM, alternative solutions, such as embedded dynamic memories (eDRAM), can provide higher density and/or reduced power consumption. One major challenge that impedes the widespread adoption of eDRAM is that they require frequent refreshes potentially reducing the availability of the memory in periods of high activity and also consuming significant amount of power due to such frequent refreshes. Reducing the refresh rate while on one hand can reduce the power overhead, if not performed in a timely manner, can cause some cells to lose their content potentially resulting in memory errors. In this paper, we consider extending the refresh period of gain-cell based dynamic memories beyond the worst-case point of failure, assuming that the resulting errors can be tolerated when the use-cases are in the domain of inherently error-resilient applications. For example, we observe that for various data mining applications, a large number of memory failures can be accepted with tolerable imprecision in output quality. In particular, our results indicate that by allowing as many as 177 errors in a 16 kB memory, the maximum loss in output quality is 11%. We use this failure limit to study the impact of relaxing reliability constraints on memory availability and retention power for different technologies.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

With over 50 billion downloads and more than 1.3 million apps in Google’s official market, Android has continued to gain popularity amongst smartphone users worldwide. At the same time there has been a rise in malware targeting the platform, with more recent strains employing highly sophisticated detection avoidance techniques. As traditional signature based methods become less potent in detecting unknown malware, alternatives are needed for timely zero-day discovery. Thus this paper proposes an approach that utilizes ensemble learning for Android malware detection. It combines advantages of static analysis with the efficiency and performance of ensemble machine learning to improve Android malware detection accuracy. The machine learning models are built using a large repository of malware samples and benign apps from a leading antivirus vendor. Experimental results and analysis presented shows that the proposed method which uses a large feature space to leverage the power of ensemble learning is capable of 97.3 % to 99% detection accuracy with very low false positive rates.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The battle to mitigate Android malware has become more critical with the emergence of new strains incorporating increasingly sophisticated evasion techniques, in turn necessitating more advanced detection capabilities. Hence, in this paper we propose and evaluate a machine learning based approach based on eigenspace analysis for Android malware detection using features derived from static analysis characterization of Android applications. Empirical evaluation with a dataset of real malware and benign samples show that detection rate of over 96% with a very low false positive rate is achievable using the proposed method.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Inherently error-resilient applications in areas such as signal processing, machine learning and data analytics provide opportunities for relaxing reliability requirements, and thereby reducing the overhead incurred by conventional error correction schemes. In this paper, we exploit the tolerable imprecision of such applications by designing an energy-efficient fault-mitigation scheme for unreliable data memories to meet target yield. The proposed approach uses a bit-shuffling mechanism to isolate faults into bit locations with lower significance. This skews the bit-error distribution towards the low order bits, substantially limiting the output error magnitude. By controlling the granularity of the shuffling, the proposed technique enables trading-off quality for power, area, and timing overhead. Compared to error-correction codes, this can reduce the overhead by as much as 83% in read power, 77% in read access time, and 89% in area, when applied to various data mining applications in 28nm process technology.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Most traditional data mining algorithms struggle to cope with the sheer scale of data efficiently. In this paper, we propose a general framework to accelerate existing clustering algorithms to cluster large-scale datasets which contain large numbers of attributes, items, and clusters. Our framework makes use of locality sensitive hashing (LSH) to significantly reduce the cluster search space. We also theoretically prove that our framework has a guaranteed error bound in terms of the clustering quality. This framework can be applied to a set of centroid-based clustering algorithms that assign an object to the most similar cluster, and we adopt the popular K-Modes categorical clustering algorithm to present how the framework can be applied. We validated our framework with five synthetic datasets and a real world Yahoo! Answers dataset. The experimental results demonstrate that our framework is able to speed up the existing clustering algorithm between factors of 2 and 6, while maintaining comparable cluster purity.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The past decade had witnessed an unprecedented growth in the amount of available digital content, and its volume is expected to continue to grow the next few years. Unstructured text data generated from web and enterprise sources form a large fraction of such content. Many of these contain large volumes of reusable data such as solutions to frequently occurring problems, and general know-how that may be reused in appropriate contexts. In this work, we address issues around leveraging unstructured text data from sources as diverse as the web and the enterprise within the Case-based Reasoning framework. Case-based Reasoning (CBR) provides a framework and methodology for systematic reuse of historical knowledge that is available in the form of problemsolution
pairs, in solving new problems. Here, we consider possibilities of enhancing Textual CBR systems under three main themes: procurement, maintenance and retrieval. We adapt and build upon the stateof-the-art techniques from data mining and natural language processing in addressing various challenges therein. Under procurement, we investigate the problem of extracting cases (i.e., problem-solution pairs) from data sources such as incident/experience
reports. We develop case-base maintenance methods specifically tuned to text targeted towards retaining solutions such that the utility of the filtered case base in solving new problems is maximized. Further, we address the problem of query suggestions for textual case-bases and show that exploiting the problem-solution partition can enhance retrieval effectiveness by prioritizing more useful query suggestions. Additionally, we illustrate interpretable clustering as a tool to drill-down to domain specific text collections (since CBR systems are usually very domain specific) and develop techniques for improved similarity assessment in social media sources such as microblogs. Through extensive empirical evaluations, we illustrate the improvements that we are able to
achieve over the state-of-the-art methods for the respective tasks.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Atualmente, a poluição atmosférica constitui uma das principais causas ambientais de mortalidade. Cerca de 30% da população europeia residente em áreas urbanas encontra-se exposta a níveis de poluição atmosférica superiores aos valores- limite de qualidade do ar legislados para proteção da saúde humana, representando o tráfego rodoviário uma das principais fontes de poluição urbana. Além dos poluentes tradicionais avaliados em áreas urbanas, os poluentes classificados como perigosos para a saúde (Hazard Air Pollutants - HAPs) têm particular relevância devido aos seus conhecidos efeitos tóxicos e cancerígenos. Neste sentido, a avaliação da exposição tornase primordial para a determinação da relação entre a poluição atmosférica urbana e efeitos na saúde. O presente estudo tem como principal objetivo o desenvolvimento e implementação de uma metodologia para avaliação da exposição individual à poluição atmosférica urbana relacionada com o tráfego rodoviário. De modo a atingir este objetivo, foram identificados os parâmetros relevantes para a quantificação de exposição e analisados os atuais e futuros potenciais impactos na saúde associados com a exposição à poluição urbana. Neste âmbito, o modelo ExPOSITION (EXPOSure model to traffIc-relaTed aIr pOllutioN) foi desenvolvido baseado numa abordagem inovadora que envolve a análise da trajetória dos indivíduos recolhidas por telemóveis com tecnologia GPS e processadas através da abordagem de data mining e análise geoespacial. O modelo ExPOSITION considera também uma abordagem probabilística para caracterizar a variabilidade dos parâmetros microambientais e a sua contribuição para exposição individual. Adicionalmente, de forma a atingir os objetivos do estudo foi desenvolvido um novo módulo de cálculo de emissões de HAPs provenientes do transporte rodoviário. Neste estudo, um sistema de modelação, incluindo os modelos de transporteemissões- dispersão-exposição, foi aplicado na área urbana de Leiria para quantificação de exposição individual a PM2.5 e benzeno. Os resultados de modelação foram validados com base em medições obtidas por monitorização pessoal e monitorização biológica verificando-se uma boa concordância entre os resultados do modelo e dados de medições. A metodologia desenvolvida e implementada no âmbito deste trabalho permite analisar e estimar a magnitude, frequência e inter e intra-variabilidade dos níveis de exposição individual, bem como a contribuição de diferentes microambientes, considerando claramente a sequência de eventos de exposição e relação fonte-recetor, que é fundamental para avaliação dos efeitos na saúde e estudos epidemiológicos. O presente trabalho contribui para uma melhor compreensão da exposição individual em áreas urbanas, proporcionando novas perspetivas sobre a exposição individual, essenciais na seleção de estratégias de redução da exposição à poluição atmosférica urbana, e consequentes efeitos na saúde.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The evolution of calcified tissues is a defining feature in vertebrate evolution. Investigating the evolution of proteins involved in tissue calcification should help elucidate how calcified tissues have evolved. The purpose of this study was to collect and compare sequences of matrix and bone γ-carboxyglutamic acid proteins (MGP and BGP, respectively) to identify common features and determine the evolutionary relationship between MGP and BGP. Thirteen cDNAs and genes were cloned using standard methods or reconstructed through the use of comparative genomics and data mining. These sequences were compared with available annotated sequences (a total of 48 complete or nearly complete sequences, 28 BGPs and 20 MGPs) have been identified across 32 different species (representing most classes of vertebrates), and evolutionarily conserved features in both MGP and BGP were analyzed using bioinformatic tools and the Tree-Puzzle software. We propose that: 1) MGP and BGP genes originated from two genome duplications that occurred around 500 and 400 million years ago before jawless and jawed fish evolved, respectively; 2) MGP appeared first concomitantly with the emergence of cartilaginous structures, and BGP appeared thereafter along with bony structures; and 3) BGP derives from MGP. We also propose a highly specific pattern definition for the Gla domain of BGP and MGP. Previous Section Next Section BGP1 (bone Gla protein or osteocalcin) and MGP (matrix Gla protein) belong to the growing family of vitamin K-dependent (VKD) proteins, the members of which are involved in a broad range of biological functions such as skeletogenesis and bone maintenance (BGP and MGP), hemostasis (prothrombin, clotting factors VII, IX, and X, and proteins C, S, and Z), growth control (gas6), and potentially signal transduction (proline-rich Gla proteins 1 and 2). VKD proteins are characterized by the presence of several Gla residues resulting from the post-translational vitamin K-dependent γ-carboxylation of specific glutamates, through which they can bind to calcium-containing mineral such as hydroxyapatite. To date, VKD proteins have only been clearly identified in vertebrates (1) although the presence of a γ-glutamyl carboxylase has been reported in the fruit fly Drosophila melanogaster (2) and in marine snails belonging to the genus Conus (3). Gla residues have also been found in neuropeptides from Conus venoms (4), suggesting a wider prevalence of γ-carboxylation.