803 resultados para genoma, genetica, dna, bioinformatica, mapreduce, snp, gwas, big data, sequenziamento, pipeline


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This dissertation analyses the influence of sugar-phosphate structure in the electronic transport in the double stretch DNA molecule, with the sequence of the base pairs modeled by two types of quasi-periodic sequences: Rudin-Shapiro and Fibonacci. For the sequences, the density of state was calculated and it was compared with the density of state of a piece of human DNA Ch22. After, the electronic transmittance was investigated. In both situations, the Hamiltonians are different. On the analysis of density of state, it was employed the Dyson equation. On the transmittance, the time independent Schrödinger equation was used. In both cases, the tight-binding model was applied. The density of states obtained through Rudin-Shapiro sequence reveal to be similar to the density of state for the Ch22. And for transmittance only until the fifth generation of the Fibonacci sequence was acquired. We have considered long range correlations in both transport mechanism

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)

Relevância:

100.00% 100.00%

Publicador:

Resumo:

La Valvola Aortica Bicuspide (BAV) rappresenta la più comune anomalia cardiaca congenita, con un’incidenza dello 0,5%-2% nella popolazione generale. Si caratterizza per la presenza di due cuspidi valvolari anziché tre e comprende diverse forme. La BAV è frequentemente associata agli aneurismi dell’aorta toracica (TAA). La dilatazione dell’aorta espone al rischio di sviluppare le complicanze aortiche acute. Materiali e metodi Sono stati reclutati 20 probandi consecutivi sottoposti a chirurgia della valvola aortica e dell'aorta ascendente presso l'Unità di Chirurgia Cardiaca di Policlinico S.Orsola-Malpighi di TAA associata a BAV. Sono stati esclusi individui con una condizione sindromica predisponente l’aneurisma aortico. Ciascun familiare maggiorenne di primo grado è stato arruolato nello studio. L’analisi di mutazioni dell’intero gene ACTA2 è stata eseguita con la tecnica del “bidirectional direct sequencing”. Nelle forme familiari, l’intera porzione codificante del genoma è stata eseguita usando l’exome sequencing. Risultati Dopo il sequenziamento di tutti i 20 esoni e giunzioni di splicing di ACTA2 nei 20 probandi, non è stata individuata alcuna mutazione. Settantasette familiari di primo grado sono stati arruolati. Sono state identificate cinque forme familiari. In una famiglia è stata trovata una mutazione del gene MYH11 non ritenuta patogenetica. Conclusioni La mancanza di mutazioni, sia nelle forme sporadiche sia in quelle familiari, ci suggerisce che questo gene non è coinvolto nello sviluppo della BAV e TAA e, l’associazione che è stata riportata deve essere considerata occasionale. L’architettura genetica della BAV verosimilmente dovrebbe consistere in svariate differenti varianti genetiche che interagiscono in maniera additiva nel determinare un aumento del rischio.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Debido al gran incremento de datos digitales que ha tenido lugar en los últimos años, ha surgido un nuevo paradigma de computación paralela para el procesamiento eficiente de grandes volúmenes de datos. Muchos de los sistemas basados en este paradigma, también llamados sistemas de computación intensiva de datos, siguen el modelo de programación de Google MapReduce. La principal ventaja de los sistemas MapReduce es que se basan en la idea de enviar la computación donde residen los datos, tratando de proporcionar escalabilidad y eficiencia. En escenarios libres de fallo, estos sistemas generalmente logran buenos resultados. Sin embargo, la mayoría de escenarios donde se utilizan, se caracterizan por la existencia de fallos. Por tanto, estas plataformas suelen incorporar características de tolerancia a fallos y fiabilidad. Por otro lado, es reconocido que las mejoras en confiabilidad vienen asociadas a costes adicionales en recursos. Esto es razonable y los proveedores que ofrecen este tipo de infraestructuras son conscientes de ello. No obstante, no todos los enfoques proporcionan la misma solución de compromiso entre las capacidades de tolerancia a fallo (o de manera general, las capacidades de fiabilidad) y su coste. Esta tesis ha tratado la problemática de la coexistencia entre fiabilidad y eficiencia de los recursos en los sistemas basados en el paradigma MapReduce, a través de metodologías que introducen el mínimo coste, garantizando un nivel adecuado de fiabilidad. Para lograr esto, se ha propuesto: (i) la formalización de una abstracción de detección de fallos; (ii) una solución alternativa a los puntos únicos de fallo de estas plataformas, y, finalmente, (iii) un nuevo sistema de asignación de recursos basado en retroalimentación a nivel de contenedores. Estas contribuciones genéricas han sido evaluadas tomando como referencia la arquitectura Hadoop YARN, que, hoy en día, es la plataforma de referencia en la comunidad de los sistemas de computación intensiva de datos. En la tesis se demuestra cómo todas las contribuciones de la misma superan a Hadoop YARN tanto en fiabilidad como en eficiencia de los recursos utilizados. ABSTRACT Due to the increase of huge data volumes, a new parallel computing paradigm to process big data in an efficient way has arisen. Many of these systems, called dataintensive computing systems, follow the Google MapReduce programming model. The main advantage of these systems is based on the idea of sending the computation where the data resides, trying to provide scalability and efficiency. In failure-free scenarios, these frameworks usually achieve good results. However, these ones are not realistic scenarios. Consequently, these frameworks exhibit some fault tolerance and dependability techniques as built-in features. On the other hand, dependability improvements are known to imply additional resource costs. This is reasonable and providers offering these infrastructures are aware of this. Nevertheless, not all the approaches provide the same tradeoff between fault tolerant capabilities (or more generally, reliability capabilities) and cost. In this thesis, we have addressed the coexistence between reliability and resource efficiency in MapReduce-based systems, looking for methodologies that introduce the minimal cost and guarantee an appropriate level of reliability. In order to achieve this, we have proposed: (i) a formalization of a failure detector abstraction; (ii) an alternative solution to single points of failure of these frameworks, and finally (iii) a novel feedback-based resource allocation system at the container level. Finally, our generic contributions have been instantiated for the Hadoop YARN architecture, which is the state-of-the-art framework in the data-intensive computing systems community nowadays. The thesis demonstrates how all our approaches outperform Hadoop YARN in terms of reliability and resource efficiency.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

L'identificazione dei prodotti ittici è uno dei temi chiave in materia di sicurezza alimentare. L’errata etichettatura dei prodotti alimentari e la sostituzione di alcuni ingredienti rappresentano questioni emergenti in termini di qualità e sicurezza alimentare e nutrizionale. L'autenticazione e la tracciabilità dei prodotti alimentari, gli studi di tassonomia e di genetica di popolazione, così come l'analisi delle abitudini alimentari degli animali e la selezione delle prede, si basano su analisi genetiche tra cui la metodica molecolare del DNA barcoding, che consiste nell’amplificazione e nel sequenziamento di una specifica regione del gene mitocondriale chiamata COI. Questa tecnica biomolecolare è utilizzata per fronteggiare la richiesta di determinazione specifica e/o la reale provenienza dei prodotti commercializzati, nonché per smascherare errori di etichettatura e sostituzioni fraudolente, difficile da rilevare soprattutto nei prodotti ittici trasformati. Sul mercato sono disponibili differenti kit per l'estrazione del DNA da campioni freschi e conservati; l’impiego dei kit, aumenta drasticamente il costo dei progetti di caratterizzazione e di genotipizzazione dei campioni da analizzare. In questo scenario è stato messo a punto un metodo veloce di estrazione del DNA. Esso non prevede nessuna fase di purificazione per i prodotti ittici freschi e trasformati e si presta a qualsiasi analisi che preveda l’utilizzo della tecnica PCR. Il protocollo consente l'amplificazione efficiente del DNA da qualsiasi scarto industriale proveniente dalla lavorazione del pesce, indipendentemente dal metodo di conservazione del campione. L’applicazione di questo metodo di estrazione del DNA, combinato al successo e alla robustezza della amplificazione PCR (secondo protocollo barcode) ha permesso di ottenere, in tempi brevissimi e con costi minimi, il sequenziamento del DNA.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertação de Mestrado, Ciências Biomédicas, Departamento de Ciências Biomédicas e Medicina, Universidade do Algarve, 2014

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Objective: To perform a 1-stage meta-analysis of genome-wide association studies (GWAS) of multiple sclerosis (MS) susceptibility and to explore functional consequences of new susceptibility loci. Methods: We synthesized 7 MS GWAS. Each data set was imputed using HapMap phase II, and a per single nucleotide polymorphism (SNP) meta-analysis was performed across the 7 data sets. We explored RNA expression data using a quantitative trait analysis in peripheral blood mononuclear cells (PBMCs) of 228 subjects with demyelinating disease. Results: We meta-analyzed 2,529,394 unique SNPs in 5,545 cases and 12,153 controls. We identified 3 novel susceptibility alleles: rs170934T at 3p24.1 (odds ratio [OR], 1.17; p ¼ 1.6 � 10�8) near EOMES, rs2150702G in the second intron of MLANA on chromosome 9p24.1 (OR, 1.16; p ¼ 3.3 � 10�8), and rs6718520A in an intergenic region on chromosome 2p21, with THADA as the nearest flanking gene (OR, 1.17; p ¼ 3.4 � 10�8). The 3 new loci do not have a strong cis effect on RNA expression in PBMCs. Ten other susceptibility loci had a suggestive p < 1 � 10�6, some of these loci have evidence of association in other inflammatory diseases (ie, IL12B, TAGAP, PLEK, and ZMIZ1). Interpretation: We have performed a meta-analysis of GWAS in MS that more than doubles the size of previous gene discovery efforts and highlights 3 novel MS susceptibility loci. These and additional loci with suggestive evidence of association are excellent candidates for further investigations to refine and validate their role in the genetic architecture of MS.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Subterranean clover stunt disease is an economically important aphid-borne virus disease affecting certain pasture and grain legumes in Australia. The virus associated with the disease, subterranean clover stunt virus (SCSV), was previously found to be representative of a new type of single-stranded DNA virus. Analysis of the virion DNA and restriction mapping of double-stranded cDNA synthesized from virion DNA suggested that SCSV has a segmented genome composed of 3 or 4 different species of circular ssDNA each of about 850-880 nucleotides. To further investigate the complexity of the SCSV genome, we have isolated the replicative form DNA from infected pea and from it prepared putative full-length clones representing the SCSV genome segments. Analysis of these clones by restriction mapping indicated that clones representing at least 4 distinct genomic segments were obtained. This method is thus suitable for generating an extensive genomic library of novel ssDNA viruses containing multiple genome segments such as SCSV and banana bunchy top virus. The N-terminal amino acid sequence and amino acid composition of the coat protein of SCSV were determined. Comparison of the amino acid sequence with partial DNA sequence data, and the distinctly different restriction maps obtained for the full-length clones suggested that only one of these clones contained the coat protein gene. The results confirmed that SCSV has a functionally divided genome composed of several distinct ssDNA circles each of about 1 kb.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Environmental monitoring is becoming critical as human activity and climate change place greater pressures on biodiversity, leading to an increasing need for data to make informed decisions. Acoustic sensors can help collect data across large areas for extended periods making them attractive in environmental monitoring. However, managing and analysing large volumes of environmental acoustic data is a great challenge and is consequently hindering the effective utilization of the big dataset collected. This paper presents an overview of our current techniques for collecting, storing and analysing large volumes of acoustic data efficiently, accurately, and cost-effectively.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Twitter ist eine besonders nützliche Quelle für Social-Media-Daten: mit dem Twitter-API (dem Application Programming Interface, das einen strukturierten Zugang zu Kommunikationsdaten in standardisierten Formaten bietet) ist es Forschern möglich, mit ein wenig Mühe und ausreichenden technische Ressourcen sehr große Archive öffentlich verbreiteter Tweets zu bestimmten Themen, Interessenbereichen, oder Veranstaltungen aufzubauen. Grundsätzlich liefert das API sehr langen Listen von Hunderten, Tausenden oder Millionen von Tweets und den Metadaten zu diesen Tweets; diese Daten können dann auf verschiedentlichste Weise extrahiert, kombiniert, und visualisiert werden, um die Dynamik der Social-Media-Kommunikation zu verstehen. Diese Forschung ist häufig um althergebrachte Fragestellungen herum aufgebaut, wird aber in der Regel in einem bislang unbekannt großen Maßstab durchgeführt. Die Projekte von Medien- und Kommunikationswissenschaftlern wie Papacharissi und de Fatima Oliveira (2012), Wood und Baughman (2012) oder Lotan et al. (2011) – um nur eine Handvoll der letzten Beispiele zu nennen – sind grundlegend auf Twitterdatensätze aufgebaut, die jetzt routinemäßig Millionen von Tweets und zugehörigen Metadaten umfassen, erfaßt nach einer Vielzahl von Kriterien. Was allen diesen Fällen gemein ist, ist jedoch die Notwendigkeit, neue methodische Wege in der Verarbeitung und Analyse derart großer Datensätze zur medienvermittelten sozialen Interaktion zu gehen.