919 resultados para genoma, genetica, dna, bioinformatica, mapreduce, snp, gwas, big data, sequenziamento, pipeline
Resumo:
The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.
Resumo:
This dissertation analyses the influence of sugar-phosphate structure in the electronic transport in the double stretch DNA molecule, with the sequence of the base pairs modeled by two types of quasi-periodic sequences: Rudin-Shapiro and Fibonacci. For the sequences, the density of state was calculated and it was compared with the density of state of a piece of human DNA Ch22. After, the electronic transmittance was investigated. In both situations, the Hamiltonians are different. On the analysis of density of state, it was employed the Dyson equation. On the transmittance, the time independent Schrödinger equation was used. In both cases, the tight-binding model was applied. The density of states obtained through Rudin-Shapiro sequence reveal to be similar to the density of state for the Ch22. And for transmittance only until the fifth generation of the Fibonacci sequence was acquired. We have considered long range correlations in both transport mechanism
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
La Valvola Aortica Bicuspide (BAV) rappresenta la più comune anomalia cardiaca congenita, con un’incidenza dello 0,5%-2% nella popolazione generale. Si caratterizza per la presenza di due cuspidi valvolari anziché tre e comprende diverse forme. La BAV è frequentemente associata agli aneurismi dell’aorta toracica (TAA). La dilatazione dell’aorta espone al rischio di sviluppare le complicanze aortiche acute. Materiali e metodi Sono stati reclutati 20 probandi consecutivi sottoposti a chirurgia della valvola aortica e dell'aorta ascendente presso l'Unità di Chirurgia Cardiaca di Policlinico S.Orsola-Malpighi di TAA associata a BAV. Sono stati esclusi individui con una condizione sindromica predisponente l’aneurisma aortico. Ciascun familiare maggiorenne di primo grado è stato arruolato nello studio. L’analisi di mutazioni dell’intero gene ACTA2 è stata eseguita con la tecnica del “bidirectional direct sequencing”. Nelle forme familiari, l’intera porzione codificante del genoma è stata eseguita usando l’exome sequencing. Risultati Dopo il sequenziamento di tutti i 20 esoni e giunzioni di splicing di ACTA2 nei 20 probandi, non è stata individuata alcuna mutazione. Settantasette familiari di primo grado sono stati arruolati. Sono state identificate cinque forme familiari. In una famiglia è stata trovata una mutazione del gene MYH11 non ritenuta patogenetica. Conclusioni La mancanza di mutazioni, sia nelle forme sporadiche sia in quelle familiari, ci suggerisce che questo gene non è coinvolto nello sviluppo della BAV e TAA e, l’associazione che è stata riportata deve essere considerata occasionale. L’architettura genetica della BAV verosimilmente dovrebbe consistere in svariate differenti varianti genetiche che interagiscono in maniera additiva nel determinare un aumento del rischio.
Resumo:
Debido al gran incremento de datos digitales que ha tenido lugar en los últimos años, ha surgido un nuevo paradigma de computación paralela para el procesamiento eficiente de grandes volúmenes de datos. Muchos de los sistemas basados en este paradigma, también llamados sistemas de computación intensiva de datos, siguen el modelo de programación de Google MapReduce. La principal ventaja de los sistemas MapReduce es que se basan en la idea de enviar la computación donde residen los datos, tratando de proporcionar escalabilidad y eficiencia. En escenarios libres de fallo, estos sistemas generalmente logran buenos resultados. Sin embargo, la mayoría de escenarios donde se utilizan, se caracterizan por la existencia de fallos. Por tanto, estas plataformas suelen incorporar características de tolerancia a fallos y fiabilidad. Por otro lado, es reconocido que las mejoras en confiabilidad vienen asociadas a costes adicionales en recursos. Esto es razonable y los proveedores que ofrecen este tipo de infraestructuras son conscientes de ello. No obstante, no todos los enfoques proporcionan la misma solución de compromiso entre las capacidades de tolerancia a fallo (o de manera general, las capacidades de fiabilidad) y su coste. Esta tesis ha tratado la problemática de la coexistencia entre fiabilidad y eficiencia de los recursos en los sistemas basados en el paradigma MapReduce, a través de metodologías que introducen el mínimo coste, garantizando un nivel adecuado de fiabilidad. Para lograr esto, se ha propuesto: (i) la formalización de una abstracción de detección de fallos; (ii) una solución alternativa a los puntos únicos de fallo de estas plataformas, y, finalmente, (iii) un nuevo sistema de asignación de recursos basado en retroalimentación a nivel de contenedores. Estas contribuciones genéricas han sido evaluadas tomando como referencia la arquitectura Hadoop YARN, que, hoy en día, es la plataforma de referencia en la comunidad de los sistemas de computación intensiva de datos. En la tesis se demuestra cómo todas las contribuciones de la misma superan a Hadoop YARN tanto en fiabilidad como en eficiencia de los recursos utilizados. ABSTRACT Due to the increase of huge data volumes, a new parallel computing paradigm to process big data in an efficient way has arisen. Many of these systems, called dataintensive computing systems, follow the Google MapReduce programming model. The main advantage of these systems is based on the idea of sending the computation where the data resides, trying to provide scalability and efficiency. In failure-free scenarios, these frameworks usually achieve good results. However, these ones are not realistic scenarios. Consequently, these frameworks exhibit some fault tolerance and dependability techniques as built-in features. On the other hand, dependability improvements are known to imply additional resource costs. This is reasonable and providers offering these infrastructures are aware of this. Nevertheless, not all the approaches provide the same tradeoff between fault tolerant capabilities (or more generally, reliability capabilities) and cost. In this thesis, we have addressed the coexistence between reliability and resource efficiency in MapReduce-based systems, looking for methodologies that introduce the minimal cost and guarantee an appropriate level of reliability. In order to achieve this, we have proposed: (i) a formalization of a failure detector abstraction; (ii) an alternative solution to single points of failure of these frameworks, and finally (iii) a novel feedback-based resource allocation system at the container level. Finally, our generic contributions have been instantiated for the Hadoop YARN architecture, which is the state-of-the-art framework in the data-intensive computing systems community nowadays. The thesis demonstrates how all our approaches outperform Hadoop YARN in terms of reliability and resource efficiency.
Resumo:
A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.
Resumo:
L'identificazione dei prodotti ittici è uno dei temi chiave in materia di sicurezza alimentare. L’errata etichettatura dei prodotti alimentari e la sostituzione di alcuni ingredienti rappresentano questioni emergenti in termini di qualità e sicurezza alimentare e nutrizionale. L'autenticazione e la tracciabilità dei prodotti alimentari, gli studi di tassonomia e di genetica di popolazione, così come l'analisi delle abitudini alimentari degli animali e la selezione delle prede, si basano su analisi genetiche tra cui la metodica molecolare del DNA barcoding, che consiste nell’amplificazione e nel sequenziamento di una specifica regione del gene mitocondriale chiamata COI. Questa tecnica biomolecolare è utilizzata per fronteggiare la richiesta di determinazione specifica e/o la reale provenienza dei prodotti commercializzati, nonché per smascherare errori di etichettatura e sostituzioni fraudolente, difficile da rilevare soprattutto nei prodotti ittici trasformati. Sul mercato sono disponibili differenti kit per l'estrazione del DNA da campioni freschi e conservati; l’impiego dei kit, aumenta drasticamente il costo dei progetti di caratterizzazione e di genotipizzazione dei campioni da analizzare. In questo scenario è stato messo a punto un metodo veloce di estrazione del DNA. Esso non prevede nessuna fase di purificazione per i prodotti ittici freschi e trasformati e si presta a qualsiasi analisi che preveda l’utilizzo della tecnica PCR. Il protocollo consente l'amplificazione efficiente del DNA da qualsiasi scarto industriale proveniente dalla lavorazione del pesce, indipendentemente dal metodo di conservazione del campione. L’applicazione di questo metodo di estrazione del DNA, combinato al successo e alla robustezza della amplificazione PCR (secondo protocollo barcode) ha permesso di ottenere, in tempi brevissimi e con costi minimi, il sequenziamento del DNA.
Resumo:
Dissertação de Mestrado, Ciências Biomédicas, Departamento de Ciências Biomédicas e Medicina, Universidade do Algarve, 2014
Resumo:
CREB is a cAMP-responsive nuclear DNA-binding protein that binds to cAMP response elements and stimulates gene transcription upon activation of the cAMP signalling pathway. The protein consists of an amino-terminal transcriptional transactivation domain and a carboxyl-terminal DNA-binding domain (bZIP domain) comprised of a basic region and a leucine zipper involved in DNA recognition and dimerization, respectively. Recently, we discovered a testis-specific transcript of CREB that contains an alternatively spliced exon encoding multiple stop codons. CREB encoded by this transcript is a truncated protein lacking the bZIP domain. We postulated that the antigen detected by CREB antiserum in the cytoplasm of germinal cells is the truncated CREB that must also lack its nuclear translocation signal (NTS). To test this hypothesis we prepared multiple expression plasmids encoding carboxyl-terminal deletions of CREB and transiently expressed them in COS-1 cells. By Western immunoblot analysis as well as immunocytochemistry of transfected cells, we show that CREB proteins truncated to amino acid 286 or shorter are sequestered in the cytoplasm, whereas a CREB of 295 amino acids is translocated into the nucleus. Chimeric CREBs containing a heterologous NTS fused to the first 248 or 261 amino acids of CREB are able to drive the translocation of the protein into the nucleus. Thus, the nine amino acids in the basic region involved in DNA recognition between positions 287 and 295 (RRKKKEYVK) of CREB contain the NTS. Further, mutation of the lysine at position 290 in CREB to an asparagine diminishes nuclear translocation of the protein.(ABSTRACT TRUNCATED AT 250 WORDS)
Resumo:
The aim of this study was to describe the demographic, clinicopathological, biological and morphometric features of Libyan breast cancer patients. The supporting value of nuclear morphometry and static image cytometry in the sensitivity for detecting breast cancer in conventional fine-needle aspiration biopsies were estimated. The findings were compared with findings in breast cancer in Finland and Nigeria. In addation, the value of ER and PR were evaluated. There were 131 histological samples, 41 cytological samples, and demographic and clinicopathological data from 234 Libyan patients. The Libyan breast cancer is dominantly premenopausal and in this feature it is similar to breast cancer in sub-Saharan Africans, but clearly different from breast cancer in Europeans, whose cancers are dominantly postmenopausal in character. At presention most Libyan patients have locally advanced disease, which is associated with poor survival rates. Nuclear morphometry and image DNA cytometry agree with earlier published data in the Finnish population and indicate that nuclear size and DNA analysis of nuclear content can be used to increase the cytological sensitivity and specificity in doubtful breast lesions, particularly when free cell sampling method is used. Combination of the morphometric data with earlier free cell data gave the following diagnostic guidelines: Range of overlap in free cell samples: 55 μm2 -71 μm2. Cut-off values for diagnostic purposes: Mean nuclear area (MNA) >54 μm2 for 100% detection of malignant cases (specificity 84 %), MNA < 72 μm2 for 100% detection of benign cases (sensitivity 91%). Histomorphometry showed a significant correlation between the MNA and most clinicopathological features, with the strongest association observed for histological grade (p <0.0001). MNA seems to be a prognosticator in Libyan breast cancer (Pearson’s test r = - 0.29, p = 0.019), but at lower level of significance than in the European material. A corresponding relationship was not found in shape-related morphometric features. ER and PR staining scores were in correlation with the clinical stage (p= 0.017, and 0.015, respectively), and also associated with lymph node negative patients (p=0.03, p=0.05, respectively). Receptor-positive (HR+) patients had a better survival. The fraction of HR+ cases among Libyan breast cancers is about the same as the fraction of positive cases in European breast cancer. The study suggests that also weak staining (corresponding to as few as 1% positive cells) has prognostic value. The prognostic significance may be associated with the practice to use antihormonal therapy in HR+ cases. The low survival and advanced presentation is associated with active cell proliferation, atypical nuclear morphology and aneuploid nuclear DNA content in Libyan breast cancer patients. The findings support the idea that breast cancer is not one type of disease, but should probably be classified into premenopausal and post menopausal types.
Resumo:
This master’s thesis has examined how Entrepreneurial, Customer and Knowledge Management Orientations are needed in the use of Big data technology by small retail firms in their Customer Knowledge Management. A vision of the ability of small retailers to move to the Big data era is based on empirical evidence of owner-managers’ attitudes and the firms’ processes. Abductive content analysis was used as a research strategy and the qualitative data was collected through theme interviews of owner-managers of 11 small-size retail firms. The biggest obstacles to the use of Big data by small retail firms are: a lack of information about the new technology; a lack of Knowledge Management Orientation; and, a lack of proactive dimension in Entrepreneurial and Customer Orientations. A strong reactive customer-led orientation, and the ability of the owner-manager to system thinking will support Customer Knowledge Management development. The low stage of technology-use is preventing utilization of customer information. Co-operation between firms or with educational organizations may significantly enhance the use of Big data –technology by small retail firms.
Resumo:
Big datalle on povattu satojen miljardien dollarien hyödyntämispotentiaalia. Big data kuvaa lukuista eri lähteistä peräisin olevia valtavia ja nopeasti kasvavia datamassoja. Kandidaatintyön tavoitteena on tutkia, kuinka big dataa voidaan hyödyntää toimitusketjun hallinnassa sekä toimitusketjun eri osa-alueilla. Työ on tehty kirjallisuuskatsauksena pohjautuen big datan ja toimitusketjun hallinnan kirjallisuuteen sekä erityisesti näitä yhdistäviin tieteellisiin artikkeleihin. Big dataa hyödyntämällä toimitusketjua saadaan tehostettua, tuottoja maksimoitua sekä kysyntää ja tarjontaa yhteensovitettua paremmin. Big dataa hyödyntämällä myös riskien hallinta, päätöksenteko, muutosvalmius ja sidosryhmäsuhteet paranevat. Big datan avulla asiakkaasta saadaan luotua kokonaisnäkymä, jonka avulla markkinointia, segmentointia, hinnoittelua ja tuotteen sijoittelua voidaan optimoida. Big datan avulla myös hankintaa, tuotantoa ja kunnossapitoa pystytään parantamaan sekä kuljetuksia ja varastoja seuraamaan tehokkaammin. Big datan hyödyntäminen on haastavaa ja siihen liittyy teknologisia, organisatorisia ja prosesseihin liittyviä haasteita. Yhtenä ratkaisuna on big data - analytiikan käyttöönoton ja käytön ulkoistaminen, mutta se sisältää omat riskinsä.
Resumo:
Tämän kandidaatintutkielman tarkoituksena oli selvittää minkälaisia liiketoiminnallisia mahdollisuuksia ja haasteita Big Dataan ja sen ominaispiirteisiin liittyy, ja miten Big Data määritellään nykyaikaisesti ja ajankohtaisesti. Tutkimusongelmaa lähestyttiin narratiivisen kirjallisuuskatsauksen keinoin. Toisin sanoen tutkielma on hajanaisen tiedon avulla koostettu yhtenäinen katsaus nykytilanteeseen. Lähdeaineisto koostuu pääosin tieteellisistä artikkeleista, mutta käytössä oli myös oppikirjamateriaalia, konferenssijulkaisuja ja uutisartikkeleja. Tutkimuksessa käytetyt akateemisen kirjallisuuden lähteet sisälsivät keskenään paljon samankaltaisia näkemyksiä tutkimusaihetta kohtaan. Niiden perusteella muodostettiin kaksi taulukkoa havaituista mahdollisuuksista ja haasteista, ja taulukoiden rivit nimettiin niitä kuvaavien ominaispiirteiden mukaan. Tutkimuksessa liiketoiminnalliset mahdollisuudet ja haasteet jaettiin viiteen pääkategoriaan ja neljään alakategoriaan. Tutkimus toteutettiin liiketoiminnan näkökulmasta, joten siinä sivuutettiin monenlaisia Big Datan teknisiä aspekteja. Tutkielman luonne on poikkitieteellinen, ja sen avulla pyritään havainnoimaan tämän hetken yhtä uusinta tietojenkäsittelykäsittelytieteiden termiä liiketoiminnallisessa kontekstissa. Tutkielmassa Big Dataan liittyvillä ominaispiirteillä todettiin olevan mahdollisuuksia, jotka voitiin jaotella korrelaatioiden havaitsemisen perusteella markkinoiden tarkemman segmentoinnin mahdollisuuksiin ja päätöksenteon tukena toimimiseen. Reaaliaikaisen seurannan mahdollisuudet perustuvat Big Datan nopeuteen ja kokoon, eli sen jatkuvaan kasvuun. Ominaispiirteisiin liittyvät haasteet voidaan jakaa viiteen kategoriaan, joista osa liittyy toimintaympäristöön ja osa organisaation sisäiseen toimintaan.