919 resultados para genoma, genetica, dna, bioinformatica, mapreduce, snp, gwas, big data, sequenziamento, pipeline


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Analytics is the technology working with the manipulation of data to produce information able to change the world we live every day. Analytics have been largely used within the last decade to cluster people’s behaviour to predict their preferences of items to buy, music to listen, movies to watch and even electoral preference. The most advanced companies succeded in controlling people’s behaviour using analytics. Despite the evidence of the super-power of analytics, they are rarely applied to the big data collected within supply chain systems (i.e. distribution network, storage systems and production plants). This PhD thesis explores the fourth research paradigm (i.e. the generation of knowledge from data) applied to supply chain system design and operations management. An ontology defining the entities and the metrics of supply chain systems is used to design data structures for data collection in supply chain systems. The consistency of this data is provided by mathematical demonstrations inspired by the factory physics theory. The availability, quantity and quality of the data within these data structures define different decision patterns. Ten decision patterns are identified, and validated on-field, to address ten different class of design and control problems in the field of supply chain systems research.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A High-Performance Computing job dispatcher is a critical software that assigns the finite computing resources to submitted jobs. This resource assignment over time is known as the on-line job dispatching problem in HPC systems. The fact the problem is on-line means that solutions must be computed in real-time, and their required time cannot exceed some threshold to do not affect the normal system functioning. In addition, a job dispatcher must deal with a lot of uncertainty: submission times, the number of requested resources, and duration of jobs. Heuristic-based techniques have been broadly used in HPC systems, at the cost of achieving (sub-)optimal solutions in a short time. However, the scheduling and resource allocation components are separated, thus generates a decoupled decision that may cause a performance loss. Optimization-based techniques are less used for this problem, although they can significantly improve the performance of HPC systems at the expense of higher computation time. Nowadays, HPC systems are being used for modern applications, such as big data analytics and predictive model building, that employ, in general, many short jobs. However, this information is unknown at dispatching time, and job dispatchers need to process large numbers of them quickly while ensuring high Quality-of-Service (QoS) levels. Constraint Programming (CP) has been shown to be an effective approach to tackle job dispatching problems. However, state-of-the-art CP-based job dispatchers are unable to satisfy the challenges of on-line dispatching, such as generate dispatching decisions in a brief period and integrate current and past information of the housing system. Given the previous reasons, we propose CP-based dispatchers that are more suitable for HPC systems running modern applications, generating on-line dispatching decisions in a proper time and are able to make effective use of job duration predictions to improve QoS levels, especially for workloads dominated by short jobs.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Using Big Data and Natural Language Processing (NLP) tools, this dissertation investigates the narrative strategies that atypical actors can leverage to deal with the adverse reactions they often elicit. Extensive research shows that atypical actors, those who fail to abide by established contextual standards and norms, are subject to skepticism and face a higher risk of rejection. Indeed, atypical actors combine features and behaviors in unconventional ways, thereby generating confusion in the audience and instilling doubts about their propositions' legitimacy. However, the same atypicality is often cited as the precursor to socio-cultural innovation and a strategic act to expand the capacity for delivering valued goods and services. Contextualizing the conditions under which atypicality is celebrated or punished has been a significant theoretical challenge for scholars interested in reconciling this tension. Nevertheless, prior work has focused on audience side factors or on actor-side characteristics that are only scantily under an actor's control (e.g., status and reputation). This dissertation demonstrates that atypical actors can use strategically crafted narratives to mitigate against the audience’s negative response. In particular, when atypical actors evoke conventional features in their story, they are more likely to overcome the illegitimacy discount usually applied to them. Moreover, narratives become successful navigational devices for atypicality when atypical actors use a more abstract language. This simplifies classification and provides the audience with more flexibility to interpret and understand them.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

L’elaborazione di quantità di dati sempre crescente ed in tempi ragionevoli è una delle principali sfide tecnologiche del momento. La difficoltà non risiede esclusivamente nel disporre di motori di elaborazione efficienti e in grado di eseguire la computazione coordinata su un’enorme mole di dati, ma anche nel fornire agli sviluppatori di tali applicazioni strumenti di sviluppo che risultino intuitivi nell’utilizzo e facili nella messa in opera, con lo scopo di ridurre il tempo necessario a realizzare concretamente un’idea di applicazione e abbassare le barriere all’ingresso degli strumenti software disponibili. Questo lavoro di tesi prende in esame il progetto RAM3S, il cui intento è quello di semplificare la realizzazione di applicazioni di elaborazione dati basate su piattaforme di Stream Processing quali Spark, Storm, Flinke e Samza, e si occupa di esaudire il suo scopo originale fornendo un framework astratto ed estensibile per la definizione di applicazioni di stream processing, capaci di eseguire indistintamente sulle piattaforme disponibili sul mercato.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Biology is now a “Big Data Science” thanks to technological advancements allowing the characterization of the whole macromolecular content of a cell or a collection of cells. This opens interesting perspectives, but only a small portion of this data may be experimentally characterized. From this derives the demand of accurate and efficient computational tools for automatic annotation of biological molecules. This is even more true when dealing with membrane proteins, on which my research project is focused leading to the development of two machine learning-based methods: BetAware-Deep and SVMyr. BetAware-Deep is a tool for the detection and topology prediction of transmembrane beta-barrel proteins found in Gram-negative bacteria. These proteins are involved in many biological processes and primary candidates as drug targets. BetAware-Deep exploits the combination of a deep learning framework (bidirectional long short-term memory) and a probabilistic graphical model (grammatical-restrained hidden conditional random field). Moreover, it introduced a modified formulation of the hydrophobic moment, designed to include the evolutionary information. BetAware-Deep outperformed all the available methods in topology prediction and reported high scores in the detection task. Glycine myristoylation in Eukaryotes is the binding of a myristic acid on an N-terminal glycine. SVMyr is a fast method based on support vector machines designed to predict this modification in dataset of proteomic scale. It uses as input octapeptides and exploits computational scores derived from experimental examples and mean physicochemical features. SVMyr outperformed all the available methods for co-translational myristoylation prediction. In addition, it allows (as a unique feature) the prediction of post-translational myristoylation. Both the tools here described are designed having in mind best practices for the development of machine learning-based tools outlined by the bioinformatics community. Moreover, they are made available via user-friendly web servers. All this make them valuable tools for filling the gap between sequential and annotated data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Intelligent systems are currently inherent to the society, supporting a synergistic human-machine collaboration. Beyond economical and climate factors, energy consumption is strongly affected by the performance of computing systems. The quality of software functioning may invalidate any improvement attempt. In addition, data-driven machine learning algorithms are the basis for human-centered applications, being their interpretability one of the most important features of computational systems. Software maintenance is a critical discipline to support automatic and life-long system operation. As most software registers its inner events by means of logs, log analysis is an approach to keep system operation. Logs are characterized as Big data assembled in large-flow streams, being unstructured, heterogeneous, imprecise, and uncertain. This thesis addresses fuzzy and neuro-granular methods to provide maintenance solutions applied to anomaly detection (AD) and log parsing (LP), dealing with data uncertainty, identifying ideal time periods for detailed software analyses. LP provides deeper semantics interpretation of the anomalous occurrences. The solutions evolve over time and are general-purpose, being highly applicable, scalable, and maintainable. Granular classification models, namely, Fuzzy set-Based evolving Model (FBeM), evolving Granular Neural Network (eGNN), and evolving Gaussian Fuzzy Classifier (eGFC), are compared considering the AD problem. The evolving Log Parsing (eLP) method is proposed to approach the automatic parsing applied to system logs. All the methods perform recursive mechanisms to create, update, merge, and delete information granules according with the data behavior. For the first time in the evolving intelligent systems literature, the proposed method, eLP, is able to process streams of words and sentences. Essentially, regarding to AD accuracy, FBeM achieved (85.64+-3.69)%; eGNN reached (96.17+-0.78)%; eGFC obtained (92.48+-1.21)%; and eLP reached (96.05+-1.04)%. Besides being competitive, eLP particularly generates a log grammar, and presents a higher level of model interpretability.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Il progetto ANTE riguarda i nuovi sistemi di traduzione automatica (TA) e la loro applicazione nel mondo delle imprese. Lo studio prende spunto dai recenti sviluppi legati all’intelligenza artificiale e ai Big Data che negli ultimi anni hanno permesso alla TA di raggiungere livelli qualitativi molto elevati, al punto tale da essere impiegata da grandi multinazionali per raggiungere nuove quote di mercato. La TA può rispondere positivamente anche ai bisogni delle imprese di piccole dimensioni e a basso tenore tecnologico, migliorando la qualità delle comunicazioni multilingue attraverso delle traduzioni in tempi brevi e a costi contenuti. Lo studio si propone quindi di contribuire al rafforzamento della competitività internazionale delle piccole e medie imprese (PMI) emiliano- romagnole, migliorando la loro capacità di comunicazione in una o più lingue straniere attraverso l’introduzione e l’utilizzo efficace e consapevole di soluzioni ICT di ultima generazione e fornire, così, nuove opportunità di internazionalizzazione.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The multi-faced evolution of network technologies ranges from big data centers to specialized network infrastructures and protocols for mission-critical operations. For instance, technologies such as Software Defined Networking (SDN) revolutionized the world of static configuration of the network - i.e., by removing the distributed and proprietary configuration of the switched networks - centralizing the control plane. While this disruptive approach is interesting from different points of view, it can introduce new unforeseen vulnerabilities classes. One topic of particular interest in the last years is industrial network security, an interest which started to rise in 2016 with the introduction of the Industry 4.0 (I4.0) movement. Networks that were basically isolated by design are now connected to the internet to collect, archive, and analyze data. While this approach got a lot of momentum due to the predictive maintenance capabilities, these network technologies can be exploited in various ways from a cybersecurity perspective. Some of these technologies lack security measures and can introduce new families of vulnerabilities. On the other side, these networks can be used to enable accurate monitoring, formal verification, or defenses that were not practical before. This thesis explores these two fields: by introducing monitoring, protections, and detection mechanisms where the new network technologies make it feasible; and by demonstrating attacks on practical scenarios related to emerging network infrastructures not protected sufficiently. The goal of this thesis is to highlight this lack of protection in terms of attacks on and possible defenses enabled by emerging technologies. We will pursue this goal by analyzing the aforementioned technologies and by presenting three years of contribution to this field. In conclusion, we will recapitulate the research questions and give answers to them.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The development of Next Generation Sequencing promotes Biology in the Big Data era. The ever-increasing gap between proteins with known sequences and those with a complete functional annotation requires computational methods for automatic structure and functional annotation. My research has been focusing on proteins and led so far to the development of three novel tools, DeepREx, E-SNPs&GO and ISPRED-SEQ, based on Machine and Deep Learning approaches. DeepREx computes the solvent exposure of residues in a protein chain. This problem is relevant for the definition of structural constraints regarding the possible folding of the protein. DeepREx exploits Long Short-Term Memory layers to capture residue-level interactions between positions distant in the sequence, achieving state-of-the-art performances. With DeepRex, I conducted a large-scale analysis investigating the relationship between solvent exposure of a residue and its probability to be pathogenic upon mutation. E-SNPs&GO predicts the pathogenicity of a Single Residue Variation. Variations occurring on a protein sequence can have different effects, possibly leading to the onset of diseases. E-SNPs&GO exploits protein embeddings generated by two novel Protein Language Models (PLMs), as well as a new way of representing functional information coming from the Gene Ontology. The method achieves state-of-the-art performances and is extremely time-efficient when compared to traditional approaches. ISPRED-SEQ predicts the presence of Protein-Protein Interaction sites in a protein sequence. Knowing how a protein interacts with other molecules is crucial for accurate functional characterization. ISPRED-SEQ exploits a convolutional layer to parse local context after embedding the protein sequence with two novel PLMs, greatly surpassing the current state-of-the-art. All methods are published in international journals and are available as user-friendly web servers. They have been developed keeping in mind standard guidelines for FAIRness (FAIR: Findable, Accessible, Interoperable, Reusable) and are integrated into the public collection of tools provided by ELIXIR, the European infrastructure for Bioinformatics.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Hematological cancers are a heterogeneous family of diseases that can be divided into leukemias, lymphomas, and myelomas, often called “liquid tumors”. Since they cannot be surgically removable, chemotherapy represents the mainstay of their treatment. However, it still faces several challenges like drug resistance and low response rate, and the need for new anticancer agents is compelling. The drug discovery process is long-term, costly, and prone to high failure rates. With the rapid expansion of biological and chemical "big data", some computational techniques such as machine learning tools have been increasingly employed to speed up and economize the whole process. Machine learning algorithms can create complex models with the aim to determine the biological activity of compounds against several targets, based on their chemical properties. These models are defined as multi-target Quantitative Structure-Activity Relationship (mt-QSAR) and can be used to virtually screen small and large chemical libraries for the identification of new molecules with anticancer activity. The aim of my Ph.D. project was to employ machine learning techniques to build an mt-QSAR classification model for the prediction of cytotoxic drugs simultaneously active against 43 hematological cancer cell lines. For this purpose, first, I constructed a large and diversified dataset of molecules extracted from the ChEMBL database. Then, I compared the performance of different ML classification algorithms, until Random Forest was identified as the one returning the best predictions. Finally, I used different approaches to maximize the performance of the model, which achieved an accuracy of 88% by correctly classifying 93% of inactive molecules and 72% of active molecules in a validation set. This model was further applied to the virtual screening of a small dataset of molecules tested in our laboratory, where it showed 100% accuracy in correctly classifying all molecules. This result is confirmed by our previous in vitro experiments.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the Era of precision medicine and big medical data sharing, it is necessary to solve the work-flow of digital radiological big data in a productive and effective way. In particular, nowadays, it is possible to extract information “hidden” in digital images, in order to create diagnostic algorithms helping clinicians to set up more personalized therapies, which are in particular targets of modern oncological medicine. Digital images generated by the patient have a “texture” structure that is not visible but encrypted; it is “hidden” because it cannot be recognized by sight alone. Thanks to artificial intelligence, pre- and post-processing software and generation of mathematical calculation algorithms, we could perform a classification based on non-visible data contained in radiological images. Being able to calculate the volume of tissue body composition could lead to creating clasterized classes of patients inserted in standard morphological reference tables, based on human anatomy distinguished by gender and age, and maybe in future also by race. Furthermore, the branch of “morpho-radiology" is a useful modality to solve problems regarding personalized therapies, which is particularly needed in the oncological field. Actually oncological therapies are no longer based on generic drugs but on target personalized therapy. The lack of gender and age therapies table could be filled thanks to morpho-radiology data analysis application.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Lo scopo del presente elaborato è ottenere dati grezzi dai maggiori offerwalls affinché si renda possibile elaborarli ed analizzarli per metterli a disposizione delle figure che si occupano di account management di un potenziale Ad Network quale è MyAppFree. Il primo Ad Network competitor a venire integrato nel presente tool di Business Intelligence è OfferToro, seguito da AdGem, il quale è attualmente in fase di integrazione. Prima di presentare i risultati del tool, a cui è stato dedicato l’ultimo capitolo dell’elaborato, sono stati approfonditi ed analizzati ampiamente i concetti fondamentali per la comprensione del progetto insieme agli strumenti utilizzati per la costituzione dell’architettura software. Successivamente, viene presentata l'architettura dei singoli microservizi oltre a quella sistemistica generale, la quale tratta come le parti che compongono iBiT, interagiscono tra loro. Infine, l’ultima parte della trattazione è dedicata al funzionamento del Front End Side per la figura account manager, che rappresenta l’utente finale del progetto. Unita alle analisi dei risultati ottenuti tramite una fase di benchmark testing, metrica che misura un insieme ripetibile di risultati quantificabili che serve come punto di riferimento perché prodotti e servizi possano essere confrontati. Lo scopo dei risultati dei test di benchmark è quello di confrontare le versioni presenti e future del software tramite i rispettivi benchmark.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Negli ultimi anni, a causa della crescente tendenza verso i Big Data, l’apprendimento automatico è diventato un approccio di previsione fondamentale perché può prevedere i prezzi delle case in modo accurato in base agli attributi delle abitazioni. In questo elaborato, verranno messe in pratica alcune tecniche di machine learning con l’obiettivo di effettuare previsioni sui prezzi delle abitazioni. Ad esempio, si può pensare all’acquisto di una nuova casa, saranno tanti i fattori di cui si dovrà preoccuparsi, la posizione, i metri quadrati, l’inquinamento dell’aria, il numero di stanze, il numero dei bagni e così via. Tutti questi fattori possono influire in modo più o meno pesante sul prezzo di quell’abitazione. E’ proprio in casi come questi che può essere applicata l’intelligenza artificiale, nello specifico il machine learning, per riuscire a trovare un modello che approssimi nel miglior modo un prezzo, data una serie di caratteristiche. In questa tesi verrà dimostrato come è possibile utilizzare l’apprendimento automatico per effettuare delle stime il più preciso possibile dei prezzi delle case. La tesi è divisa in 5 capitoli, nel primo capitolo verranno introdotti i concetti di base su cui si basa l’elaborato e alcune spiegazioni dei singoli modelli. Nel secondo capitolo, invece, viene trattato l’ambiente di lavoro utilizzato, il linguaggio e le relative librerie utilizzate. Il terzo capitolo contiene un’analisi esplorativa sul dataset utilizzato e vengono effettuate delle operazioni per preparare i dati agli algoritmi che verranno applicati in seguito. Nel capitolo 4 vengono creati i diversi modelli ed effettuate le previsioni sui prezzi mentre nel capitolo 5 vengono analizzati i risultati ottenuti e riportate le conclusioni.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Negli ultimi anni la necessità di processare e mantenere dati di qualsiasi natura è aumentata considerevolmente, in aggiunta a questo, l’obsolescenza del modello centralizzato ha contribuito alla sempre più frequente adozione del modello distribuito. Inevitabile dunque l’aumento di traffico che attraversa i nodi appartenenti alle infrastrutture, un traffico sempre più in aumento e che con l’avvento dell’IoT, dei Big Data, del Cloud Computing, del Serverless Computing etc., ha raggiunto picchi elevatissimi. Basti pensare che se prima i dati erano contenuti in loco, oggi non è assurdo pensare che l’archiviazione dei propri dati sia completamente affidata a terzi. Così come cresce, quindi, il traffico che attraversa i nodi facenti parte di un’infrastruttura, cresce la necessità che questo traffico sia filtrato e gestito dai nodi stessi. L’obbiettivo di questa tesi è quello di estendere un Message-oriented Middleware, in grado di garantire diverse qualità di servizio per la consegna di messaggi, in modo da accelerarne la fase di routing verso i nodi destinazione. L’estensione consiste nell’aggiungere al Message-oriented Middleware, precedentemente implementato, la funzione di intercettare i pacchetti in arrivo (che nel caso del middleware in questione possono rappresentare la propagazione di eventi) e redirigerli verso un nuovo nodo in base ad alcuni parametri. Il Message-oriented Middleware oggetto di tesi sarà considerato il message broker di un modello pub/sub, pertanto la redirezione deve avvenire con tempi molto bassi di latenza e, a tal proposito, deve avvenire senza l’uscita dal kernel space del sistema operativo. Per questo motivo si è deciso di utilizzare eBPF, in particolare il modulo XDP, che permette di scrivere programmi che eseguono all’interno del kernel.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Negli ultimi anni, a causa degli enormi progressi dell’informatica e della sempre crescente quantità di dati generati, si è sentito sempre più il bisogno di trovare nuove tecniche, approcci e algoritmi per la ricerca dei dati. Infatti, la quantità di informazioni da memorizzare è diventata tale che ormai si sente sempre più spesso parlare di "Big Data". Questo nuovo scenario ha reso sempre più inefficaci gli approcci tradizionali alla ricerca di dati. Recentemente sono state quindi proposte nuove tecniche di ricerca, come ad esempio le ricerche Nearest Neighbor. In questo elaborato sono analizzate le prestazioni della ricerca di vicini in uno spazio vettoriale utilizzando come sistema di data storage Elasticsearch su un’infrastruttura cloud. In particolare, sono stati analizzati e messi a confronto i tempi di ricerca delle ricerche Nearest Neighbor esatte e approssimate, valutando anche la perdita di precisione nel caso di ricerche approssimate, utilizzando due diverse metriche di distanza: la similarità coseno e il prodotto scalare.