864 resultados para Data mining methods
Resumo:
We review recent visualization techniques aimed at supporting tasks that require the analysis of text documents, from approaches targeted at visually summarizing the relevant content of a single document to those aimed at assisting exploratory investigation of whole collections of documents.Techniques are organized considering their target input materialeither single texts or collections of textsand their focus, which may be at displaying content, emphasizing relevant relationships, highlighting the temporal evolution of a document or collection, or helping users to handle results from a query posed to a search engine.We describe the approaches adopted by distinct techniques and briefly review the strategies they employ to obtain meaningful text models, discuss how they extract the information required to produce representative visualizations, the tasks they intend to support and the interaction issues involved, and strengths and limitations. Finally, we show a summary of techniques, highlighting their goals and distinguishing characteristics. We also briefly discuss some open problems and research directions in the fields of visual text mining and text analytics.
Resumo:
Context. The ESO public survey VISTA variables in the Via Lactea (VVV) started in 2010. VVV targets 562 sq. deg in the Galactic bulge and an adjacent plane region and is expected to run for about five years. Aims. We describe the progress of the survey observations in the first observing season, the observing strategy, and quality of the data obtained. Methods. The observations are carried out on the 4-m VISTA telescope in the ZYJHK(s) filters. In addition to the multi-band imaging the variability monitoring campaign in the K-s filter has started. Data reduction is carried out using the pipeline at the Cambridge Astronomical Survey Unit. The photometric and astrometric calibration is performed via the numerous 2MASS sources observed in each pointing. Results. The first data release contains the aperture photometry and astrometric catalogues for 348 individual pointings in the ZYJHK(s) filters taken in the 2010 observing season. The typical image quality is similar to 0 ''.9-1 ''.0. The stringent photometric and image quality requirements of the survey are satisfied in 100% of the JHK(s) images in the disk area and 90% of the JHK(s) images in the bulge area. The completeness in the Z and Y images is 84% in the disk, and 40% in the bulge. The first season catalogues contain 1.28 x 10(8) stellar sources in the bulge and 1.68 x 10(8) in the disk area detected in at least one of the photometric bands. The combined, multi-band catalogues contain more than 1.63 x 10(8) stellar sources. About 10% of these are double detections because of overlapping adjacent pointings. These overlapping multiple detections are used to characterise the quality of the data. The images in the JHK(s) bands extend typically similar to 4 mag deeper than 2MASS. The magnitude limit and photometric quality depend strongly on crowding in the inner Galactic regions. The astrometry for K-s = 15-18 mag has rms similar to 35-175 mas. Conclusions. The VVV Survey data products offer a unique dataset to map the stellar populations in the Galactic bulge and the adjacent plane and provide an exciting new tool for the study of the structure, content, and star-formation history of our Galaxy, as well as for investigations of the newly discovered star clusters, star-forming regions in the disk, high proper motion stars, asteroids, planetary nebulae, and other interesting objects.
Resumo:
Abstract Background Mycelium-to-yeast transition in the human host is essential for pathogenicity by the fungus Paracoccidioides brasiliensis and both cell types are therefore critical to the establishment of paracoccidioidomycosis (PCM), a systemic mycosis endemic to Latin America. The infected population is of about 10 million individuals, 2% of whom will eventually develop the disease. Previously, transcriptome analysis of mycelium and yeast cells resulted in the assembly of 6,022 sequence groups. Gene expression analysis, using both in silico EST subtraction and cDNA microarray, revealed genes that were differential to yeast or mycelium, and we discussed those involved in sugar metabolism. To advance our understanding of molecular mechanisms of dimorphic transition, we performed an extended analysis of gene expression profiles using the methods mentioned above. Results In this work, continuous data mining revealed 66 new differentially expressed sequences that were MIPS(Munich Information Center for Protein Sequences)-categorised according to the cellular process in which they are presumably involved. Two well represented classes were chosen for further analysis: (i) control of cell organisation – cell wall, membrane and cytoskeleton, whose representatives were hex (encoding for a hexagonal peroxisome protein), bgl (encoding for a 1,3-β-glucosidase) in mycelium cells; and ags (an α-1,3-glucan synthase), cda (a chitin deacetylase) and vrp (a verprolin) in yeast cells; (ii) ion metabolism and transport – two genes putatively implicated in ion transport were confirmed to be highly expressed in mycelium cells – isc and ktp, respectively an iron-sulphur cluster-like protein and a cation transporter; and a putative P-type cation pump (pct) in yeast. Also, several enzymes from the cysteine de novo biosynthesis pathway were shown to be up regulated in the yeast form, including ATP sulphurylase, APS kinase and also PAPS reductase. Conclusion Taken together, these data show that several genes involved in cell organisation and ion metabolism/transport are expressed differentially along dimorphic transition. Hyper expression in yeast of the enzymes of sulphur metabolism reinforced that this metabolic pathway could be important for this process. Understanding these changes by functional analysis of such genes may lead to a better understanding of the infective process, thus providing new targets and strategies to control PCM.
Resumo:
Il presente lavoro nasce dall’obiettivo di individuare strumenti statistici per indagare, sotto diversi aspetti, il flusso di lavoro di un Laboratorio di Anatomia Patologica. Il punto di partenza dello studio è l’ambiente di lavoro di ATHENA, software gestionale utilizzato nell’Anatomia Patologica, sviluppato dalla NoemaLife S.p.A., azienda specializzata nell’informatica per la sanità. A partire da tale applicativo è stato innanzitutto formalizzato il workflow del laboratorio (Capitolo 2), nelle sue caratteristiche e nelle sue possibili varianti, identificando le operazioni principali attraverso una serie di “fasi”. Proprio le fasi, unitamente alle informazioni addizionali ad esse associate, saranno per tutta la trattazione e sotto diversi punti di vista al centro dello studio. L’analisi che presentiamo è stata per completezza sviluppata in due scenari che tengono conto di diversi aspetti delle informazioni in possesso. Il primo scenario tiene conto delle sequenze di fasi, che si presentano nel loro ordine cronologico, comprensive di eventuali ripetizioni o cicli di fasi precedenti alla conclusione. Attraverso l’elaborazione dei dati secondo specifici formati è stata svolta un’iniziale indagine grafica di Workflow Mining (Capitolo 3) grazie all’ausilio di EMiT, un software che attraverso un set di log di processo restituisce graficamente il flusso di lavoro che li rappresenta. Questa indagine consente già di valutare la completezza dell’utilizzo di un applicativo rispetto alle sue potenzialità. Successivamente, le stesse fasi sono state elaborate attraverso uno specifico adattamento di un comune algoritmo di allineamento globale, l’algoritmo Needleman-Wunsch (Capitolo 4). L’utilizzo delle tecniche di allineamento applicate a sequenze di processo è in grado di individuare, nell’ambito di una specifica codifica delle fasi, le similarità tra casi clinici. L’algoritmo di Needleman-Wunsch individua le identità e le discordanze tra due stringhe di caratteri, assegnando relativi punteggi che portano a valutarne la similarità. Tale algoritmo è stato opportunamente modificato affinché possa riconoscere e penalizzare differentemente cicli e ripetizioni, piuttosto che fasi mancanti. Sempre in ottica di allineamento sarà utilizzato l’algoritmo euristico Clustal, che a partire da un confronto pairwise tra sequenze costruisce un dendrogramma rappresentante graficamente l’aggregazione dei casi in funzione della loro similarità. Proprio il dendrogramma, per la sua struttura grafica ad albero, è in grado di mostrare intuitivamente l’andamento evolutivo della similarità di un pattern di casi. Il secondo scenario (Capitolo 5) aggiunge alle sequenze l’informazione temporale in termini di istante di esecuzione di ogni fase. Da un dominio basato su sequenze di fasi, si passa dunque ad uno scenario di serie temporali. I tempi rappresentano infatti un dato essenziale per valutare la performance di un laboratorio e per individuare la conformità agli standard richiesti. Il confronto tra i casi è stato effettuato con diverse modalità, in modo da stabilire la distanza tra tutte le coppie sotto diversi aspetti: le sequenze, rappresentate in uno specifico sistema di riferimento, sono state confrontate in base alla Distanza Euclidea ed alla Dynamic Time Warping, in grado di esprimerne le discordanze rispettivamente temporali, di forma e, dunque, di processo. Alla luce dei risultati e del loro confronto, saranno presentate già in questa fase le prime valutazioni sulla pertinenza delle distanze e sulle informazioni deducibili da esse. Il Capitolo 6 rappresenta la ricerca delle correlazioni tra elementi caratteristici del processo e la performance dello stesso. Svariati fattori come le procedure utilizzate, gli utenti coinvolti ed ulteriori specificità determinano direttamente o indirettamente la qualità del servizio erogato. Le distanze precedentemente calcolate vengono dunque sottoposte a clustering, una tecnica che a partire da un insieme eterogeneo di elementi individua famiglie o gruppi simili. L’algoritmo utilizzato sarà l’UPGMA, comunemente applicato nel clustering in quanto, utilizzando, una logica di medie pesate, porta a clusterizzazioni pertinenti anche in ambiti diversi, dal campo biologico a quello industriale. L’ottenimento dei cluster potrà dunque essere finalmente sottoposto ad un’attività di ricerca di correlazioni utili, che saranno individuate ed interpretate relativamente all’attività gestionale del laboratorio. La presente trattazione propone quindi modelli sperimentali adattati al caso in esame ma idealmente estendibili, interamente o in parte, a tutti i processi che presentano caratteristiche analoghe.
Resumo:
Zielvorgaben der vorliegenden Arbeit war die Identifikation neuer selektiv in Tumoren aktivierter Gene sowie die Entwicklung eines methodischen Prozesses, um die molekularen Effekte der fehlerhaften Aktivierung solcher Gene zu untersuchen. Für die erste Fragestellung haben wir zwei komplementäre Methoden entwickelt. Zum einen haben wir nach neuen Mitglieder der Cancer/Germline (CG) Familie von Genen gesucht, die bereits attraktive Zielstrukturen laufender Phase I/IIa Studien sind. Zu diesem Zweck wurde ein bioinformatischer Data Mining Ansatz generiert. Dieser führte zur erfolgreichen in silico Klonierung neuer CG Gene. Zur Identifikation von in Tumorzellen überexprimierten Genen nutzten wir einen cDNA Mikroarray mit 1152 ausgewählten Genen mit direkter oder indirekter tumorimmunologischer oder tumorbiologischer Relevanz. Die komparative transkriptionelle Untersuchung von humanen Tumor- und Normalgeweben mit diesem Array führte zur Wiederentdeckung bereits bekannter, aber auch zur Aufdeckung bisher nicht beschriebener tumor-assoziierter Transkriptionsveränderungen. Der zweite große Schwerpunkt dieser Arbeit war die Technologieentwicklung eines versatilen Prozesses zur Untersuchung von molekularen Effekten eines aberrant in Zellen exprimierten Gens. Zur Simulation dieser Situation stellten wir in vitro transkribierte RNA dieses Gens her und elektroporierten diese in Zielzellen. Transkriptionsanalysen solcher Transfektanden mit Affymetrix Oligonukleotid Mikroarray deckten auf gesamt-genomischer Ebene ganze Kaskaden konsekutiver, transkriptioneller Alterationen auf.
Resumo:
Supernovae are among the most energetic events occurring in the universe and are so far the only verified extrasolar source of neutrinos. As the explosion mechanism is still not well understood, recording a burst of neutrinos from such a stellar explosion would be an important benchmark for particle physics as well as for the core collapse models. The neutrino telescope IceCube is located at the Geographic South Pole and monitors the antarctic glacier for Cherenkov photons. Even though it was conceived for the detection of high energy neutrinos, it is capable of identifying a burst of low energy neutrinos ejected from a supernova in the Milky Way by exploiting the low photomultiplier noise in the antarctic ice and extracting a collective rate increase. A signal Monte Carlo specifically developed for water Cherenkov telescopes is presented. With its help, we will investigate how well IceCube can distinguish between core collapse models and oscillation scenarios. In the second part, nine years of data taken with the IceCube precursor AMANDA will be analyzed. Intensive data cleaning methods will be presented along with a background simulation. From the result, an upper limit on the expected occurrence of supernovae within the Milky Way will be determined.
Resumo:
Il problema relativo alla predizione, la ricerca di pattern predittivi all‘interno dei dati, è stato studiato ampiamente. Molte metodologie robuste ed efficienti sono state sviluppate, procedimenti che si basano sull‘analisi di informazioni numeriche strutturate. Quella testuale, d‘altro canto, è una tipologia di informazione fortemente destrutturata. Quindi, una immediata conclusione, porterebbe a pensare che per l‘analisi predittiva su dati testuali sia necessario sviluppare metodi completamente diversi da quelli ben noti dalle tecniche di data mining. Un problema di predizione può essere risolto utilizzando invece gli stessi metodi : dati testuali e documenti possono essere trasformati in valori numerici, considerando per esempio l‘assenza o la presenza di termini, rendendo di fatto possibile una utilizzazione efficiente delle tecniche già sviluppate. Il text mining abilita la congiunzione di concetti da campi di applicazione estremamente eterogenei. Con l‘immensa quantità di dati testuali presenti, basti pensare, sul World Wide Web, ed in continua crescita a causa dell‘utilizzo pervasivo di smartphones e computers, i campi di applicazione delle analisi di tipo testuale divengono innumerevoli. L‘avvento e la diffusione dei social networks e della pratica di micro blogging abilita le persone alla condivisione di opinioni e stati d‘animo, creando un corpus testuale di dimensioni incalcolabili aggiornato giornalmente. Le nuove tecniche di Sentiment Analysis, o Opinion Mining, si occupano di analizzare lo stato emotivo o la tipologia di opinione espressa all‘interno di un documento testuale. Esse sono discipline attraverso le quali, per esempio, estrarre indicatori dello stato d‘animo di un individuo, oppure di un insieme di individui, creando una rappresentazione dello stato emotivo sociale. L‘andamento dello stato emotivo sociale può condizionare macroscopicamente l‘evolvere di eventi globali? Studi in campo di Economia e Finanza Comportamentale assicurano un legame fra stato emotivo, capacità nel prendere decisioni ed indicatori economici. Grazie alle tecniche disponibili ed alla mole di dati testuali continuamente aggiornati riguardanti lo stato d‘animo di milioni di individui diviene possibile analizzare tali correlazioni. In questo studio viene costruito un sistema per la previsione delle variazioni di indici di borsa, basandosi su dati testuali estratti dalla piattaforma di microblogging Twitter, sotto forma di tweets pubblici; tale sistema include tecniche di miglioramento della previsione basate sullo studio di similarità dei testi, categorizzandone il contributo effettivo alla previsione.
Resumo:
Nowadays, more and more data is collected in large amounts, such that the need of studying it both efficiently and profitably is arising; we want to acheive new and significant informations that weren't known before the analysis. At this time many graph mining algorithms have been developed, but an algebra that could systematically define how to generalize such operations is missing. In order to propel the development of a such automatic analysis of an algebra, We propose for the first time (to the best of my knowledge) some primitive operators that may be the prelude to the systematical definition of a hypergraph algebra in this regard.
Resumo:
In this work we will discuss about a project started by the Emilia-Romagna Regional Government regarding the manage of the public transport. In particular we will perform a data mining analysis on the data-set of this project. After introducing the Weka software used to make our analysis, we will discover the most useful data mining techniques and algorithms; and we will show how these results can be used to violate the privacy of the same public transport operators. At the end, despite is off topic of this work, we will spend also a few words about how it's possible to prevent this kind of attack.
Resumo:
Autism Spectrum Disorders (ASDs) describe a set of neurodevelopmental disorders. ASD represents a significant public health problem. Currently, ASDs are not diagnosed before the 2nd year of life but an early identification of ASDs would be crucial as interventions are much more effective than specific therapies starting in later childhood. To this aim, cheap an contact-less automatic approaches recently aroused great clinical interest. Among them, the cry and the movements of the newborn, both involving the central nervous system, are proposed as possible indicators of neurological disorders. This PhD work is a first step towards solving this challenging problem. An integrated system is presented enabling the recording of audio (crying) and video (movements) data of the newborn, their automatic analysis with innovative techniques for the extraction of clinically relevant parameters and their classification with data mining techniques. New robust algorithms were developed for the selection of the voiced parts of the cry signal, the estimation of acoustic parameters based on the wavelet transform and the analysis of the infant’s general movements (GMs) through a new body model for segmentation and 2D reconstruction. In addition to a thorough literature review this thesis presents the state of the art on these topics that shows that no studies exist concerning normative ranges for newborn infant cry in the first 6 months of life nor the correlation between cry and movements. Through the new automatic methods a population of control infants (“low-risk”, LR) was compared to a group of “high-risk” (HR) infants, i.e. siblings of children already diagnosed with ASD. A subset of LR infants clinically diagnosed as newborns with Typical Development (TD) and one affected by ASD were compared. The results show that the selected acoustic parameters allow good differentiation between the two groups. This result provides new perspectives both diagnostic and therapeutic.
Resumo:
Sviluppo e analisi di un dataset campione, composto da circa 3 mln di entry ed estratto da un data warehouse di informazioni riguardanti il consumo energetico di diverse smart home.
Resumo:
Advances in the area of mobile and wireless communication for healthcare (m-Health) along with the improvements in information science allow the design and development of new patient-centric models for the provision of personalised healthcare services, increase of patient independence and improvement of patient's self-control and self-management capabilities. This paper comprises a brief overview of the m-Health applications towards the self-management of individuals with diabetes mellitus and the enhancement of their quality of life. Furthermore, the design and development of a mobile phone application for Type 1 Diabetes Mellitus (T1DM) self-management is presented. The technical evaluation of the application, which permits the management of blood glucose measurements, blood pressure measurements, insulin dosage, food/drink intake and physical activity, has shown that the use of the mobile phone technologies along with data analysis methods might improve the self-management of T1DM.
Resumo:
Sinotubular junction dilation is one of the most frequent pathologies associated with aortic root incompetence. Hence, we create a finite element model considering the whole root geometry; then, starting from healthy valve models and referring to measures of pathological valves reported in the literature, we reproduce the pathology of the aortic root by imposing appropriate boundary conditions. After evaluating the virtual pathological process, we are able to correlate dimensions of non-functional valves with dimensions of competent valves. Such a relation could be helpful in recreating a competent aortic root and, in particular, it could provide useful information in advance in aortic valve sparing surgery.