877 resultados para data-mining application
Resumo:
In this thesis the evolution of the techno-social systems analysis methods will be reported, through the explanation of the various research experience directly faced. The first case presented is a research based on data mining of a dataset of words association named Human Brain Cloud: validation will be faced and, also through a non-trivial modeling, a better understanding of language properties will be presented. Then, a real complex system experiment will be introduced: the WideNoise experiment in the context of the EveryAware european project. The project and the experiment course will be illustrated and data analysis will be displayed. Then the Experimental Tribe platform for social computation will be introduced . It has been conceived to help researchers in the implementation of web experiments, and aims also to catalyze the cumulative growth of experimental methodologies and the standardization of tools cited above. In the last part, three other research experience which already took place on the Experimental Tribe platform will be discussed in detail, from the design of the experiment to the analysis of the results and, eventually, to the modeling of the systems involved. The experiments are: CityRace, about the measurement of human traffic-facing strategies; laPENSOcosì, aiming to unveil the political opinion structure; AirProbe, implemented again in the EveryAware project framework, which consisted in monitoring air quality opinion shift of a community informed about local air pollution. At the end, the evolution of the technosocial systems investigation methods shall emerge together with the opportunities and the threats offered by this new scientific path.
Resumo:
Al giorno d'oggi una pratica molto comune è quella di eseguire ricerche su Google per cercare qualsiasi tipo di informazione e molte persone, con problemi di salute, cercano su Google sintomi, consigli medici e possibili rimedi. Questo fatto vale sia per pazienti sporadici che per pazienti cronici: il primo gruppo spesso fa ricerche per rassicurarsi e per cercare informazioni riguardanti i sintomi ed i tempi di guarigione, il secondo gruppo invece cerca nuovi trattamenti e soluzioni. Anche i social networks sono diventati posti di comunicazione medica, dove i pazienti condividono le loro esperienze, ascoltano quelle di altri e si scambiano consigli. Tutte queste ricerche, questo fare domande e scrivere post o altro ha contribuito alla crescita di grandissimi database distribuiti online di informazioni, conosciuti come BigData, che sono molto utili ma anche molto complessi e che necessitano quindi di algoritmi specifici per estrarre e comprendere le variabili di interesse. Per analizzare questo gruppo interessante di pazienti gli sforzi sono stati concentrati in particolare sui pazienti affetti dal morbo di Crohn, che è un tipo di malattia infiammatoria intestinale (IBD) che può colpire qualsiasi parte del tratto gastrointestinale, dalla bocca all'ano, provocando una grande varietà di sintomi. E' stato fatto riferimento a competenze mediche ed informatiche per identificare e studiare ciò che i pazienti con questa malattia provano e scrivono sui social, al fine di comprendere come la loro malattia evolve nel tempo e qual'è il loro umore a riguardo.
Resumo:
Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems. Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort. Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times. A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.
Resumo:
Autism Spectrum Disorders (ASDs) describe a set of neurodevelopmental disorders. ASD represents a significant public health problem. Currently, ASDs are not diagnosed before the 2nd year of life but an early identification of ASDs would be crucial as interventions are much more effective than specific therapies starting in later childhood. To this aim, cheap an contact-less automatic approaches recently aroused great clinical interest. Among them, the cry and the movements of the newborn, both involving the central nervous system, are proposed as possible indicators of neurological disorders. This PhD work is a first step towards solving this challenging problem. An integrated system is presented enabling the recording of audio (crying) and video (movements) data of the newborn, their automatic analysis with innovative techniques for the extraction of clinically relevant parameters and their classification with data mining techniques. New robust algorithms were developed for the selection of the voiced parts of the cry signal, the estimation of acoustic parameters based on the wavelet transform and the analysis of the infant’s general movements (GMs) through a new body model for segmentation and 2D reconstruction. In addition to a thorough literature review this thesis presents the state of the art on these topics that shows that no studies exist concerning normative ranges for newborn infant cry in the first 6 months of life nor the correlation between cry and movements. Through the new automatic methods a population of control infants (“low-risk”, LR) was compared to a group of “high-risk” (HR) infants, i.e. siblings of children already diagnosed with ASD. A subset of LR infants clinically diagnosed as newborns with Typical Development (TD) and one affected by ASD were compared. The results show that the selected acoustic parameters allow good differentiation between the two groups. This result provides new perspectives both diagnostic and therapeutic.
Resumo:
L'Open Data, letteralmente “dati aperti”, è la corrente di pensiero (e il relativo “movimento”) che cerca di rispondere all'esigenza di poter disporre di dati legalmente “aperti”, ovvero liberamente re-usabili da parte del fruitore, per qualsiasi scopo. L’obiettivo dell’Open Data può essere raggiunto per legge, come negli USA dove l’informazione generata dal settore pubblico federale è in pubblico dominio, oppure per scelta dei detentori dei diritti, tramite opportune licenze. Per motivare la necessità di avere dei dati in formato aperto, possiamo usare una comparazione del tipo: l'Open Data sta al Linked Data, come la rete Internet sta al Web. L'Open Data, quindi, è l’infrastruttura (o la “piattaforma”) di cui il Linked Data ha bisogno per poter creare la rete di inferenze tra i vari dati sparsi nel Web. Il Linked Data, in altre parole, è una tecnologia ormai abbastanza matura e con grandi potenzialità, ma ha bisogno di grandi masse di dati tra loro collegati, ossia “linkati”, per diventare concretamente utile. Questo, in parte, è già stato ottenuto ed è in corso di miglioramento, grazie a progetti come DBpedia o FreeBase. In parallelo ai contributi delle community online, un altro tassello importante – una sorta di “bulk upload” molto prezioso – potrebbe essere dato dalla disponibilità di grosse masse di dati pubblici, idealmente anche già linkati dalle istituzioni stesse o comunque messi a disposizione in modo strutturato – che aiutino a raggiungere una “massa” di Linked Data. A partire dal substrato, rappresentato dalla disponibilità di fatto dei dati e dalla loro piena riutilizzabilità (in modo legale), il Linked Data può offrire una potente rappresentazione degli stessi, in termini di relazioni (collegamenti): in questo senso, Linked Data ed Open Data convergono e raggiungono la loro piena realizzazione nell’approccio Linked Open Data. L’obiettivo di questa tesi è quello di approfondire ed esporre le basi sul funzionamento dei Linked Open Data e gli ambiti in cui vengono utilizzati.
Resumo:
Obiettivo di questa tesi dal titolo “Analisi di tecniche per l’estrazione di informazioni da documenti testuali e non strutturati” è quello di mostrare tecniche e metodologie informatiche che permettano di ricavare informazioni e conoscenza da dati in formato testuale. Gli argomenti trattati includono l'analisi di software per l'estrazione di informazioni, il web semantico, l'importanza dei dati e in particolare i Big Data, Open Data e Linked Data. Si parlerà inoltre di data mining e text mining.
Resumo:
La capacità di estrarre entità da testi, collegarle tra loro ed eliminare possibili ambiguità tra di esse è uno degli obiettivi del Web Semantico. Chiamato anche Web 3.0, esso presenta numerose innovazioni volte ad arricchire il Web con dati strutturati comprensibili sia dagli umani che dai calcolatori. Nel reperimento di questi temini e nella definizione delle entities è di fondamentale importanza la loro univocità. Il nostro orizzonte di lavoro è quello delle università italiane e le entities che vogliamo estrarre, collegare e rendere univoche sono nomi di professori italiani. L’insieme di informazioni di partenza, per sua natura, vede la presenza di ambiguità. Attenendoci il più possibile alla sua semantica, abbiamo studiato questi dati ed abbiamo risolto le collisioni presenti sui nomi dei professori. Arald, la nostra architettura software per il Web Semantico, estrae entità e le collega, ma soprattutto risolve ambiguità e omonimie tra i professori delle università italiane. Per farlo si appoggia alla semantica dei loro lavori accademici e alla rete di coautori desumibile dagli articoli da loro pubblicati, rappresentati tramite un data cluster. In questo docu delle università italiane e le entities che vogliamo estrarre, collegare e rendere univoche sono nomi di professori italiani. Partendo da un insieme di informazioni che, per sua natura, vede la presenza di ambiguità, lo abbiamo studiato attenendoci il più possibile alla sua semantica, ed abbiamo risolto le collisioni che accadevano sui nomi dei professori. Arald, la nostra architettura software per il Web Semantico, estrae entità, le collega, ma soprattutto risolve ambiguità e omonimie tra i professori delle università italiane. Per farlo si appoggia alla semantica dei loro lavori accademici e alla rete di coautori desumibile dagli articoli da loro pubblicati tramite la costruzione di un data cluster.
Resumo:
In questa analisi si cercherà di comprendere cosa caratterizza questa l'ondata di progresso tecnologico che sta cambiando il mercato del lavoro. Il principale aspetto negativo di questo progresso si chiama "Technological Unemployment". Benché gli esperti si trovino in disaccordo su quali siano le cause della persistente alta disoccupazione, Brynjolfsson e McAfee puntano il dito contro l'automazione che ha soppiantato i lavori ripetitivi delle aziende. Tuttavia, è anche vero che il progresso ha sempre portato aumenti di produttività, e soprattutto nuovi tipi di occupazioni che hanno compensato la perdita di posti di lavoro, nel medio-lungo termine. Keynes evidenzia che la disoccupazione dovuta alla scoperta di strumenti economizzatori di manodopera procede con ritmo più rapido di quello con cui riusciamo a trovare nuovi impieghi per la manodopera stessa. Da ciò si crea ansia per il futuro, più o meno motivata. Gli stessi esperti sono spaccati a metà tra chi ha fiducia nei possibili risvolti positivi del progresso e chi invece teme possa comportare scenari catastrofici. Le macchine ci rubano lavoro o ci liberano da esso? Con questa ricerca ci si pone l'obiettivo di analizzare le effettive prospettive dei prossimi decenni. Nel capitolo 2 che è il corpo della tesi prenderemo soprattutto in conto il lavoro accademico di Frey ed Osborne dell'Oxford Martin School, intitolato "The future of employment: how susceptible are jobs to computerisation?" (2013). Essi sono stati tra i primi a studiare e quantificare cosa comporteranno le nuove tecnologie in termini di impiego. Il loro obiettivo era individuare le occupazioni a rischio, da qui a vent'anni, nel mercato del lavoro degli Stati Uniti e la relazione che intercorre tra la loro probabilità di essere computerizzati e i loro salari e livello d'istruzione medi, il tutto valutato attraverso l'ausilio di una nuova metodologia che si vedrà nel dettaglio. A conclusioni simili alle loro, per certi aspetti, è successivamente giunto anche Autor; tra l'altro viene spesso citato per altre sue opere dagli stessi Frey e Osborne, che usano le sue categorizzazioni per impostare la struttura del loro calcolo dell'automatizzabilità dei lavori utilizzando i recenti miglioramenti nelle scienze ingegneristiche quali ML (Machine Learning ad esempio Data mining, Machine vision, Computational statistics o più in generale AI) e MR (Mobile robotics) come strumenti di valutazione. Oltre alle sue ricerche, si presenteranno brevemente i risultati di un recente sondaggio tenuto dal Pew Research Center in cui importanti figure dell'informatica e dell'economia esprimono il loro giudizio sul futuro panorama del mondo del lavoro, considerando l'imminente ondata di innovazioni tecnologiche. La tesi si conclude con un'elaborazione personale. In questo modo si prenderà coscienza dei problemi concreti che il progresso tecnologico potrebbe procurare, ma anche dei suoi aspetti positivi.
Resumo:
SMARTDIAB is a platform designed to support the monitoring, management, and treatment of patients with type 1 diabetes mellitus (T1DM), by combining state-of-the-art approaches in the fields of database (DB) technologies, communications, simulation algorithms, and data mining. SMARTDIAB consists mainly of two units: 1) the patient unit (PU); and 2) the patient management unit (PMU), which communicate with each other for data exchange. The PMU can be accessed by the PU through the internet using devices, such as PCs/laptops with direct internet access or mobile phones via a Wi-Fi/General Packet Radio Service access network. The PU consists of an insulin pump for subcutaneous insulin infusion to the patient and a continuous glucose measurement system. The aforementioned devices running a user-friendly application gather patient's related information and transmit it to the PMU. The PMU consists of a diabetes data management system (DDMS), a decision support system (DSS) that provides risk assessment for long-term diabetes complications, and an insulin infusion advisory system (IIAS), which reside on a Web server. The DDMS can be accessed from both medical personnel and patients, with appropriate security access rights and front-end interfaces. The DDMS, apart from being used for data storage/retrieval, provides also advanced tools for the intelligent processing of the patient's data, supporting the physician in decision making, regarding the patient's treatment. The IIAS is used to close the loop between the insulin pump and the continuous glucose monitoring system, by providing the pump with the appropriate insulin infusion rate in order to keep the patient's glucose levels within predefined limits. The pilot version of the SMARTDIAB has already been implemented, while the platform's evaluation in clinical environment is being in progress.
Resumo:
The spectacular advances computer science applied to geographic information systems (GIS) in recent times has favored the emergence of several technological solutions. These developments have given rise to enormous opportunities for digital management of the territory. Among the technological solutions, the most famous Google Maps offers free online mapping dynamic exhaustive of the Maps. In addition to meet the enormous needs of urban indicators geotagged information, we did work on this project “Integration of an urban observatory on Google Maps.” The problem of geolocation in the urban observatory is particularly relevant in the sense that there is currently no data (descriptive and geographical) reliable on the urban sector; we must stick to extrapolate from data old and obsolete. This helps to curb the effectiveness of urban management to make difficult investment programming and to prevent the acquisition of knowledge to make cities engines of growth. The use of a geolocation tool coupled to the data would allow better monitoring of indicators Our project's objective is to develop an interactive map server (WebMapping) which map layer is formed from the resources of the Google Maps servers and match information from the field to produce maps of urban equipment and infrastructure of a city data to the client's request To achieve this goal, we will participate in a study of a GPS location of strategic sites in our core sector (health facilities), on the other hand, using information from the field, we will build a postgresql database that will link the information from the field to map from Google Maps via KML scripts and PHP appropriate. We will limit ourselves in our work to the city of Douala Cameroon with the sectors of health facilities with the possibility of extension to other areas and other cities. Keywords: Geographic Information System (GIS), Thematic Mapping, Web Mapping, data mining, Google API.
Resumo:
Functional neuroimaging techniques enable investigations into the neural basis of human cognition, emotions, and behaviors. In practice, applications of functional magnetic resonance imaging (fMRI) have provided novel insights into the neuropathophysiology of major psychiatric,neurological, and substance abuse disorders, as well as into the neural responses to their treatments. Modern activation studies often compare localized task-induced changes in brain activity between experimental groups. One may also extend voxel-level analyses by simultaneously considering the ensemble of voxels constituting an anatomically defined region of interest (ROI) or by considering means or quantiles of the ROI. In this work we present a Bayesian extension of voxel-level analyses that offers several notable benefits. First, it combines whole-brain voxel-by-voxel modeling and ROI analyses within a unified framework. Secondly, an unstructured variance/covariance for regional mean parameters allows for the study of inter-regional functional connectivity, provided enough subjects are available to allow for accurate estimation. Finally, an exchangeable correlation structure within regions allows for the consideration of intra-regional functional connectivity. We perform estimation for our model using Markov Chain Monte Carlo (MCMC) techniques implemented via Gibbs sampling which, despite the high throughput nature of the data, can be executed quickly (less than 30 minutes). We apply our Bayesian hierarchical model to two novel fMRI data sets: one considering inhibitory control in cocaine-dependent men and the second considering verbal memory in subjects at high risk for Alzheimer’s disease. The unifying hierarchical model presented in this manuscript is shown to enhance the interpretation content of these data sets.
Resumo:
Accurate seasonal to interannual streamflow forecasts based on climate information are critical for optimal management and operation of water resources systems. Considering most water supply systems are multipurpose, operating these systems to meet increasing demand under the growing stresses of climate variability and climate change, population and economic growth, and environmental concerns could be very challenging. This study was to investigate improvement in water resources systems management through the use of seasonal climate forecasts. Hydrological persistence (streamflow and precipitation) and large-scale recurrent oceanic-atmospheric patterns such as the El Niño/Southern Oscillation (ENSO), Pacific Decadal Oscillation (PDO), North Atlantic Oscillation (NAO), the Atlantic Multidecadal Oscillation (AMO), the Pacific North American (PNA), and customized sea surface temperature (SST) indices were investigated for their potential to improve streamflow forecast accuracy and increase forecast lead-time in a river basin in central Texas. First, an ordinal polytomous logistic regression approach is proposed as a means of incorporating multiple predictor variables into a probabilistic forecast model. Forecast performance is assessed through a cross-validation procedure, using distributions-oriented metrics, and implications for decision making are discussed. Results indicate that, of the predictors evaluated, only hydrologic persistence and Pacific Ocean sea surface temperature patterns associated with ENSO and PDO provide forecasts which are statistically better than climatology. Secondly, a class of data mining techniques, known as tree-structured models, is investigated to address the nonlinear dynamics of climate teleconnections and screen promising probabilistic streamflow forecast models for river-reservoir systems. Results show that the tree-structured models can effectively capture the nonlinear features hidden in the data. Skill scores of probabilistic forecasts generated by both classification trees and logistic regression trees indicate that seasonal inflows throughout the system can be predicted with sufficient accuracy to improve water management, especially in the winter and spring seasons in central Texas. Lastly, a simplified two-stage stochastic economic-optimization model was proposed to investigate improvement in water use efficiency and the potential value of using seasonal forecasts, under the assumption of optimal decision making under uncertainty. Model results demonstrate that incorporating the probabilistic inflow forecasts into the optimization model can provide a significant improvement in seasonal water contract benefits over climatology, with lower average deficits (increased reliability) for a given average contract amount, or improved mean contract benefits for a given level of reliability compared to climatology. The results also illustrate the trade-off between the expected contract amount and reliability, i.e., larger contracts can be signed at greater risk.
Resumo:
The primary goal of this project is to demonstrate the practical use of data mining algorithms to cluster a solved steady-state computational fluids simulation (CFD) flow domain into a simplified lumped-parameter network. A commercial-quality code, “cfdMine” was created using a volume-weighted k-means clustering that that can accomplish the clustering of a 20 million cell CFD domain on a single CPU in several hours or less. Additionally agglomeration and k-means Mahalanobis were added as optional post-processing steps to further enhance the separation of the clusters. The resultant nodal network is considered a reduced-order model and can be solved transiently at a very minimal computational cost. The reduced order network is then instantiated in the commercial thermal solver MuSES to perform transient conjugate heat transfer using convection predicted using a lumped network (based on steady-state CFD). When inserting the lumped nodal network into a MuSES model, the potential for developing a “localized heat transfer coefficient” is shown to be an improvement over existing techniques. Also, it was found that the use of the clustering created a new flow visualization technique. Finally, fixing clusters near equipment newly demonstrates a capability to track temperatures near specific objects (such as equipment in vehicles).
Resumo:
The municipality of San Juan La Laguna, Guatemala is home to approximately 5,200 people and located on the western side of the Lake Atitlán caldera. Steep slopes surround all but the eastern side of San Juan. The Lake Atitlán watershed is susceptible to many natural hazards, but most predictable are the landslides that can occur annually with each rainy season, especially during high-intensity events. Hurricane Stan hit Guatemala in October 2005; the resulting flooding and landslides devastated the Atitlán region. Locations of landslide and non-landslide points were obtained from field observations and orthophotos taken following Hurricane Stan. This study used data from multiple attributes, at every landslide and non-landslide point, and applied different multivariate analyses to optimize a model for landslides prediction during high-intensity precipitation events like Hurricane Stan. The attributes considered in this study are: geology, geomorphology, distance to faults and streams, land use, slope, aspect, curvature, plan curvature, profile curvature and topographic wetness index. The attributes were pre-evaluated for their ability to predict landslides using four different attribute evaluators, all available in the open source data mining software Weka: filtered subset, information gain, gain ratio and chi-squared. Three multivariate algorithms (decision tree J48, logistic regression and BayesNet) were optimized for landslide prediction using different attributes. The following statistical parameters were used to evaluate model accuracy: precision, recall, F measure and area under the receiver operating characteristic (ROC) curve. The algorithm BayesNet yielded the most accurate model and was used to build a probability map of landslide initiation points. The probability map developed in this study was also compared to the results of a bivariate landslide susceptibility analysis conducted for the watershed, encompassing Lake Atitlán and San Juan. Landslides from Tropical Storm Agatha 2010 were used to independently validate this study’s multivariate model and the bivariate model. The ultimate aim of this study is to share the methodology and results with municipal contacts from the author's time as a U.S. Peace Corps volunteer, to facilitate more effective future landslide hazard planning and mitigation.