845 resultados para mining data streams
Resumo:
Analisi e applicazione dei processi di data mining al flusso informativo di sistemi real-time. Implementazione e analisi di un algoritmo autoadattivo per la ricerca di frequent patterns su macchine automatiche.
Resumo:
La tesi da me svolta durante questi ultimi sei mesi è stata sviluppata presso i laboratori di ricerca di IMA S.p.a.. IMA (Industria Macchine Automatiche) è una azienda italiana che naque nel 1961 a Bologna ed oggi riveste il ruolo di leader mondiale nella produzione di macchine automatiche per il packaging di medicinali. Vorrei subito mettere in luce che in tale contesto applicativo l’utilizzo di algoritmi di data-mining risulta essere ostico a causa dei due ambienti in cui mi trovo. Il primo è quello delle macchine automatiche che operano con sistemi in tempo reale dato che non presentano a pieno le risorse di cui necessitano tali algoritmi. Il secondo è relativo alla produzione di farmaci in quanto vige una normativa internazionale molto restrittiva che impone il tracciamento di tutti gli eventi trascorsi durante l’impacchettamento ma che non permette la visione al mondo esterno di questi dati sensibili. Emerge immediatamente l’interesse nell’utilizzo di tali informazioni che potrebbero far affiorare degli eventi riconducibili a un problema della macchina o a un qualche tipo di errore al fine di migliorare l’efficacia e l’efficienza dei prodotti IMA. Lo sforzo maggiore per riuscire ad ideare una strategia applicativa è stata nella comprensione ed interpretazione dei messaggi relativi agli aspetti software. Essendo i dati molti, chiusi, e le macchine con scarse risorse per poter applicare a dovere gli algoritmi di data mining ho provveduto ad adottare diversi approcci in diversi contesti applicativi: • Sistema di identificazione automatica di errore al fine di aumentare di diminuire i tempi di correzione di essi. • Modifica di un algoritmo di letteratura per la caratterizzazione della macchina. La trattazione è così strutturata: • Capitolo 1: descrive la macchina automatica IMA Adapta della quale ci sono stati forniti i vari file di log. Essendo lei l’oggetto di analisi per questo lavoro verranno anche riportati quali sono i flussi di informazioni che essa genera. • Capitolo 2: verranno riportati degli screenshoot dei dati in mio possesso al fine di, tramite un’analisi esplorativa, interpretarli e produrre una formulazione di idee/proposte applicabili agli algoritmi di Machine Learning noti in letteratura. • Capitolo 3 (identificazione di errore): in questo capitolo vengono riportati i contesti applicativi da me progettati al fine di implementare una infrastruttura che possa soddisfare il requisito, titolo di questo capitolo. • Capitolo 4 (caratterizzazione della macchina): definirò l’algoritmo utilizzato, FP-Growth, e mostrerò le modifiche effettuate al fine di poterlo impiegare all’interno di macchine automatiche rispettando i limiti stringenti di: tempo di cpu, memoria, operazioni di I/O e soprattutto la non possibilità di aver a disposizione l’intero dataset ma solamente delle sottoporzioni. Inoltre verranno generati dei DataSet per il testing di dell’algoritmo FP-Growth modificato.
Resumo:
PURPOSE: Tumor stage and nuclear grade are the most important prognostic parameters of clear cell renal cell carcinoma (ccRCC). The progression risk of ccRCC remains difficult to predict particularly for tumors with organ-confined stage and intermediate differentiation grade. Elucidating molecular pathways deregulated in ccRCC may point to novel prognostic parameters that facilitate planning of therapeutic approaches. EXPERIMENTAL DESIGN: Using tissue microarrays, expression patterns of 15 different proteins were evaluated in over 800 ccRCC patients to analyze pathways reported to be physiologically controlled by the tumor suppressors von Hippel-Lindau protein and phosphatase and tensin homologue (PTEN). Tumor staging and grading were improved by performing variable selection using Cox regression and a recursive bootstrap elimination scheme. RESULTS: Patients with pT2 and pT3 tumors that were p27 and CAIX positive had a better outcome than those with all remaining marker combinations. A prolonged survival among patients with intermediate grade (grade 2) correlated with both nuclear p27 and cytoplasmic PTEN expression, as well as with inactive, nonphosphorylated ribosomal protein S6. By applying graphical log-linear modeling for over 700 ccRCC for which the molecular parameters were available, only a weak conditional dependence existed between the expression of p27, PTEN, CAIX, and p-S6, suggesting that the dysregulation of several independent pathways are crucial for tumor progression. CONCLUSIONS: The use of recursive bootstrap elimination, as well as graphical log-linear modeling for comprehensive tissue microarray (TMA) data analysis allows the unraveling of complex molecular contexts and may improve predictive evaluations for patients with advanced renal cancer.
Resumo:
The copper mining boom in Michigan's Upper Peninsula ended in the mid-1960s, but the historical mining still affects the region to this day. Earlier studies conducted in the Keweenaw have shown that trace metals in the sediments negatively affect benthic macroinvertebrate populations. However, because the concentrations of trace metals that are observed to be toxic often differ significantly between the laboratory and the environment, a better method for determining toxic levels of trace metals in the natural environment is desirable in order to establish surface water quality guidelines that effectively protect aquatic life. There were four research objectives for this research project. First, to determine if trace-level concentrations of copper can result in detectable ecological impacts even in the presence of high dissolved organic carbon (DOC). Second, to determine if there is a "safe" concentration of total dissolved copper below which there is little to no ecological impairment. Third, to establish which streams in the Keweenaw Peninsula have been most impacted by elevated levels of total dissolved copper. Fourth, to use this information to evaluate revisions to the water quality criterion for copper that were recently proposed by the Michigan Department of Environmental Quality (MDEQ). In order to collect water quality and macroinvertebrate data, two sampling surveys of approximately 50 streams were completed in the spring and summer of 2012. Our findings demonstrate that negative ecological impacts can be detected even in the presence of high concentrations of DOC. The majority of surveyed streams showed evidence of total dissolved copper concentrations that were elevated above background levels. Our findings suggest that there are detectable negative impacts below the current water quality standard for copper in many Keweenaw streams. The diversity of benthic macroinvertebrates and the number of species present has been reduced as a result of exposure to copper. Additionally, the multimetric approach used by MDEQ is unable to detect copper impairment in local streams due to the use of several insensitive metrics. The proposed changes to the copper criterion would increase the amount of total dissolved copper allowable despite the fact that approximately 25% of streams sampled have aquatic chemistries that would leave them vulnerable to high levels of copper ions.
Artisanal and small scale mining in Mongolia: Statistical overview based on survey data by suom 2012
Resumo:
Index tracking has become one of the most common strategies in asset management. The index-tracking problem consists of constructing a portfolio that replicates the future performance of an index by including only a subset of the index constituents in the portfolio. Finding the most representative subset is challenging when the number of stocks in the index is large. We introduce a new three-stage approach that at first identifies promising subsets by employing data-mining techniques, then determines the stock weights in the subsets using mixed-binary linear programming, and finally evaluates the subsets based on cross validation. The best subset is returned as the tracking portfolio. Our approach outperforms state-of-the-art methods in terms of out-of-sample performance and running times.
Resumo:
Academic and industrial research in the late 90s have brought about an exponential explosion of DNA sequence data. Automated expert systems are being created to help biologists to extract patterns, trends and links from this ever-deepening ocean of information. Two such systems aimed on retrieving and subsequently utilizing phylogenetically relevant information have been developed in this dissertation, the major objective of which was to automate the often difficult and confusing phylogenetic reconstruction process. ^ Popular phylogenetic reconstruction methods, such as distance-based methods, attempt to find an optimal tree topology (that reflects the relationships among related sequences and their evolutionary history) by searching through the topology space. Various compromises between the fast (but incomplete) and exhaustive (but computationally prohibitive) search heuristics have been suggested. An intelligent compromise algorithm that relies on a flexible “beam” search principle from the Artificial Intelligence domain and uses the pre-computed local topology reliability information to adjust the beam search space continuously is described in the second chapter of this dissertation. ^ However, sometimes even a (virtually) complete distance-based method is inferior to the significantly more elaborate (and computationally expensive) maximum likelihood (ML) method. In fact, depending on the nature of the sequence data in question either method might prove to be superior. Therefore, it is difficult (even for an expert) to tell a priori which phylogenetic reconstruction method—distance-based, ML or maybe maximum parsimony (MP)—should be chosen for any particular data set. ^ A number of factors, often hidden, influence the performance of a method. For example, it is generally understood that for a phylogenetically “difficult” data set more sophisticated methods (e.g., ML) tend to be more effective and thus should be chosen. However, it is the interplay of many factors that one needs to consider in order to avoid choosing an inferior method (potentially a costly mistake, both in terms of computational expenses and in terms of reconstruction accuracy.) ^ Chapter III of this dissertation details a phylogenetic reconstruction expert system that selects a superior proper method automatically. It uses a classifier (a Decision Tree-inducing algorithm) to map a new data set to the proper phylogenetic reconstruction method. ^
Resumo:
Abstract Due to recent scientific and technological advances in information sys¬tems, it is now possible to perform almost every application on a mobile device. The need to make sense of such devices more intelligent opens an opportunity to design data mining algorithm that are able to autonomous execute in local devices to provide the device with knowledge. The problem behind autonomous mining deals with the proper configuration of the algorithm to produce the most appropriate results. Contextual information together with resource information of the device have a strong impact on both the feasibility of a particu¬lar execution and on the production of the proper patterns. On the other hand, performance of the algorithm expressed in terms of efficacy and efficiency highly depends on the features of the dataset to be analyzed together with values of the parameters of a particular implementation of an algorithm. However, few existing approaches deal with autonomous configuration of data mining algorithms and in any case they do not deal with contextual or resources information. Both issues are of particular significance, in particular for social net¬works application. In fact, the widespread use of social networks and consequently the amount of information shared have made the need of modeling context in social application a priority. Also the resource consumption has a crucial role in such platforms as the users are using social networks mainly on their mobile devices. This PhD thesis addresses the aforementioned open issues, focusing on i) Analyzing the behavior of algorithms, ii) mapping contextual and resources information to find the most appropriate configuration iii) applying the model for the case of a social recommender. Four main contributions are presented: - The EE-Model: is able to predict the behavior of a data mining algorithm in terms of resource consumed and accuracy of the mining model it will obtain. - The SC-Mapper: maps a situation defined by the context and resource state to a data mining configuration. - SOMAR: is a social activity (event and informal ongoings) recommender for mobile devices. - D-SOMAR: is an evolution of SOMAR which incorporates the configurator in order to provide updated recommendations. Finally, the experimental validation of the proposed contributions using synthetic and real datasets allows us to achieve the objectives and answer the research questions proposed for this dissertation.