845 resultados para mining data streams
Resumo:
The Iowa Department of Natural Resources uses benthic macroinvertebrate and fish sampling data to assess stream biological condition and the support status of designated aquatic life uses (Wilton 2004; IDNR 2013). Stream physical habitat data assist with the interpretation of biological sampling results by quantifying important physical characteristics that influence a stream’s ability to support a healthy aquatic community (Heitke et al., 2006; Rowe et al. 2009; Sindt et al., 2012). This document describes aquatic community sampling and physical habitat assessment procedures currently followed in the Iowa stream biological assessment program. Standardized biological sampling and physical habitat assessment procedures were first established following a pilot sampling study in 1994 (IDNR 1994a, 1994b). The procedure documents were last updated in 2001 (IDNR 2001a; 2001b). The biological sampling and physical habitat assessment procedures described below are evaluated on a continual basis. Revision of this working document will occur periodically to reflect additional changes.
Resumo:
It is common practice in genome-wide association studies (GWAS) to focus on the relationship between disease risk and genetic variants one marker at a time. When relevant genes are identified it is often possible to implicate biological intermediates and pathways likely to be involved in disease aetiology. However, single genetic variants typically explain small amounts of disease risk. Our idea is to construct allelic scores that explain greater proportions of the variance in biological intermediates, and subsequently use these scores to data mine GWAS. To investigate the approach's properties, we indexed three biological intermediates where the results of large GWAS meta-analyses were available: body mass index, C-reactive protein and low density lipoprotein levels. We generated allelic scores in the Avon Longitudinal Study of Parents and Children, and in publicly available data from the first Wellcome Trust Case Control Consortium. We compared the explanatory ability of allelic scores in terms of their capacity to proxy for the intermediate of interest, and the extent to which they associated with disease. We found that allelic scores derived from known variants and allelic scores derived from hundreds of thousands of genetic markers explained significant portions of the variance in biological intermediates of interest, and many of these scores showed expected correlations with disease. Genome-wide allelic scores however tended to lack specificity suggesting that they should be used with caution and perhaps only to proxy biological intermediates for which there are no known individual variants. Power calculations confirm the feasibility of extending our strategy to the analysis of tens of thousands of molecular phenotypes in large genome-wide meta-analyses. We conclude that our method represents a simple way in which potentially tens of thousands of molecular phenotypes could be screened for causal relationships with disease without having to expensively measure these variables in individual disease collections.
Resumo:
Résumé Cette thèse est consacrée à l'analyse, la modélisation et la visualisation de données environnementales à référence spatiale à l'aide d'algorithmes d'apprentissage automatique (Machine Learning). L'apprentissage automatique peut être considéré au sens large comme une sous-catégorie de l'intelligence artificielle qui concerne particulièrement le développement de techniques et d'algorithmes permettant à une machine d'apprendre à partir de données. Dans cette thèse, les algorithmes d'apprentissage automatique sont adaptés pour être appliqués à des données environnementales et à la prédiction spatiale. Pourquoi l'apprentissage automatique ? Parce que la majorité des algorithmes d'apprentissage automatiques sont universels, adaptatifs, non-linéaires, robustes et efficaces pour la modélisation. Ils peuvent résoudre des problèmes de classification, de régression et de modélisation de densité de probabilités dans des espaces à haute dimension, composés de variables informatives spatialisées (« géo-features ») en plus des coordonnées géographiques. De plus, ils sont idéaux pour être implémentés en tant qu'outils d'aide à la décision pour des questions environnementales allant de la reconnaissance de pattern à la modélisation et la prédiction en passant par la cartographie automatique. Leur efficacité est comparable au modèles géostatistiques dans l'espace des coordonnées géographiques, mais ils sont indispensables pour des données à hautes dimensions incluant des géo-features. Les algorithmes d'apprentissage automatique les plus importants et les plus populaires sont présentés théoriquement et implémentés sous forme de logiciels pour les sciences environnementales. Les principaux algorithmes décrits sont le Perceptron multicouches (MultiLayer Perceptron, MLP) - l'algorithme le plus connu dans l'intelligence artificielle, le réseau de neurones de régression généralisée (General Regression Neural Networks, GRNN), le réseau de neurones probabiliste (Probabilistic Neural Networks, PNN), les cartes auto-organisées (SelfOrganized Maps, SOM), les modèles à mixture Gaussiennes (Gaussian Mixture Models, GMM), les réseaux à fonctions de base radiales (Radial Basis Functions Networks, RBF) et les réseaux à mixture de densité (Mixture Density Networks, MDN). Cette gamme d'algorithmes permet de couvrir des tâches variées telle que la classification, la régression ou l'estimation de densité de probabilité. L'analyse exploratoire des données (Exploratory Data Analysis, EDA) est le premier pas de toute analyse de données. Dans cette thèse les concepts d'analyse exploratoire de données spatiales (Exploratory Spatial Data Analysis, ESDA) sont traités selon l'approche traditionnelle de la géostatistique avec la variographie expérimentale et selon les principes de l'apprentissage automatique. La variographie expérimentale, qui étudie les relations entre pairs de points, est un outil de base pour l'analyse géostatistique de corrélations spatiales anisotropiques qui permet de détecter la présence de patterns spatiaux descriptible par une statistique. L'approche de l'apprentissage automatique pour l'ESDA est présentée à travers l'application de la méthode des k plus proches voisins qui est très simple et possède d'excellentes qualités d'interprétation et de visualisation. Une part importante de la thèse traite de sujets d'actualité comme la cartographie automatique de données spatiales. Le réseau de neurones de régression généralisée est proposé pour résoudre cette tâche efficacement. Les performances du GRNN sont démontrées par des données de Comparaison d'Interpolation Spatiale (SIC) de 2004 pour lesquelles le GRNN bat significativement toutes les autres méthodes, particulièrement lors de situations d'urgence. La thèse est composée de quatre chapitres : théorie, applications, outils logiciels et des exemples guidés. Une partie importante du travail consiste en une collection de logiciels : Machine Learning Office. Cette collection de logiciels a été développée durant les 15 dernières années et a été utilisée pour l'enseignement de nombreux cours, dont des workshops internationaux en Chine, France, Italie, Irlande et Suisse ainsi que dans des projets de recherche fondamentaux et appliqués. Les cas d'études considérés couvrent un vaste spectre de problèmes géoenvironnementaux réels à basse et haute dimensionnalité, tels que la pollution de l'air, du sol et de l'eau par des produits radioactifs et des métaux lourds, la classification de types de sols et d'unités hydrogéologiques, la cartographie des incertitudes pour l'aide à la décision et l'estimation de risques naturels (glissements de terrain, avalanches). Des outils complémentaires pour l'analyse exploratoire des données et la visualisation ont également été développés en prenant soin de créer une interface conviviale et facile à l'utilisation. Machine Learning for geospatial data: algorithms, software tools and case studies Abstract The thesis is devoted to the analysis, modeling and visualisation of spatial environmental data using machine learning algorithms. In a broad sense machine learning can be considered as a subfield of artificial intelligence. It mainly concerns with the development of techniques and algorithms that allow computers to learn from data. In this thesis machine learning algorithms are adapted to learn from spatial environmental data and to make spatial predictions. Why machine learning? In few words most of machine learning algorithms are universal, adaptive, nonlinear, robust and efficient modeling tools. They can find solutions for the classification, regression, and probability density modeling problems in high-dimensional geo-feature spaces, composed of geographical space and additional relevant spatially referenced features. They are well-suited to be implemented as predictive engines in decision support systems, for the purposes of environmental data mining including pattern recognition, modeling and predictions as well as automatic data mapping. They have competitive efficiency to the geostatistical models in low dimensional geographical spaces but are indispensable in high-dimensional geo-feature spaces. The most important and popular machine learning algorithms and models interesting for geo- and environmental sciences are presented in details: from theoretical description of the concepts to the software implementation. The main algorithms and models considered are the following: multi-layer perceptron (a workhorse of machine learning), general regression neural networks, probabilistic neural networks, self-organising (Kohonen) maps, Gaussian mixture models, radial basis functions networks, mixture density networks. This set of models covers machine learning tasks such as classification, regression, and density estimation. Exploratory data analysis (EDA) is initial and very important part of data analysis. In this thesis the concepts of exploratory spatial data analysis (ESDA) is considered using both traditional geostatistical approach such as_experimental variography and machine learning. Experimental variography is a basic tool for geostatistical analysis of anisotropic spatial correlations which helps to understand the presence of spatial patterns, at least described by two-point statistics. A machine learning approach for ESDA is presented by applying the k-nearest neighbors (k-NN) method which is simple and has very good interpretation and visualization properties. Important part of the thesis deals with a hot topic of nowadays, namely, an automatic mapping of geospatial data. General regression neural networks (GRNN) is proposed as efficient model to solve this task. Performance of the GRNN model is demonstrated on Spatial Interpolation Comparison (SIC) 2004 data where GRNN model significantly outperformed all other approaches, especially in case of emergency conditions. The thesis consists of four chapters and has the following structure: theory, applications, software tools, and how-to-do-it examples. An important part of the work is a collection of software tools - Machine Learning Office. Machine Learning Office tools were developed during last 15 years and was used both for many teaching courses, including international workshops in China, France, Italy, Ireland, Switzerland and for realizing fundamental and applied research projects. Case studies considered cover wide spectrum of the real-life low and high-dimensional geo- and environmental problems, such as air, soil and water pollution by radionuclides and heavy metals, soil types and hydro-geological units classification, decision-oriented mapping with uncertainties, natural hazards (landslides, avalanches) assessments and susceptibility mapping. Complementary tools useful for the exploratory data analysis and visualisation were developed as well. The software is user friendly and easy to use.
Resumo:
Summary of biological water quality data collected during the Floods of 2008.
Resumo:
US Geological Survey (USGS) based elevation data are the most commonly used data source for highway hydraulic analysis; however, due to the vertical accuracy of USGS-based elevation data, USGS data may be too “coarse” to adequately describe surface profiles of watershed areas or drainage patterns. Additionally hydraulic design requires delineation of much smaller drainage areas (watersheds) than other hydrologic applications, such as environmental, ecological, and water resource management. This research study investigated whether higher resolution LIDAR based surface models would provide better delineation of watersheds and drainage patterns as compared to surface models created from standard USGS-based elevation data. Differences in runoff values were the metric used to compare the data sets. The two data sets were compared for a pilot study area along the Iowa 1 corridor between Iowa City and Mount Vernon. Given the limited breadth of the analysis corridor, areas of particular emphasis were the location of drainage area boundaries and flow patterns parallel to and intersecting the road cross section. Traditional highway hydrology does not appear to be significantly impacted, or benefited, by the increased terrain detail that LIDAR provided for the study area. In fact, hydrologic outputs, such as streams and watersheds, may be too sensitive to the increased horizontal resolution and/or errors in the data set. However, a true comparison of LIDAR and USGS-based data sets of equal size and encompassing entire drainage areas could not be performed in this study. Differences may also result in areas with much steeper slopes or significant changes in terrain. LIDAR may provide possibly valuable detail in areas of modified terrain, such as roads. Better representations of channel and terrain detail in the vicinity of the roadway may be useful in modeling problem drainage areas and evaluating structural surety during and after significant storm events. Furthermore, LIDAR may be used to verify the intended/expected drainage patterns at newly constructed highways. LIDAR will likely provide the greatest benefit for highway projects in flood plains and areas with relatively flat terrain where slight changes in terrain may have a significant impact on drainage patterns.
Resumo:
Physical habitat characteristics such as stream width, depth, instream cover, and substrate composition are important environmental factors that shape Iowa’s stream fish species assemblages. The Iowa Department of Natural Resources (IDNR) stream biological assessment program collects physical habitat data to help interpret fish assemblage sampling results in order to assess stream health condition and the attainment status of designated aquatic life uses. The quantitative habitat indicators and interpretative guidelines developed in this study are designed for specific applications within the stream bioassessment program. These tools might also be useful to natural resource managers for purposes such as stream habitat improvement prioritization, goal-setting, and performance assessment.
Resumo:
In anticipation of regulation involving numeric turbidity limit at highway construction sites, research was done into the most appropriate, affordable methods for surface water monitoring. Measuring sediment concentration in streams may be conducted a number of ways. As part of a project funded by the Iowa Department of Transportation, several testing methods were explored to determine the most affordable, appropriate methods for data collection both in the field and in the lab. The primary purpose of the research was to determine the exchangeability of the acrylic transparency tube for water clarity analysis as compared to the turbidimeter.
Resumo:
Biological water quality changes in two Mediterranean river basins from a network of 42 sampling sites assessed since 1979 are presented. In order to characterize the biological quality, the index FBILL, designed to characterize these rivers" quality using aquatic macroinvertebrates, is used. When comparing the data from recent years to older ones, only two headwater sites from the 42 had improved their water quality to good or very good conditions. In the middle or low river basin sites or even in headwater localities were river flow is reduced, the important investment to build up sewage water treatment systems and plants (more than 70 in 15 years) allowed for a small recovery from poor or very poor conditions to moderate water quality. Nevertheless still a significant number (25 %) of the localities remain in poor conditions. The evolution of the quality in several points of both basins shows how the main problems for the recovery of the biological quality is due to the water diverted for small hydraulic plants, the presence of saline pollution in the Llobregat River, and the insufficient water depuration. In the smaller rivers, and specially the Besòs the lack of dilution flows from the treatment plants is the main problem for water quality recovery.
Resumo:
Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.
Resumo:
Temporary streams are those water courses that undergo the recurrent cessation of flow or the complete drying of their channel. The structure and composition of biological communities in temporary stream reaches are strongly dependent on the temporal changes of the aquatic habitats determined by the hydrological conditions. Therefore, the structural and functional characteristics of aquatic fauna to assess the ecological quality of a temporary stream reach cannot be used without taking into account the controls imposed by the hydrological regime. This paper develops methods for analysing temporary streams' aquatic regimes, based on the definition of six aquatic states that summarize the transient sets of mesohabitats occurring on a given reach at a particular moment, depending on the hydrological conditions: Hyperrheic, Eurheic, Oligorheic, Arheic, Hyporheic and Edaphic. When the hydrological conditions lead to a change in the aquatic state, the structure and composition of the aquatic community changes according to the new set of available habitats. We used the water discharge records from gauging stations or simulations with rainfall-runoff models to infer the temporal patterns of occurrence of these states in the Aquatic States Frequency Graph we developed. The visual analysis of this graph is complemented by the development of two metrics which describe the permanence of flow and the seasonal predictability of zero flow periods. Finally, a classification of temporary streams in four aquatic regimes in terms of their influence over the development of aquatic life is updated from the existing classifications, with stream aquatic regimes defined as Permanent, Temporary-pools, Temporary-dry and Episodic. While aquatic regimes describe the long-term overall variability of the hydrological conditions of the river section and have been used for many years by hydrologists and ecologists, aquatic states describe the availability of mesohabitats in given periods that determine the presence of different biotic assemblages. This novel concept links hydrological and ecological conditions in a unique way. All these methods were implemented with data from eight temporary streams around the Mediterranean within the MIRAGE project. Their application was a precondition to assessing the ecological quality of these streams.
Resumo:
In this thesis we study the field of opinion mining by giving a comprehensive review of the available research that has been done in this topic. Also using this available knowledge we present a case study of a multilevel opinion mining system for a student organization's sales management system. We describe the field of opinion mining by discussing its historical roots, its motivations and applications as well as the different scientific approaches that have been used to solve this challenging problem of mining opinions. To deal with this huge subfield of natural language processing, we first give an abstraction of the problem of opinion mining and describe the theoretical frameworks that are available for dealing with appraisal language. Then we discuss the relation between opinion mining and computational linguistics which is a crucial pre-processing step for the accuracy of the subsequent steps of opinion mining. The second part of our thesis deals with the semantics of opinions where we describe the different ways used to collect lists of opinion words as well as the methods and techniques available for extracting knowledge from opinions present in unstructured textual data. In the part about collecting lists of opinion words we describe manual, semi manual and automatic ways to do so and give a review of the available lists that are used as gold standards in opinion mining research. For the methods and techniques of opinion mining we divide the task into three levels that are the document, sentence and feature level. The techniques that are presented in the document and sentence level are divided into supervised and unsupervised approaches that are used to determine the subjectivity and polarity of texts and sentences at these levels of analysis. At the feature level we give a description of the techniques available for finding the opinion targets, the polarity of the opinions about these opinion targets and the opinion holders. Also at the feature level we discuss the various ways to summarize and visualize the results of this level of analysis. In the third part of our thesis we present a case study of a sales management system that uses free form text and that can benefit from an opinion mining system. Using the knowledge gathered in the review of this field we provide a theoretical multi level opinion mining system (MLOM) that can perform most of the tasks needed from an opinion mining system. Based on the previous research we give some hints that many of the laborious market research tasks that are done by the sales force, which uses this sales management system, can improve their insight about their partners and by that increase the quality of their sales services and their overall results.
Resumo:
The 1980-1990 Amazonian gold rush left an enormous liability that increasingly has been substituted by developing fish aquaculture. This work aimed at the identification of the mercury levels in the environment, associated with fish farms located in the North of Mato Grosso State, Southern Amazon. Sediment and soil samples were analyzed for total organic carbon and total mercury. Results indicate that the chemical characteristics of the sediment largely depend on the management procedures of the fish pond (liming, fish food used and fish population). The soils presented relatively low concentrations when compared with other data from the literature.
Resumo:
Raw measurement data does not always immediately convey useful information, but applying mathematical statistical analysis tools into measurement data can improve the situation. Data analysis can offer benefits like acquiring meaningful insight from the dataset, basing critical decisions on the findings, and ruling out human bias through proper statistical treatment. In this thesis we analyze data from an industrial mineral processing plant with the aim of studying the possibility of forecasting the quality of the final product, given by one variable, with a model based on the other variables. For the study mathematical tools like Qlucore Omics Explorer (QOE) and Sparse Bayesian regression (SB) are used. Later on, linear regression is used to build a model based on a subset of variables that seem to have most significant weights in the SB model. The results obtained from QOE show that the variable representing the desired final product does not correlate with other variables. For SB and linear regression, the results show that both SB and linear regression models built on 1-day averaged data seriously underestimate the variance of true data, whereas the two models built on 1-month averaged data are reliable and able to explain a larger proportion of variability in the available data, making them suitable for prediction purposes. However, it is concluded that no single model can fit well the whole available dataset and therefore, it is proposed for future work to make piecewise non linear regression models if the same available dataset is used, or the plant to provide another dataset that should be collected in a more systematic fashion than the present data for further analysis.
Resumo:
Environmental accountability has become a major source of competitive advantage for industrial companies, because customers consider it as relevant buying criterion. However, in order to leverage their environmental responsibility, industrial suppliers have to be able to demonstrate the environmental value of their products and services, which is also the aim of Kemira, a global water chemistry company considered in this study. The aim of this thesis is to develop a tool which Kemira can use to assess the environmental value of their solutions for the customer companies in mining industry. This study answers to questions on what kinds of methods to assess environmental impacts exist, and what kind of tool could be used to assess the environmental value of Kemira’s water treatment solutions. The environmental impacts of mining activities vary greatly between different mines. Generally the major impacts include the water related issues and wastes. Energy consumption is also a significant environmental aspect. Water related issues include water consumption and impacts in water quality. There are several methods to assess environmental impacts, for example life cycle assessment, eco-efficiency tools, footprint calculations and process simulation. In addition the corresponding financial value may be estimated utilizing monetary assessment methods. Some of the industrial companies considered in the analysis of industry best practices use environmental and sustainability assessments. Based on the theoretical research and conducted interviews, an Excel based tool utilizing reference data on previous customer cases and customer specific test results was considered to be most suitable to assess the environmental value of Kemira’s solutions. The tool can be used to demonstrate the functionality of Kemira’s solutions in customers’ processes, their impacts in other process parameters and their environmental and financial aspects. In the future, the tool may be applied to fit also Kemira’s other segments, not only mining industry.
Resumo:
The Pasvik monitoring programme was created in 2006 as a result of the trilateral cooperation, and with the intention of following changes in the environment under variable pollution levels. Water quality is one of the basic elements of the Programme when assessing the effects of the emissions from the Pechenganikel mining and metallurgical industry (Kola GMK). The Metallurgic Production Renovation Programme was implemented by OJSC Kola GMK to reduce emissions of sulphur and heavy metal concentrated dust. However, the expectations for the reduction in emissions from the smelter in the settlement Nikel were not realized. Nevertheless, Kola GMK has found that the modernization programme’s measures do not provide the planned reductions of sulfur dioxide emissions. In this report, temporal trends in water chemistry during 2000–2009 are examined on the basis of the data gathered from Lake Inari, River Pasvik and directly connected lakes, as well as from 26 small lakes in three areas: Pechenganikel (Russia), Jarfjord (Norway) and Vätsäri (Finland). The lower parts of the Pasvik watercourse are impacted by both atmospheric pollution and direct wastewater discharge from the Pechenganikel smelter and the settlement of Nikel. The upper section of the watercourse, and the small lakes and streams which are not directly linked to the Pasvik watercourse, only receive atmospheric pollution. The data obtained confirms the ongoing pollution of the river and water system. Copper (Cu), nickel (Ni) and sulphates are the main pollution components. The highest levels were observed close to the smelters. The most polluted water source of the basin is the River Kolosjoki, as it directly receives the sewage discharge from the smelters and the stream connecting the Lakes Salmijarvi and Kuetsjarvi. The concentrations of metals and sulphates in the River Pasvik are higher downstream from the Kuetsjarvi Lake. There has been no fall in the concentrations of pollutants in Pasvik watercourse over the last 10 years. Ongoing recovery from acidification has been evident in the small lakes of the Jarfjord and Vätsäri areas during the 2000s. The buffering capacity of these lakes has improved and the pH has increased. The reason for this recovery is that sulphate deposition has decreased, which is also evident in the water quality. However, concentrations of some metals, especially Ni and Cu, have risen during the 2000s. Ni concentrations have increased in all three areas, and Cu concentrations in the Pechenganickel and Jarfjord areas, which are located closer to the smelters. Emission levels of Ni and Cu did not fall during 2000s. In fact, the emission levels of Ni compounds even increased compared to the 1990s.