18 resultados para mining data streams

em Doria (National Library of Finland DSpace Services) - National Library of Finland, Finland


Relevância:

40.00% 40.00%

Publicador:

Resumo:

Recent advances in machine learning methods enable increasingly the automatic construction of various types of computer assisted methods that have been difficult or laborious to program by human experts. The tasks for which this kind of tools are needed arise in many areas, here especially in the fields of bioinformatics and natural language processing. The machine learning methods may not work satisfactorily if they are not appropriately tailored to the task in question. However, their learning performance can often be improved by taking advantage of deeper insight of the application domain or the learning problem at hand. This thesis considers developing kernel-based learning algorithms incorporating this kind of prior knowledge of the task in question in an advantageous way. Moreover, computationally efficient algorithms for training the learning machines for specific tasks are presented. In the context of kernel-based learning methods, the incorporation of prior knowledge is often done by designing appropriate kernel functions. Another well-known way is to develop cost functions that fit to the task under consideration. For disambiguation tasks in natural language, we develop kernel functions that take account of the positional information and the mutual similarities of words. It is shown that the use of this information significantly improves the disambiguation performance of the learning machine. Further, we design a new cost function that is better suitable for the task of information retrieval and for more general ranking problems than the cost functions designed for regression and classification. We also consider other applications of the kernel-based learning algorithms such as text categorization, and pattern recognition in differential display. We develop computationally efficient algorithms for training the considered learning machines with the proposed kernel functions. We also design a fast cross-validation algorithm for regularized least-squares type of learning algorithm. Further, an efficient version of the regularized least-squares algorithm that can be used together with the new cost function for preference learning and ranking tasks is proposed. In summary, we demonstrate that the incorporation of prior knowledge is possible and beneficial, and novel advanced kernels and cost functions can be used in algorithms efficiently.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Visual data mining (VDM) tools employ information visualization techniques in order to represent large amounts of high-dimensional data graphically and to involve the user in exploring data at different levels of detail. The users are looking for outliers, patterns and models – in the form of clusters, classes, trends, and relationships – in different categories of data, i.e., financial, business information, etc. The focus of this thesis is the evaluation of multidimensional visualization techniques, especially from the business user’s perspective. We address three research problems. The first problem is the evaluation of projection-based visualizations with respect to their effectiveness in preserving the original distances between data points and the clustering structure of the data. In this respect, we propose the use of existing clustering validity measures. We illustrate their usefulness in evaluating five visualization techniques: Principal Components Analysis (PCA), Sammon’s Mapping, Self-Organizing Map (SOM), Radial Coordinate Visualization and Star Coordinates. The second problem is concerned with evaluating different visualization techniques as to their effectiveness in visual data mining of business data. For this purpose, we propose an inquiry evaluation technique and conduct the evaluation of nine visualization techniques. The visualizations under evaluation are Multiple Line Graphs, Permutation Matrix, Survey Plot, Scatter Plot Matrix, Parallel Coordinates, Treemap, PCA, Sammon’s Mapping and the SOM. The third problem is the evaluation of quality of use of VDM tools. We provide a conceptual framework for evaluating the quality of use of VDM tools and apply it to the evaluation of the SOM. In the evaluation, we use an inquiry technique for which we developed a questionnaire based on the proposed framework. The contributions of the thesis consist of three new evaluation techniques and the results obtained by applying these evaluation techniques. The thesis provides a systematic approach to evaluation of various visualization techniques. In this respect, first, we performed and described the evaluations in a systematic way, highlighting the evaluation activities, and their inputs and outputs. Secondly, we integrated the evaluation studies in the broad framework of usability evaluation. The results of the evaluations are intended to help developers and researchers of visualization systems to select appropriate visualization techniques in specific situations. The results of the evaluations also contribute to the understanding of the strengths and limitations of the visualization techniques evaluated and further to the improvement of these techniques.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This thesis introduces heat demand forecasting models which are generated by using data mining algorithms. The forecast spans one full day and this forecast can be used in regulating heat consumption of buildings. For training the data mining models, two years of heat consumption data from a case building and weather measurement data from Finnish Meteorological Institute are used. The thesis utilizes Microsoft SQL Server Analysis Services data mining tools in generating the data mining models and CRISP-DM process framework to implement the research. Results show that the built models can predict heat demand at best with mean average percentage errors of 3.8% for 24-h profile and 5.9% for full day. A deployment model for integrating the generated data mining models into an existing building energy management system is also discussed.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Data mining, as a heatedly discussed term, has been studied in various fields. Its possibilities in refining the decision-making process, realizing potential patterns and creating valuable knowledge have won attention of scholars and practitioners. However, there are less studies intending to combine data mining and libraries where data generation occurs all the time. Therefore, this thesis plans to fill such a gap. Meanwhile, potential opportunities created by data mining are explored to enhance one of the most important elements of libraries: reference service. In order to thoroughly demonstrate the feasibility and applicability of data mining, literature is reviewed to establish a critical understanding of data mining in libraries and attain the current status of library reference service. The result of the literature review indicates that free online data resources other than data generated on social media are rarely considered to be applied in current library data mining mandates. Therefore, the result of the literature review motivates the presented study to utilize online free resources. Furthermore, the natural match between data mining and libraries is established. The natural match is explained by emphasizing the data richness reality and considering data mining as one kind of knowledge, an easy choice for libraries, and a wise method to overcome reference service challenges. The natural match, especially the aspect that data mining could be helpful for library reference service, lays the main theoretical foundation for the empirical work in this study. Turku Main Library was selected as the case to answer the research question: whether data mining is feasible and applicable for reference service improvement. In this case, the daily visit from 2009 to 2015 in Turku Main Library is considered as the resource for data mining. In addition, corresponding weather conditions are collected from Weather Underground, which is totally free online. Before officially being analyzed, the collected dataset is cleansed and preprocessed in order to ensure the quality of data mining. Multiple regression analysis is employed to mine the final dataset. Hourly visits are the independent variable and weather conditions, Discomfort Index and seven days in a week are dependent variables. In the end, four models in different seasons are established to predict visiting situations in each season. Patterns are realized in different seasons and implications are created based on the discovered patterns. In addition, library-climate points are generated by a clustering method, which simplifies the process for librarians using weather data to forecast library visiting situation. Then the data mining result is interpreted from the perspective of improving reference service. After this data mining work, the result of the case study is presented to librarians so as to collect professional opinions regarding the possibility of employing data mining to improve reference services. In the end, positive opinions are collected, which implies that it is feasible to utilizing data mining as a tool to enhance library reference service.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The incredible rapid development to huge volumes of air travel, mainly because of jet airliners that appeared to the sky in the 1950s, created the need for systematic research for aviation safety and collecting data about air traffic. The structured data can be analysed easily using queries from databases and running theseresults through graphic tools. However, in analysing narratives that often give more accurate information about the case, mining tools are needed. The analysis of textual data with computers has not been possible until data mining tools have been developed. Their use, at least among aviation, is still at a moderate level. The research aims at discovering lethal trends in the flight safety reports. The narratives of 1,200 flight safety reports from years 1994 – 1996 in Finnish were processed with three text mining tools. One of them was totally language independent, the other had a specific configuration for Finnish and the third originally created for English, but encouraging results had been achieved with Spanish and that is why a Finnish test was undertaken, too. The global rate of accidents is stabilising and the situation can now be regarded as satisfactory, but because of the growth in air traffic, the absolute number of fatal accidents per year might increase, if the flight safety will not be improved. The collection of data and reporting systems have reached their top level. The focal point in increasing the flight safety is analysis. The air traffic has generally been forecasted to grow 5 – 6 per cent annually over the next two decades. During this period, the global air travel will probably double also with relatively conservative expectations of economic growth. This development makes the airline management confront growing pressure due to increasing competition, signify cant rise in fuel prices and the need to reduce the incident rate due to expected growth in air traffic volumes. All this emphasises the urgent need for new tools and methods. All systems provided encouraging results, as well as proved challenges still to be won. Flight safety can be improved through the development and utilisation of sophisticated analysis tools and methods, like data mining, using its results supporting the decision process of the executives.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this thesis we study the field of opinion mining by giving a comprehensive review of the available research that has been done in this topic. Also using this available knowledge we present a case study of a multilevel opinion mining system for a student organization's sales management system. We describe the field of opinion mining by discussing its historical roots, its motivations and applications as well as the different scientific approaches that have been used to solve this challenging problem of mining opinions. To deal with this huge subfield of natural language processing, we first give an abstraction of the problem of opinion mining and describe the theoretical frameworks that are available for dealing with appraisal language. Then we discuss the relation between opinion mining and computational linguistics which is a crucial pre-processing step for the accuracy of the subsequent steps of opinion mining. The second part of our thesis deals with the semantics of opinions where we describe the different ways used to collect lists of opinion words as well as the methods and techniques available for extracting knowledge from opinions present in unstructured textual data. In the part about collecting lists of opinion words we describe manual, semi manual and automatic ways to do so and give a review of the available lists that are used as gold standards in opinion mining research. For the methods and techniques of opinion mining we divide the task into three levels that are the document, sentence and feature level. The techniques that are presented in the document and sentence level are divided into supervised and unsupervised approaches that are used to determine the subjectivity and polarity of texts and sentences at these levels of analysis. At the feature level we give a description of the techniques available for finding the opinion targets, the polarity of the opinions about these opinion targets and the opinion holders. Also at the feature level we discuss the various ways to summarize and visualize the results of this level of analysis. In the third part of our thesis we present a case study of a sales management system that uses free form text and that can benefit from an opinion mining system. Using the knowledge gathered in the review of this field we provide a theoretical multi level opinion mining system (MLOM) that can perform most of the tasks needed from an opinion mining system. Based on the previous research we give some hints that many of the laborious market research tasks that are done by the sales force, which uses this sales management system, can improve their insight about their partners and by that increase the quality of their sales services and their overall results.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Raw measurement data does not always immediately convey useful information, but applying mathematical statistical analysis tools into measurement data can improve the situation. Data analysis can offer benefits like acquiring meaningful insight from the dataset, basing critical decisions on the findings, and ruling out human bias through proper statistical treatment. In this thesis we analyze data from an industrial mineral processing plant with the aim of studying the possibility of forecasting the quality of the final product, given by one variable, with a model based on the other variables. For the study mathematical tools like Qlucore Omics Explorer (QOE) and Sparse Bayesian regression (SB) are used. Later on, linear regression is used to build a model based on a subset of variables that seem to have most significant weights in the SB model. The results obtained from QOE show that the variable representing the desired final product does not correlate with other variables. For SB and linear regression, the results show that both SB and linear regression models built on 1-day averaged data seriously underestimate the variance of true data, whereas the two models built on 1-month averaged data are reliable and able to explain a larger proportion of variability in the available data, making them suitable for prediction purposes. However, it is concluded that no single model can fit well the whole available dataset and therefore, it is proposed for future work to make piecewise non linear regression models if the same available dataset is used, or the plant to provide another dataset that should be collected in a more systematic fashion than the present data for further analysis.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Environmental accountability has become a major source of competitive advantage for industrial companies, because customers consider it as relevant buying criterion. However, in order to leverage their environmental responsibility, industrial suppliers have to be able to demonstrate the environmental value of their products and services, which is also the aim of Kemira, a global water chemistry company considered in this study. The aim of this thesis is to develop a tool which Kemira can use to assess the environmental value of their solutions for the customer companies in mining industry. This study answers to questions on what kinds of methods to assess environmental impacts exist, and what kind of tool could be used to assess the environmental value of Kemira’s water treatment solutions. The environmental impacts of mining activities vary greatly between different mines. Generally the major impacts include the water related issues and wastes. Energy consumption is also a significant environmental aspect. Water related issues include water consumption and impacts in water quality. There are several methods to assess environmental impacts, for example life cycle assessment, eco-efficiency tools, footprint calculations and process simulation. In addition the corresponding financial value may be estimated utilizing monetary assessment methods. Some of the industrial companies considered in the analysis of industry best practices use environmental and sustainability assessments. Based on the theoretical research and conducted interviews, an Excel based tool utilizing reference data on previous customer cases and customer specific test results was considered to be most suitable to assess the environmental value of Kemira’s solutions. The tool can be used to demonstrate the functionality of Kemira’s solutions in customers’ processes, their impacts in other process parameters and their environmental and financial aspects. In the future, the tool may be applied to fit also Kemira’s other segments, not only mining industry.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Pasvik monitoring programme was created in 2006 as a result of the trilateral cooperation, and with the intention of following changes in the environment under variable pollution levels. Water quality is one of the basic elements of the Programme when assessing the effects of the emissions from the Pechenganikel mining and metallurgical industry (Kola GMK). The Metallurgic Production Renovation Programme was implemented by OJSC Kola GMK to reduce emissions of sulphur and heavy metal concentrated dust. However, the expectations for the reduction in emissions from the smelter in the settlement Nikel were not realized. Nevertheless, Kola GMK has found that the modernization programme’s measures do not provide the planned reductions of sulfur dioxide emissions. In this report, temporal trends in water chemistry during 2000–2009 are examined on the basis of the data gathered from Lake Inari, River Pasvik and directly connected lakes, as well as from 26 small lakes in three areas: Pechenganikel (Russia), Jarfjord (Norway) and Vätsäri (Finland). The lower parts of the Pasvik watercourse are impacted by both atmospheric pollution and direct wastewater discharge from the Pechenganikel smelter and the settlement of Nikel. The upper section of the watercourse, and the small lakes and streams which are not directly linked to the Pasvik watercourse, only receive atmospheric pollution. The data obtained confirms the ongoing pollution of the river and water system. Copper (Cu), nickel (Ni) and sulphates are the main pollution components. The highest levels were observed close to the smelters. The most polluted water source of the basin is the River Kolosjoki, as it directly receives the sewage discharge from the smelters and the stream connecting the Lakes Salmijarvi and Kuetsjarvi. The concentrations of metals and sulphates in the River Pasvik are higher downstream from the Kuetsjarvi Lake. There has been no fall in the concentrations of pollutants in Pasvik watercourse over the last 10 years. Ongoing recovery from acidification has been evident in the small lakes of the Jarfjord and Vätsäri areas during the 2000s. The buffering capacity of these lakes has improved and the pH has increased. The reason for this recovery is that sulphate deposition has decreased, which is also evident in the water quality. However, concentrations of some metals, especially Ni and Cu, have risen during the 2000s. Ni concentrations have increased in all three areas, and Cu concentrations in the Pechenganickel and Jarfjord areas, which are located closer to the smelters. Emission levels of Ni and Cu did not fall during 2000s. In fact, the emission levels of Ni compounds even increased compared to the 1990s.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The aim of this master’s thesis is to analyze the mining industry customers' current and future needs for the water treatment services and discover new business development opportunities in the context of mine water treatment. In addition, the study focuses on specifying service offerings needed and evaluate suitable revenue generation models for them. The main research question of the study is: What kind of service needs related to water treatment can be identified in the Finnish mining industry? The literature examined in the study focused on industrial service classification and new service development process as well as the revenue generation of services. A qualitative research approach employing a case study method was chosen for the study. The present study uses customer and expert interviews as primary data source, complemented by archival data. The primary data was gathered by organizing total of 13 interviews, and the interviews were analyzed by using qualitative content analysis. The abductive-logic was chosen as the way of conducting scientific reasoning in this study. As a result, new service proposals were developed for Finnish mine industry suppliers. The main areas of development were on asset efficiency services and process support services. The service needs were strongly associated with suppliers’ know-how of water treatment process optimization, cost-effectiveness as well as on alternative technologies. The study provides an insight for managers that wish to pursue a water treatment services as a part of their business offering.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Presentation at Open Repositories 2014, Helsinki, Finland, June 9-13, 2014

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Purification of hydrocarbon waste streams is needed to recycle valuable hydrocarbon products, reduce hazardous impacts on environment, and save energy. To obtain these goals, research must be focused on the search of effective and feasible purification and re-refining technologies. Hydrocarbon waste streams can contain both deliberately added additives to original product and during operation cycle accumulated undesired contaminants. Compounds may have degenerated or cross-reacted. Thus, the presence of unknown species cause additional challenges for the purification process. Adsorption process is most suitable to reduce impurities to very low concentrations. Main advantages are availability of selective commercial adsorbents and the regeneration option to recycle used separation material. Used hydrocarbon fraction was purified with various separation materials in the experimental part. First screening of suitable materials was done. In the second stage, temperature dependence and adsorption kinetics were studied. Finally, one fixed bed experiment was done with the most suitable material. Additionally, FTIR-measurements of hydrocarbon samples were carried out to develop a model to monitor the concentrations of three target impurities based on spectral data. Adsorption capacities of the tested separation materials were observed to be low to achieve high enough removal efficiencies for target impurities. Based on the obtained data, batch process would be more suitable than a fixed bed process and operation at high temperatures is favorable. Additional pretreatment step is recommended to improve removal efficiency. The FTIR-measurement was proven to be a reliable and fast analysis method for challenging hydrocarbon samples.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Pasvik monitoring programme was created in 2006 as a result of the trilateral cooperation and with the intention of following changes in the environment under variable pollution levels. Water quality is one of the basic elements of the programme when assessing the effects of the emissions from the Pechenganikel mining end metallurgical industry (Kola GMK). In this report temporal trends of the water chemistry during 2000–2013 are examined on the basis of the data gathered from lake Inari, River Pasvik and directly connected lakes, Lake Kuetsjarvi and 25 small lakes in three areas: Pechenganikel (Russia), Jarfjord (Norway) and Vätsäri (Finland). The lower parts of the Pasvik watercourse are impacted by both atmospheric pollution and direct wastewater discharge from the Pechenganikel smelter and the settlement of Nikel. The upper section of the watercourse and the small lakes and streams which are not directly linked to the Pasvik Watercourse only receive atmospheric pollution. Lake Inari is free of direct emissions from the Pechenganikel and the water quality is excellent. In River Pasvik and the directly connected lakes copper, nickel, and sulphates are the main pollutants. The most polluted water body is the Kolosjoki River as well as the stream connecting the Lakes Salmijarvi and Kuetsjarvi. The concentration of metals and sulphates in the water notably increases downstream the river lower Lake Kuetsjarvi. In Lake Kuetsjarvi copper and nickel concentrations are clearly elevated and have changed insignificantly in the last years of the research period. In the small border area lakes recovery from acidification in Vätsäri and Jarfjord is evident. Nickel and copper oncentrations have fluctuated but remained on clearly elevated level in Jarfjord and Pechenga. Copper concentrations have been slightly rising in the recent years. In Pechenga area nickel concentrations during the last four monitoring years are decreasing in some places but the regional trend through whole time series is still positive.