20 resultados para Data Mining, Big Data, Consumi energetici, Weka Data Cleaning

em Doria (National Library of Finland DSpace Services) - National Library of Finland, Finland


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Recent advances in machine learning methods enable increasingly the automatic construction of various types of computer assisted methods that have been difficult or laborious to program by human experts. The tasks for which this kind of tools are needed arise in many areas, here especially in the fields of bioinformatics and natural language processing. The machine learning methods may not work satisfactorily if they are not appropriately tailored to the task in question. However, their learning performance can often be improved by taking advantage of deeper insight of the application domain or the learning problem at hand. This thesis considers developing kernel-based learning algorithms incorporating this kind of prior knowledge of the task in question in an advantageous way. Moreover, computationally efficient algorithms for training the learning machines for specific tasks are presented. In the context of kernel-based learning methods, the incorporation of prior knowledge is often done by designing appropriate kernel functions. Another well-known way is to develop cost functions that fit to the task under consideration. For disambiguation tasks in natural language, we develop kernel functions that take account of the positional information and the mutual similarities of words. It is shown that the use of this information significantly improves the disambiguation performance of the learning machine. Further, we design a new cost function that is better suitable for the task of information retrieval and for more general ranking problems than the cost functions designed for regression and classification. We also consider other applications of the kernel-based learning algorithms such as text categorization, and pattern recognition in differential display. We develop computationally efficient algorithms for training the considered learning machines with the proposed kernel functions. We also design a fast cross-validation algorithm for regularized least-squares type of learning algorithm. Further, an efficient version of the regularized least-squares algorithm that can be used together with the new cost function for preference learning and ranking tasks is proposed. In summary, we demonstrate that the incorporation of prior knowledge is possible and beneficial, and novel advanced kernels and cost functions can be used in algorithms efficiently.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Visual data mining (VDM) tools employ information visualization techniques in order to represent large amounts of high-dimensional data graphically and to involve the user in exploring data at different levels of detail. The users are looking for outliers, patterns and models – in the form of clusters, classes, trends, and relationships – in different categories of data, i.e., financial, business information, etc. The focus of this thesis is the evaluation of multidimensional visualization techniques, especially from the business user’s perspective. We address three research problems. The first problem is the evaluation of projection-based visualizations with respect to their effectiveness in preserving the original distances between data points and the clustering structure of the data. In this respect, we propose the use of existing clustering validity measures. We illustrate their usefulness in evaluating five visualization techniques: Principal Components Analysis (PCA), Sammon’s Mapping, Self-Organizing Map (SOM), Radial Coordinate Visualization and Star Coordinates. The second problem is concerned with evaluating different visualization techniques as to their effectiveness in visual data mining of business data. For this purpose, we propose an inquiry evaluation technique and conduct the evaluation of nine visualization techniques. The visualizations under evaluation are Multiple Line Graphs, Permutation Matrix, Survey Plot, Scatter Plot Matrix, Parallel Coordinates, Treemap, PCA, Sammon’s Mapping and the SOM. The third problem is the evaluation of quality of use of VDM tools. We provide a conceptual framework for evaluating the quality of use of VDM tools and apply it to the evaluation of the SOM. In the evaluation, we use an inquiry technique for which we developed a questionnaire based on the proposed framework. The contributions of the thesis consist of three new evaluation techniques and the results obtained by applying these evaluation techniques. The thesis provides a systematic approach to evaluation of various visualization techniques. In this respect, first, we performed and described the evaluations in a systematic way, highlighting the evaluation activities, and their inputs and outputs. Secondly, we integrated the evaluation studies in the broad framework of usability evaluation. The results of the evaluations are intended to help developers and researchers of visualization systems to select appropriate visualization techniques in specific situations. The results of the evaluations also contribute to the understanding of the strengths and limitations of the visualization techniques evaluated and further to the improvement of these techniques.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Anders Söderbäckin esitys Kirjastoverkkopäivillä 26.10.2011 Helsingissä.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Poster at Open Repositories 2014, Helsinki, Finland, June 9-13, 2014

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Presentation at Open Repositories 2014, Helsinki, Finland, June 9-13, 2014

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The whole research of the current Master Thesis project is related to Big Data transfer over Parallel Data Link and my main objective is to assist the Saint-Petersburg National Research University ITMO research team to accomplish this project and apply Green IT methods for the data transfer system. The goal of the team is to transfer Big Data by using parallel data links with SDN Openflow approach. My task as a team member was to compare existing data transfer applications in case to verify which results the highest data transfer speed in which occasions and explain the reasons. In the context of this thesis work a comparison between 5 different utilities was done, which including Fast Data Transfer (FDT), BBCP, BBFTP, GridFTP, and FTS3. A number of scripts where developed which consist of creating random binary data to be incompressible to have fair comparison between utilities, execute the Utilities with specified parameters, create log files, results, system parameters, and plot graphs to compare the results. Transferring such an enormous variety of data can take a long time, and hence, the necessity appears to reduce the energy consumption to make them greener. In the context of Green IT approach, our team used Cloud Computing infrastructure called OpenStack. It’s more efficient to allocated specific amount of hardware resources to test different scenarios rather than using the whole resources from our testbed. Testing our implementation with OpenStack infrastructure results that the virtual channel does not consist of any traffic and we can achieve the highest possible throughput. After receiving the final results we are in place to identify which utilities produce faster data transfer in different scenarios with specific TCP parameters and we can use them in real network data links.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This thesis introduces heat demand forecasting models which are generated by using data mining algorithms. The forecast spans one full day and this forecast can be used in regulating heat consumption of buildings. For training the data mining models, two years of heat consumption data from a case building and weather measurement data from Finnish Meteorological Institute are used. The thesis utilizes Microsoft SQL Server Analysis Services data mining tools in generating the data mining models and CRISP-DM process framework to implement the research. Results show that the built models can predict heat demand at best with mean average percentage errors of 3.8% for 24-h profile and 5.9% for full day. A deployment model for integrating the generated data mining models into an existing building energy management system is also discussed.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Teknologian nopea kehitys ja toimintaympäristön muutokset kannustavat organisaatioita omaksumaan innovatiivisia ratkaisuja, pysyäkseen kilpailukykyisinä ja kehityksessä mukana. Perinteisen kustannussäästöjen ja toiminnan tehostamisen tavoittelun onkin syrjäyttänyt halu vahvistaa ja kasvattaa kilpailuetua. Näistä kannusteista huolimatta tutkimukset osoittavat, että suurin osa Suomessa toimivista organisaatioista ei johda eri teknologioihin pohjautuvia innovaatioita kovinkaan kokonaisvaltaisesti, vaan IT:n käyttö on enemmänkin reaktiivista. Tutkimuksessa tutkimmekin, miten Suomessa johdetaan IT-innovaatioiden, ja näistä erityisesti big datan, käyttöönottoa sekä madollisia poikkeavuuksia näiden välillä. Tutkimus on tehty kvalitatiivisin tutkimusmenetelmin, hyödyntämällä fokusryhmätutkimusmetodia empiirisen aineiston keruussa. Tutkimus esittää IT-innovaatioiden käyttöönoton johtamisen prosessina, joka lähtee liikkeelle viestintäkanavista saatavasta ärsykkeesta. Tietämystä vahvistetaan suostutteluvaiheessa, joka laukaisee arviointivaiheen. Prosessi etenee lopulta IT-innovaatioiden hyötyjen ja kustannusten arvioinnin myötä käyttöönotosta tehtävään päätökseen. Käyttöönottoon vaikuttavat myös erilaiset taustatekijät, jotka voivat edistää tai estää IT-innovaation omaksumista. Päätöksentekovaiheessa organisaation tietohallintojohdolla ja liiketoimintajohdolla on omat roolinsa, jotka muotoutuvat organisaation työnjaon ja investoinnin suuruuden mukaan. Tutkimuskohteiden käyttöönoton johtamistavoista kertovat, miten organisaatioiden käyttöönottoprosessin ja päätöksentekoprosessin vaiheet etenevät, mitkä taustatekijät vaikuttavat käyttöönottopäätökseen ja millaisia hyötyjä tavoitellaan. Big datan johtamistapojen selvittämiseen vaikuttaa myös se, onko organisaatiolla strategiaa tai toimintasuunnitelmaa sen hyödyntämiseksi. Tutkielman johtopäätöksenä toteamme, että yleistä IT-innovaatioiden käyttöönottoa johdetaan kolmella tavalla: strategisesti, reaktiivisesti ja muutoksen pakottamana. Johtamistapojen erot tulevat esiin investoinnin suuruuden, käyttöönottoon johtavan päätöksenteon sekä käyttöönoton taustalla vaikuttavien syiden kautta. Yleisesti IT-innovaatioita näytettäisiin johdettavan melko samassa suhteessa strategisesti, reaktiivisesti ja muutoksen pakottamana. Big datan käyttöönoton johtamisessa havaitsimme piirteitä vain strategisesta ja reaktiivisesta johtamisesta. Yleinen IT-innovaatioiden ja big datan käyttöönoton johtaminen eroavat toisistaan sen suhteen, että big dataa näytettäisiin johdettavan vielä vähemmän strategisesti ja sen päätöksentekovastuut ovat hajanaisempia. Yleisesti voidaan sanoa, että tutkimuskohteilla esiintyi heikosti selkeitä ja kokonaisvaltaisia strategioita tai toimintasuunnitelmia IT-innovaatioiden käyttöönoton johtamiseksi.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In the new age of information technology, big data has grown to be the prominent phenomena. As information technology evolves, organizations have begun to adopt big data and apply it as a tool throughout their decision-making processes. Research on big data has grown in the past years however mainly from a technical stance and there is a void in business related cases. This thesis fills the gap in the research by addressing big data challenges and failure cases. The Technology-Organization-Environment framework was applied to carry out a literature review on trends in Business Intelligence and Knowledge management information system failures. A review of extant literature was carried out using a collection of leading information system journals. Academic papers and articles on big data, Business Intelligence, Decision Support Systems, and Knowledge Management systems were studied from both failure and success aspects in order to build a model for big data failure. I continue and delineate the contribution of the Information System failure literature as it is the principal dynamics behind technology-organization-environment framework. The gathered literature was then categorised and a failure model was developed from the identified critical failure points. The failure constructs were further categorized, defined, and tabulated into a contextual diagram. The developed model and table were designed to act as comprehensive starting point and as general guidance for academics, CIOs or other system stakeholders to facilitate decision-making in big data adoption process by measuring the effect of technological, organizational, and environmental variables with perceived benefits, dissatisfaction and discontinued use.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In the new age of information technology, big data has grown to be the prominent phenomena. As information technology evolves, organizations have begun to adopt big data and apply it as a tool throughout their decision-making processes. Research on big data has grown in the past years however mainly from a technical stance and there is a void in business related cases. This thesis fills the gap in the research by addressing big data challenges and failure cases. The Technology-Organization-Environment framework was applied to carry out a literature review on trends in Business Intelligence and Knowledge management information system failures. A review of extant literature was carried out using a collection of leading information system journals. Academic papers and articles on big data, Business Intelligence, Decision Support Systems, and Knowledge Management systems were studied from both failure and success aspects in order to build a model for big data failure. I continue and delineate the contribution of the Information System failure literature as it is the principal dynamics behind technology-organization-environment framework. The gathered literature was then categorised and a failure model was developed from the identified critical failure points. The failure constructs were further categorized, defined, and tabulated into a contextual diagram. The developed model and table were designed to act as comprehensive starting point and as general guidance for academics, CIOs or other system stakeholders to facilitate decision-making in big data adoption process by measuring the effect of technological, organizational, and environmental variables with perceived benefits, dissatisfaction and discontinued use.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Data mining, as a heatedly discussed term, has been studied in various fields. Its possibilities in refining the decision-making process, realizing potential patterns and creating valuable knowledge have won attention of scholars and practitioners. However, there are less studies intending to combine data mining and libraries where data generation occurs all the time. Therefore, this thesis plans to fill such a gap. Meanwhile, potential opportunities created by data mining are explored to enhance one of the most important elements of libraries: reference service. In order to thoroughly demonstrate the feasibility and applicability of data mining, literature is reviewed to establish a critical understanding of data mining in libraries and attain the current status of library reference service. The result of the literature review indicates that free online data resources other than data generated on social media are rarely considered to be applied in current library data mining mandates. Therefore, the result of the literature review motivates the presented study to utilize online free resources. Furthermore, the natural match between data mining and libraries is established. The natural match is explained by emphasizing the data richness reality and considering data mining as one kind of knowledge, an easy choice for libraries, and a wise method to overcome reference service challenges. The natural match, especially the aspect that data mining could be helpful for library reference service, lays the main theoretical foundation for the empirical work in this study. Turku Main Library was selected as the case to answer the research question: whether data mining is feasible and applicable for reference service improvement. In this case, the daily visit from 2009 to 2015 in Turku Main Library is considered as the resource for data mining. In addition, corresponding weather conditions are collected from Weather Underground, which is totally free online. Before officially being analyzed, the collected dataset is cleansed and preprocessed in order to ensure the quality of data mining. Multiple regression analysis is employed to mine the final dataset. Hourly visits are the independent variable and weather conditions, Discomfort Index and seven days in a week are dependent variables. In the end, four models in different seasons are established to predict visiting situations in each season. Patterns are realized in different seasons and implications are created based on the discovered patterns. In addition, library-climate points are generated by a clustering method, which simplifies the process for librarians using weather data to forecast library visiting situation. Then the data mining result is interpreted from the perspective of improving reference service. After this data mining work, the result of the case study is presented to librarians so as to collect professional opinions regarding the possibility of employing data mining to improve reference services. In the end, positive opinions are collected, which implies that it is feasible to utilizing data mining as a tool to enhance library reference service.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Relaatiotietokannat ovat olleet vallitseva suunta suurissa tietokantajärjestelmissä jo 80-luvulta lähtien. Viimeisen vuosikymmenen aikana lähes kaikki teollinen ja henkilökohtainen tiedonvaihto on siirtynyt sähköiseen maailmaan. Tämä on aiheuttanut valtaisan kasvun datamäärissä. Sama kasvu jatkuu edelleen eksponentiaalisesti. Samalla ei-relaatiotietokannat eli NoSQL-tietokannat ovat nousseet huomattavaan asemaan. Monet organisaatiot käsittelevät suuria määriä järjestämätöntä dataa, jolloin perinteisen relaatiotietokannan käyttö yksin ei välttämättä ole paras, tai edes riittävä vaihtoehto. Web 2.0 -termin takana oleva internet-kulttuurin muutos tukee mukautuvampia ja skaalautuvia NoSQL-järjestelmiä. Internetin käyttäjät, erityisesti sosiaalisessa mediassa tuottavat valtavia määriä järjestymätöntä dataa. Kerättävä tieto ei ole enää tietyn mallin mukaan muotoiltua, vaan yksittäiseen tietueeseen saattaa liittyä esimerkiksi kuvia, videoita, viittauksia muiden käyttäjien luomiin instansseihin tai osoitetietoja. Tässä tutkielmassa käsitellään NoSQL-järjestelmien rakennetta sekä asemaa erityisesti suurissa tietojärjestelmissä ja vertaillaan niiden hyötyjä ja haittoja relaatiotietokantojen suhteen.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The incredible rapid development to huge volumes of air travel, mainly because of jet airliners that appeared to the sky in the 1950s, created the need for systematic research for aviation safety and collecting data about air traffic. The structured data can be analysed easily using queries from databases and running theseresults through graphic tools. However, in analysing narratives that often give more accurate information about the case, mining tools are needed. The analysis of textual data with computers has not been possible until data mining tools have been developed. Their use, at least among aviation, is still at a moderate level. The research aims at discovering lethal trends in the flight safety reports. The narratives of 1,200 flight safety reports from years 1994 – 1996 in Finnish were processed with three text mining tools. One of them was totally language independent, the other had a specific configuration for Finnish and the third originally created for English, but encouraging results had been achieved with Spanish and that is why a Finnish test was undertaken, too. The global rate of accidents is stabilising and the situation can now be regarded as satisfactory, but because of the growth in air traffic, the absolute number of fatal accidents per year might increase, if the flight safety will not be improved. The collection of data and reporting systems have reached their top level. The focal point in increasing the flight safety is analysis. The air traffic has generally been forecasted to grow 5 – 6 per cent annually over the next two decades. During this period, the global air travel will probably double also with relatively conservative expectations of economic growth. This development makes the airline management confront growing pressure due to increasing competition, signify cant rise in fuel prices and the need to reduce the incident rate due to expected growth in air traffic volumes. All this emphasises the urgent need for new tools and methods. All systems provided encouraging results, as well as proved challenges still to be won. Flight safety can be improved through the development and utilisation of sophisticated analysis tools and methods, like data mining, using its results supporting the decision process of the executives.