11 resultados para Structured data
em Doria (National Library of Finland DSpace Services) - National Library of Finland, Finland
Resumo:
Vaikka liiketoimintatiedon hallintaa sekä johdon päätöksentekoa on tutkittu laajasti, näiden kahden käsitteen yhteisvaikutuksesta on olemassa hyvin rajallinen määrä tutkimustietoa. Tulevaisuudessa aiheen tärkeys korostuu, sillä olemassa olevan datan määrä kasvaa jatkuvasti. Yritykset tarvitsevat jatkossa yhä enemmän kyvykkyyksiä sekä resursseja, jotta sekä strukturoitua että strukturoimatonta tietoa voidaan hyödyntää lähteestä riippumatta. Nykyiset Business Intelligence -ratkaisut mahdollistavat tehokkaan liiketoimintatiedon hallinnan osana johdon päätöksentekoa. Aiemman kirjallisuuden pohjalta, tutkimuksen empiirinen osuus tunnistaa liiketoimintatiedon hyödyntämiseen liittyviä tekijöitä, jotka joko tukevat tai rajoittavat johdon päätöksentekoprosessia. Tutkimuksen teoreettinen osuus johdattaa lukijan tutkimusaiheeseen kirjallisuuskatsauksen avulla. Keskeisimmät tutkimukseen liittyvät käsitteet, kuten Business Intelligence ja johdon päätöksenteko, esitetään relevantin kirjallisuuden avulla – tämän lisäksi myös dataan liittyvät käsitteet analysoidaan tarkasti. Tutkimuksen empiirinen osuus rakentuu tutkimusteorian pohjalta. Tutkimuksen empiirisessä osuudessa paneudutaan tutkimusteemoihin käytännön esimerkein: kolmen tapaustutkimuksen avulla tutkitaan sekä kuvataan toisistaan irrallisia tapauksia. Jokainen tapaus kuvataan sekä analysoidaan teoriaan perustuvien väitteiden avulla – nämä väitteet ovat perusedellytyksiä menestyksekkäälle liiketoimintatiedon hyödyntämiseen perustuvalle päätöksenteolle. Tapaustutkimusten avulla alkuperäistä tutkimusongelmaa voidaan analysoida tarkasti huomioiden jo olemassa oleva tutkimustieto. Analyysin tulosten avulla myös yksittäisiä rajoitteita sekä mahdollistavia tekijöitä voidaan analysoida. Tulokset osoittavat, että rajoitteilla on vahvasti negatiivinen vaikutus päätöksentekoprosessin onnistumiseen. Toisaalta yritysjohto on tietoinen liiketoimintatiedon hallintaan liittyvistä positiivisista seurauksista, vaikka kaikkia mahdollisuuksia ei olisikaan hyödynnetty. Tutkimuksen merkittävin tulos esittelee viitekehyksen, jonka puitteissa johdon päätöksentekoprosesseja voidaan arvioida sekä analysoida. Despite the fact that the literature on Business Intelligence and managerial decision-making is extensive, relatively little effort has been made to research the relationship between them. This particular field of study has become important since the amount of data in the world is growing every second. Companies require capabilities and resources in order to utilize structured data and unstructured data from internal and external data sources. However, the present Business Intelligence technologies enable managers to utilize data effectively in decision-making. Based on the prior literature, the empirical part of the thesis identifies the enablers and constraints in computer-aided managerial decision-making process. In this thesis, the theoretical part provides a preliminary understanding about the research area through a literature review. The key concepts such as Business Intelligence and managerial decision-making are explored by reviewing the relevant literature. Additionally, different data sources as well as data forms are analyzed in further detail. All key concepts are taken into account when the empirical part is carried out. The empirical part obtains an understanding of the real world situation when it comes to the themes that were covered in the theoretical part. Three selected case companies are analyzed through those statements, which are considered as critical prerequisites for successful computer-aided managerial decision-making. The case study analysis, which is a part of the empirical part, enables the researcher to examine the relationship between Business Intelligence and managerial decision-making. Based on the findings of the case study analysis, the researcher identifies the enablers and constraints through the case study interviews. The findings indicate that the constraints have a highly negative influence on the decision-making process. In addition, the managers are aware of the positive implications that Business Intelligence has for decision-making, but all possibilities are not yet utilized. As a main result of this study, a data-driven framework for managerial decision-making is introduced. This framework can be used when the managerial decision-making processes are evaluated and analyzed.
Resumo:
The incredible rapid development to huge volumes of air travel, mainly because of jet airliners that appeared to the sky in the 1950s, created the need for systematic research for aviation safety and collecting data about air traffic. The structured data can be analysed easily using queries from databases and running theseresults through graphic tools. However, in analysing narratives that often give more accurate information about the case, mining tools are needed. The analysis of textual data with computers has not been possible until data mining tools have been developed. Their use, at least among aviation, is still at a moderate level. The research aims at discovering lethal trends in the flight safety reports. The narratives of 1,200 flight safety reports from years 1994 – 1996 in Finnish were processed with three text mining tools. One of them was totally language independent, the other had a specific configuration for Finnish and the third originally created for English, but encouraging results had been achieved with Spanish and that is why a Finnish test was undertaken, too. The global rate of accidents is stabilising and the situation can now be regarded as satisfactory, but because of the growth in air traffic, the absolute number of fatal accidents per year might increase, if the flight safety will not be improved. The collection of data and reporting systems have reached their top level. The focal point in increasing the flight safety is analysis. The air traffic has generally been forecasted to grow 5 – 6 per cent annually over the next two decades. During this period, the global air travel will probably double also with relatively conservative expectations of economic growth. This development makes the airline management confront growing pressure due to increasing competition, signify cant rise in fuel prices and the need to reduce the incident rate due to expected growth in air traffic volumes. All this emphasises the urgent need for new tools and methods. All systems provided encouraging results, as well as proved challenges still to be won. Flight safety can be improved through the development and utilisation of sophisticated analysis tools and methods, like data mining, using its results supporting the decision process of the executives.
Resumo:
Tämän diplomityön tarkoituksena on käydä läpi XML:n tarjoamia mahdollisuuksia heterogeenisen palveluverkon integroinnissa. Työssä kuvataan XML-kielen yleistä teoriaa ja perehdytään etenkin sovellusten välisen kommunikoinnin kannalta tärkeisiin ominaisuuksiin. Samalla käydään läpi sovelluskehitysympäristöjen muuttumista heterogeenisemmiksi ja siitä seurannutta palveluarkkitehtuurien kehittymistä ja kuinka nämä muutokset vaikuttavat XML:n hyväksikäyttöön. Työssä suunniteltiin ja toteutettiin luonnollisen kielen palvelukehitykseen Fuse-palvelualusta. Työssä kuvataan palvelualustan arkkitehtuuri ja siinä tarkastellaan XML:n hyödyntämistä luonnollisen kielen tulkin ja palvelun integroinnissa. Samalla arvioidaan muita XML:n käyttömahdollisuuksia Fuse-palvelualustan parantamiseksi.
Resumo:
Current-day web search engines (e.g., Google) do not crawl and index a significant portion of theWeb and, hence, web users relying on search engines only are unable to discover and access a large amount of information from the non-indexable part of the Web. Specifically, dynamic pages generated based on parameters provided by a user via web search forms (or search interfaces) are not indexed by search engines and cannot be found in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases on the Web. In order to obtain some information from a web database of interest, a user issues his/her query by specifying query terms in a search form and receives the query results, a set of dynamic pages that embed required information from a database. At the same time, issuing a query via an arbitrary search interface is an extremely complex task for any kind of automatic agents including web crawlers, which, at least up to the present day, do not even attempt to pass through web forms on a large scale. In this thesis, our primary and key object of study is a huge portion of the Web (hereafter referred as the deep Web) hidden behind web search interfaces. We concentrate on three classes of problems around the deep Web: characterization of deep Web, finding and classifying deep web resources, and querying web databases. Characterizing deep Web: Though the term deep Web was coined in 2000, which is sufficiently long ago for any web-related concept/technology, we still do not know many important characteristics of the deep Web. Another matter of concern is that surveys of the deep Web existing so far are predominantly based on study of deep web sites in English. One can then expect that findings from these surveys may be biased, especially owing to a steady increase in non-English web content. In this way, surveying of national segments of the deep Web is of interest not only to national communities but to the whole web community as well. In this thesis, we propose two new methods for estimating the main parameters of deep Web. We use the suggested methods to estimate the scale of one specific national segment of the Web and report our findings. We also build and make publicly available a dataset describing more than 200 web databases from the national segment of the Web. Finding deep web resources: The deep Web has been growing at a very fast pace. It has been estimated that there are hundred thousands of deep web sites. Due to the huge volume of information in the deep Web, there has been a significant interest to approaches that allow users and computer applications to leverage this information. Most approaches assumed that search interfaces to web databases of interest are already discovered and known to query systems. However, such assumptions do not hold true mostly because of the large scale of the deep Web – indeed, for any given domain of interest there are too many web databases with relevant content. Thus, the ability to locate search interfaces to web databases becomes a key requirement for any application accessing the deep Web. In this thesis, we describe the architecture of the I-Crawler, a system for finding and classifying search interfaces. Specifically, the I-Crawler is intentionally designed to be used in deepWeb characterization studies and for constructing directories of deep web resources. Unlike almost all other approaches to the deep Web existing so far, the I-Crawler is able to recognize and analyze JavaScript-rich and non-HTML searchable forms. Querying web databases: Retrieving information by filling out web search forms is a typical task for a web user. This is all the more so as interfaces of conventional search engines are also web forms. At present, a user needs to manually provide input values to search interfaces and then extract required data from the pages with results. The manual filling out forms is not feasible and cumbersome in cases of complex queries but such kind of queries are essential for many web searches especially in the area of e-commerce. In this way, the automation of querying and retrieving data behind search interfaces is desirable and essential for such tasks as building domain-independent deep web crawlers and automated web agents, searching for domain-specific information (vertical search engines), and for extraction and integration of information from various deep web resources. We present a data model for representing search interfaces and discuss techniques for extracting field labels, client-side scripts and structured data from HTML pages. We also describe a representation of result pages and discuss how to extract and store results of form queries. Besides, we present a user-friendly and expressive form query language that allows one to retrieve information behind search interfaces and extract useful data from the result pages based on specified conditions. We implement a prototype system for querying web databases and describe its architecture and components design.
Resumo:
Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.
Resumo:
Learning of preference relations has recently received significant attention in machine learning community. It is closely related to the classification and regression analysis and can be reduced to these tasks. However, preference learning involves prediction of ordering of the data points rather than prediction of a single numerical value as in case of regression or a class label as in case of classification. Therefore, studying preference relations within a separate framework facilitates not only better theoretical understanding of the problem, but also motivates development of the efficient algorithms for the task. Preference learning has many applications in domains such as information retrieval, bioinformatics, natural language processing, etc. For example, algorithms that learn to rank are frequently used in search engines for ordering documents retrieved by the query. Preference learning methods have been also applied to collaborative filtering problems for predicting individual customer choices from the vast amount of user generated feedback. In this thesis we propose several algorithms for learning preference relations. These algorithms stem from well founded and robust class of regularized least-squares methods and have many attractive computational properties. In order to improve the performance of our methods, we introduce several non-linear kernel functions. Thus, contribution of this thesis is twofold: kernel functions for structured data that are used to take advantage of various non-vectorial data representations and the preference learning algorithms that are suitable for different tasks, namely efficient learning of preference relations, learning with large amount of training data, and semi-supervised preference learning. Proposed kernel-based algorithms and kernels are applied to the parse ranking task in natural language processing, document ranking in information retrieval, and remote homology detection in bioinformatics domain. Training of kernel-based ranking algorithms can be infeasible when the size of the training set is large. This problem is addressed by proposing a preference learning algorithm whose computation complexity scales linearly with the number of training data points. We also introduce sparse approximation of the algorithm that can be efficiently trained with large amount of data. For situations when small amount of labeled data but a large amount of unlabeled data is available, we propose a co-regularized preference learning algorithm. To conclude, the methods presented in this thesis address not only the problem of the efficient training of the algorithms but also fast regularization parameter selection, multiple output prediction, and cross-validation. Furthermore, proposed algorithms lead to notably better performance in many preference learning tasks considered.
Resumo:
XML-muotoista tiedonesitystapaa hyödynnetään yhä enemmän esitettäessä rakenteellista tietoa. Tarkoituksena on antaa yleishyödyllinen ja uudelleenkäytettävä tapa jakaa yleistä tietoa erilaisten rajapintojen yli. XML-tekniikoita käytetään myös korjaamaan aiemmin tehdyissä sovellutuksissa esiintyneitä puutteita ja parantamaan niiden toimintaa. Tässä diplomityössä esitellään Telestelle LabView-pohjaiseen testaussovellusympäristöön suunniteltava ajuriuudistus. Työssä paranneltiin aiempaa ajurimallia soveltamalla siihen XML-tekniikoita hyödyntäviä toimintoja. Tarkoituksena oli vähentää testaussovelluskehityksessä vaadittavaa ohjelmointityötä korvaamalla sovelluksiin kovakoodatut ominaisuudet XML-pohjaisilla konfiguraatiotiedostoilla. Järjestelmän pohjana on yleiskäyttöinen ajuri, joka käyttää Telesten omaa EMS-protokollaa kommunikoinnissaan testattavien tuotteiden kanssa. Ajurimalli käyttää XML-pohjaisia konfiguraatiotiedostoja määrittelemään testattavien tuotteiden ominaisuuksia. XML-skeematiedostoilla esitetään ajurin käyttämän kommunikaatioprotokollan viestityypit ja niiden rakenteet. Työn tuloksena onnistuttiin luomaan uudenlainen XML-tekniikoita hyödyntävä ajurimalli. Yhteen yhteiseen ajuriin perustuva malli yhdenmukaistaa testaussovelluksien toteuttamista ja vähentää tarvittavaa ohjelmointityötä. Ajurin käyttöä helpotettiin toteuttamalla testaussovelluksien kehitysympäristöön erityinen editori, jolla voidaan helposti luoda ajuria käyttäviä toimintoja.
Resumo:
Machine learning provides tools for automated construction of predictive models in data intensive areas of engineering and science. The family of regularized kernel methods have in the recent years become one of the mainstream approaches to machine learning, due to a number of advantages the methods share. The approach provides theoretically well-founded solutions to the problems of under- and overfitting, allows learning from structured data, and has been empirically demonstrated to yield high predictive performance on a wide range of application domains. Historically, the problems of classification and regression have gained the majority of attention in the field. In this thesis we focus on another type of learning problem, that of learning to rank. In learning to rank, the aim is from a set of past observations to learn a ranking function that can order new objects according to how well they match some underlying criterion of goodness. As an important special case of the setting, we can recover the bipartite ranking problem, corresponding to maximizing the area under the ROC curve (AUC) in binary classification. Ranking applications appear in a large variety of settings, examples encountered in this thesis include document retrieval in web search, recommender systems, information extraction and automated parsing of natural language. We consider the pairwise approach to learning to rank, where ranking models are learned by minimizing the expected probability of ranking any two randomly drawn test examples incorrectly. The development of computationally efficient kernel methods, based on this approach, has in the past proven to be challenging. Moreover, it is not clear what techniques for estimating the predictive performance of learned models are the most reliable in the ranking setting, and how the techniques can be implemented efficiently. The contributions of this thesis are as follows. First, we develop RankRLS, a computationally efficient kernel method for learning to rank, that is based on minimizing a regularized pairwise least-squares loss. In addition to training methods, we introduce a variety of algorithms for tasks such as model selection, multi-output learning, and cross-validation, based on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm, which is one of the most well established methods for learning to rank. Third, we study the combination of the empirical kernel map and reduced set approximation, which allows the large-scale training of kernel machines using linear solvers, and propose computationally efficient solutions to cross-validation when using the approach. Next, we explore the problem of reliable cross-validation when using AUC as a performance criterion, through an extensive simulation study. We demonstrate that the proposed leave-pair-out cross-validation approach leads to more reliable performance estimation than commonly used alternative approaches. Finally, we present a case study on applying machine learning to information extraction from biomedical literature, which combines several of the approaches considered in the thesis. The thesis is divided into two parts. Part I provides the background for the research work and summarizes the most central results, Part II consists of the five original research articles that are the main contribution of this thesis.
Resumo:
Metadata in increasing levels of sophistication has been the most powerful concept used in management of unstructured information ever since the first librarian used the Dewey decimal system for library classifications. It remains to be seen, however, what the best approach is to implementing metadata to manage huge volumes of unstructured information in a large organization. Also, once implemented, how is it possible to track whether it is adding value to the company, and whether the implementation has been successful? Existing literature on metadata seems to either focus too much on technical and quality aspects or describe issues with respect to adoption for general information management initiatives. This research therefore, strives to contribute to these gaps: to give a consolidated framework for striving to understand the value added by implementing metadata. The basic methodology used is that of case study, which incorporates aspects of design science, surveys, and interviews in order to provide a holistic approach to quantitative and qualitative analysis of the case. The research identifies the various approaches to implementing metadata, particularly studying the one followed by the unit of analysis of case study, a large company in the Oil and Gas Sector. Of the three approaches identified, the selected company already follows an approach that appears to be superior. The researcher further explores its shortcomings, and proposes a slightly modified approach that can handle them. The research categorically and thoroughly (in context) identifies the top effectiveness criteria, and corresponding key performance indicators(KPIs) that can be measured to understand the level of advancement of the metadata management initiative in the company. In an effort to contrast and have a basis of comparison for the findings, the research also includes views from information managers dealing with core structured data stored in ERPs and other databases. In addition, the results include the basic criteria that can be used to evaluate metrics, in order to classify a metric as a KPI.
Resumo:
Tutkimuksen tarkoituksena on selvittää, millainen vaikutus yrityskuvalla on yrityksen rekrytointiprosessiin. Yrityskuva- ja rekrytointikirjallisuuden pohjalta luotiin tutkimukselle viitekehys, jonka pohjalta yrityskuvan ja rekrytointiprosessin välistä suhdetta analysoitiin. Tutkimuksen empiirinen osa toteutettiin tapaustutkimuksena kvalitatiivisia tutkimusmenetelmiä hyödyntäen. Kohdeyrityksenä toimi Gigantti Oy Ab, ja tiedonkeruu toteutettiin puolistrukturoiduilla teemahaastatteluilla yhteensä kolmessa Gigantin myymälässä sekä pääkonttorilla. Tutkimuksen tulokset osoittavat, että yrityskuva vaikuttaa yrityksen kiinnostavuuteen ja positiivinen yrityskuva lisää hakijamääriä. Myös yrityksen tapa suorittaa rekrytointinsa vaikuttaa yrityksen yrityskuvaan, sillä se muokkaa hakijoiden mielikuvia yrityksestä.
Resumo:
The purpose of the study was to explore how a public, IT services transferor, organization, comprised of autonomous entities, can effectively develop and organize its data center cost recovery mechanisms in a fair manner. The lack of a well-defined model for charges and a cost recovery scheme could cause various problems. For example one entity may be subsidizing the costs of another entity(s). Transfer pricing is in the best interest of each autonomous entity in a CCA. While transfer pricing plays a pivotal role in the price settings of services and intangible assets, TCE focuses on the arrangement at the boundary between entities. TCE is concerned with the costs, autonomy, and cooperation issues of an organization. The theory is concern with the factors that influence intra-firm transaction costs and attempting to manifest the problems involved in the determination of the charges or prices of the transactions. This study was carried out, as a single case study, in a public organization. The organization intended to transfer the IT services of its own affiliated public entities and was in the process of establishing a municipal-joint data center. Nine semi-structured interviews, including two pilot interviews, were conducted with the experts and managers of the case company and its affiliating entities. The purpose of these interviews was to explore the charging and pricing issues of the intra-firm transactions. In order to process and summarize the findings, this study employed qualitative techniques with the multiple methods of data collection. The study, by reviewing the TCE theory and a sample of transfer pricing literature, created an IT services pricing framework as a conceptual tool for illustrating the structure of transferring costs. Antecedents and consequences of the transfer price based on TCE were developed. An explanatory fair charging model was eventually developed and suggested. The findings of the study suggested that the Chargeback system was inappropriate scheme for an organization with affiliated autonomous entities. The main contribution of the study was the application of TP methodologies in the public sphere with no tax issues consideration.