181 resultados para Datavetenskap (datalogi)


Relevância:

10.00% 10.00%

Publicador:

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this thesis a manifold learning method is applied to the problem of WLAN positioning and automatic radio map creation. Due to the nature of WLAN signal strength measurements, a signal map created from raw measurements results in non-linear distance relations between measurement points. These signal strength vectors reside in a high-dimensioned coordinate system. With the help of the so called Isomap-algorithm the dimensionality of this map can be reduced, and thus more easily processed. By embedding position-labeled strategic key points, we can automatically adjust the mapping to match the surveyed environment. The environment is thus learned in a semi-supervised way; gathering training points and embedding them in a two-dimensional manifold gives us a rough mapping of the measured environment. After a calibration phase, where the labeled key points in the training data are used to associate coordinates in the manifold representation with geographical locations, we can perform positioning using the adjusted map. This can be achieved through a traditional supervised learning process, which in our case is a simple nearest neighbors matching of a sampled signal strength vector. We deployed this system in two locations in the Kumpula campus in Helsinki, Finland. Results indicate that positioning based on the learned radio map can achieve good accuracy, especially in hallways or other areas in the environment where the WLAN signal is constrained by obstacles such as walls.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Perimän eri kohdissa sijaitsevat genotyypit ovat assosioituneita, jos niiden välillä on tilastollinen riippuvuus. Tässä tutkielmassa esitellään ja vertaillaan menetelmiä kromosomien välisten genotyyppiassosiaatioiden etsintään. Saatavilla olevista genotyyppiaineistoista voidaan muodostaa miljardeja kromosomien välisiä ehdokkaita mahdollisesti assosioituneiksi genotyyppipareiksi. Etsintätehtävä voidaan jakaa kolmeen erilliseen osaan: assosiaation voimakkuutta kuvaavan tunnusluvun valinta, tuloksen merkitsevyyden laskeminen sekä tarpeeksi merkitsevien tulosten valinta. Tunnusluvun valintaan ja merkitsevyyden laskemiseen liittyen tutkielmassa esitellään pari alleeliassosiaation mittaamiseen tarkoitettua perinteistä alleeliassosiaatiomittaa sekä yleisempiä riippumattomuustestejä kuten khii-toiseen-testi, G-testi ja erilaisia satunnaiseen näytteenottoon perustuvia testaustapoja. Lisäksi ehdotetaan kahta menetelmää tarkkaan merkitsevyyden laskemiseen: genotyyppikohtaista tarkkaa testiä ja maksimipoikkeamatestiä. Merkitsevien tulosten valintaan liittyen tutustutaan koekohtaista virhetodennäköisyyttä rajoittavaan Bonferroni-korjaukseen, hylkäysvirheastetta rajoittavaan FDR-kontrollointiin sekä näiden muunnelmiin. Lopuksi kokeillaan muutamaa esiteltyä menetelmää sekä keinotekoisesti tuotetulla että aidolla genotyyppiaineistolla ja analysoidaan löydettyjä assosiaatioita. Koetuloksista on havaittavissa joukko vahvasti merkitseviä assosiaatioita kromosomien välillä. Osa näistä on selitettävissä populaation sisäisillä osapopulaatioilla, ja muutamat näyttäisivät olevan seurausta aineistossa väärin sijoitelluista markkereista. Suuri osa riippuvuuksista aiheutuu kolmesta sukupuolen kanssa vahvasti assosioituneesta perimän kohdasta. Näiden lisäksi jäljelle jää joukko assosiaatioita, joiden syyt ovat tuntemattomia.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Automatisk språkprocessering har efter mer än ett halvt sekel av forskning blivit ett mycket viktigt område inom datavetenskapen. Flera vetenskapligt viktiga problem har lösts och praktiska applikationer har nått programvarumarknaden. Disambiguering av ord innebär att hitta rätt betydelse för ett mångtydigt ord. Sammanhanget, de omkringliggande orden och kunskap om ämnesområdet är faktorer som kan användas för att disambiguera ett ord. Automatisk sammanfattning innebär att förkorta en text utan att den relevanta informationen går förlorad. Relevanta meningar kan plockas ur texten, eller så kan en ny, kortare text genereras på basen av fakta i den ursprungliga texten. Avhandlingen ger en allmän översikt och kort historik av språkprocesseringen och jämför några metoder för disambiguering av ord och automatisk sammanfattning. Problemområdenas likheter och skillnader lyfts fram och metodernas ställning inom datavetenskapen belyses.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The core aim of machine learning is to make a computer program learn from the experience. Learning from data is usually defined as a task of learning regularities or patterns in data in order to extract useful information, or to learn the underlying concept. An important sub-field of machine learning is called multi-view learning where the task is to learn from multiple data sets or views describing the same underlying concept. A typical example of such scenario would be to study a biological concept using several biological measurements like gene expression, protein expression and metabolic profiles, or to classify web pages based on their content and the contents of their hyperlinks. In this thesis, novel problem formulations and methods for multi-view learning are presented. The contributions include a linear data fusion approach during exploratory data analysis, a new measure to evaluate different kinds of representations for textual data, and an extension of multi-view learning for novel scenarios where the correspondence of samples in the different views or data sets is not known in advance. In order to infer the one-to-one correspondence of samples between two views, a novel concept of multi-view matching is proposed. The matching algorithm is completely data-driven and is demonstrated in several applications such as matching of metabolites between humans and mice, and matching of sentences between documents in two languages.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

As the virtual world grows more complex, finding a standard way for storing data becomes increasingly important. Ideally, each data item would be brought into the computer system only once. References for data items need to be cryptographically verifiable, so the data can maintain its identity while being passed around. This way there will be only one copy of the users family photo album, while the user can use multiple tools to show or manipulate the album. Copies of users data could be stored on some of his family members computer, some of his computers, but also at some online services which he uses. When all actors operate over one replicated copy of the data, the system automatically avoids a single point of failure. Thus the data will not disappear with one computer breaking, or one service provider going out of business. One shared copy also makes it possible to delete a piece of data from all systems at once, on users request. In our research we tried to find a model that would make data manageable to users, and make it possible to have the same data stored at various locations. We studied three systems, Persona, Freenet, and GNUnet, that suggest different models for protecting user data. The main application areas of the systems studied include securing online social networks, providing anonymous web, and preventing censorship in file-sharing. Each of the systems studied store user data on machines belonging to third parties. The systems differ in measures they take to protect their users from data loss, forged information, censorship, and being monitored. All of the systems use cryptography to secure names used for the content, and to protect the data from outsiders. Based on the gained knowledge, we built a prototype platform called Peerscape, which stores user data in a synchronized, protected database. Data items themselves are protected with cryptography against forgery, but not encrypted as the focus has been disseminating the data directly among family and friends instead of letting third parties store the information. We turned the synchronizing database into peer-to-peer web by revealing its contents through an integrated http server. The REST-like http API supports development of applications in javascript. To evaluate the platform s suitability for application development we wrote some simple applications, including a public chat room, bittorrent site, and a flower growing game. During our early tests we came to the conclusion that using the platform for simple applications works well. As web standards develop further, writing applications for the platform should become easier. Any system this complex will have its problems, and we are not expecting our platform to replace the existing web, but are fairly impressed with the results and consider our work important from the perspective of managing user data.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

A distributed system is a collection of networked autonomous processing units which must work in a cooperative manner. Currently, large-scale distributed systems, such as various telecommunication and computer networks, are abundant and used in a multitude of tasks. The field of distributed computing studies what can be computed efficiently in such systems. Distributed systems are usually modelled as graphs where nodes represent the processors and edges denote communication links between processors. This thesis concentrates on the computational complexity of the distributed graph colouring problem. The objective of the graph colouring problem is to assign a colour to each node in such a way that no two nodes connected by an edge share the same colour. In particular, it is often desirable to use only a small number of colours. This task is a fundamental symmetry-breaking primitive in various distributed algorithms. A graph that has been coloured in this manner using at most k different colours is said to be k-coloured. This work examines the synchronous message-passing model of distributed computation: every node runs the same algorithm, and the system operates in discrete synchronous communication rounds. During each round, a node can communicate with its neighbours and perform local computation. In this model, the time complexity of a problem is the number of synchronous communication rounds required to solve the problem. It is known that 3-colouring any k-coloured directed cycle requires at least ½(log* k - 3) communication rounds and is possible in ½(log* k + 7) communication rounds for all k ≥ 3. This work shows that for any k ≥ 3, colouring a k-coloured directed cycle with at most three colours is possible in ½(log* k + 3) rounds. In contrast, it is also shown that for some values of k, colouring a directed cycle with at most three colours requires at least ½(log* k + 1) communication rounds. Furthermore, in the case of directed rooted trees, reducing a k-colouring into a 3-colouring requires at least log* k + 1 rounds for some k and possible in log* k + 3 rounds for all k ≥ 3. The new positive and negative results are derived using computational methods, as the existence of distributed colouring algorithms corresponds to the colourability of so-called neighbourhood graphs. The colourability of these graphs is analysed using Boolean satisfiability (SAT) solvers. Finally, this thesis shows that similar methods are applicable in capturing the existence of distributed algorithms for other graph problems, such as the maximal matching problem.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Modern smart phones often come with a significant amount of computational power and an integrated digital camera making them an ideal platform for intelligents assistants. This work is restricted to retail environments, where users could be provided with for example navigational in- structions to desired products or information about special offers within their close proximity. This kind of applications usually require information about the user's current location in the domain environment, which in our case corresponds to a retail store. We propose a vision based positioning approach that recognizes products the user's mobile phone's camera is currently pointing at. The products are related to locations within the store, which enables us to locate the user by pointing the mobile phone's camera to a group of products. The first step of our method is to extract meaningful features from digital images. We use the Scale- Invariant Feature Transform SIFT algorithm, which extracts features that are highly distinctive in the sense that they can be correctly matched against a large database of features from many images. We collect a comprehensive set of images from all meaningful locations within our domain and extract the SIFT features from each of these images. As the SIFT features are of high dimensionality and thus comparing individual features is infeasible, we apply the Bags of Keypoints method which creates a generic representation, visual category, from all features extracted from images taken from a specific location. A category for an unseen image can be deduced by extracting the corresponding SIFT features and by choosing the category that best fits the extracted features. We have applied the proposed method within a Finnish supermarket. We consider grocery shelves as categories which is a sufficient level of accuracy to help users navigate or to provide useful information about nearby products. We achieve a 40% accuracy which is quite low for commercial applications while significantly outperforming the random guess baseline. Our results suggest that the accuracy of the classification could be increased with a deeper analysis on the domain and by combining existing positioning methods with ours.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Reorganizing a dataset so that its hidden structure can be observed is useful in any data analysis task. For example, detecting a regularity in a dataset helps us to interpret the data, compress the data, and explain the processes behind the data. We study datasets that come in the form of binary matrices (tables with 0s and 1s). Our goal is to develop automatic methods that bring out certain patterns by permuting the rows and columns. We concentrate on the following patterns in binary matrices: consecutive-ones (C1P), simultaneous consecutive-ones (SC1P), nestedness, k-nestedness, and bandedness. These patterns reflect specific types of interplay and variation between the rows and columns, such as continuity and hierarchies. Furthermore, their combinatorial properties are interlinked, which helps us to develop the theory of binary matrices and efficient algorithms. Indeed, we can detect all these patterns in a binary matrix efficiently, that is, in polynomial time in the size of the matrix. Since real-world datasets often contain noise and errors, we rarely witness perfect patterns. Therefore we also need to assess how far an input matrix is from a pattern: we count the number of flips (from 0s to 1s or vice versa) needed to bring out the perfect pattern in the matrix. Unfortunately, for most patterns it is an NP-complete problem to find the minimum distance to a matrix that has the perfect pattern, which means that the existence of a polynomial-time algorithm is unlikely. To find patterns in datasets with noise, we need methods that are noise-tolerant and work in practical time with large datasets. The theory of binary matrices gives rise to robust heuristics that have good performance with synthetic data and discover easily interpretable structures in real-world datasets: dialectical variation in the spoken Finnish language, division of European locations by the hierarchies found in mammal occurrences, and co-occuring groups in network data. In addition to determining the distance from a dataset to a pattern, we need to determine whether the pattern is significant or a mere occurrence of a random chance. To this end, we use significance testing: we deem a dataset significant if it appears exceptional when compared to datasets generated from a certain null hypothesis. After detecting a significant pattern in a dataset, it is up to domain experts to interpret the results in the terms of the application.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Gene mapping is a systematic search for genes that affect observable characteristics of an organism. In this thesis we offer computational tools to improve the efficiency of (disease) gene-mapping efforts. In the first part of the thesis we propose an efficient simulation procedure for generating realistic genetical data from isolated populations. Simulated data is useful for evaluating hypothesised gene-mapping study designs and computational analysis tools. As an example of such evaluation, we demonstrate how a population-based study design can be a powerful alternative to traditional family-based designs in association-based gene-mapping projects. In the second part of the thesis we consider a prioritisation of a (typically large) set of putative disease-associated genes acquired from an initial gene-mapping analysis. Prioritisation is necessary to be able to focus on the most promising candidates. We show how to harness the current biomedical knowledge for the prioritisation task by integrating various publicly available biological databases into a weighted biological graph. We then demonstrate how to find and evaluate connections between entities, such as genes and diseases, from this unified schema by graph mining techniques. Finally, in the last part of the thesis, we define the concept of reliable subgraph and the corresponding subgraph extraction problem. Reliable subgraphs concisely describe strong and independent connections between two given vertices in a random graph, and hence they are especially useful for visualising such connections. We propose novel algorithms for extracting reliable subgraphs from large random graphs. The efficiency and scalability of the proposed graph mining methods are backed by extensive experiments on real data. While our application focus is in genetics, the concepts and algorithms can be applied to other domains as well. We demonstrate this generality by considering coauthor graphs in addition to biological graphs in the experiments.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Algoritmien suoritusajasta kuluu usein merkittävä osa datan siirtelyyn muistihierarkian kerrosten välillä. Ongelma korostuu hakurakenteilla, sillä ne käsittelevät suuria datamääriä. Työn päämääränä on selvittää muistisiirtojen minimoinnilla saavutettavat käytännön edut hakupuiden tapauksessa. Toinen päämäärä on kartoittaa ideaalisen välimuistin malliin perustuvien parametrittomien hakupuiden etuja ja heikkouksia ulkoisen muistin malliin perustuviin parametrillisiin hakupuihin nähden. Parametrittomuus tarkoittaa, ettei algoritmi tiedä käytetyn suunnittelumallin parametreja, kuten välimuistin kokoa. Staattisista hakupuista tarkastellaan leveyssuuntaiseen järjestykseen, esijärjestykseen ja van Emde Boas -järjestykseen tallennettuja binäärihakupuita sekä staattista B-puuta. Dynaamisista hakupuista käsitellään B+-puuta sekä parametritonta B-puuta, COB-puuta. Sekä parametrittomat että parametrilliset hakupuut pyrkivät minimoimaan vaadittavaa muistisiirtojen määrää parantamalla laskennan paikallisuutta. Käsiteltävien hakupuiden käytännön nopeutta testataan monipuolisesti. Saatujen tulosten nojalla sekä staattiset että dynaamiset parametrittomat hakupuut pärjäävät satunnaisoperaatioiden nopeudessa vastaaville parametrillisille hakupuille. Ne jäävät kuitenkin jonkin verran jälkeen perättäisoperaatioita suoritettaessa.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Tässä tutkielmassa tutustutaan kirjallisuuden avulla yleisesti käytössä oleviin roskapostin torjuntamenetelmiin. Myös niitä soveltava järjestelmäkokonaisuus esitellään. Työssä käsitellään esimerkiksi mustat DNS-listat, kollaboratiivisia tekniikoita ja harmaalistaus. Sisältöpohjaisiin menetelmiin, erityisesti bayesiläiseen luokitteluun ja logistiseen regressioanalyysiin tutustutaan tarkemmin. Tutkielmassa perehdytään myös roskapostitusta rajoittavaan lainsäädäntöön ja pohditaan, minkälaisilla keinoilla päädyttäisiin kokonaisuuden kannalta parhaaseen lopputulokseen. Työn kokeellisessa osuudessa verrataan logistista regressioanalyysiä ja bayesiläistä luokittelua roskapostintunnistuksessa realistisella koeasetelmalla käyttäen aitoa sähköpostikorpusta aineistona. Tärkeimmät kokeisiin perustuvat johtopäätökset ovat, että logistiseen regressioanalyysiin pohjaava tunnistus täydentäisi luokittelutuloksen puolesta erinomaisesti roskapostintorjuntajärjestelmää bayesiläisen luokittelijan rinnalla, mutta menetelmänä se on liian hidas tietokantanoudoista johtuvan I/O-vaativuuden takia. Lisäksi todetaan, että jopa käytettyä luokittelumenetelmää tärkeämpi seikka oppivaa roskapostintunnistusta hyödyntävässä järjestelmässä saattaa olla luokittelijalle syötetty aineisto, jonka laadun varmistamiseen on syytä panostaa erityisesti monen käyttäjän roskapostintorjuntajärjestelmässä, jossa luokitellaan kaikkien käyttäjien viestit samaan aineistoon perustuen.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The history of software development in a somewhat systematical way has been performed for half a century. Despite this time period, serious failures in software development projects still occur. The pertinent mission of software project management is to continuously achieve more and more successful projects. The application of agile software methods and more recently the integration of Lean practices contribute to this trend of continuous improvement in the software industry. One such area warranting proper empirical evidence is the operational efficiency of projects. In the field of software development, Kanban as a process management method has gained momentum recently, mostly due to its linkages to Lean thinking. However, only a few empirical studies investigate the impacts of Kanban on projects in that particular area. The aim of this doctoral thesis is to improve the understanding of how Kanban impacts on software projects. The research is carried out in the area of Lean thinking, which contains a variety of concepts including Kanban. This article-type thesis conducts a set of case studies expanded with the research strategy of quasi-controlled experiment. The data-gathering techniques of interviews, questionnaires, and different types of observations are used to study the case projects, and thereby to understand the impacts of Kanban on software development projects. The research papers of the thesis are refereed, international journal and conference publications. The results highlight new findings regarding the application of Kanban in the software context. The key findings of the thesis suggest that Kanban is applicable to software development. Despite its several benefits reported in this thesis, the empirical evidence implies that Kanban is not all-encompassing but requires additional practices to keep development projects performing appropriately. Implications for research are given, as well. In addition to these findings, the thesis contributes in the area of plan-driven software development by suggesting implications both for research and practitioners. As a conclusion, Kanban can benefit software development projects but additional practices would increase its potential for the projects.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Bayesian networks are compact, flexible, and interpretable representations of a joint distribution. When the network structure is unknown but there are observational data at hand, one can try to learn the network structure. This is called structure discovery. This thesis contributes to two areas of structure discovery in Bayesian networks: space--time tradeoffs and learning ancestor relations. The fastest exact algorithms for structure discovery in Bayesian networks are based on dynamic programming and use excessive amounts of space. Motivated by the space usage, several schemes for trading space against time are presented. These schemes are presented in a general setting for a class of computational problems called permutation problems; structure discovery in Bayesian networks is seen as a challenging variant of the permutation problems. The main contribution in the area of the space--time tradeoffs is the partial order approach, in which the standard dynamic programming algorithm is extended to run over partial orders. In particular, a certain family of partial orders called parallel bucket orders is considered. A partial order scheme that provably yields an optimal space--time tradeoff within parallel bucket orders is presented. Also practical issues concerning parallel bucket orders are discussed. Learning ancestor relations, that is, directed paths between nodes, is motivated by the need for robust summaries of the network structures when there are unobserved nodes at work. Ancestor relations are nonmodular features and hence learning them is more difficult than modular features. A dynamic programming algorithm is presented for computing posterior probabilities of ancestor relations exactly. Empirical tests suggest that ancestor relations can be learned from observational data almost as accurately as arcs even in the presence of unobserved nodes.