39 resultados para tietojenkäsittelytiede
Resumo:
Metabolism is the cellular subsystem responsible for generation of energy from nutrients and production of building blocks for larger macromolecules. Computational and statistical modeling of metabolism is vital to many disciplines including bioengineering, the study of diseases, drug target identification, and understanding the evolution of metabolism. In this thesis, we propose efficient computational methods for metabolic modeling. The techniques presented are targeted particularly at the analysis of large metabolic models encompassing the whole metabolism of one or several organisms. We concentrate on three major themes of metabolic modeling: metabolic pathway analysis, metabolic reconstruction and the study of evolution of metabolism. In the first part of this thesis, we study metabolic pathway analysis. We propose a novel modeling framework called gapless modeling to study biochemically viable metabolic networks and pathways. In addition, we investigate the utilization of atom-level information on metabolism to improve the quality of pathway analyses. We describe efficient algorithms for discovering both gapless and atom-level metabolic pathways, and conduct experiments with large-scale metabolic networks. The presented gapless approach offers a compromise in terms of complexity and feasibility between the previous graph-theoretic and stoichiometric approaches to metabolic modeling. Gapless pathway analysis shows that microbial metabolic networks are not as robust to random damage as suggested by previous studies. Furthermore the amino acid biosynthesis pathways of the fungal species Trichoderma reesei discovered from atom-level data are shown to closely correspond to those of Saccharomyces cerevisiae. In the second part, we propose computational methods for metabolic reconstruction in the gapless modeling framework. We study the task of reconstructing a metabolic network that does not suffer from connectivity problems. Such problems often limit the usability of reconstructed models, and typically require a significant amount of manual postprocessing. We formulate gapless metabolic reconstruction as an optimization problem and propose an efficient divide-and-conquer strategy to solve it with real-world instances. We also describe computational techniques for solving problems stemming from ambiguities in metabolite naming. These techniques have been implemented in a web-based sofware ReMatch intended for reconstruction of models for 13C metabolic flux analysis. In the third part, we extend our scope from single to multiple metabolic networks and propose an algorithm for inferring gapless metabolic networks of ancestral species from phylogenetic data. Experimenting with 16 fungal species, we show that the method is able to generate results that are easily interpretable and that provide hypotheses about the evolution of metabolism.
Resumo:
Large-scale chromosome rearrangements such as copy number variants (CNVs) and inversions encompass a considerable proportion of the genetic variation between human individuals. In a number of cases, they have been closely linked with various inheritable diseases. Single-nucleotide polymorphisms (SNPs) are another large part of the genetic variance between individuals. They are also typically abundant and their measuring is straightforward and cheap. This thesis presents computational means of using SNPs to detect the presence of inversions and deletions, a particular variety of CNVs. Technically, the inversion-detection algorithm detects the suppressed recombination rate between inverted and non-inverted haplotype populations whereas the deletion-detection algorithm uses the EM-algorithm to estimate the haplotype frequencies of a window with and without a deletion haplotype. As a contribution to population biology, a coalescent simulator for simulating inversion polymorphisms has been developed. Coalescent simulation is a backward-in-time method of modelling population ancestry. Technically, the simulator also models multiple crossovers by using the Counting model as the chiasma interference model. Finally, this thesis includes an experimental section. The aforementioned methods were tested on synthetic data to evaluate their power and specificity. They were also applied to the HapMap Phase II and Phase III data sets, yielding a number of candidates for previously unknown inversions, deletions and also correctly detecting known such rearrangements.
Resumo:
Topic detection and tracking (TDT) is an area of information retrieval research the focus of which revolves around news events. The problems TDT deals with relate to segmenting news text into cohesive stories, detecting something new, previously unreported, tracking the development of a previously reported event, and grouping together news that discuss the same event. The performance of the traditional information retrieval techniques based on full-text similarity has remained inadequate for online production systems. It has been difficult to make the distinction between same and similar events. In this work, we explore ways of representing and comparing news documents in order to detect new events and track their development. First, however, we put forward a conceptual analysis of the notions of topic and event. The purpose is to clarify the terminology and align it with the process of news-making and the tradition of story-telling. Second, we present a framework for document similarity that is based on semantic classes, i.e., groups of words with similar meaning. We adopt people, organizations, and locations as semantic classes in addition to general terms. As each semantic class can be assigned its own similarity measure, document similarity can make use of ontologies, e.g., geographical taxonomies. The documents are compared class-wise, and the outcome is a weighted combination of class-wise similarities. Third, we incorporate temporal information into document similarity. We formalize the natural language temporal expressions occurring in the text, and use them to anchor the rest of the terms onto the time-line. Upon comparing documents for event-based similarity, we look not only at matching terms, but also how near their anchors are on the time-line. Fourth, we experiment with an adaptive variant of the semantic class similarity system. The news reflect changes in the real world, and in order to keep up, the system has to change its behavior based on the contents of the news stream. We put forward two strategies for rebuilding the topic representations and report experiment results. We run experiments with three annotated TDT corpora. The use of semantic classes increased the effectiveness of topic tracking by 10-30\% depending on the experimental setup. The gain in spotting new events remained lower, around 3-4\%. The anchoring the text to a time-line based on the temporal expressions gave a further 10\% increase the effectiveness of topic tracking. The gains in detecting new events, again, remained smaller. The adaptive systems did not improve the tracking results.
Resumo:
Telecommunications network management is based on huge amounts of data that are continuously collected from elements and devices from all around the network. The data is monitored and analysed to provide information for decision making in all operation functions. Knowledge discovery and data mining methods can support fast-pace decision making in network operations. In this thesis, I analyse decision making on different levels of network operations. I identify the requirements decision-making sets for knowledge discovery and data mining tools and methods, and I study resources that are available to them. I then propose two methods for augmenting and applying frequent sets to support everyday decision making. The proposed methods are Comprehensive Log Compression for log data summarisation and Queryable Log Compression for semantic compression of log data. Finally I suggest a model for a continuous knowledge discovery process and outline how it can be implemented and integrated to the existing network operations infrastructure.
Resumo:
In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals.
Resumo:
Reuse of existing carefully designed and tested software improves the quality of new software systems and reduces their development costs. Object-oriented frameworks provide an established means for software reuse on the levels of both architectural design and concrete implementation. Unfortunately, due to frame-works complexity that typically results from their flexibility and overall abstract nature, there are severe problems in using frameworks. Patterns are generally accepted as a convenient way of documenting frameworks and their reuse interfaces. In this thesis it is argued, however, that mere static documentation is not enough to solve the problems related to framework usage. Instead, proper interactive assistance tools are needed in order to enable system-atic framework-based software production. This thesis shows how patterns that document a framework s reuse interface can be represented as dependency graphs, and how dynamic lists of programming tasks can be generated from those graphs to assist the process of using a framework to build an application. This approach to framework specialization combines the ideas of framework cookbooks and task-oriented user interfaces. Tasks provide assistance in (1) cre-ating new code that complies with the framework reuse interface specification, (2) assuring the consistency between existing code and the specification, and (3) adjusting existing code to meet the terms of the specification. Besides illustrating how task-orientation can be applied in the context of using frameworks, this thesis describes a systematic methodology for modeling any framework reuse interface in terms of software patterns based on dependency graphs. The methodology shows how framework-specific reuse interface specifi-cations can be derived from a library of existing reusable pattern hierarchies. Since the methodology focuses on reusing patterns, it also alleviates the recog-nized problem of framework reuse interface specification becoming complicated and unmanageable for frameworks of realistic size. The ideas and methods proposed in this thesis have been tested through imple-menting a framework specialization tool called JavaFrames. JavaFrames uses role-based patterns that specify a reuse interface of a framework to guide frame-work specialization in a task-oriented manner. This thesis reports the results of cases studies in which JavaFrames and the hierarchical framework reuse inter-face modeling methodology were applied to the Struts web application frame-work and the JHotDraw drawing editor framework.
Resumo:
Event-based systems are seen as good candidates for supporting distributed applications in dynamic and ubiquitous environments because they support decoupled and asynchronous many-to-many information dissemination. Event systems are widely used, because asynchronous messaging provides a flexible alternative to RPC (Remote Procedure Call). They are typically implemented using an overlay network of routers. A content-based router forwards event messages based on filters that are installed by subscribers and other routers. The filters are organized into a routing table in order to forward incoming events to proper subscribers and neighbouring routers. This thesis addresses the optimization of content-based routing tables organized using the covering relation and presents novel data structures and configurations for improving local and distributed operation. Data structures are needed for organizing filters into a routing table that supports efficient matching and runtime operation. We present novel results on dynamic filter merging and the integration of filter merging with content-based routing tables. In addition, the thesis examines the cost of client mobility using different protocols and routing topologies. We also present a new matching technique called temporal subspace matching. The technique combines two new features. The first feature, temporal operation, supports notifications, or content profiles, that persist in time. The second feature, subspace matching, allows more expressive semantics, because notifications may contain intervals and be defined as subspaces of the content space. We also present an application of temporal subspace matching pertaining to metadata-based continuous collection and object tracking.
Resumo:
The ever expanding growth of the wireless access to the Internet in recent years has led to the proliferation of wireless and mobile devices to connect to the Internet. This has created the possibility of mobile devices equipped with multiple radio interfaces to connect to the Internet using any of several wireless access network technologies such as GPRS, WLAN and WiMAX in order to get the connectivity best suited for the application. These access networks are highly heterogeneous and they vary widely in their characteristics such as bandwidth, propagation delay and geographical coverage. The mechanism by which a mobile device switches between these access networks during an ongoing connection is referred to as vertical handoff and it often results in an abrupt and significant change in the access link characteristics. The most common Internet applications such as Web browsing and e-mail make use of the Transmission Control Protocol (TCP) as their transport protocol and the behaviour of TCP depends on the end-to-end path characteristics such as bandwidth and round-trip time (RTT). As the wireless access link is most likely the bottleneck of a TCP end-to-end path, the abrupt changes in the link characteristics due to a vertical handoff may affect TCP behaviour adversely degrading the performance of the application. The focus of this thesis is to study the effect of a vertical handoff on TCP behaviour and to propose algorithms that improve the handoff behaviour of TCP using cross-layer information about the changes in the access link characteristics. We begin this study by identifying the various problems of TCP due to a vertical handoff based on extensive simulation experiments. We use this study as a basis to develop cross-layer assisted TCP algorithms in handoff scenarios involving GPRS and WLAN access networks. We then extend the scope of the study by developing cross-layer assisted TCP algorithms in a broader context applicable to a wide range of bandwidth and delay changes during a handoff. And finally, the algorithms developed here are shown to be easily extendable to the multiple-TCP flow scenario. We evaluate the proposed algorithms by comparison with standard TCP (TCP SACK) and show that the proposed algorithms are effective in improving TCP behavior in vertical handoff involving a wide range of bandwidth and delay of the access networks. Our algorithms are easy to implement in real systems and they involve modifications to the TCP sender algorithm only. The proposed algorithms are conservative in nature and they do not adversely affect the performance of TCP in the absence of cross-layer information.
Resumo:
Sensor networks represent an attractive tool to observe the physical world. Networks of tiny sensors can be used to detect a fire in a forest, to monitor the level of pollution in a river, or to check on the structural integrity of a bridge. Application-specific deployments of static-sensor networks have been widely investigated. Commonly, these networks involve a centralized data-collection point and no sharing of data outside the organization that owns it. Although this approach can accommodate many application scenarios, it significantly deviates from the pervasive computing vision of ubiquitous sensing where user applications seamlessly access anytime, anywhere data produced by sensors embedded in the surroundings. With the ubiquity and ever-increasing capabilities of mobile devices, urban environments can help give substance to the ubiquitous sensing vision through Urbanets, spontaneously created urban networks. Urbanets consist of mobile multi-sensor devices, such as smart phones and vehicular systems, public sensor networks deployed by municipalities, and individual sensors incorporated in buildings, roads, or daily artifacts. My thesis is that "multi-sensor mobile devices can be successfully programmed to become the underpinning elements of an open, infrastructure-less, distributed sensing platform that can bring sensor data out of their traditional close-loop networks into everyday urban applications". Urbanets can support a variety of services ranging from emergency and surveillance to tourist guidance and entertainment. For instance, cars can be used to provide traffic information services to alert drivers to upcoming traffic jams, and phones to provide shopping recommender services to inform users of special offers at the mall. Urbanets cannot be programmed using traditional distributed computing models, which assume underlying networks with functionally homogeneous nodes, stable configurations, and known delays. Conversely, Urbanets have functionally heterogeneous nodes, volatile configurations, and unknown delays. Instead, solutions developed for sensor networks and mobile ad hoc networks can be leveraged to provide novel architectures that address Urbanet-specific requirements, while providing useful abstractions that hide the network complexity from the programmer. This dissertation presents two middleware architectures that can support mobile sensing applications in Urbanets. Contory offers a declarative programming model that views Urbanets as a distributed sensor database and exposes an SQL-like interface to developers. Context-aware Migratory Services provides a client-server paradigm, where services are capable of migrating to different nodes in the network in order to maintain a continuous and semantically correct interaction with clients. Compared to previous approaches to supporting mobile sensing urban applications, our architectures are entirely distributed and do not assume constant availability of Internet connectivity. In addition, they allow on-demand collection of sensor data with the accuracy and at the frequency required by every application. These architectures have been implemented in Java and tested on smart phones. They have proved successful in supporting several prototype applications and experimental results obtained in ad hoc networks of phones have demonstrated their feasibility with reasonable performance in terms of latency, memory, and energy consumption.
Resumo:
The analysis of sequential data is required in many diverse areas such as telecommunications, stock market analysis, and bioinformatics. A basic problem related to the analysis of sequential data is the sequence segmentation problem. A sequence segmentation is a partition of the sequence into a number of non-overlapping segments that cover all data points, such that each segment is as homogeneous as possible. This problem can be solved optimally using a standard dynamic programming algorithm. In the first part of the thesis, we present a new approximation algorithm for the sequence segmentation problem. This algorithm has smaller running time than the optimal dynamic programming algorithm, while it has bounded approximation ratio. The basic idea is to divide the input sequence into subsequences, solve the problem optimally in each subsequence, and then appropriately combine the solutions to the subproblems into one final solution. In the second part of the thesis, we study alternative segmentation models that are devised to better fit the data. More specifically, we focus on clustered segmentations and segmentations with rearrangements. While in the standard segmentation of a multidimensional sequence all dimensions share the same segment boundaries, in a clustered segmentation the multidimensional sequence is segmented in such a way that dimensions are allowed to form clusters. Each cluster of dimensions is then segmented separately. We formally define the problem of clustered segmentations and we experimentally show that segmenting sequences using this segmentation model, leads to solutions with smaller error for the same model cost. Segmentation with rearrangements is a novel variation to the segmentation problem: in addition to partitioning the sequence we also seek to apply a limited amount of reordering, so that the overall representation error is minimized. We formulate the problem of segmentation with rearrangements and we show that it is an NP-hard problem to solve or even to approximate. We devise effective algorithms for the proposed problem, combining ideas from dynamic programming and outlier detection algorithms in sequences. In the final part of the thesis, we discuss the problem of aggregating results of segmentation algorithms on the same set of data points. In this case, we are interested in producing a partitioning of the data that agrees as much as possible with the input partitions. We show that this problem can be solved optimally in polynomial time using dynamic programming. Furthermore, we show that not all data points are candidates for segment boundaries in the optimal solution.
Resumo:
Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.
Resumo:
In recent years, XML has been widely adopted as a universal format for structured data. A variety of XML-based systems have emerged, most prominently SOAP for Web services, XMPP for instant messaging, and RSS and Atom for content syndication. This popularity is helped by the excellent support for XML processing in many programming languages and by the variety of XML-based technologies for more complex needs of applications. Concurrently with this rise of XML, there has also been a qualitative expansion of the Internet's scope. Namely, mobile devices are becoming capable enough to be full-fledged members of various distributed systems. Such devices are battery-powered, their network connections are based on wireless technologies, and their processing capabilities are typically much lower than those of stationary computers. This dissertation presents work performed to try to reconcile these two developments. XML as a highly redundant text-based format is not obviously suitable for mobile devices that need to avoid extraneous processing and communication. Furthermore, the protocols and systems commonly used in XML messaging are often designed for fixed networks and may make assumptions that do not hold in wireless environments. This work identifies four areas of improvement in XML messaging systems: the programming interfaces to the system itself and to XML processing, the serialization format used for the messages, and the protocol used to transmit the messages. We show a complete system that improves the overall performance of XML messaging through consideration of these areas. The work is centered on actually implementing the proposals in a form usable on real mobile devices. The experimentation is performed on actual devices and real networks using the messaging system implemented as a part of this work. The experimentation is extensive and, due to using several different devices, also provides a glimpse of what the performance of these systems may look like in the future.
Resumo:
In this thesis we study a series of multi-user resource-sharing problems for the Internet, which involve distribution of a common resource among participants of multi-user systems (servers or networks). We study concurrently accessible resources, which for end-users may be exclusively accessible or non-exclusively. For all kinds we suggest a separate algorithm or a modification of common reputation scheme. Every algorithm or method is studied from different perspectives: optimality of protocols, selfishness of end users, fairness of the protocol for end users. On the one hand the multifaceted analysis allows us to select the most suited protocols among a set of various available ones based on trade-offs of optima criteria. On the other hand, the future Internet predictions dictate new rules for the optimality we should take into account and new properties of the networks that cannot be neglected anymore. In this thesis we have studied new protocols for such resource-sharing problems as the backoff protocol, defense mechanisms against Denial-of-Service, fairness and confidentiality for users in overlay networks. For backoff protocol we present analysis of a general backoff scheme, where an optimization is applied to a general-view backoff function. It leads to an optimality condition for backoff protocols in both slot times and continuous time models. Additionally we present an extension for the backoff scheme in order to achieve fairness for the participants in an unfair environment, such as wireless signal strengths. Finally, for the backoff algorithm we suggest a reputation scheme that deals with misbehaving nodes. For the next problem -- denial-of-service attacks, we suggest two schemes that deal with the malicious behavior for two conditions: forged identities and unspoofed identities. For the first one we suggest a novel most-knocked-first-served algorithm, while for the latter we apply a reputation mechanism in order to restrict resource access for misbehaving nodes. Finally, we study the reputation scheme for the overlays and peer-to-peer networks, where resource is not placed on a common station, but spread across the network. The theoretical analysis suggests what behavior will be selected by the end station under such a reputation mechanism.
Resumo:
XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.
Resumo:
Wireless network access is gaining increased heterogeneity in terms of the types of IP capable access technologies. The access network heterogeneity is an outcome of incremental and evolutionary approach of building new infrastructure. The recent success of multi-radio terminals drives both building a new infrastructure and implicit deployment of heterogeneous access networks. Typically there is no economical reason to replace the existing infrastructure when building a new one. The gradual migration phase usually takes several years. IP-based mobility across different access networks may involve both horizontal and vertical handovers. Depending on the networking environment, the mobile terminal may be attached to the network through multiple access technologies. Consequently, the terminal may send and receive packets through multiple networks simultaneously. This dissertation addresses the introduction of IP Mobility paradigm into the existing mobile operator network infrastructure that have not originally been designed for multi-access and IP Mobility. We propose a model for the future wireless networking and roaming architecture that does not require revolutionary technology changes and can be deployed without unnecessary complexity. The model proposes a clear separation of operator roles: (i) access operator, (ii) service operator, and (iii) inter-connection and roaming provider. The separation allows each type of an operator to have their own development path and business models without artificial bindings with each other. We also propose minimum requirements for the new model. We present the state of the art of IP Mobility. We also present results of standardization efforts in IP-based wireless architectures. Finally, we present experimentation results of IP-level mobility in various wireless operator deployments.