937 resultados para Data structures (Computer science)


Relevância:

100.00% 100.00%

Publicador:

Resumo:

A key trait of Free and Open Source Software (FOSS) development is its distributed nature. Nevertheless, two project-level operations, the fork and the merge of program code, are among the least well understood events in the lifespan of a FOSS project. Some projects have explicitly adopted these operations as the primary means of concurrent development. In this study, we examine the effect of highly distributed software development, is found in the Linux kernel project, on collection and modelling of software development data. We find that distributed development calls for sophisticated temporal modelling techniques where several versions of the source code tree can exist at once. Attention must be turned towards the methods of quality assurance and peer review that projects employ to manage these parallel source trees. Our analysis indicates that two new metrics, fork rate and merge rate, could be useful for determining the role of distributed version control systems in FOSS projects. The study presents a preliminary data set consisting of version control and mailing list data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The core aim of machine learning is to make a computer program learn from the experience. Learning from data is usually defined as a task of learning regularities or patterns in data in order to extract useful information, or to learn the underlying concept. An important sub-field of machine learning is called multi-view learning where the task is to learn from multiple data sets or views describing the same underlying concept. A typical example of such scenario would be to study a biological concept using several biological measurements like gene expression, protein expression and metabolic profiles, or to classify web pages based on their content and the contents of their hyperlinks. In this thesis, novel problem formulations and methods for multi-view learning are presented. The contributions include a linear data fusion approach during exploratory data analysis, a new measure to evaluate different kinds of representations for textual data, and an extension of multi-view learning for novel scenarios where the correspondence of samples in the different views or data sets is not known in advance. In order to infer the one-to-one correspondence of samples between two views, a novel concept of multi-view matching is proposed. The matching algorithm is completely data-driven and is demonstrated in several applications such as matching of metabolites between humans and mice, and matching of sentences between documents in two languages.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a novel second order cone programming formulation for designing robust classifiers which can handle uncertainty in observations. Similar formulations are also derived for designing regression functions which are robust to uncertainties in the regression setting. The proposed formulations are independent of the underlying distribution, requiring only the existence of second order moments. These formulations are then specialized to the case of missing values in observations for both classification and regression problems. Experiments show that the proposed formulations outperform imputation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper discusses a method for scaling SVM with Gaussian kernel function to handle large data sets by using a selective sampling strategy for the training set. It employs a scalable hierarchical clustering algorithm to construct cluster indexing structures of the training data in the kernel induced feature space. These are then used for selective sampling of the training data for SVM to impart scalability to the training process. Empirical studies made on real world data sets show that the proposed strategy performs well on large data sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a detailed description of the hardware design and implementation of PROMIDS: a PROtotype Multi-rIng Data flow System for functional programming languages. The hardware constraints and the design trade-offs are discussed. The design of the functional units is described in detail. Finally, we report our experience with PROMIDS.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The domination and Hamilton circuit problems are of interest both in algorithm design and complexity theory. The domination problem has applications in facility location and the Hamilton circuit problem has applications in routing problems in communications and operations research.The problem of deciding if G has a dominating set of cardinality at most k, and the problem of determining if G has a Hamilton circuit are NP-Complete. Polynomial time algorithms are, however, available for a large number of restricted classes. A motivation for the study of these algorithms is that they not only give insight into the characterization of these classes but also require a variety of algorithmic techniques and data structures. So the search for efficient algorithms, for these problems in many classes still continues.A class of perfect graphs which is practically important and mathematically interesting is the class of permutation graphs. The domination problem is polynomial time solvable on permutation graphs. Algorithms that are already available are of time complexity O(n2) or more, and space complexity O(n2) on these graphs. The Hamilton circuit problem is open for this class.We present a simple O(n) time and O(n) space algorithm for the domination problem on permutation graphs. Unlike the existing algorithms, we use the concept of geometric representation of permutation graphs. Further, exploiting this geometric notion, we develop an O(n2) time and O(n) space algorithm for the Hamilton circuit problem.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Reorganizing a dataset so that its hidden structure can be observed is useful in any data analysis task. For example, detecting a regularity in a dataset helps us to interpret the data, compress the data, and explain the processes behind the data. We study datasets that come in the form of binary matrices (tables with 0s and 1s). Our goal is to develop automatic methods that bring out certain patterns by permuting the rows and columns. We concentrate on the following patterns in binary matrices: consecutive-ones (C1P), simultaneous consecutive-ones (SC1P), nestedness, k-nestedness, and bandedness. These patterns reflect specific types of interplay and variation between the rows and columns, such as continuity and hierarchies. Furthermore, their combinatorial properties are interlinked, which helps us to develop the theory of binary matrices and efficient algorithms. Indeed, we can detect all these patterns in a binary matrix efficiently, that is, in polynomial time in the size of the matrix. Since real-world datasets often contain noise and errors, we rarely witness perfect patterns. Therefore we also need to assess how far an input matrix is from a pattern: we count the number of flips (from 0s to 1s or vice versa) needed to bring out the perfect pattern in the matrix. Unfortunately, for most patterns it is an NP-complete problem to find the minimum distance to a matrix that has the perfect pattern, which means that the existence of a polynomial-time algorithm is unlikely. To find patterns in datasets with noise, we need methods that are noise-tolerant and work in practical time with large datasets. The theory of binary matrices gives rise to robust heuristics that have good performance with synthetic data and discover easily interpretable structures in real-world datasets: dialectical variation in the spoken Finnish language, division of European locations by the hierarchies found in mammal occurrences, and co-occuring groups in network data. In addition to determining the distance from a dataset to a pattern, we need to determine whether the pattern is significant or a mere occurrence of a random chance. To this end, we use significance testing: we deem a dataset significant if it appears exceptional when compared to datasets generated from a certain null hypothesis. After detecting a significant pattern in a dataset, it is up to domain experts to interpret the results in the terms of the application.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We study the following problem: given a geometric graph G and an integer k, determine if G has a planar spanning subgraph (with the original embedding and straight-line edges) such that all nodes have degree at least k. If G is a unit disk graph, the problem is trivial to solve for k = 1. We show that even the slightest deviation from the trivial case (e.g., quasi unit disk graphs or k = 1) leads to NP-hard problems.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We introduce a variation density function that profiles the relationship between multiple scalar fields over isosurfaces of a given scalar field. This profile serves as a valuable tool for multifield data exploration because it provides the user with cues to identify interesting isovalues of scalar fields. Existing isosurface-based techniques for scalar data exploration like Reeb graphs, contour spectra, isosurface statistics, etc., study a scalar field in isolation. We argue that the identification of interesting isovalues in a multifield data set should necessarily be based on the interaction between the different fields. We demonstrate the effectiveness of our approach by applying it to explore data from a wide variety of applications.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Implementation details of efficient schemes for lenient execution and concurrent execution of re-entrant routines in a data flow model have been discussed in this paper. The proposed schemes require no extra hardware support and utilise the existing hardware resources such as the Matching Unit and Memory Network Interface, effectively to achieve the above mentioned goals.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Bayesian networks are compact, flexible, and interpretable representations of a joint distribution. When the network structure is unknown but there are observational data at hand, one can try to learn the network structure. This is called structure discovery. This thesis contributes to two areas of structure discovery in Bayesian networks: space--time tradeoffs and learning ancestor relations. The fastest exact algorithms for structure discovery in Bayesian networks are based on dynamic programming and use excessive amounts of space. Motivated by the space usage, several schemes for trading space against time are presented. These schemes are presented in a general setting for a class of computational problems called permutation problems; structure discovery in Bayesian networks is seen as a challenging variant of the permutation problems. The main contribution in the area of the space--time tradeoffs is the partial order approach, in which the standard dynamic programming algorithm is extended to run over partial orders. In particular, a certain family of partial orders called parallel bucket orders is considered. A partial order scheme that provably yields an optimal space--time tradeoff within parallel bucket orders is presented. Also practical issues concerning parallel bucket orders are discussed. Learning ancestor relations, that is, directed paths between nodes, is motivated by the need for robust summaries of the network structures when there are unobserved nodes at work. Ancestor relations are nonmodular features and hence learning them is more difficult than modular features. A dynamic programming algorithm is presented for computing posterior probabilities of ancestor relations exactly. Empirical tests suggest that ancestor relations can be learned from observational data almost as accurately as arcs even in the presence of unobserved nodes.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A graphical display of the frequency content of background,electroencephalogram (EEG) activity is obtained by calculating the spectral estimates using autocorrelation autoregressive method and the classical Fourier transform method, Display of spectral content of consecutive data segments is made using hidden-line suppression technique so as to get a spectral array, The autoregressive spectral array (ASA) is found to be sensitive to baseline drift, Following baseline correction the autoregressive technique is found to be superior to the Fourier method of compressed spectral array (CSA) in detecting the transitions in the frequencies of the signal. The smoothed ASA gives a better picture of transitions and changes in the background activity, The ASA can be made to adapt to specific changes of dominant frequencies while eliminating unnecessary peaks in the spectrum. The utility,of the ASA for background EEG analysis is discussed,

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Interactive visualization applications benefit from simplification techniques that generate good-quality coarse meshes from high-resolution meshes that represent the domain. These meshes often contain interesting substructures, called embedded structures, and it is desirable to preserve the topology of the embedded structures during simplification, in addition to preserving the topology of the domain. This paper describes a proof that link conditions, proposed earlier, are sufficient to ensure that edge contractions preserve the topology of the embedded structures and the domain. Excluding two specific configurations, the link conditions are also shown to be necessary for topology preservation. Repeated application of edge contraction on an extended complex produces a coarser representation of the domain and the embedded structures. An extension of the quadric error metric is used to schedule edge contractions, resulting in a good-quality coarse mesh that closely approximates the input domain and the embedded structures.