937 resultados para Data structures (Computer science)
Resumo:
In the last few years we have observed a proliferation of approaches for clustering XML docu- ments and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the XML data to be clustered. These applications need data in the form of similar contents, tags, paths, structures and semantics. In this paper, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. This presentation leads to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering compo- nent. Finally, the paper moves into the description of future trends and research issues that still need to be faced.
Resumo:
Citizen Science projects are initiatives in which members of the general public participate in scientific research projects and perform or manage research-related tasks such as data collection and/or data annotation. Citizen Science is technologically possible and scientifically significant. However, although research teams can save time and money by recruiting general citizens to volunteer their time and skills to help data analysis, the reliability of contributed data varies a lot. Data reliability issues are significant to the domain of Citizen Science due to the quantity and diversity of people and devices involved. Participants may submit low quality, misleading, inaccurate, or even malicious data. Therefore, finding a way to improve the data reliability has become an urgent demand. This study aims to investigate techniques to enhance the reliability of data contributed by general citizens in scientific research projects especially for acoustic sensing projects. In particular, we propose to design a reputation framework to enhance data reliability and also investigate some critical elements that should be aware of during developing and designing new reputation systems.
Resumo:
This article describes a Matlab toolbox for parametric identification of fluid-memory models associated with the radiation forces ships and offshore structures. Radiation forces are a key component of force-to-motion models used in simulators, motion control designs, and also for initial performance evaluation of wave-energy converters. The software described provides tools for preparing non-parmatric data and for identification with automatic model-order detection. The identification problem is considered in the frequency domain.
Resumo:
This paper addresses the problem of joint identification of infinite-frequency added mass and fluid memory models of marine structures from finite frequency data. This problem is relevant for cases where the code used to compute the hydrodynamic coefficients of the marine structure does not give the infinite-frequency added mass. This case is typical of codes based on 2D-potential theory since most 3D-potential-theory codes solve the boundary value associated with the infinite frequency. The method proposed in this paper presents a simpler alternative approach to other methods previously presented in the literature. The advantage of the proposed method is that the same identification procedure can be used to identify the fluid-memory models with or without having access to the infinite-frequency added mass coefficient. Therefore, it provides an extension that puts the two identification problems into the same framework. The method also exploits the constraints related to relative degree and low-frequency asymptotic values of the hydrodynamic coefficients derived from the physics of the problem, which are used as prior information to refine the obtained models.
Resumo:
Time-domain models of marine structures based on frequency domain data are usually built upon the Cummins equation. This type of model is a vector integro-differential equation which involves convolution terms. These convolution terms are not convenient for analysis and design of motion control systems. In addition, these models are not efficient with respect to simulation time, and ease of implementation in standard simulation packages. For these reasons, different methods have been proposed in the literature as approximate alternative representations of the convolutions. Because the convolution is a linear operation, different approaches can be followed to obtain an approximately equivalent linear system in the form of either transfer function or state-space models. This process involves the use of system identification, and several options are available depending on how the identification problem is posed. This raises the question whether one method is better than the others. This paper therefore has three objectives. The first objective is to revisit some of the methods for replacing the convolutions, which have been reported in different areas of analysis of marine systems: hydrodynamics, wave energy conversion, and motion control systems. The second objective is to compare the different methods in terms of complexity and performance. For this purpose, a model for the response in the vertical plane of a modern containership is considered. The third objective is to describe the implementation of the resulting model in the standard simulation environment Matlab/Simulink.
Resumo:
Dealing with the large amount of data resulting from association rule mining is a big challenge. The essential issue is how to provide efficient methods for summarizing and representing meaningful discovered knowledge from databases. This paper presents a new approach called multi-tier granule mining to improve the performance of association rule mining. Rather than using patterns, it uses granules to represent knowledge that is implicitly contained in relational databases. This approach also uses multi-tier structures and association mappings to interpret association rules in terms of granules. Consequently, association rules can be quickly assessed and meaningless association rules can be justified according to these association mappings. The experimental results indicate that the proposed approach is promising
Resumo:
We present an approach for the inspection of vertical pole-like infrastructure using a vertical take-off and landing (VTOL) unmanned aerial vehicle and shared autonomy. Inspecting vertical structures, such as light and power distribution poles, is a time consuming, dangerous and expensive task with high operator workload. To address these issues, we propose a VTOL platform that can operate at close-quarters, whilst maintaining a safe stand-off distance and rejecting environmental disturbances. We adopt an Image based Visual Servoing (IBVS) technique using only two line features to stabilise the vehicle with respect to a pole. Visual, inertial and sonar data are used, making the approach suitable for indoor or GPS-denied environments. Results from simulation and outdoor flight experiments demonstrate the system is able to successfully inspect and circumnavigate a pole.
Resumo:
One of the main challenges in data analytics is that discovering structures and patterns in complex datasets is a computer-intensive task. Recent advances in high-performance computing provide part of the solution. Multicore systems are now more affordable and more accessible. In this paper, we investigate how this can be used to develop more advanced methods for data analytics. We focus on two specific areas: model-driven analysis and data mining using optimisation techniques.
Resumo:
Bird species richness survey is one of the most intriguing ecological topics for evaluating environmental health. Here, bird species richness denotes the number of unique bird species in a particular area. Factors affecting the investigation of bird species richness include weather, observation bias, and most importantly, the prohibitive costs of conducting surveys at large spatiotemporal scales. Thanks to advances in recording techniques, these problems have been alleviated by deploying sensors for acoustic data collection. Although automated detection techniques have been introduced to identify various bird species, the innate complexity of bird vocalizations, the background noise present in the recording and the escalating volumes of acoustic data pose a challenging task on determination of bird species richness. In this paper we proposed a two-step computer-assisted sampling approach for determining bird species richness in one-day acoustic data. First, a classification model is built based on acoustic indices for filtering out minutes that contain few bird species. Then the classified bird minutes are ordered by an acoustic index and the redundant temporal minutes are removed from the ranked minute sequence. The experimental results show that our method is more efficient in directing experts for determination of bird species compared with the previous methods.
Resumo:
Background: A genetic network can be represented as a directed graph in which a node corresponds to a gene and a directed edge specifies the direction of influence of one gene on another. The reconstruction of such networks from transcript profiling data remains an important yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological sample of interest. Prevailing strategies for learning the structure of a genetic network from high-dimensional transcript profiling data assume sparsity and linearity. Many methods consider relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work examines large undirected graphs representations of genetic networks, graphs with many thousands of nodes where an undirected edge between two nodes does not indicate the direction of influence, and the problem of estimating the structure of such a sparse linear genetic network (SLGN) from transcript profiling data. Results: The structure learning task is cast as a sparse linear regression problem which is then posed as a LASSO (l1-constrained fitting) problem and solved finally by formulating a Linear Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave-One-Out Error. The accuracy and utility of LP-SLGNs is assessed quantitatively and qualitatively using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate the comparison of algorithms for deducing the structure of networks. The structures of LP-SLGNs estimated from the INSILICO1, INSILICO2 and INSILICO3 simulated DREAM2 data sets are comparable to those proposed by the first and/or second ranked teams in the DREAM2 competition. The structures of LP-SLGNs estimated from two published Saccharomyces cerevisae cell cycle transcript profiling data sets capture known regulatory associations. In each S. cerevisiae LP-SLGN, the number of nodes with a particular degree follows an approximate power law suggesting that its degree distributions is similar to that observed in real-world networks. Inspection of these LP-SLGNs suggests biological hypotheses amenable to experimental verification. Conclusion: A statistically robust and computationally efficient LP-based method for estimating the topology of a large sparse undirected graph from high-dimensional data yields representations of genetic networks that are biologically plausible and useful abstractions of the structures of real genetic networks. Analysis of the statistical and topological properties of learned LP-SLGNs may have practical value; for example, genes with high random walk betweenness, a measure of the centrality of a node in a graph, are good candidates for intervention studies and hence integrated computational – experimental investigations designed to infer more realistic and sophisticated probabilistic directed graphical model representations of genetic networks. The LP-based solutions of the sparse linear regression problem described here may provide a method for learning the structure of transcription factor networks from transcript profiling and transcription factor binding motif data.
Resumo:
A variety of data structures such as inverted file, multi-lists, quad tree, k-d tree, range tree, polygon tree, quintary tree, multidimensional tries, segment tree, doubly chained tree, the grid file, d-fold tree. super B-tree, Multiple Attribute Tree (MAT), etc. have been studied for multidimensional searching and related problems. Physical data base organization, which is an important application of multidimensional searching, is traditionally and mostly handled by employing inverted file. This study proposes MAT data structure for bibliographic file systems, by illustrating the superiority of MAT data structure over inverted file. Both the methods are compared in terms of preprocessing, storage and query costs. Worst-case complexity analysis of both the methods, for a partial match query, is carried out in two cases: (a) when directory resides in main memory, (b) when directory resides in secondary memory. In both cases, MAT data structure is shown to be more efficient than the inverted file method. Arguments are given to illustrate the superiority of MAT data structure in an average case also. An efficient adaptation of MAT data structure, that exploits the special features of MAT structure and bibliographic files, is proposed for bibliographic file systems. In this adaptation, suitable techniques for fixing and ranking of the attributes for MAT data structure are proposed. Conclusions and proposals for future research are presented.
Resumo:
Compositional data analysis usually deals with relative information between parts where the total (abundances, mass, amount, etc.) is unknown or uninformative. This article addresses the question of what to do when the total is known and is of interest. Tools used in this case are reviewed and analysed, in particular the relationship between the positive orthant of D-dimensional real space, the product space of the real line times the D-part simplex, and their Euclidean space structures. The first alternative corresponds to data analysis taking logarithms on each component, and the second one to treat a log-transformed total jointly with a composition describing the distribution of component amounts. Real data about total abundances of phytoplankton in an Australian river motivated the present study and are used for illustration.
Resumo:
Dispersing a data object into a set of data shares is an elemental stage in distributed communication and storage systems. In comparison to data replication, data dispersal with redundancy saves space and bandwidth. Moreover, dispersing a data object to distinct communication links or storage sites limits adversarial access to whole data and tolerates loss of a part of data shares. Existing data dispersal schemes have been proposed mostly based on various mathematical transformations on the data which induce high computation overhead. This paper presents a novel data dispersal scheme where each part of a data object is replicated, without encoding, into a subset of data shares according to combinatorial design theory. Particularly, data parts are mapped to points and data shares are mapped to lines of a projective plane. Data parts are then distributed to data shares using the point and line incidence relations in the plane so that certain subsets of data shares collectively possess all data parts. The presented scheme incorporates combinatorial design theory with inseparability transformation to achieve secure data dispersal at reduced computation, communication and storage costs. Rigorous formal analysis and experimental study demonstrate significant cost-benefits of the presented scheme in comparison to existing methods.
Resumo:
Bacteriorhodopsin has been the subject of intense study in order to understand its photochemical function. The recent atomic model proposed by Henderson and coworkers based on electron cryo-microscopic studies has helped in understanding many of the structural and functional aspects of bacteriorhodopsin. However, the accuracy of the positions of the side chains is not very high since the model is based on low-resolution data. In this study, we have minimized the energy of this structure of bacteriorhodopsin and analyzed various types of interactions such as - intrahelical and interhelical hydrogen bonds and retinal environment. In order to understand the photochemical action, it is necessary to obtain information on the structures adopted at the intermediate states. In this direction, we have generated some intermediate structures taking into account certain experimental data, by computer modeling studies. Various isomers of retinal with 13-cis and/or 15-cis conformations and all possible staggered orientations of Lys-216 side chain were generated. The resultant structures were examined for the distance between Lys-216-schiff base nitrogen and the carboxylate oxygen atoms of Asp-96 - a residue which is known to reprotonate the schiff base at later stages of photocycle. Some of the structures were selected on the basis of suitable retinal orientation and the stability of these structures were tested by energy minimization studies. Further, the minimized structures are analyzed for the hydrogen bond interactions and retinal environment and the results are compared with those of the minimized rest state structure. The importance of functional groups in stabilizing the structure of bacteriorhodopsin and in participating dynamically during the photocycle have been discussed.
Resumo:
We present external memory data structures for efficiently answering range-aggregate queries. The range-aggregate problem is defined as follows: Given a set of weighted points in R-d, compute the aggregate of the weights of the points that lie inside a d-dimensional orthogonal query rectangle. The aggregates we consider in this paper include COUNT, sum, and MAX. First, we develop a structure for answering two-dimensional range-COUNT queries that uses O(N/B) disk blocks and answers a query in O(log(B) N) I/Os, where N is the number of input points and B is the disk block size. The structure can be extended to obtain a near-linear-size structure for answering range-sum queries using O(log(B) N) I/Os, and a linear-size structure for answering range-MAX queries in O(log(B)(2) N) I/Os. Our structures can be made dynamic and extended to higher dimensions. (C) 2012 Elsevier B.V. All rights reserved.