902 resultados para Large Data Sets
Resumo:
Currently several thousands of objects are being tracked in the MEO and GEO regions through optical means. The problem faced in this framework is that of Multiple Target Tracking (MTT). In this context both, the correct associations among the observations and the orbits of the objects have to be determined. The complexity of the MTT problem is defined by its dimension S. The number S corresponds to the number of fences involved in the problem. Each fence consists of a set of observations where each observation belongs to a different object. The S ≥ 3 MTT problem is an NP-hard combinatorial optimization problem. There are two general ways to solve this. One way is to seek the optimum solution, this can be achieved by applying a branch-and- bound algorithm. When using these algorithms the problem has to be greatly simplified to keep the computational cost at a reasonable level. Another option is to approximate the solution by using meta-heuristic methods. These methods aim to efficiently explore the different possible combinations so that a reasonable result can be obtained with a reasonable computational effort. To this end several population-based meta-heuristic methods are implemented and tested on simulated optical measurements. With the advent of improved sensors and a heightened interest in the problem of space debris, it is expected that the number of tracked objects will grow by an order of magnitude in the near future. This research aims to provide a method that can treat the correlation and orbit determination problems simultaneously, and is able to efficiently process large data sets with minimal manual intervention.
Resumo:
Currently several thousands of objects are being tracked in the MEO and GEO regions through optical means. The problem faced in this framework is that of Multiple Target Tracking (MTT). In this context both the correct associations among the observations, and the orbits of the objects have to be determined. The complexity of the MTT problem is defined by its dimension S. Where S stands for the number of ’fences’ used in the problem, each fence consists of a set of observations that all originate from dierent targets. For a dimension of S ˃ the MTT problem becomes NP-hard. As of now no algorithm exists that can solve an NP-hard problem in an optimal manner within a reasonable (polynomial) computation time. However, there are algorithms that can approximate the solution with a realistic computational e ort. To this end an Elitist Genetic Algorithm is implemented to approximately solve the S ˃ MTT problem in an e cient manner. Its complexity is studied and it is found that an approximate solution can be obtained in a polynomial time. With the advent of improved sensors and a heightened interest in the problem of space debris, it is expected that the number of tracked objects will grow by an order of magnitude in the near future. This research aims to provide a method that can treat the correlation and orbit determination problems simultaneously, and is able to e ciently process large data sets with minimal manual intervention.
Resumo:
A number of analyses of large data sets have suggested that the reading achievement gap between African American and White U.S. is negligible or small at school entry, but widens substantially during the school years because African American students show slower rates of growth in elementary and secondary school. Identifying when and why gaps occur, therefore, is a an important research endeavor. In addition, being able to predict which African American children are most likely to fall behind can contribute to efforts to close the achievement gap. This paper analyzes first grade and third grade data on African American and White children in Massachusetts who all were identified in first grade as struggling readers and enrolled in Reading Recovery—an individualized intervention. All the children were low-income and attending urban schools. Using Observation Survey data from first grade, and MCAS Reading data from 3rd grade, we found that the African American and White students made equal average progress while in first grade, but by the end of third grade showed a large gap in MCAS proficiency rates. We discuss the results in terms of school quality, reading development, dialect issues, testing formats, and the need to provide long-term support to vulnerable learners.
Resumo:
Logistic regression is one of the most important tools in the analysis of epidemiological and clinical data. Such data often contain missing values for one or more variables. Common practice is to eliminate all individuals for whom any information is missing. This deletion approach does not make efficient use of available information and often introduces bias.^ Two methods were developed to estimate logistic regression coefficients for mixed dichotomous and continuous covariates including partially observed binary covariates. The data were assumed missing at random (MAR). One method (PD) used predictive distribution as weight to calculate the average of the logistic regressions performing on all possible values of missing observations, and the second method (RS) used a variant of resampling technique. Additional seven methods were compared with these two approaches in a simulation study. They are: (1) Analysis based on only the complete cases, (2) Substituting the mean of the observed values for the missing value, (3) An imputation technique based on the proportions of observed data, (4) Regressing the partially observed covariates on the remaining continuous covariates, (5) Regressing the partially observed covariates on the remaining continuous covariates conditional on response variable, (6) Regressing the partially observed covariates on the remaining continuous covariates and response variable, and (7) EM algorithm. Both proposed methods showed smaller standard errors (s.e.) for the coefficient involving the partially observed covariate and for the other coefficients as well. However, both methods, especially PD, are computationally demanding; thus for analysis of large data sets with partially observed covariates, further refinement of these approaches is needed. ^
Resumo:
The main problem of pedestrian dead-reckoning (PDR) using only a body-attached inertial measurement unit is the accumulation of heading errors. The heading provided by magnetometers in indoor buildings is in general not reliable and therefore it is commonly not used. Recently, a new method was proposed called heuristic drift elimination (HDE) that minimises the heading error when navigating in buildings. It assumes that the majority of buildings have their corridors parallel to each other, or they intersect at right angles, and consequently most of the time the person walks along a straight path with a heading constrained to one of the four possible directions. In this article we study the performance of HDE-based methods in complex buildings, i.e. with pathways also oriented at 45°, long curved corridors, and wide areas where non-oriented motion is possible. We explain how the performance of the original HDE method can be deteriorated in complex buildings, and also, how severe errors can appear in the case of false matches with the building's dominant directions. Although magnetic compassing indoors has a chaotic behaviour, in this article we analyse large data-sets in order to study the potential use that magnetic compassing has to estimate the absolute yaw angle of a walking person. Apart from these analysis, this article also proposes an improved HDE method called Magnetically-aided Improved Heuristic Drift Elimination (MiHDE), that is implemented over a PDR framework that uses foot-mounted inertial navigation with an extended Kalman filter (EKF). The EKF is fed with the MiHDE-estimated orientation error, gyro bias corrections, as well as the confidence over that corrections. We experimentally evaluated the performance of the proposed MiHDE-based PDR method, comparing it with the original HDE implementation. Results show that both methods perform very well in ideal orthogonal narrow-corridor buildings, and MiHDE outperforms HDE for non-ideal trajectories (e.g. curved paths) and also makes it robust against potential false dominant direction matchings.
Resumo:
Background Gray scale images make the bulk of data in bio-medical image analysis, and hence, the main focus of many image processing tasks lies in the processing of these monochrome images. With ever improving acquisition devices, spatial and temporal image resolution increases, and data sets become very large. Various image processing frameworks exists that make the development of new algorithms easy by using high level programming languages or visual programming. These frameworks are also accessable to researchers that have no background or little in software development because they take care of otherwise complex tasks. Specifically, the management of working memory is taken care of automatically, usually at the price of requiring more it. As a result, processing large data sets with these tools becomes increasingly difficult on work station class computers. One alternative to using these high level processing tools is the development of new algorithms in a languages like C++, that gives the developer full control over how memory is handled, but the resulting workflow for the prototyping of new algorithms is rather time intensive, and also not appropriate for a researcher with little or no knowledge in software development. Another alternative is in using command line tools that run image processing tasks, use the hard disk to store intermediate results, and provide automation by using shell scripts. Although not as convenient as, e.g. visual programming, this approach is still accessable to researchers without a background in computer science. However, only few tools exist that provide this kind of processing interface, they are usually quite task specific, and don’t provide an clear approach when one wants to shape a new command line tool from a prototype shell script. Results The proposed framework, MIA, provides a combination of command line tools, plug-ins, and libraries that make it possible to run image processing tasks interactively in a command shell and to prototype by using the according shell scripting language. Since the hard disk becomes the temporal storage memory management is usually a non-issue in the prototyping phase. By using string-based descriptions for filters, optimizers, and the likes, the transition from shell scripts to full fledged programs implemented in C++ is also made easy. In addition, its design based on atomic plug-ins and single tasks command line tools makes it easy to extend MIA, usually without the requirement to touch or recompile existing code. Conclusion In this article, we describe the general design of MIA, a general purpouse framework for gray scale image processing. We demonstrated the applicability of the software with example applications from three different research scenarios, namely motion compensation in myocardial perfusion imaging, the processing of high resolution image data that arises in virtual anthropology, and retrospective analysis of treatment outcome in orthognathic surgery. With MIA prototyping algorithms by using shell scripts that combine small, single-task command line tools is a viable alternative to the use of high level languages, an approach that is especially useful when large data sets need to be processed.
Resumo:
Thesis (Master's)--University of Washington, 2016-06
Resumo:
The n-tuple pattern recognition method has been tested using a selection of 11 large data sets from the European Community StatLog project, so that the results could be compared with those reported for the 23 other algorithms the project tested. The results indicate that this ultra-fast memory-based method is a viable competitor with the others, which include optimisation-based neural network algorithms, even though the theory of memory-based neural computing is less highly developed in terms of statistical theory.
Resumo:
We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world datasets indicate the efficiency of the approach.
Resumo:
We develop an approach for sparse representations of Gaussian Process (GP) models (which are Bayesian types of kernel machines) in order to overcome their limitations for large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the GP model. By using an appealing parametrisation and projection techniques that use the RKHS norm, recursions for the effective parameters and a sparse Gaussian approximation of the posterior process are obtained. This allows both for a propagation of predictions as well as of Bayesian error measures. The significance and robustness of our approach is demonstrated on a variety of experiments.
Resumo:
We have recently developed a principled approach to interactive non-linear hierarchical visualization [8] based on the Generative Topographic Mapping (GTM). Hierarchical plots are needed when a single visualization plot is not sufficient (e.g. when dealing with large quantities of data). In this paper we extend our system by giving the user a choice of initializing the child plots of the current plot in either interactive, or automatic mode. In the interactive mode the user interactively selects ``regions of interest'' as in [8], whereas in the automatic mode an unsupervised minimum message length (MML)-driven construction of a mixture of GTMs is used. The latter is particularly useful when the plots are covered with dense clusters of highly overlapping data projections, making it difficult to use the interactive mode. Such a situation often arises when visualizing large data sets. We illustrate our approach on a data set of 2300 18-dimensional points and mention extension of our system to accommodate discrete data types.
Resumo:
We develop an approach for sparse representations of Gaussian Process (GP) models (which are Bayesian types of kernel machines) in order to overcome their limitations for large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the GP model. By using an appealing parametrisation and projection techniques that use the RKHS norm, recursions for the effective parameters and a sparse Gaussian approximation of the posterior process are obtained. This allows both for a propagation of predictions as well as of Bayesian error measures. The significance and robustness of our approach is demonstrated on a variety of experiments.
Resumo:
Automatically generating maps of a measured variable of interest can be problematic. In this work we focus on the monitoring network context where observations are collected and reported by a network of sensors, and are then transformed into interpolated maps for use in decision making. Using traditional geostatistical methods, estimating the covariance structure of data collected in an emergency situation can be difficult. Variogram determination, whether by method-of-moment estimators or by maximum likelihood, is very sensitive to extreme values. Even when a monitoring network is in a routine mode of operation, sensors can sporadically malfunction and report extreme values. If this extreme data destabilises the model, causing the covariance structure of the observed data to be incorrectly estimated, the generated maps will be of little value, and the uncertainty estimates in particular will be misleading. Marchant and Lark [2007] propose a REML estimator for the covariance, which is shown to work on small data sets with a manual selection of the damping parameter in the robust likelihood. We show how this can be extended to allow treatment of large data sets together with an automated approach to all parameter estimation. The projected process kriging framework of Ingram et al. [2007] is extended to allow the use of robust likelihood functions, including the two component Gaussian and the Huber function. We show how our algorithm is further refined to reduce the computational complexity while at the same time minimising any loss of information. To show the benefits of this method, we use data collected from radiation monitoring networks across Europe. We compare our results to those obtained from traditional kriging methodologies and include comparisons with Box-Cox transformations of the data. We discuss the issue of whether to treat or ignore extreme values, making the distinction between the robust methods which ignore outliers and transformation methods which treat them as part of the (transformed) process. Using a case study, based on an extreme radiological events over a large area, we show how radiation data collected from monitoring networks can be analysed automatically and then used to generate reliable maps to inform decision making. We show the limitations of the methods and discuss potential extensions to remedy these.
Resumo:
Development of mass spectrometry techniques to detect protein oxidation, which contributes to signalling and inflammation, is important. Label-free approaches have the advantage of reduced sample manipulation, but are challenging in complex samples owing to undirected analysis of large data sets using statistical search engines. To identify oxidised proteins in biological samples, we previously developed a targeted approach involving precursor ion scanning for diagnostic MS3 ions from oxidised residues. Here, we tested this approach for other oxidations, and compared it with an alternative approach involving the use of extracted ion chromatograms (XICs) generated from high-resolution MSMS data using very narrow mass windows. This accurate mass XIC data methodology was effective at identifying nitrotyrosine, chlorotyrosine, and oxidative deamination of lysine, and for tyrosine oxidations highlighted more modified peptide species than precursor ion scanning or statistical database searches. Although some false positive peaks still occurred in the XICs, these could be identified by comparative assessment of the peak intensities. The method has the advantage that a number of different modifications can be analysed simultaneously in a single LC-MSMS run. This article is part of a Special Issue entitled: Posttranslational Protein modifications in biology and Medicine. Biological significance: The use of accurate mass extracted product ion chromatograms to detect oxidised peptides could improve the identification of oxidatively damaged proteins in inflammatory conditions. © 2013 Elsevier B.V.
Resumo:
Purpose: To describe and validate bespoke software designed to extract morphometric data from ciliary muscle Visante Anterior Segment Optical Coherence Tomography (AS-OCT) images. Method: Initially, to ensure the software was capable of appropriately applying tiered refractive index corrections and accurately measuring orthogonal and oblique parameters, 5 sets of custom-made rigid gas-permeable lenses aligned to simulate the sclera and ciliary muscle were imaged by the Visante AS-OCT and were analysed by the software. Human temporal ciliary muscle data from 50 participants extracted via the internal Visante AS-OCT caliper method and the software were compared. The repeatability of the software was also investigated by imaging the temporal ciliary muscle of 10 participants on 2 occasions. Results: The mean difference between the software and the absolute thickness measurements of the rigid gas-permeable lenses were not statistically significantly different from 0 (t = -1.458, p = 0.151). Good correspondence was observed between human ciliary muscle measurements obtained by the software and the internal Visante AS-OCT calipers (maximum thickness t = -0.864, p = 0.392, total length t = 0.860, p = 0.394). The software extracted highly repeatable ciliary muscle measurements (variability ≤6% of mean value). Conclusion: The bespoke software is capable of extracting accurate and repeatable ciliary muscle measurements and is suitable for analysing large data sets.