960 resultados para data sets


Relevância:

70.00% 70.00%

Publicador:

Resumo:

In the first part of the thesis we explore three fundamental questions that arise naturally when we conceive a machine learning scenario where the training and test distributions can differ. Contrary to conventional wisdom, we show that in fact mismatched training and test distribution can yield better out-of-sample performance. This optimal performance can be obtained by training with the dual distribution. This optimal training distribution depends on the test distribution set by the problem, but not on the target function that we want to learn. We show how to obtain this distribution in both discrete and continuous input spaces, as well as how to approximate it in a practical scenario. Benefits of using this distribution are exemplified in both synthetic and real data sets.

In order to apply the dual distribution in the supervised learning scenario where the training data set is fixed, it is necessary to use weights to make the sample appear as if it came from the dual distribution. We explore the negative effect that weighting a sample can have. The theoretical decomposition of the use of weights regarding its effect on the out-of-sample error is easy to understand but not actionable in practice, as the quantities involved cannot be computed. Hence, we propose the Targeted Weighting algorithm that determines if, for a given set of weights, the out-of-sample performance will improve or not in a practical setting. This is necessary as the setting assumes there are no labeled points distributed according to the test distribution, only unlabeled samples.

Finally, we propose a new class of matching algorithms that can be used to match the training set to a desired distribution, such as the dual distribution (or the test distribution). These algorithms can be applied to very large datasets, and we show how they lead to improved performance in a large real dataset such as the Netflix dataset. Their computational complexity is the main reason for their advantage over previous algorithms proposed in the covariate shift literature.

In the second part of the thesis we apply Machine Learning to the problem of behavior recognition. We develop a specific behavior classifier to study fly aggression, and we develop a system that allows analyzing behavior in videos of animals, with minimal supervision. The system, which we call CUBA (Caltech Unsupervised Behavior Analysis), allows detecting movemes, actions, and stories from time series describing the position of animals in videos. The method summarizes the data, as well as it provides biologists with a mathematical tool to test new hypotheses. Other benefits of CUBA include finding classifiers for specific behaviors without the need for annotation, as well as providing means to discriminate groups of animals, for example, according to their genetic line.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The report describes the results of preliminary analyses of data obtained from a series of water temperature loggers sited at various distances (0.8 to 21.8 km) downstream of Kielder dam on the River North Tyne and in two natural tributaries. The report deals with three aspects of the water temperature records: An analysis of an operational aspect of the data sets for selected stations, a simple examination of the effects of impoundment upon water temperature at or close to the point of release, relative to natural river temperatures, and an examination of rate of change of monthly means of daily mean, maximum, minimum and range (maximum - minimum) with distance downstream of the point of release during 1983.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This report describes the general background to the project, defines the stations from which data sets have been obtained and lists the available data. The project had the following aims: To develop a more accurate and less labour-intensive system for the collection and processing of water temperature data from a number of stations within a stream/river system, and to use the River North Tyne downstream of the Kielder impoundment as a test bed for the system. This should yield useful information on the effects of impoundment upon downstream water temperatures.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Research on assessment and monitoring methods has primarily focused on fisheries with long multivariate data sets. Less research exists on methods applicable to data-poor fisheries with univariate data sets with a small sample size. In this study, we examine the capabilities of seasonal autoregressive integrated moving average (SARIMA) models to fit, forecast, and monitor the landings of such data-poor fisheries. We use a European fishery on meagre (Sciaenidae: Argyrosomus regius), where only a short time series of landings was available to model (n=60 months), as our case-study. We show that despite the limited sample size, a SARIMA model could be found that adequately fitted and forecasted the time series of meagre landings (12-month forecasts; mean error: 3.5 tons (t); annual absolute percentage error: 15.4%). We derive model-based prediction intervals and show how they can be used to detect problematic situations in the fishery. Our results indicate that over the course of one year the meagre landings remained within the prediction limits of the model and therefore indicated no need for urgent management intervention. We discuss the information that SARIMA model structure conveys on the meagre lifecycle and fishery, the methodological requirements of SARIMA forecasting of data-poor fisheries landings, and the capabilities SARIMA models present within current efforts to monitor the world’s data-poorest resources.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In the problem of one-class classification (OCC) one of the classes, the target class, has to be distinguished from all other possible objects, considered as nontargets. In many biomedical problems this situation arises, for example, in diagnosis, image based tumor recognition or analysis of electrocardiogram data. In this paper an approach to OCC based on a typicality test is experimentally compared with reference state-of-the-art OCC techniques-Gaussian, mixture of Gaussians, naive Parzen, Parzen, and support vector data description-using biomedical data sets. We evaluate the ability of the procedures using twelve experimental data sets with not necessarily continuous data. As there are few benchmark data sets for one-class classification, all data sets considered in the evaluation have multiple classes. Each class in turn is considered as the target class and the units in the other classes are considered as new units to be classified. The results of the comparison show the good performance of the typicality approach, which is available for high dimensional data; it is worth mentioning that it can be used for any kind of data (continuous, discrete, or nominal), whereas state-of-the-art approaches application is not straightforward when nominal variables are present.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

EXTRACT (SEE PDF FOR FULL ABSTRACT): This is a previous presentation of what has been observed in points spread in Mexico. The existing data amount is large enough that an atlas was given out in 1977. This atlas has information which goes back to the beginning of the country. The original data sets from which this atlas was issued exist in a variety of storage forms ranging from simple paper blocks up to books and magnetic tapes.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Freehand 3D ultrasound can be acquired without a position sensor by finding the separations of pairs of frames using information in the images themselves. Previous work has not considered how to reconstruct entirely freehand data, which can exhibit irregularly spaced frames, non-monotonic out-of-plane probe motion and significant inplane motion. This paper presents reconstruction methods that overcome these limitations and are able to robustly reconstruct freehand data. The methods are assessed on freehand data sets and compared to reconstructions obtained using a position sensor.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Harmful algal blooms (HABs) are a significant and potentially expanding problem around the world. Resource management and public health protection require sufficient information to reduce the impacts of HABs by response strategies and through warnings and advisories. To be effective, these programs can best be served by an integration of improved detection methods with both evolving monitoring systems and new communications capabilities. Data sets are typically collected from a variety of sources, these can be considered as several types: point data, such as water samples; transects, such as from shipboard continuous sampling; and synoptic, such as from satellite imagery. Generation of a field of the HAB distribution requires all of these sampling approaches. This means that the data sets need to be interpreted and analyzed with each other to create the field or distribution of the HAB. The HAB field is also a necessary input into models that forecast blooms. Several systems have developed strategies that demonstrate these approaches. These range from data sets collected at key sites, such as swimming beaches, to automated collection systems, to integration of interpreted satellite data. Improved data collection, particularly in speed and cost, will be one of the advances of the next few years. Methods to improve creation of the HAB field from the variety of data types will be necessary for routine nowcasting and forecasting of HABs.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In the face of dramatic declines in groundfish populations and a lack of sufficient stock assessment information, a need has arisen for new methods of assessing groundfish populations. We describe the integration of seafloor transect data gathered by a manned submersible with high-resolution sonar imagery to produce a habitat-based stock assessment system for groundfish. The data sets used in this study were collected from Heceta Bank, Oregon, and were derived from 42 submersible dives (1988–90) and a multibeam sonar survey (1998). The submersible habitat survey investigated seafloor topography and groundfish abundance along 30-minute transects over six predetermined stations and found a statistical relationship between habitat variability and groundfish distribution and abundance. These transects were analyzed in a geographic information system (GIS) by using dynamic segmentation to display changes in habitat along the transects. We used the submersible data to extrapolate fish abundance within uniform habitat patches over broad areas of the bank by means of a habitat classification based on the sonar imagery. After applying a navigation correction to the submersible-based habitat segments, a good correlation with major boundaries on the backscatter and topographic boundaries on the imagery were apparent. Extrapolation of the extent of uniform habitats was made in the vicinity of the dive stations and a preliminary stock assessment of several species of demersal fish was calculated. Such a habitat-based approach will allow researchers to characterize marine communities over large areas of the seafloor.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

I simulated somatic growth and accompanying otolith growth using an individual-based bioenergetics model in order to examine the performance of several back-calculation methods. Four shapes of otolith radius-total length relations (OR-TL) were simulated. Ten different back-calculation equations, two different regression models of radius length, and two schemes of annulus selection were examined for a total of 20 different methods to estimate size at age from simulated data sets of length and annulus measurements. The accuracy of each of the twenty methods was evaluated by comparing the back-calculated length-at-age and the true length-at-age. The best back-calculation technique was directly related to how well the OR-TL model fitted. When the OR-TL was sigmoid shaped and all annuli were used, employing a least squares linear regression coupled with a log-transformed Lee back-calculation equation (y-intercept corrected) resulted in the least error; when only the last annulus was used, employing a direct proportionality back-calculation equation resulted in the least error. When the OR-TL was linear, employing a functional regression coupled with the Lee back-calculation equation resulted in the least error when all annuli were used, and also when only the last annulus was used. If the OR-TL was exponentially shaped, direct substitution into the fitted quadratic equation resulted in the least error when all annuli were used, and when only the last annulus was used. Finally, an asymptotically shaped OR-TL was best modeled by the individually corrected Weibull cumulative distribution function when all annuli were used, and when only the last annulus was used.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This is an interim report for a study of mussel recovery and species dynamics at four California rocky intertidal sites. Conducted by Kinnetic Laboratories, Inc. (KLI), and funded by the Minerals Management Service (MMS), the initial experimental field study began in spring 1985 and continued through spring 1991. The initial field study included six sites along the central and northern California coast. In 1992, MMS decided to continue the work started by KLI through an in-house study and establishment of the MMS Intertidal (MINT) team. Four of the original six sites have been continued by MMS. The study methods of the original study have been retained by the MINT team, and close coordination with the original KLI team continues. In 1994, the MMS Environmental Studies Program officially awarded a contract to the MINT team for this in-house study. This interim report presents the results from the fall 1992 sampling, the first year of sampling by the MINT team. The report presents a limited statistical analysis and visual comparison of the 1992 data. The next interim report will include data collected during fall 1994 and will present a broader statistical analysis of both the 1992 and 1994 data sets.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

There is increasing evidence that many of the mitochondrial DNA (mtDNA) databases published in the fields of forensic science and molecular anthropology are flawed. An a posteriori phylogenetic analysis of the sequences could help to eliminate most of the errors and thus greatly improve data quality. However, previously published caveats and recommendations along these lines were not yet picked up by all researchers. Here we call for stringent quality control of mtDNA data by haplogroup-directed database comparisons. We take some problematic databases of East Asian mtDNAs, published in the Journal of Forensic Sciences and Forensic Science International, as examples to demonstrate the process of pinpointing obvious errors. Our results show that data sets are not only notoriously plagued by base shifts and artificial recombination but also by lab-specific phantom mutations, especially in the second hypervariable region (HVR-II). (C) 2003 Elsevier Ireland Ltd. All rights reserved.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor.org/packages/2.10/bioc/html/BHC.html. We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/randomisedbhc/.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Decision tree classification algorithms have significant potential for land cover mapping problems and have not been tested in detail by the remote sensing community relative to more conventional pattern recognition techniques such as maximum likelihood classification. In this paper, we present several types of decision tree classification algorithms arid evaluate them on three different remote sensing data sets. The decision tree classification algorithms tested include an univariate decision tree, a multivariate decision tree, and a hybrid decision tree capable of including several different types of classification algorithms within a single decision tree structure. Classification accuracies produced by each of these decision tree algorithms are compared with both maximum likelihood and linear discriminant function classifiers. Results from this analysis show that the decision tree algorithms consistently outperform the maximum likelihood and linear discriminant function classifiers in regard to classf — cation accuracy. In particular, the hybrid tree consistently produced the highest classification accuracies for the data sets tested. More generally, the results from this work show that decision trees have several advantages for remote sensing applications by virtue of their relatively simple, explicit, and intuitive classification structure. Further, decision tree algorithms are strictly nonparametric and, therefore, make no assumptions regarding the distribution of input data, and are flexible and robust with respect to nonlinear and noisy relations among input features and class labels.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Mark Pagel, Andrew Meade (2004). A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Systematic Biology, 53(4), 571-581. RAE2008