9 resultados para data sets

em Repositório Institucional UNESP - Universidade Estadual Paulista "Julio de Mesquita Filho"


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Empirical phylogeographic studies have progressively sampled greater numbers of loci over time, in part motivated by theoretical papers showing that estimates of key demographic parameters improve as the number of loci increases. Recently, next-generation sequencing has been applied to questions about organismal history, with the promise of revolutionizing the field. However, no systematic assessment of how phylogeographic data sets have changed over time with respect to overall size and information content has been performed. Here, we quantify the changing nature of these genetic data sets over the past 20years, focusing on papers published in Molecular Ecology. We found that the number of independent loci, the total number of alleles sampled and the total number of single nucleotide polymorphisms (SNPs) per data set has improved over time, with particularly dramatic increases within the past 5years. Interestingly, uniparentally inherited organellar markers (e.g. animal mitochondrial and plant chloroplast DNA) continue to represent an important component of phylogeographic data. Single-species studies (cf. comparative studies) that focus on vertebrates (particularly fish and to some extent, birds) represent the gold standard of phylogeographic data collection. Based on the current trajectory seen in our survey data, forecast modelling indicates that the median number of SNPs per data set for studies published by the end of the year 2016 may approach similar to 20000. This survey provides baseline information for understanding the evolution of phylogeographic data sets and underscores the fact that development of analytical methods for handling very large genetic data sets will be critical for facilitating growth of the field.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The objective of the present study was to investigate the effect of data structure on estimated genetic parameters and predicted breeding values of direct and maternal genetic effects for weaning weight (WW) and weight gain from birth to weaning (BWG), including or not the genetic covariance between direct and maternal effects. Records of 97,490 Nellore animals born between 1993 and 2006, from the Jacarezinho cattle raising farm, were used. Two different data sets were analyzed: DI_all, which included all available progenies of dams without their own performance; DII_all, which included DI_all + 20% of recorded progenies with maternal phenotypes. Two subsets were obtained from each data set (DI_all and DII_all): DI_1 and DII_1, which included only dams with three or fewer progenies; DI_5 and DII_5, which included only dams with five or more progenies. (Co)variance components and heritabilities were estimated by Bayesian inference through Gibbs sampling using univariate animal models. In general, for the population and traits studied, the proportion of dams with known phenotypic information and the number of progenies per dam influenced direct and maternal heritabilities, as well as the contribution of maternal permanent environmental variance to phenotypic variance. Only small differences were observed in the genetic and environmental parameters when the genetic covariance between direct and maternal effects was set to zero in the data sets studied. Thus, the inclusion or not of the genetic covariance between direct and maternal effects had little effect on the ranking of animals according to their breeding values for WW and BWG. Accurate estimation of genetic correlations between direct and maternal genetic effects depends on the data structure. Thus, this covariance should be set to zero in Nellore data sets in which the proportion of dams with phenotypic information is low, the number of progenies per dam is small, and pedigree relationships are poorly known. (c) 2012 Elsevier B.V. All rights reserved.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

P>In livestock genetic resource conservation, decision making about conservation priorities is based on the simultaneous analysis of several different criteria that may contribute to long-term sustainable breeding conditions, such as genetic and demographic characteristics, environmental conditions, and role of the breed in the local or regional economy. Here we address methods to integrate different data sets and highlight problems related to interdisciplinary comparisons. Data integration is based on the use of geographic coordinates and Geographic Information Systems (GIS). In addition to technical problems related to projection systems, GIS have to face the challenging issue of the non homogeneous scale of their data sets. We give examples of the successful use of GIS for data integration and examine the risk of obtaining biased results when integrating datasets that have been captured at different scales.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This article introduces the software program called EthoSeq, which is designed to extract probabilistic behavioral sequences (tree-generated sequences, or TGSs) from observational data and to prepare a TGS-species matrix for phylogenetic analysis. The program uses Graph Theory algorithms to automatically detect behavioral patterns within the observational sessions. It includes filtering tools to adjust the search procedure to user-specified statistical needs. Preliminary analyses of data sets, such as grooming sequences in birds and foraging tactics in spiders, uncover a large number of TGSs which together yield single phylogenetic trees. An example of the use of the program is our analysis of felid grooming sequences, in which we have obtained 1,386 felid grooming TGSs for seven species, resulting in a single phylogeny. These results show that behavior is definitely useful in phylogenetic analysis. EthoSeq simplifies and automates such analyses, uncovers much of the hidden patterns of long behavioral sequences, and prepares this data for further analysis with standard phylogenetic programs. We hope it will encourage many empirical studies on the evolution of behavior.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Non-technical losses identification has been paramount in the last decade. Since we have datasets with hundreds of legal and illegal profiles, one may have a method to group data into subprofiles in order to minimize the search for consumers that cause great frauds. In this context, a electric power company may be interested in to go deeper a specific profile of illegal consumer. In this paper, we introduce the Optimum-Path Forest (OPF) clustering technique to this task, and we evaluate the behavior of a dataset provided by a brazilian electric power company with different values of an OPF parameter. © 2011 IEEE.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This paper presents a method for indirect orientation of aerial images using ground control lines extracted from airborne Laser system (ALS) data. This data integration strategy has shown good potential in the automation of photogrammetric tasks, including the indirect orientation of images. The most important characteristic of the proposed approach is that the exterior orientation parameters (EOP) of a single or multiple images can be automatically computed with a space resection procedure from data derived from different sensors. The suggested method works as follows. Firstly, the straight lines are automatically extracted in the digital aerial image (s) and in the intensity image derived from an ALS data-set (S). Then, correspondence between s and S is automatically determined. A line-based coplanarity model that establishes the relationship between straight lines in the object and in the image space is used to estimate the EOP with the iterated extended Kalman filtering (IEKF). Implementation and testing of the method have employed data from different sensors. Experiments were conducted to assess the proposed method and the results obtained showed that the estimation of the EOP is function of ALS positional accuracy.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

We present the results of the combination of searches for the standard model Higgs boson produced in association with a W or Z boson and decaying into bb̄ using the data sample collected with the D0 detector in pp̄ collisions at √s=1.96TeV at the Fermilab Tevatron Collider. We derive 95% C.L. upper limits on the Higgs boson cross section relative to the standard model prediction in the mass range 100GeV≤M H≤150GeV, and we exclude Higgs bosons with masses smaller than 102 GeV at the 95% C.L. In the mass range 120GeV≤M H≤145GeV, the data exhibit an excess above the background prediction with a global significance of 1.5 standard deviations, consistent with the expectation in the presence of a standard model Higgs boson. © 2012 American Physical Society.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Semi-supervised learning is applied to classification problems where only a small portion of the data items is labeled. In these cases, the reliability of the labels is a crucial factor, because mislabeled items may propagate wrong labels to a large portion or even the entire data set. This paper aims to address this problem by presenting a graph-based (network-based) semi-supervised learning method, specifically designed to handle data sets with mislabeled samples. The method uses teams of walking particles, with competitive and cooperative behavior, for label propagation in the network constructed from the input data set. The proposed model is nature-inspired and it incorporates some features to make it robust to a considerable amount of mislabeled data items. Computer simulations show the performance of the method in the presence of different percentage of mislabeled data, in networks of different sizes and average node degree. Importantly, these simulations reveals the existence of the critical points of the mislabeled subset size, below which the network is free of wrong label contamination, but above which the mislabeled samples start to propagate their labels to the rest of the network. Moreover, numerical comparisons have been made among the proposed method and other representative graph-based semi-supervised learning methods using both artificial and real-world data sets. Interestingly, the proposed method has increasing better performance than the others as the percentage of mislabeled samples is getting larger. © 2012 IEEE.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In the instrumental records of daily precipitation, we often encounter one or more periods in which values below some threshold were not registered. Such periods, besides lacking small values, also have a large number of dry days. Their cumulative distribution function is shifted to the right in relation to that for other portions of the record having more reliable observations. Such problems are examined in this work, based mostly on the two-sample Kolmogorov–Smirnov (KS) test, where the portion of the series with more number of dry days is compared with the portion with less number of dry days. Another relatively common problem in daily rainfall data is the prevalence of integers either throughout the period of record or in some part of it, likely resulting from truncation during data compilation prior to archiving or by coarse rounding of daily readings by observers. This problem is identified by simple calculation of the proportion of integers in the series, taking the expected proportion as 10%. The above two procedures were applied to the daily rainfall data sets from the European Climate Assessment (ECA), Southeast Asian Climate Assessment (SACA), and Brazilian Water Resources Agency (BRA). Taking the statistic D of the KS test >0.15 and the corresponding p-value <0.001 as the condition to classify a given series as suspicious, the proportions of the ECA, SACA, and BRA series falling into this category are, respectively, 34.5%, 54.3%, and 62.5%. With relation to coarse rounding problem, the proportions of series exceeding twice the 10% reference level are 3%, 60%, and 43% for the ECA, SACA, and BRA data sets, respectively. A simple way to visualize the two problems addressed here is by plotting the time series of daily rainfall for a limited range, for instance, 0–10 mm day−1.