978 resultados para Imbalanced datasets


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper we propose a novel scheme for carrying out speaker diarization in an iterative manner. We aim to show that the information obtained through the first pass of speaker diarization can be reused to refine and improve the original diarization results. We call this technique speaker rediarization and demonstrate the practical application of our rediarization algorithm using a large archive of two-speaker telephone conversation recordings. We use the NIST 2008 SRE summed telephone corpora for evaluating our speaker rediarization system. This corpus contains recurring speaker identities across independent recording sessions that need to be linked across the entire corpus. We show that our speaker rediarization scheme can take advantage of inter-session speaker information, linked in the initial diarization pass, to achieve a 30% relative improvement over the original diarization error rate (DER) after only two iterations of rediarization.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Background Australian national biomonitoring for persistent organic pollutants (POPs) relies upon age-specific pooled serum samples to characterize central tendencies of concentrations but does not provide estimates of upper bound concentrations. This analysis compares population variation from biomonitoring datasets from the US, Canada, Germany, Spain, and Belgium to identify and test patterns potentially useful for estimating population upper bound reference values for the Australian population. Methods Arithmetic means and the ratio of the 95th percentile to the arithmetic mean (P95:mean) were assessed by survey for defined age subgroups for three polychlorinated biphenyls (PCBs 138, 153, and 180), hexachlorobenzene (HCB), p,p-dichlorodiphenyldichloroethylene (DDE), 2,2′,4,4′ tetrabrominated diphenylether (PBDE 47), perfluorooctanoic acid (PFOA) and perfluorooctane sulfonate (PFOS). Results Arithmetic mean concentrations of each analyte varied widely across surveys and age groups. However, P95:mean ratios differed to a limited extent, with no systematic variation across ages. The average P95:mean ratios were 2.2 for the three PCBs and HCB; 3.0 for DDE; 2.0 and 2.3 for PFOA and PFOS, respectively. The P95:mean ratio for PBDE 47 was more variable among age groups, ranging from 2.7 to 4.8. The average P95:mean ratios accurately estimated age group-specific P95s in the Flemish Environmental Health Survey II and were used to estimate the P95s for the Australian population by age group from the pooled biomonitoring data. Conclusions Similar population variation patterns for POPs were observed across multiple surveys, even when absolute concentrations differed widely. These patterns can be used to estimate population upper bounds when only pooled sampling data are available.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

BACKGROUND Endometriosis is a heritable common gynaecological condition influenced by multiple genetic and environmental factors. Genome-wide association studies (GWASs) have proved successful in identifying common genetic variants of moderate effects for various complex diseases. To date, eight GWAS and replication studies from multiple populations have been published on endometriosis. In this review, we investigate the consistency and heterogeneity of the results across all the studies and their implications for an improved understanding of the aetiology of the condition. METHODS Meta-analyses were conducted on four GWASs and four replication studies including a total of 11 506 cases and 32 678 controls, and on the subset of studies that investigated associations for revised American Fertility Society (rAFS) Stage III/IV including 2859 cases. The datasets included 9039 cases and 27 343 controls of European (Australia, Belgium, Italy, UK, USA) and 2467 cases and 5335 controls of Japanese ancestry. Fixed and Han and Elkin random-effects models, and heterogeneity statistics (Cochran's Q test), were used to investigate the evidence of the nine reported genome-wide significant loci across datasets and populations. RESULTS Meta-analysis showed that seven out of nine loci had consistent directions of effect across studies and populations, and six out of nine remained genome-wide significant (P < 5 × 10(-8)), including rs12700667 on 7p15.2 (P = 1.6 × 10(-9)), rs7521902 near WNT4 (P = 1.8 × 10(-15)), rs10859871 near VEZT (P = 4.7 × 10(-15)), rs1537377 near CDKN2B-AS1 (P = 1.5 × 10(-8)), rs7739264 near ID4 (P = 6.2 × 10(-10)) and rs13394619 in GREB1 (P = 4.5 × 10(-8)). In addition to the six loci, two showed borderline genome-wide significant associations with Stage III/IV endometriosis, including rs1250248 in FN1 (P = 8 × 10(-8)) and rs4141819 on 2p14 (P = 9.2 × 10(-8)). Two independent inter-genic loci, rs4141819 and rs6734792 on chromosome 2, showed significant evidence of heterogeneity across datasets (P < 0.005). Eight of the nine loci had stronger effect sizes among Stage III/IV cases, implying that they are likely to be implicated in the development of moderate to severe, or ovarian, disease. While three out of nine loci were inter-genic, the remaining were in or near genes with known functions of biological relevance to endometriosis, varying from roles in developmental pathways to cellular growth/carcinogenesis. CONCLUSIONS Our meta-analysis shows remarkable consistency in endometriosis GWAS results across studies, with little evidence of population-based heterogeneity. They also show that the phenotypic classifications used in GWAS to date have been limited. Stronger associations with Stage III/IV disease observed for most loci emphasize the importance for future studies to include detailed sub-phenotype information. Functional studies in relevant tissues are needed to understand the effect of the variants on downstream biological pathways.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Marker ordering during linkage map construction is a critical component of QTL mapping research. In recent years, high-throughput genotyping methods have become widely used, and these methods may generate hundreds of markers for a single mapping population. This poses problems for linkage analysis software because the number of possible marker orders increases exponentially as the number of markers increases. In this paper, we tested the accuracy of linkage analyses on simulated recombinant inbred line data using the commonly used Map Manager QTX (Manly et al. 2001: Mammalian Genome 12, 930-932) software and RECORD (Van Os et al. 2005: Theoretical and Applied Genetics 112, 30-40). Accuracy was measured by calculating two scores: % correct marker positions, and a novel, weighted rank-based score derived from the sum of absolute values of true minus observed marker ranks divided by the total number of markers. The accuracy of maps generated using Map Manager QTX was considerably lower than those generated using RECORD. Differences in linkage maps were often observed when marker ordering was performed several times using the identical dataset. In order to test the effect of reducing marker numbers on the stability of marker order, we pruned marker datasets focusing on regions consisting of tightly linked clusters of markers, which included redundant markers. Marker pruning improved the accuracy and stability of linkage maps because a single unambiguous marker order was produced that was consistent across replications of analysis. Marker pruning was also applied to a real barley mapping population and QTL analysis was performed using different map versions produced by the different programs. While some QTLs were identified with both map versions, there were large differences in QTL mapping results. Differences included maximum LOD and R-2 values at QTL peaks and map positions, thus highlighting the importance of marker order for QTL mapping

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Birds represent the most diverse extant tetrapod clade, with ca. 10,000 extant species, and the timing of the crown avian radiation remains hotly debated. The fossil record supports a primarily Cenozoic radiation of crown birds, whereas molecular divergence dating analyses generally imply that this radiation was well underway during the Cretaceous. Furthermore, substantial differences have been noted between published divergence estimates. These have been variously attributed to clock model, calibration regime, and gene type. One underappreciated phenomenon is that disparity between fossil ages and molecular dates tends to be proportionally greater for shallower nodes in the avian Tree of Life. Here, we explore potential drivers of disparity in avian divergence dates through a set of analyses applying various calibration strategies and coding methods to a mitochondrial genome dataset and an 18-gene nuclear dataset, both sampled across 72 taxa. Our analyses support the occurrence of two deep divergences (i.e., the Palaeognathae/Neognathae split and the Galloanserae/Neoaves split) well within the Cretaceous, followed by a rapid radiation of Neoaves near the K-Pg boundary. However, 95% highest posterior density intervals for most basal divergences in Neoaves cross the boundary, and we emphasize that, barring unreasonably strict prior distributions, distinguishing between a rapid Early Paleocene radiation and a Late Cretaceous radiation may be beyond the resolving power of currently favored divergence dating methods. In contrast to recent observations for placental mammals, constraining all divergences within Neoaves to occur in the Cenozoic does not result in unreasonably high inferred substitution rates. Comparisons of nuclear DNA (nDNA) versus mitochondrial DNA (mtDNA) datasets and NT- versus RY-coded mitochondrial data reveal patterns of disparity that are consistent with substitution model misspecifications that result in tree compression/tree extension artifacts, which may explain some discordance between previous divergence estimates based on different sequence types. Comparisons of fully calibrated and nominally calibrated trees support a correlation between body mass and apparent dating error. Overall, our results are consistent with (but do not require) a Paleogene radiation for most major clades of crown birds.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Daily rainfall datasets of 10 years (1998-2007) of Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) version 6 and India Meteorological Department (IMD) gridded rain gauge have been compared over the Indian landmass, both in large and small spatial scales. On the larger spatial scale, the pattern correlation between the two datasets on daily scales during individual years of the study period is ranging from 0.4 to 0.7. The correlation improved significantly (similar to 0.9) when the study was confined to specific wet and dry spells each of about 5-8 days. Wavelet analysis of intraseasonal oscillations (ISO) of the southwest monsoon rainfall show the percentage contribution of the major two modes (30-50 days and 10-20 days), to be ranging respectively between similar to 30-40% and 5-10% for the various years. Analysis of inter-annual variability shows the satellite data to be underestimating seasonal rainfall by similar to 110 mm during southwest monsoon and overestimating by similar to 150 mm during northeast monsoon season. At high spatio-temporal scales, viz., 1 degrees x1 degrees grid, TMPA data do not correspond to ground truth. We have proposed here a new analysis procedure to assess the minimum spatial scale at which the two datasets are compatible with each other. This has been done by studying the contribution to total seasonal rainfall from different rainfall rate windows (at 1 mm intervals) on different spatial scales (at daily time scale). The compatibility spatial scale is seen to be beyond 5 degrees x5 degrees average spatial scale over the Indian landmass. This will help to decide the usability of TMPA products, if averaged at appropriate spatial scales, for specific process studies, e.g., cloud scale, meso scale or synoptic scale.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper describes a spatio-temporal registration approach for speech articulation data obtained from electromagnetic articulography (EMA) and real-time Magnetic Resonance Imaging (rtMRI). This is motivated by the potential for combining the complementary advantages of both types of data. The registration method is validated on EMA and rtMRI datasets obtained at different times, but using the same stimuli. The aligned corpus offers the advantages of high temporal resolution (from EMA) and a complete mid-sagittal view (from rtMRI). The co-registration also yields optimum placement of EMA sensors as articulatory landmarks on the magnetic resonance images, thus providing richer spatio-temporal information about articulatory dynamics. (C) 2014 Acoustical Society of America

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Supporting presentation slides as part of the Janet network end to end performance initiative

Relevância:

20.00% 20.00%

Publicador:

Resumo:

More and more users aim at taking advantage of the existing Linked Open Data environment to formulate a query over a dataset and to then try to process the same query over different datasets, one after another, in order to obtain a broader set of answers. However, the heterogeneity of vocabularies used in the datasets on the one side, and the fact that the number of alignments among those datasets is scarce on the other, makes that querying task difficult for them. Considering this scenario we present in this paper a proposal that allows on demand translations of queries formulated over an original dataset, into queries expressed using the vocabulary of a targeted dataset. Our approach relieves users from knowing the vocabulary used in the targeted datasets and even more it considers situations where alignments do not exist or they are not suitable for the formulated query. Therefore, in order to favour the possibility of getting answers, sometimes there is no guarantee of obtaining a semantically equivalent translation. The core component of our proposal is a query rewriting model that considers a set of transformation rules devised from a pragmatic point of view. The feasibility of our scheme has been validated with queries defined in well known benchmarks and SPARQL endpoint logs, as the obtained results confirm.