957 resultados para Imbalanced datasets
Resumo:
OBJECTIVES: To determine effective and efficient monitoring criteria for ocular hypertension [raised intraocular pressure (IOP)] through (i) identification and validation of glaucoma risk prediction models; and (ii) development of models to determine optimal surveillance pathways.
DESIGN: A discrete event simulation economic modelling evaluation. Data from systematic reviews of risk prediction models and agreement between tonometers, secondary analyses of existing datasets (to validate identified risk models and determine optimal monitoring criteria) and public preferences were used to structure and populate the economic model.
SETTING: Primary and secondary care.
PARTICIPANTS: Adults with ocular hypertension (IOP > 21 mmHg) and the public (surveillance preferences).
INTERVENTIONS: We compared five pathways: two based on National Institute for Health and Clinical Excellence (NICE) guidelines with monitoring interval and treatment depending on initial risk stratification, 'NICE intensive' (4-monthly to annual monitoring) and 'NICE conservative' (6-monthly to biennial monitoring); two pathways, differing in location (hospital and community), with monitoring biennially and treatment initiated for a ≥ 6% 5-year glaucoma risk; and a 'treat all' pathway involving treatment with a prostaglandin analogue if IOP > 21 mmHg and IOP measured annually in the community.
MAIN OUTCOME MEASURES: Glaucoma cases detected; tonometer agreement; public preferences; costs; willingness to pay and quality-adjusted life-years (QALYs).
RESULTS: The best available glaucoma risk prediction model estimated the 5-year risk based on age and ocular predictors (IOP, central corneal thickness, optic nerve damage and index of visual field status). Taking the average of two IOP readings, by tonometry, true change was detected at two years. Sizeable measurement variability was noted between tonometers. There was a general public preference for monitoring; good communication and understanding of the process predicted service value. 'Treat all' was the least costly and 'NICE intensive' the most costly pathway. Biennial monitoring reduced the number of cases of glaucoma conversion compared with a 'treat all' pathway and provided more QALYs, but the incremental cost-effectiveness ratio (ICER) was considerably more than £30,000. The 'NICE intensive' pathway also avoided glaucoma conversion, but NICE-based pathways were either dominated (more costly and less effective) by biennial hospital monitoring or had a ICERs > £30,000. Results were not sensitive to the risk threshold for initiating surveillance but were sensitive to the risk threshold for initiating treatment, NHS costs and treatment adherence.
LIMITATIONS: Optimal monitoring intervals were based on IOP data. There were insufficient data to determine the optimal frequency of measurement of the visual field or optic nerve head for identification of glaucoma. The economic modelling took a 20-year time horizon which may be insufficient to capture long-term benefits. Sensitivity analyses may not fully capture the uncertainty surrounding parameter estimates.
CONCLUSIONS: For confirmed ocular hypertension, findings suggest that there is no clear benefit from intensive monitoring. Consideration of the patient experience is important. A cohort study is recommended to provide data to refine the glaucoma risk prediction model, determine the optimum type and frequency of serial glaucoma tests and estimate costs and patient preferences for monitoring and treatment.
FUNDING: The National Institute for Health Research Health Technology Assessment Programme.
Resumo:
This research aims to use the multivariate geochemical dataset, generated by the Tellus project, to investigate the appropriate use of transformation methods to maintain the integrity of geochemical data and inherent constrained behaviour in multivariate relationships. The widely used normal score transform is compared with the use of a stepwise conditional transform technique. The Tellus Project, managed by GSNI and funded by the Department of Enterprise Trade and Development and the EU’s Building Sustainable Prosperity Fund, involves the most comprehensive geological mapping project ever undertaken in Northern Ireland. Previous study has demonstrated spatial variability in the Tellus data but geostatistical analysis and interpretation of the datasets requires use of an appropriate methodology that reproduces the inherently complex multivariate relations. Previous investigation of the Tellus geochemical data has included use of Gaussian-based techniques. However, earth science variables are rarely Gaussian, hence transformation of data is integral to the approach. The multivariate geochemical dataset generated by the Tellus project provides an opportunity to investigate the appropriate use of transformation methods, as required for Gaussian-based geostatistical analysis. In particular, the stepwise conditional transform is investigated and developed for the geochemical datasets obtained as part of the Tellus project. The transform is applied to four variables in a bivariate nested fashion due to the limited availability of data. Simulation of these transformed variables is then carried out, along with a corresponding back transformation to original units. Results show that the stepwise transform is successful in reproducing both univariate statistics and the complex bivariate relations exhibited by the data. Greater fidelity to multivariate relationships will improve uncertainty models, which are required for consequent geological, environmental and economic inferences.
Resumo:
In recent years, gradient vector flow (GVF) based algorithms have been successfully used to segment a variety of 2-D and 3-D imagery. However, due to the compromise of internal and external energy forces within the resulting partial differential equations, these methods may lead to biased segmentation results. In this paper, we propose MSGVF, a mean shift based GVF segmentation algorithm that can successfully locate the correct borders. MSGVF is developed so that when the contour reaches equilibrium, the various forces resulting from the different energy terms are balanced. In addition, the smoothness constraint of image pixels is kept so that over- or under-segmentation can be reduced. Experimental results on publicly accessible datasets of dermoscopic and optic disc images demonstrate that the proposed method effectively detects the borders of the objects of interest.
Resumo:
The advent of next generation sequencing technologies (NGS) has expanded the area of genomic research, offering high coverage and increased sensitivity over older microarray platforms. Although the current cost of next generation sequencing is still exceeding that of microarray approaches, the rapid advances in NGS will likely make it the platform of choice for future research in differential gene expression. Connectivity mapping is a procedure for examining the connections among diseases, genes and drugs by differential gene expression initially based on microarray technology, with which a large collection of compound-induced reference gene expression profiles have been accumulated. In this work, we aim to test the feasibility of incorporating NGS RNA-Seq data into the current connectivity mapping framework by utilizing the microarray based reference profiles and the construction of a differentially expressed gene signature from a NGS dataset. This would allow for the establishment of connections between the NGS gene signature and those microarray reference profiles, alleviating the associated incurring cost of re-creating drug profiles with NGS technology. We examined the connectivity mapping approach on a publicly available NGS dataset with androgen stimulation of LNCaP cells in order to extract candidate compounds that could inhibit the proliferative phenotype of LNCaP cells and to elucidate their potential in a laboratory setting. In addition, we also analyzed an independent microarray dataset of similar experimental settings. We found a high level of concordance between the top compounds identified using the gene signatures from the two datasets. The nicotine derivative cotinine was returned as the top candidate among the overlapping compounds with potential to suppress this proliferative phenotype. Subsequent lab experiments validated this connectivity mapping hit, showing that cotinine inhibits cell proliferation in an androgen dependent manner. Thus the results in this study suggest a promising prospect of integrating NGS data with connectivity mapping. © 2013 McArt et al.
Resumo:
Integrating evidence from multiple domains is useful in prioritizing disease candidate genes for subsequent testing. We ranked all known human genes (n = 3819) under linkage peaks in the Irish Study of High-Density Schizophrenia Families using three different evidence domains: 1) a meta-analysis of microarray gene expression results using the Stanley Brain collection, 2) a schizophrenia protein-protein interaction network, and 3) a systematic literature search. Each gene was assigned a domain-specific p-value and ranked after evaluating the evidence within each domain. For comparison to this
ranking process, a large-scale candidate gene hypothesis was also tested by including genes with Gene Ontology terms related to neurodevelopment. Subsequently, genotypes of 3725 SNPs in 167 genes from a custom Illumina iSelect array were used to evaluate the top ranked vs. hypothesis selected genes. Seventy-three genes were both highly ranked and involved in neurodevelopment (category 1) while 42 and 52 genes were exclusive to neurodevelopment (category 2) or highly ranked (category 3), respectively. The most significant associations were observed in genes PRKG1, PRKCE, and CNTN4 but no individual SNPs were significant after correction for multiple testing. Comparison of the approaches showed an excess of significant tests using the hypothesis-driven neurodevelopment category. Random selection of similar sized genes from two independent genome-wide association studies (GWAS) of schizophrenia showed the excess was unlikely by chance. In a further meta-analysis of three GWAS datasets, four candidate SNPs reached nominal significance. Although gene ranking using integrated sources of prior information did not enrich for significant results in the current experiment, gene selection using an a priori hypothesis (neurodevelopment) was superior to random selection. As such, further development of gene ranking strategies using more carefully selected sources of information is warranted.
Resumo:
FLT3-ITD mutations are prevalent mutations in acute myeloid leukaemia (AML). PRL-3, a metastasis-associated phosphatase, is a downstream target of FLT3-ITD. This study investigates the regulation and function of PRL-3 in leukaemia cell lines and AML patients associated with FLT3-ITD mutations. PRL-3 expression is upregulated by the FLT3-STAT5 signalling pathway in leukaemia cells, leading an activation of AP-1 transcription factors via ERK and JNK pathways. PRL-3-depleted AML cells showed a significant decrease in cell growth. Clinically, high PRL-3 mRNA expression was associated with FLT3-ITD mutations in four independent AML datasets with 1158 patients. Multivariable Cox-regression analysis on our Cohort 1 with 221 patients identified PRL-3 as a novel prognostic marker independent of other clinical parameters. Kaplan-Meier analysis showed high PRL-3 mRNA expression was significantly associated with poorer survival among 491 patients with normal karyotype. Targeting PRL-3 reversed the oncogenic effects in FLT3-ITD AML models in vitro and in vivo. Herein, we suggest that PRL-3 could serve as a prognostic marker to predict poorer survival and as a promising novel therapeutic target for AML patients.
Resumo:
Web sites that rely on databases for their content are now ubiquitous. Query result pages are dynamically generated from these databases in response to user-submitted queries. Automatically extracting structured data from query result pages is a challenging problem, as the structure of the data is not explicitly represented. While humans have shown good intuition in visually understanding data records on a query result page as displayed by a web browser, no existing approach to data record extraction has made full use of this intuition. We propose a novel approach, in which we make use of the common sources of evidence that humans use to understand data records on a displayed query result page. These include structural regularity, and visual and content similarity between data records displayed on a query result page. Based on these observations we propose new techniques that can identify each data record individually, while ignoring noise items, such as navigation bars and adverts. We have implemented these techniques in a software prototype, rExtractor, and tested it using two datasets. Our experimental results show that our approach achieves significantly higher accuracy than previous approaches. Furthermore, it establishes the case for use of vision-based algorithms in the context of data extraction from web sites.
Resumo:
Identifying differential expression of genes in psoriatic and healthy skin by microarray data analysis is a key approach to understand the pathogenesis of psoriasis. Analysis of more than one dataset to identify genes commonly upregulated reduces the likelihood of false positives and narrows down the possible signature genes. Genes controlling the critical balance between T helper 17 and regulatory T cells are of special interest in psoriasis. Our objectives were to identify genes that are consistently upregulated in lesional skin from three published microarray datasets. We carried out a reanalysis of gene expression data extracted from three experiments on samples from psoriatic and nonlesional skin using the same stringency threshold and software and further compared the expression levels of 92 genes related to the T helper 17 and regulatory T cell signaling pathways. We found 73 probe sets representing 57 genes commonly upregulated in lesional skin from all datasets. These included 26 probe sets representing 20 genes that have no previous link to the etiopathogenesis of psoriasis. These genes may represent novel therapeutic targets and surely need more rigorous experimental testing to be validated. Our analysis also identified 12 of 92 genes known to be related to the T helper 17 and regulatory T cell signaling pathways, and these were found to be differentially expressed in the lesional skin samples.
Resumo:
We examine mid- to late Holocene centennial-scale climate variability in Ireland using proxy data from peatlands, lakes and a speleothem. A high degree of between-record variability is apparent in the proxy data and significant chronological uncertainties are present. However, tephra layers provide a robust tool for correlation and improve the chronological precision of the records. Although we can find no statistically significant coherence in the dataset as a whole, a selection of high-quality peatland water table reconstructions co-vary more than would be expected by chance alone. A locally weighted regression model with bootstrapping can be used to construct a ‘best-estimate’ palaeoclimatic reconstruction from these datasets. Visual comparison and cross-wavelet analysis of peatland water table compilations from Ireland and Northern Britain show that there are some periods of coherence between these records. Some terrestrial palaeoclimatic changes in Ireland appear to coincide with changes in the North Atlantic thermohaline circulation and solar activity. However, these relationships are inconsistent and may be obscured by chronological uncertainties. We conclude by suggesting an agenda for future Holocene climate research in Ireland. ©2013 Elsevier B.V. All rights reserved.
Resumo:
In this paper, we introduce an application of matrix factorization to produce corpus-derived, distributional
models of semantics that demonstrate cognitive plausibility. We find that word representations
learned by Non-Negative Sparse Embedding (NNSE), a variant of matrix factorization, are sparse,
effective, and highly interpretable. To the best of our knowledge, this is the first approach which
yields semantic representation of words satisfying these three desirable properties. Though extensive
experimental evaluations on multiple real-world tasks and datasets, we demonstrate the superiority
of semantic models learned by NNSE over other state-of-the-art baselines.
Resumo:
In most previous research on distributional semantics, Vector Space Models (VSMs) of words are built either from topical information (e.g., documents in which a word is present), or from syntactic/semantic types of words (e.g., dependency parse links of a word in sentences), but not both. In this paper, we explore the utility of combining these two representations to build VSM for the task of semantic composition of adjective-noun phrases. Through extensive experiments on benchmark datasets, we find that even though a type-based VSM is effective for semantic composition, it is often outperformed by a VSM built using a combination of topic- and type-based statistics. We also introduce a new evaluation task wherein we predict the composed vector representation of a phrase from the brain activity of a human subject reading that phrase. We exploit a large syntactically parsed corpus of 16 billion tokens to build our VSMs, with vectors for both phrases and words, and make them publicly available.
Resumo:
Our review of paleoclimate information for New Zealand pertaining to the past 30,000 years has identified a general sequence of climatic events, spanning the onset of cold conditions marking the final phase of the Last Glaciation, through to the emergence to full interglacial conditions in the early Holocene. In order to facilitate more detailed assessments of climate variability and any leads or lags in the timing of climate changes across the region, a composite stratotype is proposed for New Zealand. The stratotype is based on terrestrial stratigraphic records and is intended to provide a standard reference for the intercomparison and evaluation of climate proxy records. We nominate a specific stratigraphic type record for each climatic event, using either natural exposure or drill core stratigraphic sections. Type records were selected on thebasis of having very good numerical age control and a clear proxy record. In all cases the main proxy of the type record is subfossil pollen. The type record for the period from ca 30 to ca 18 calendar kiloyears BP (cal. ka BP) is designated in lake-bed sediments from a small morainic kettle lake (Galway tarn) in western South Island. The Galway tarn type record spans a period of full glacial conditions (Last Glacial Coldest Period, LGCP) within the Otira Glaciation, and includes three cold stadials separated by two cool interstadials. The type record for the emergence from glacial conditions following the termination of the Last Glaciation (post-Termination amelioration) is in a core of lake sediments from a maar (Pukaki volcanic crater) in Auckland, northern North Island, and spans from ca 18 to 15.64±0.41 cal. ka BP. The type record for the Lateglacial period is an exposure of interbedded peat and mud at montane Kaipo bog, eastern North Island. In this high-resolution type record, an initial mild period was succeeded at 13.74±0.13 cal. ka BP by a cooler period, which after 12.55±0.14 cal. ka BP gave way to a progressive ascent to full interglacial conditions that were achieved by 11.88±0.18 cal. ka BP. Although a type section is not formally designated for the Holocene Interglacial (11.88±0.18 cal. ka BP to the present day), the sedimentary record of Lake Maratoto on the Waikato lowlands, northwestern North Island, is identified as a prospective type section pending the integration and updating of existing stratigraphic and proxy datasets, and age models. The type records are interconnected by one or more dated tephra layers, the ages of which are derived from Bayesian depositional modelling and OxCal-based calibrations using the IntCal09 dataset. Along with the type sections and the Lake Maratoto record, important, well-dated terrestrial reference records are provided for each climate event. Climate proxies from these reference records include pollen flora, stable isotopes from speleothems, beetle and chironomid fauna, and glacier moraines. The regional composite stratotype provides a benchmark against which to compare other records and proxies. Based on the composite stratotype, we provide an updated climate event stratigraphic classification for the New Zealand region. © 2013 Elsevier Ltd.
Resumo:
Background: Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take >2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping.
Results: cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance.
Conclusion: Emerging 'omics' technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http://purl.oclc.org/NET/cudaMap.
Resumo:
ABSTRACT
The start of the Upper Wurmian in the Alps was marked by massive fluvioglacial aggradation prior to the arrival of the Central Alpine glaciers. In 1984, the Subcommission on European Quaternary Stratigraphy defined the clay pit of Baumkirchen (in the foreland of the Inn Valley, Austria) as the stratotype for the Middle to Upper Wurmian boundary in the Alps. Key for the selection of this site was its radiocarbon chronology, which still ranks among the most important datasets of this time interval in the Alps. In this study we re-sampled all available original plant specimens and established an accelerator mass spectrometry chronology which supersedes the published 40-year-old chronology. The new data show a much smaller scatter and yielded slightly older conventional radiocarbon dates clustering at ca. 31 C-14 ka BP. When calibrated using INTCAL13 the new data suggest that the sampled interval of 653-681 m in the clay pit was deposited 34-36 cal ka BP. Using two new radiocarbon dates of bone fragments found in the fluvioglacial gravel above the banded clays allows us to constrain the timing of the marked change from lacustrine to fluvioglacial sedimentation to ca. 32-33 cal ka BP, which suggests a possible link to the Heinrich 3 event in the North Atlantic. Copyright (c) 2013 John Wiley & Sons, Ltd.
Resumo:
Model selection between competing models is a key consideration in the discovery of prognostic multigene signatures. The use of appropriate statistical performance measures as well as verification of biological significance of the signatures is imperative to maximise the chance of external validation of the generated signatures. Current approaches in time-to-event studies often use only a single measure of performance in model selection, such as logrank test p-values, or dichotomise the follow-up times at some phase of the study to facilitate signature discovery. In this study we improve the prognostic signature discovery process through the application of the multivariate partial Cox model combined with the concordance index, hazard ratio of predictions, independence from available clinical covariates and biological enrichment as measures of signature performance. The proposed framework was applied to discover prognostic multigene signatures from early breast cancer data. The partial Cox model combined with the multiple performance measures were used in both guiding the selection of the optimal panel of prognostic genes and prediction of risk within cross validation without dichotomising the follow-up times at any stage. The signatures were successfully externally cross validated in independent breast cancer datasets, yielding a hazard ratio of 2.55 [1.44, 4.51] for the top ranking signature.