950 resultados para Cross-validation
Resumo:
Climate and environmental reconstructions from natural archives are important for the interpretation of current climatic change. Few quantitative high-resolution reconstructions exist for South America which is the only land mass extending from the tropics to the southern high latitudes at 56°S. We analyzed sediment cores from two adjacent lakes in Northern Chilean Patagonia, Lago Castor (45°36′S, 71°47′W) and Laguna Escondida (45°31′S, 71°49′W). Radiometric dating (210Pb, 137Cs, 14C-AMS) suggests that the cores reach back to c. 900 BC (Laguna Escondida) and c. 1900 BC (Lago Castor). Both lakes show similarities and reproducibility in sedimentation rate changes and tephra layer deposition. We found eight macroscopic tephras (0.2–5.5 cm thick) dated at 1950 BC, 1700 BC, at 300 BC, 50 BC, 90 AD, 160 AD, 400 AD and at 900 AD. These can be used as regional time-synchronous stratigraphic markers. The two thickest tephras represent known well-dated explosive eruptions of Hudson volcano around 1950 and 300 BC. Biogenic silica flux revealed in both lakes a climate signal and correlation with annual temperature reanalysis data (calibration 1900–2006 AD; Lago Castor r = 0.37; Laguna Escondida r = 0.42, seven years filtered data). We used a linear inverse regression plus scaling model for calibration and leave-one-out cross-validation (RMSEv = 0.56 °C) to reconstruct sub decadal-scale temperature variability for Laguna Escondida back to AD 400. The lower part of the core from Laguna Escondida prior to AD 400 and the core of Lago Castor are strongly influenced by primary and secondary tephras and, therefore, not used for the temperature reconstruction. The temperature reconstruction from Laguna Escondida shows cold conditions in the 5th century (relative to the 20th century mean), warmer temperatures from AD 600 to AD 1150 and colder temperatures from AD 1200 to AD 1450. From AD 1450 to AD 1700 our reconstruction shows a period with stronger variability and on average higher values than the 20th century mean. Until AD 1900 the temperature values decrease but stay slightly above the 20th century mean. Most of the centennial-scale features are reproduced in the few other natural climate archives in the region. The early onset of cool conditions from c. AD 1200 onward seems to be confirmed for this region.
Resumo:
Nitazoxanide (2-acetolyloxy-N-(5-nitro 2-thiazolyl) benzamide; NTZ) represents the parent compound of a novel class of broad-spectrum anti-parasitic compounds named thiazolides. NTZ is active against a wide variety of intestinal and tissue-dwelling helminths, protozoa, enteric bacteria and a number of viruses infecting animals and humans. While potent, this poses a problem in practice, since this obvious non-selectivity can lead to undesired side effects in both humans and animals. In this study, we used real time PCR to determine the in vitro activities of 29 different thiazolides (NTZ-derivatives), which carry distinct modifications on both the thiazole- and the benzene moieties, against the tachyzoite stage of the intracellular protozoan Neospora caninum. The goal was to identify a highly active compound lacking the undesirable nitro group, which would have a more specific applicability, such as in food animals. By applying self-organizing molecular field analysis (SOMFA), these data were used to develop a predictive model for future drug design. SOMFA performs self-alignment of the molecules, and takes into account the steric and electrostatic properties, in order to determine 3D-quantitative structure activity relationship models. The best model was obtained by overlay of the thiazole moieties. Plotting of predicted versus experimentally determined activity produced an r2 value of 0.8052 and cross-validation using the "leave one out" methodology resulted in a q2 value of 0.7987. A master grid map showed that large steric groups at the R2 position, the nitrogen of the amide bond and position Y could greatly reduce activity, and the presence of large steric groups placed at positions X, R4 and surrounding the oxygen atom of the amide bond, may increase the activity of thiazolides against Neospora caninum tachyzoites. The model obtained here will be an important predictive tool for future development of this important class of drugs.
Resumo:
Smoothing splines are a popular approach for non-parametric regression problems. We use periodic smoothing splines to fit a periodic signal plus noise model to data for which we assume there are underlying circadian patterns. In the smoothing spline methodology, choosing an appropriate smoothness parameter is an important step in practice. In this paper, we draw a connection between smoothing splines and REACT estimators that provides motivation for the creation of criteria for choosing the smoothness parameter. The new criteria are compared to three existing methods, namely cross-validation, generalized cross-validation, and generalization of maximum likelihood criteria, by a Monte Carlo simulation and by an application to the study of circadian patterns. For most of the situations presented in the simulations, including the practical example, the new criteria out-perform the three existing criteria.
Resumo:
The early detection of subjects with probable Alzheimer's disease (AD) is crucial for effective appliance of treatment strategies. Here we explored the ability of a multitude of linear and non-linear classification algorithms to discriminate between the electroencephalograms (EEGs) of patients with varying degree of AD and their age-matched control subjects. Absolute and relative spectral power, distribution of spectral power, and measures of spatial synchronization were calculated from recordings of resting eyes-closed continuous EEGs of 45 healthy controls, 116 patients with mild AD and 81 patients with moderate AD, recruited in two different centers (Stockholm, New York). The applied classification algorithms were: principal component linear discriminant analysis (PC LDA), partial least squares LDA (PLS LDA), principal component logistic regression (PC LR), partial least squares logistic regression (PLS LR), bagging, random forest, support vector machines (SVM) and feed-forward neural network. Based on 10-fold cross-validation runs it could be demonstrated that even tough modern computer-intensive classification algorithms such as random forests, SVM and neural networks show a slight superiority, more classical classification algorithms performed nearly equally well. Using random forests classification a considerable sensitivity of up to 85% and a specificity of 78%, respectively for the test of even only mild AD patients has been reached, whereas for the comparison of moderate AD vs. controls, using SVM and neural networks, values of 89% and 88% for sensitivity and specificity were achieved. Such a remarkable performance proves the value of these classification algorithms for clinical diagnostics.
Resumo:
The advances in computational biology have made simultaneous monitoring of thousands of features possible. The high throughput technologies not only bring about a much richer information context in which to study various aspects of gene functions but they also present challenge of analyzing data with large number of covariates and few samples. As an integral part of machine learning, classification of samples into two or more categories is almost always of interest to scientists. In this paper, we address the question of classification in this setting by extending partial least squares (PLS), a popular dimension reduction tool in chemometrics, in the context of generalized linear regression based on a previous approach, Iteratively ReWeighted Partial Least Squares, i.e. IRWPLS (Marx, 1996). We compare our results with two-stage PLS (Nguyen and Rocke, 2002A; Nguyen and Rocke, 2002B) and other classifiers. We show that by phrasing the problem in a generalized linear model setting and by applying bias correction to the likelihood to avoid (quasi)separation, we often get lower classification error rates.
Resumo:
The construction of a reliable, practically useful prediction rule for future response is heavily dependent on the "adequacy" of the fitted regression model. In this article, we consider the absolute prediction error, the expected value of the absolute difference between the future and predicted responses, as the model evaluation criterion. This prediction error is easier to interpret than the average squared error and is equivalent to the mis-classification error for the binary outcome. We show that the distributions of the apparent error and its cross-validation counterparts are approximately normal even under a misspecified fitted model. When the prediction rule is "unsmooth", the variance of the above normal distribution can be estimated well via a perturbation-resampling method. We also show how to approximate the distribution of the difference of the estimated prediction errors from two competing models. With two real examples, we demonstrate that the resulting interval estimates for prediction errors provide much more information about model adequacy than the point estimates alone.
Resumo:
Suppose that we are interested in establishing simple, but reliable rules for predicting future t-year survivors via censored regression models. In this article, we present inference procedures for evaluating such binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values. Specifically, under various working models we derive consistent estimators for the above measures via substitution and cross validation estimation procedures. Furthermore, we provide large sample approximations to the distributions of these nonsmooth estimators without assuming that the working model is correctly specified. Confidence intervals, for example, for the difference of the precision measures between two competing rules can then be constructed. All the proposals are illustrated with two real examples and their finite sample properties are evaluated via a simulation study.
Resumo:
BACKGROUND: Many HIV-infected patients on highly active antiretroviral therapy (HAART) experience metabolic complications including dyslipidaemia and insulin resistance, which may increase their coronary heart disease (CHD) risk. We developed a prognostic model for CHD tailored to the changes in risk factors observed in patients starting HAART. METHODS: Data from five cohort studies (British Regional Heart Study, Caerphilly and Speedwell Studies, Framingham Offspring Study, Whitehall II) on 13,100 men aged 40-70 and 114,443 years of follow up were used. CHD was defined as myocardial infarction or death from CHD. Model fit was assessed using the Akaike Information Criterion; generalizability across cohorts was examined using internal-external cross-validation. RESULTS: A parametric model based on the Gompertz distribution generalized best. Variables included in the model were systolic blood pressure, total cholesterol, high-density lipoprotein cholesterol, triglyceride, glucose, diabetes mellitus, body mass index and smoking status. Compared with patients not on HAART, the estimated CHD hazard ratio (HR) for patients on HAART was 1.46 (95% CI 1.15-1.86) for moderate and 2.48 (95% CI 1.76-3.51) for severe metabolic complications. CONCLUSIONS: The change in the risk of CHD in HIV-infected men starting HAART can be estimated based on typical changes in risk factors, assuming that HRs estimated using data from non-infected men are applicable to HIV-infected men. Based on this model the risk of CHD is likely to increase, but increases may often be modest, and could be offset by lifestyle changes.
Resumo:
Accurate seasonal to interannual streamflow forecasts based on climate information are critical for optimal management and operation of water resources systems. Considering most water supply systems are multipurpose, operating these systems to meet increasing demand under the growing stresses of climate variability and climate change, population and economic growth, and environmental concerns could be very challenging. This study was to investigate improvement in water resources systems management through the use of seasonal climate forecasts. Hydrological persistence (streamflow and precipitation) and large-scale recurrent oceanic-atmospheric patterns such as the El Niño/Southern Oscillation (ENSO), Pacific Decadal Oscillation (PDO), North Atlantic Oscillation (NAO), the Atlantic Multidecadal Oscillation (AMO), the Pacific North American (PNA), and customized sea surface temperature (SST) indices were investigated for their potential to improve streamflow forecast accuracy and increase forecast lead-time in a river basin in central Texas. First, an ordinal polytomous logistic regression approach is proposed as a means of incorporating multiple predictor variables into a probabilistic forecast model. Forecast performance is assessed through a cross-validation procedure, using distributions-oriented metrics, and implications for decision making are discussed. Results indicate that, of the predictors evaluated, only hydrologic persistence and Pacific Ocean sea surface temperature patterns associated with ENSO and PDO provide forecasts which are statistically better than climatology. Secondly, a class of data mining techniques, known as tree-structured models, is investigated to address the nonlinear dynamics of climate teleconnections and screen promising probabilistic streamflow forecast models for river-reservoir systems. Results show that the tree-structured models can effectively capture the nonlinear features hidden in the data. Skill scores of probabilistic forecasts generated by both classification trees and logistic regression trees indicate that seasonal inflows throughout the system can be predicted with sufficient accuracy to improve water management, especially in the winter and spring seasons in central Texas. Lastly, a simplified two-stage stochastic economic-optimization model was proposed to investigate improvement in water use efficiency and the potential value of using seasonal forecasts, under the assumption of optimal decision making under uncertainty. Model results demonstrate that incorporating the probabilistic inflow forecasts into the optimization model can provide a significant improvement in seasonal water contract benefits over climatology, with lower average deficits (increased reliability) for a given average contract amount, or improved mean contract benefits for a given level of reliability compared to climatology. The results also illustrate the trade-off between the expected contract amount and reliability, i.e., larger contracts can be signed at greater risk.
Resumo:
In this study, we demonstrate the power of applying complementary DNA (cDNA) microarray technology to identifying candidate loci that exhibit subtle differences in expression levels associated with a complex trait in natural populations of a nonmodel organism. Using a highly replicated experimental design involving 180 cDNA microarray experiments, we measured gene-expression levels from 1098 transcript probes in 90 individuals originating from six brown trout (Salmo trutta) and one Atlantic salmon (Salmo salar) population, which follow either a migratory or a sedentary life history. We identified several candidate genes associated with preparatory adaptations to different life histories in salmonids, including genes encoding for transaldolase 1, constitutive heat-shock protein HSC70-1 and endozepine. Some of these genes clustered into functional groups, providing insight into the physiological pathways potentially involved in the expression of life-history related phenotypic differences. Such differences included the down-regulation of genes involved in the respiratory system of future migratory individuals. In addition, we used linear discriminant analysis to identify a set of 12 genes that correctly classified immature individuals as migratory or sedentary with high accuracy. Using the expression levels of these 12 genes, 17 out of 18 individuals used for cross-validation were correctly assigned to their respective life-history phenotype. Finally, we found various candidate genes associated with physiological changes that are likely to be involved in preadaptations to seawater in anadromous populations of the genus Salmo, one of which was identified to encode for nucleophosmin 1. Our findings thus provide new molecular insights into salmonid life-history variation, opening new perspectives in the study of this complex trait.
Resumo:
High-resolution and highly precise age models for recent lake sediments (last 100–150 years) are essential for quantitative paleoclimate research. These are particularly important for sedimentological and geochemical proxies, where transfer functions cannot be established and calibration must be based upon the relation of sedimentary records to instrumental data. High-precision dating for the calibration period is most critical as it determines directly the quality of the calibration statistics. Here, as an example, we compare radionuclide age models obtained on two high-elevation glacial lakes in the Central Chilean Andes (Laguna Negra: 33°38′S/70°08′W, 2,680 m a.s.l. and Laguna El Ocho: 34°02′S/70°19′W, 3,250 m a.s.l.). We show the different numerical models that produce accurate age-depth chronologies based on 210Pb profiles, and we explain how to obtain reduced age-error bars at the bottom part of the profiles, i.e., typically around the end of the 19th century. In order to constrain the age models, we propose a method with five steps: (i) sampling at irregularly-spaced intervals for 226Ra, 210Pb and 137Cs depending on the stratigraphy and microfacies, (ii) a systematic comparison of numerical models for the calculation of 210Pb-based age models: constant flux constant sedimentation (CFCS), constant initial concentration (CIC), constant rate of supply (CRS) and sediment isotope tomography (SIT), (iii) numerical constraining of the CRS and SIT models with the 137Cs chronomarker of AD 1964 and, (iv) step-wise cross-validation with independent diagnostic environmental stratigraphic markers of known age (e.g., volcanic ash layer, historical flood and earthquakes). In both examples, we also use airborne pollutants such as spheroidal carbonaceous particles (reflecting the history of fossil fuel emissions), excess atmospheric Cu deposition (reflecting the production history of a large local Cu mine), and turbidites related to historical earthquakes. Our results show that the SIT model constrained with the 137Cs AD 1964 peak performs best over the entire chronological profile (last 100–150 years) and yields the smallest standard deviations for the sediment ages. Such precision is critical for the calibration statistics, and ultimately, for the quality of the quantitative paleoclimate reconstruction. The systematic comparison of CRS and SIT models also helps to validate the robustness of the chronologies in different sections of the profile. Although surprisingly poorly known and under-explored in paleolimnological research, the SIT model has a great potential in paleoclimatological reconstructions based on lake sediments