Biblioteca Digital

939 resultados para STATISTICAL DATA

Crystal structure and statistical coupling analysis of highly glycosylated peroxidase from royal palm tree (Roystonea regia)

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Royal palm tree peroxidase (RPTP) is a very stable enzyme in regards to acidity, temperature, H(2)O(2), and organic solvents. Thus, RPTP is a promising candidate for developing H(2)O(2)-sensitive biosensors for diverse applications in industry and analytical chemistry. RPTP belongs to the family of class III secretory plant peroxidases, which include horseradish peroxidase isozyme C, soybean and peanut peroxidases. Here we report the X-ray structure of native RPTP isolated from royal palm tree (Roystonea regia) refined to a resolution of 1.85 angstrom. RPTP has the same overall folding pattern of the plant peroxidase superfamily, and it contains one heme group and two calcium-binding sites in similar locations. The three-dimensional structure of RPTP was solved for a hydroperoxide complex state, and it revealed a bound 2-(N-morpholino) ethanesulfonic acid molecule (MES) positioned at a putative substrate-binding secondary site. Nine N-glycosylation sites are clearly defined in the RPTP electron-density maps, revealing for the first time conformations of the glycan chains of this highly glycosylated enzyme. Furthermore, statistical coupling analysis (SCA) of the plant peroxidase superfamily was performed. This sequence-based method identified a set of evolutionarily conserved sites that mapped to regions surrounding the heme prosthetic group. The SCA matrix also predicted a set of energetically coupled residues that are involved in the maintenance of the structural folding of plant peroxidases. The combination of crystallographic data and SCA analysis provides information about the key structural elements that could contribute to explaining the unique stability of RPTP. (C) 2009 Elsevier Inc. All rights reserved.

Inferential Implications of Over-Parametrization: A Case Study in Incomplete Categorical Data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

P>In the context of either Bayesian or classical sensitivity analyses of over-parametrized models for incomplete categorical data, it is well known that prior-dependence on posterior inferences of nonidentifiable parameters or that too parsimonious over-parametrized models may lead to erroneous conclusions. Nevertheless, some authors either pay no attention to which parameters are nonidentifiable or do not appropriately account for possible prior-dependence. We review the literature on this topic and consider simple examples to emphasize that in both inferential frameworks, the subjective components can influence results in nontrivial ways, irrespectively of the sample size. Specifically, we show that prior distributions commonly regarded as slightly informative or noninformative may actually be too informative for nonidentifiable parameters, and that the choice of over-parametrized models may drastically impact the results, suggesting that a careful examination of their effects should be considered before drawing conclusions.Resume Que ce soit dans un cadre Bayesien ou classique, il est bien connu que la surparametrisation, dans les modeles pour donnees categorielles incompletes, peut conduire a des conclusions erronees. Cependant, certains auteurs persistent a negliger les problemes lies a la presence de parametres non identifies. Nous passons en revue la litterature dans ce domaine, et considerons quelques exemples surparametres simples dans lesquels les elements subjectifs influencent de facon non negligeable les resultats, independamment de la taille des echantillons. Plus precisement, nous montrons comment des a priori consideres comme peu ou non-informatifs peuvent se reveler extremement informatifs en ce qui concerne les parametres non identifies, et que le recours a des modeles surparametres peut avoir sur les conclusions finales un impact considerable. Ceci suggere un examen tres attentif de l`impact potentiel des a priori.

Comparing diagnostic tests with missing data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

When missing data occur in studies designed to compare the accuracy of diagnostic tests, a common, though naive, practice is to base the comparison of sensitivity, specificity, as well as of positive and negative predictive values on some subset of the data that fits into methods implemented in standard statistical packages. Such methods are usually valid only under the strong missing completely at random (MCAR) assumption and may generate biased and less precise estimates. We review some models that use the dependence structure of the completely observed cases to incorporate the information of the partially categorized observations into the analysis and show how they may be fitted via a two-stage hybrid process involving maximum likelihood in the first stage and weighted least squares in the second. We indicate how computational subroutines written in R may be used to fit the proposed models and illustrate the different analysis strategies with observational data collected to compare the accuracy of three distinct non-invasive diagnostic methods for endometriosis. The results indicate that even when the MCAR assumption is plausible, the naive partial analyses should be avoided.

Missing data mechanisms and their implications on the analysis of categorical data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We review some issues related to the implications of different missing data mechanisms on statistical inference for contingency tables and consider simulation studies to compare the results obtained under such models to those where the units with missing data are disregarded. We confirm that although, in general, analyses under the correct missing at random and missing completely at random models are more efficient even for small sample sizes, there are exceptions where they may not improve the results obtained by ignoring the partially classified data. We show that under the missing not at random (MNAR) model, estimates on the boundary of the parameter space as well as lack of identifiability of the parameters of saturated models may be associated with undesirable asymptotic properties of maximum likelihood estimators and likelihood ratio tests; even in standard cases the bias of the estimators may be low only for very large samples. We also show that the probability of a boundary solution obtained under the correct MNAR model may be large even for large samples and that, consequently, we may not always conclude that a MNAR model is misspecified because the estimate is on the boundary of the parameter space.

An information-theoretic approach to statistical dependence: Copula information

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We discuss the connection between information and copula theories by showing that a copula can be employed to decompose the information content of a multivariate distribution into marginal and dependence components, with the latter quantified by the mutual information. We define the information excess as a measure of deviation from a maximum-entropy distribution. The idea of marginal invariant dependence measures is also discussed and used to show that empirical linear correlation underestimates the amplitude of the actual correlation in the case of non-Gaussian marginals. The mutual information is shown to provide an upper bound for the asymptotic empirical log-likelihood of a copula. An analytical expression for the information excess of T-copulas is provided, allowing for simple model identification within this family. We illustrate the framework in a financial data set. Copyright (C) EPLA, 2009

A fast and robust statistical test based on likelihood ratio with Bartlett correction to identify Granger causality between gene sets

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We propose a likelihood ratio test ( LRT) with Bartlett correction in order to identify Granger causality between sets of time series gene expression data. The performance of the proposed test is compared to a previously published bootstrapbased approach. LRT is shown to be significantly faster and statistically powerful even within non- Normal distributions. An R package named gGranger containing an implementation for both Granger causality identification tests is also provided.

Discrepancies in solar irradiation data for Stockholm and Athens

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The aim of this study is to evaluate the variation of solar radiation data between different data sources that will be free and available at the Solar Energy Research Center (SERC). The comparison between data sources will be carried out for two locations: Stockholm, Sweden and Athens, Greece. For the desired locations, data is gathered for different tilt angles: 0°, 30°, 45°, 60° facing south. The full dataset is available in two excel files: “Stockholm annual irradiation” and “Athens annual irradiation”. The World Radiation Data Center (WRDC) is defined as a reference for the comparison with other dtaasets, because it has the highest time span recorded for Stockholm (1964–2010) and Athens (1964–1986), in form of average monthly irradiation, expressed in kWh/m2. The indicator defined for the data comparison is the estimated standard deviation. The mean biased error (MBE) and the root mean square error (RMSE) were also used as statistical indicators for the horizontal solar irradiation data. The variation in solar irradiation data is categorized in two categories: natural or inter-annual variability, due to different data sources and lastly due to different calculation models. The inter-annual variation for Stockholm is 140.4kWh/m2 or 14.4% and 124.3kWh/m2 or 8.0% for Athens. The estimated deviation for horizontal solar irradiation is 3.7% for Stockholm and 4.4% Athens. This estimated deviation is respectively equal to 4.5% and 3.6% for Stockholm and Athens at 30° tilt, 5.2% and 4.5% at 45° tilt, 5.9% and 7.0% at 60°. NASA’s SSE, SAM and RETScreen (respectively Satel-light) exhibited the highest deviation from WRDC’s data for Stockholm (respectively Athens). The essential source for variation is notably the difference in horizontal solar irradiation. The variation increases by 1-2% per degree of tilt, using different calculation models, as used in PVSYST and Meteonorm. The location and altitude of the data source did not directly influence the variation with the WRDC data. Further examination is suggested in order to improve the methodology of selecting the location; Examining the functional dependence of ground reflected radiation with ambient temperature; variation of ambient temperature and its impact on different solar energy systems; Im pact of variation in solar irradiation and ambient temperature on system output.

qtl.outbred: Interfacing outbred line cross data with the R/qtl mapping software

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background qtl.outbred is an extendible interface in the statistical environment, R, for combining quantitative trait loci (QTL) mapping tools. It is built as an umbrella package that enables outbred genotype probabilities to be calculated and/or imported into the software package R/qtl. Findings Using qtl.outbred, the genotype probabilities from outbred line cross data can be calculated by interfacing with a new and efficient algorithm developed for analyzing arbitrarily large datasets (included in the package) or imported from other sources such as the web-based tool, GridQTL. Conclusion qtl.outbred will improve the speed for calculating probabilities and the ability to analyse large future datasets. This package enables the user to analyse outbred line cross data accurately, but with similar effort than inbred line cross data.

Novel Statistical Methods in Quantitative Genetics : Modeling Genetic Variance for Quantitative Trait Loci Mapping and Genomic Evaluation

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This thesis develops and evaluates statistical methods for different types of genetic analyses, including quantitative trait loci (QTL) analysis, genome-wide association study (GWAS), and genomic evaluation. The main contribution of the thesis is to provide novel insights in modeling genetic variance, especially via random effects models. In variance component QTL analysis, a full likelihood model accounting for uncertainty in the identity-by-descent (IBD) matrix was developed. It was found to be able to correctly adjust the bias in genetic variance component estimation and gain power in QTL mapping in terms of precision. Double hierarchical generalized linear models, and a non-iterative simplified version, were implemented and applied to fit data of an entire genome. These whole genome models were shown to have good performance in both QTL mapping and genomic prediction. A re-analysis of a publicly available GWAS data set identified significant loci in Arabidopsis that control phenotypic variance instead of mean, which validated the idea of variance-controlling genes. The works in the thesis are accompanied by R packages available online, including a general statistical tool for fitting random effects models (hglm), an efficient generalized ridge regression for high-dimensional data (bigRR), a double-layer mixed model for genomic data analysis (iQTL), a stochastic IBD matrix calculator (MCIBD), a computational interface for QTL mapping (qtl.outbred), and a GWAS analysis tool for mapping variance-controlling loci (vGWAS).

First-principle data-driven models for assessment of motor disorders in Parkinson’s disease

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Parkinson’s disease (PD) is an increasing neurological disorder in an aging society. The motor and non-motor symptoms of PD advance with the disease progression and occur in varying frequency and duration. In order to affirm the full extent of a patient’s condition, repeated assessments are necessary to adjust medical prescription. In clinical studies, symptoms are assessed using the unified Parkinson’s disease rating scale (UPDRS). On one hand, the subjective rating using UPDRS relies on clinical expertise. On the other hand, it requires the physical presence of patients in clinics which implies high logistical costs. Another limitation of clinical assessment is that the observation in hospital may not accurately represent a patient’s situation at home. For such reasons, the practical frequency of tracking PD symptoms may under-represent the true time scale of PD fluctuations and may result in an overall inaccurate assessment. Current technologies for at-home PD treatment are based on data-driven approaches for which the interpretation and reproduction of results are problematic. The overall objective of this thesis is to develop and evaluate unobtrusive computer methods for enabling remote monitoring of patients with PD. It investigates first-principle data-driven model based novel signal and image processing techniques for extraction of clinically useful information from audio recordings of speech (in texts read aloud) and video recordings of gait and finger-tapping motor examinations. The aim is to map between PD symptoms severities estimated using novel computer methods and the clinical ratings based on UPDRS part-III (motor examination). A web-based test battery system consisting of self-assessment of symptoms and motor function tests was previously constructed for a touch screen mobile device. A comprehensive speech framework has been developed for this device to analyze text-dependent running speech by: (1) extracting novel signal features that are able to represent PD deficits in each individual component of the speech system, (2) mapping between clinical ratings and feature estimates of speech symptom severity, and (3) classifying between UPDRS part-III severity levels using speech features and statistical machine learning tools. A novel speech processing method called cepstral separation difference showed stronger ability to classify between speech symptom severities as compared to existing features of PD speech. In the case of finger tapping, the recorded videos of rapid finger tapping examination were processed using a novel computer-vision (CV) algorithm that extracts symptom information from video-based tapping signals using motion analysis of the index-finger which incorporates a face detection module for signal calibration. This algorithm was able to discriminate between UPDRS part III severity levels of finger tapping with high classification rates. Further analysis was performed on novel CV based gait features constructed using a standard human model to discriminate between a healthy gait and a Parkinsonian gait. The findings of this study suggest that the symptom severity levels in PD can be discriminated with high accuracies by involving a combination of first-principle (features) and data-driven (classification) approaches. The processing of audio and video recordings on one hand allows remote monitoring of speech, gait and finger-tapping examinations by the clinical staff. On the other hand, the first-principles approach eases the understanding of symptom estimates for clinicians. We have demonstrated that the selected features of speech, gait and finger tapping were able to discriminate between symptom severity levels, as well as, between healthy controls and PD patients with high classification rates. The findings support suitability of these methods to be used as decision support tools in the context of PD assessment.

Parameter estimation of Poisson generalized linear mixed models based on three different statistical principles : a simulation study

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Generalized linear mixed models are flexible tools for modeling non-normal data and are useful for accommodating overdispersion in Poisson regression models with random effects. Their main difficulty resides in the parameter estimation because there is no analytic solution for the maximization of the marginal likelihood. Many methods have been proposed for this purpose and many of them are implemented in software packages. The purpose of this study is to compare the performance of three different statistical principles - marginal likelihood, extended likelihood, Bayesian analysis-via simulation studies. Real data on contact wrestling are used for illustration.

Recent developments in statistical methods for detecting genetic loci affecting phenotypic variability

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A number of recent works have introduced statistical methods for detecting genetic loci that affect phenotypic variability, which we refer to as variability-controlling quantitative trait loci (vQTL). These are genetic variants whose allelic state predicts how much phenotype values will vary about their expected means. Such loci are of great potential interest in both human and non-human genetic studies, one reason being that a detected vQTL could represent a previously undetected interaction with other genes or environmental factors. The simultaneous publication of these new methods in different journals has in many cases precluded opportunity for comparison. We survey some of these methods, the respective trade-offs they imply, and the connections between them. The methods fall into three main groups: classical non-parametric, fully parametric, and semi-parametric two-stage approximations. Choosing between alternatives involves balancing the need for robustness, flexibility, and speed. For each method, we identify important assumptions and limitations, including those of practical importance, such as their scope for including covariates and random effects. We show in simulations that both parametric methods and their semi-parametric approximations can give elevated false positive rates when they ignore mean-variance relationships intrinsic to the data generation process. We conclude that choice of method depends on the trait distribution, the need to include non-genetic covariates, and the population size and structure, coupled with a critical evaluation of how these fit with the assumptions of the statistical model.

Geoprocessing journey-to-work data : delineating commuting regions in Dalarna, Sweden

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Delineation of commuting regions has always been based on statistical units, often municipalities or wards. However, using these units has certain disadvantages as their land areas differ considerably. Much information is lost in the larger spatial base units and distortions in self-containment values, the main criterion in rule-based delineation procedures, occur. Alternatively, one can start from relatively small standard size units such as hexagons. In this way, much greater detail in spatial patterns is obtained. In this paper, regions are built by means of intrazonal maximization (Intramax) on the basis of hexagons. The use of geoprocessing tools, specifically developed for the processing ofcommuting data, speeds up processing time considerably. The results of the Intramax analysis are evaluated with travel-to-work area constraints, and comparisons are made with commuting fields, accessibility to employment, commuting flow density and network commuting flow size. From selected steps in the regionalization process, a hierarchy of nested commuting regions emerges, revealing the complexity of commuting patterns.

Using Spatiotemporal Methods to Fill Gaps In Energy Usage Interval Data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Researchers analyzing spatiotemporal or panel data, which varies both in location and over time, often find that their data has holes or gaps. This thesis explores alternative methods for filling those gaps and also suggests a set of techniques for evaluating those gap-filling methods to determine which works best.

Investigating The Sources Of Fresh Water Affecting The Hydrological Balance Of Lakes Enriquillo And Azuei (Hispaniola) – Data Analysis

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Enriquillo and Azuei are saltwater lakes located in a closed water basin in the southwestern region of the island of La Hispaniola, these have been experiencing dramatic changes in total lake-surface area coverage during the period 1980-2012. The size of Lake Enriquillo presented a surface area of approximately 276 km2 in 1984, gradually decreasing to 172 km2 in 1996. The surface area of the lake reached its lowest point in the satellite observation record in 2004, at 165 km2. Then the recent growth of the lake began reaching its 1984 size by 2006. Based on surface area measurement for June and July 2013, Lake Enriquillo has a surface area of ~358 km2. Sumatra sizes at both ends of the record are 116 km2 in 1984 and 134 km2in 2013, an overall 15.8% increase in 30 years. Determining the causes of lake surface area changes is of extreme importance due to its environmental, social, and economic impacts. The overall goal of this study is to quantify the changing water balance in these lakes and their catchment area using satellite and ground observations and a regional atmospheric-hydrologic modeling approach. Data analyses of environmental variables in the region reflect a hydrological unbalance of the lakes due to changing regional hydro-climatic conditions. Historical data show precipitation, land surface temperature and humidity, and sea surface temperature (SST), increasing over region during the past decades. Salinity levels have also been decreasing by more than 30% from previously reported baseline levels. Here we present a summary of the historical data obtained, new sensors deployed in the sourrounding sierras and the lakes, and the integrated modeling exercises. As well as the challenges of gathering, storing, sharing, and analyzing this large volumen of data in a remote location from such a diverse number of sources.

«
1
2
...
55
56
57
58
59
60
61
62
63
»