774 resultados para binary data

em Queensland University of Technology - ePrints Archive


Relevância:

70.00% 70.00%

Publicador:

Resumo:

This dissertation is primarily an applied statistical modelling investigation, motivated by a case study comprising real data and real questions. Theoretical questions on modelling and computation of normalization constants arose from pursuit of these data analytic questions. The essence of the thesis can be described as follows. Consider binary data observed on a two-dimensional lattice. A common problem with such data is the ambiguity of zeroes recorded. These may represent zero response given some threshold (presence) or that the threshold has not been triggered (absence). Suppose that the researcher wishes to estimate the effects of covariates on the binary responses, whilst taking into account underlying spatial variation, which is itself of some interest. This situation arises in many contexts and the dingo, cypress and toad case studies described in the motivation chapter are examples of this. Two main approaches to modelling and inference are investigated in this thesis. The first is frequentist and based on generalized linear models, with spatial variation modelled by using a block structure or by smoothing the residuals spatially. The EM algorithm can be used to obtain point estimates, coupled with bootstrapping or asymptotic MLE estimates for standard errors. The second approach is Bayesian and based on a three- or four-tier hierarchical model, comprising a logistic regression with covariates for the data layer, a binary Markov Random field (MRF) for the underlying spatial process, and suitable priors for parameters in these main models. The three-parameter autologistic model is a particular MRF of interest. Markov chain Monte Carlo (MCMC) methods comprising hybrid Metropolis/Gibbs samplers is suitable for computation in this situation. Model performance can be gauged by MCMC diagnostics. Model choice can be assessed by incorporating another tier in the modelling hierarchy. This requires evaluation of a normalization constant, a notoriously difficult problem. Difficulty with estimating the normalization constant for the MRF can be overcome by using a path integral approach, although this is a highly computationally intensive method. Different methods of estimating ratios of normalization constants (N Cs) are investigated, including importance sampling Monte Carlo (ISMC), dependent Monte Carlo based on MCMC simulations (MCMC), and reverse logistic regression (RLR). I develop an idea present though not fully developed in the literature, and propose the Integrated mean canonical statistic (IMCS) method for estimating log NC ratios for binary MRFs. The IMCS method falls within the framework of the newly identified path sampling methods of Gelman & Meng (1998) and outperforms ISMC, MCMC and RLR. It also does not rely on simplifying assumptions, such as ignoring spatio-temporal dependence in the process. A thorough investigation is made of the application of IMCS to the three-parameter Autologistic model. This work introduces background computations required for the full implementation of the four-tier model in Chapter 7. Two different extensions of the three-tier model to a four-tier version are investigated. The first extension incorporates temporal dependence in the underlying spatio-temporal process. The second extensions allows the successes and failures in the data layer to depend on time. The MCMC computational method is extended to incorporate the extra layer. A major contribution of the thesis is the development of a fully Bayesian approach to inference for these hierarchical models for the first time. Note: The author of this thesis has agreed to make it open access but invites people downloading the thesis to send her an email via the 'Contact Author' function.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Optimal design for generalized linear models has primarily focused on univariate data. Often experiments are performed that have multiple dependent responses described by regression type models, and it is of interest and of value to design the experiment for all these responses. This requires a multivariate distribution underlying a pre-chosen model for the data. Here, we consider the design of experiments for bivariate binary data which are dependent. We explore Copula functions which provide a rich and flexible class of structures to derive joint distributions for bivariate binary data. We present methods for deriving optimal experimental designs for dependent bivariate binary data using Copulas, and demonstrate that, by including the dependence between responses in the design process, more efficient parameter estimates are obtained than by the usual practice of simply designing for a single variable only. Further, we investigate the robustness of designs with respect to initial parameter estimates and Copula function, and also show the performance of compound criteria within this bivariate binary setting.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The method of generalized estimating equations (GEEs) provides consistent estimates of the regression parameters in a marginal regression model for longitudinal data, even when the working correlation model is misspecified (Liang and Zeger, 1986). However, the efficiency of a GEE estimate can be seriously affected by the choice of the working correlation model. This study addresses this problem by proposing a hybrid method that combines multiple GEEs based on different working correlation models, using the empirical likelihood method (Qin and Lawless, 1994). Analyses show that this hybrid method is more efficient than a GEE using a misspecified working correlation model. Furthermore, if one of the working correlation structures correctly models the within-subject correlations, then this hybrid method provides the most efficient parameter estimates. In simulations, the hybrid method's finite-sample performance is superior to a GEE under any of the commonly used working correlation models and is almost fully efficient in all scenarios studied. The hybrid method is illustrated using data from a longitudinal study of the respiratory infection rates in 275 Indonesian children.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We investigate methods for data-based selection of working covariance models in the analysis of correlated data with generalized estimating equations. We study two selection criteria: Gaussian pseudolikelihood and a geodesic distance based on discrepancy between model-sensitive and model-robust regression parameter covariance estimators. The Gaussian pseudolikelihood is found in simulation to be reasonably sensitive for several response distributions and noncanonical mean-variance relations for longitudinal data. Application is also made to a clinical dataset. Assessment of adequacy of both correlation and variance models for longitudinal data should be routine in applications, and we describe open-source software supporting this practice.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The method of generalized estimating equation-, (GEEs) has been criticized recently for a failure to protect against misspecification of working correlation models, which in some cases leads to loss of efficiency or infeasibility of solutions. However, the feasibility and efficiency of GEE methods can be enhanced considerably by using flexible families of working correlation models. We propose two ways of constructing unbiased estimating equations from general correlation models for irregularly timed repeated measures to supplement and enhance GEE. The supplementary estimating equations are obtained by differentiation of the Cholesky decomposition of the working correlation, or as score equations for decoupled Gaussian pseudolikelihood. The estimating equations are solved with computational effort equivalent to that required for a first-order GEE. Full details and analytic expressions are developed for a generalized Markovian model that was evaluated through simulation. Large-sample ".sandwich" standard errors for working correlation parameter estimates are derived and shown to have good performance. The proposed estimating functions are further illustrated in an analysis of repeated measures of pulmonary function in children.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The method of generalised estimating equations for regression modelling of clustered outcomes allows for specification of a working matrix that is intended to approximate the true correlation matrix of the observations. We investigate the asymptotic relative efficiency of the generalised estimating equation for the mean parameters when the correlation parameters are estimated by various methods. The asymptotic relative efficiency depends on three-features of the analysis, namely (i) the discrepancy between the working correlation structure and the unobservable true correlation structure, (ii) the method by which the correlation parameters are estimated and (iii) the 'design', by which we refer to both the structures of the predictor matrices within clusters and distribution of cluster sizes. Analytical and numerical studies of realistic data-analysis scenarios show that choice of working covariance model has a substantial impact on regression estimator efficiency. Protection against avoidable loss of efficiency associated with covariance misspecification is obtained when a 'Gaussian estimation' pseudolikelihood procedure is used with an AR(1) structure.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The article describes a generalized estimating equations approach that was used to investigate the impact of technology on vessel performance in a trawl fishery during 1988-96, while accounting for spatial and temporal correlations in the catch-effort data. Robust estimation of parameters in the presence of several levels of clustering depended more on the choice of cluster definition than on the choice of correlation structure within the cluster. Models with smaller cluster sizes produced stable results, while models with larger cluster sizes, that may have had complex within-cluster correlation structures and that had within-cluster covariates, produced estimates sensitive to the correlation structure. The preferred model arising from this dataset assumed that catches from a vessel were correlated in the same years and the same areas, but independent in different years and areas. The model that assumed catches from a vessel were correlated in all years and areas, equivalent to a random effects term for vessel, produced spurious results. This was an unexpected finding that highlighted the need to adopt a systematic strategy for modelling. The article proposes a modelling strategy of selecting the best cluster definition first, and the working correlation structure (within clusters) second. The article discusses the selection and interpretation of the model in the light of background knowledge of the data and utility of the model, and the potential for this modelling approach to apply in similar statistical situations.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The reduction of CO2 emissions and social exclusion are two key elements of UK transport strategy. Despite intensive research on each theme, little effort has so far been made linking the relationship between emissions and social exclusion. In addition, current knowledge on each theme is limited to urban areas; little research is available on these themes for rural areas. This research contributes to this gap in the literature by analysing 157 weekly activity-travel diary data collected from three case study areas with differential levels of area accessibility and area mobility options, located in rural Northern Ireland. Individual weekly CO2 emission levels from personal travel diaries (both hot exhaust emission and cold-start emission) were calculated using average speed models for different modes of transport. The socio-spatial patterns associated with CO2 emissions were identified using a general linear model whereas binary logistic regression analyses were conducted to identify mode choice behaviour and activity patterns. This research found groups that emitted a significantly lower level of CO2 included individuals living in an area with a higher level of accessibility and mobility, non-car, non-working, and low-income older people. However, evidence in this research also shows that although certain groups (e.g. those working, and residing in an area with a lower level of accessibility) emitted higher levels of CO2, their rate of participation in activities was however found to be significantly lower compared to their counterparts. Based on the study findings, this research highlights the need for both soft (e.g. teleworking) and physical (e.g. accessibility planning) policy measures in rural areas in order to meet government’s stated CO2 reduction targets while at the same time enhancing social inclusion.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

High levels of sitting have been linked with poor health outcomes. Previously a pragmatic MTI accelerometer data cut-point (100 count/min-1) has been used to estimate sitting. Data on the accuracy of this cut-point is unavailable. PURPOSE: To ascertain whether the 100 count/min-1 cut-point accurately isolates sitting from standing activities. METHODS: Participants fitted with an MTI accelerometer were observed performing a range of sitting, standing, light & moderate activities. 1-min epoch MTI data were matched to observed activities, then re-categorized as either sitting or not using the 100 count/min-1 cut-point. Self-report demographics and current physical activity were collected. Generalized estimating equation for repeated measures with a binary logistic model analyses (GEE), corrected for age, gender and BMI, were conducted to ascertain the odds of the MTI data being misclassified. RESULTS: Data were from 26 healthy subjects (8 men; 50% aged <25 years; mean BMI (SD) 22.7(3.8)m/kg2). MTI sitting and standing data mode was 0 count/min-1, with 46% of sitting activities and 21% of standing activities recording 0 count/min-1. The GEE was unable to accurately isolate sitting from standing activities using the 100 count/min-1 cut-point, since all sitting activities were incorrectly predicted as standing (p=0.05). To further explore the sensitivity of MTI data to delineate sitting from standing, the upper 95% confidence interval of the mean for the sitting activities (46 count/min-1) was used to re-categorise the data; this resulted in the GEE correctly classifying 49% of sitting, and 69% of standing activities. Using the 100 count/min-1 cut-point the data were re-categorised into a combined ‘sit/stand’ category and tested against other light activities: 88% of sit/stand and 87% of light activities were accurately predicted. Using Freedson’s moderate cut-point of 1952 count/min-1 the GEE accurately predicted 97% of light vs. 90% of moderate activities. CONCLUSION: The distributions of MTI recorded sitting and standing data overlap considerably, as such the 100 count/min -1 cut-point did not accurately isolate sitting from other static standing activities. The 100 count/min -1 cut-point more accurately predicted sit/stand vs. other movement orientated activities.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

miRDeep and its varieties are widely used to quantify known and novel micro RNA (miRNA) from small RNA sequencing (RNAseq). This article describes miRDeep*, our integrated miRNA identification tool, which is modeled off miRDeep, but the precision of detecting novel miRNAs is improved by introducing new strategies to identify precursor miRNAs. miRDeep* has a user-friendly graphic interface and accepts raw data in FastQ and Sequence Alignment Map (SAM) or the binary equivalent (BAM) format. Known and novel miRNA expression levels, as measured by the number of reads, are displayed in an interface, which shows each RNAseq read relative to the pre-miRNA hairpin. The secondary pre-miRNA structure and read locations for each predicted miRNA are shown and kept in a separate figure file. Moreover, the target genes of known and novel miRNAs are predicted using the TargetScan algorithm, and the targets are ranked according to the confidence score. miRDeep* is an integrated standalone application where sequence alignment, pre-miRNA secondary structure calculation and graphical display are purely Java coded. This application tool can be executed using a normal personal computer with 1.5 GB of memory. Further, we show that miRDeep* outperformed existing miRNA prediction tools using our LNCaP and other small RNAseq datasets. miRDeep* is freely available online at http://www.australianprostatecentre.org/research/software/mirdeep-star