Biblioteca Digital

156 resultados para Ordered Categorical Data

em University of Queensland eSpace - Australia

Fitting genetic models to twin data with binary and ordered categorical responses: A comparison of structural equation modelling and Bayesian hierarchical models

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We compare Bayesian methodology utilizing free-ware BUGS (Bayesian Inference Using Gibbs Sampling) with the traditional structural equation modelling approach based on another free-ware package, Mx. Dichotomous and ordinal (three category) twin data were simulated according to different additive genetic and common environment models for phenotypic variation. Practical issues are discussed in using Gibbs sampling as implemented by BUGS to fit subject-specific Bayesian generalized linear models, where the components of variation may be estimated directly. The simulation study (based on 2000 twin pairs) indicated that there is a consistent advantage in using the Bayesian method to detect a correct model under certain specifications of additive genetics and common environmental effects. For binary data, both methods had difficulty in detecting the correct model when the additive genetic effect was low (between 10 and 20%) or of moderate range (between 20 and 40%). Furthermore, neither method could adequately detect a correct model that included a modest common environmental effect (20%) even when the additive genetic effect was large (50%). Power was significantly improved with ordinal data for most scenarios, except for the case of low heritability under a true ACE model. We illustrate and compare both methods using data from 1239 twin pairs over the age of 50 years, who were registered with the Australian National Health and Medical Research Council Twin Registry (ATR) and presented symptoms associated with osteoarthritis occurring in joints of the hand.

Data mining and simulation: a grey relationship demonstration

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Fuzzy data has grown to be an important factor in data mining. Whenever uncertainty exists, simulation can be used as a model. Simulation is very flexible, although it can involve significant levels of computation. This article discusses fuzzy decision-making using the grey related analysis method. Fuzzy models are expected to better reflect decision-making uncertainty, at some cost in accuracy relative to crisp models. Monte Carlo simulation is used to incorporate experimental levels of uncertainty into the data and to measure the impact of fuzzy decision tree models using categorical data. Results are compared with decision tree models based on crisp continuous data.

Genetic variability in cultivated common bean beyond the two major gene pools

Relevância:

80.00% 80.00%

Publicador:

Resumo:

It is generally accepted that two major gene pools exist in cultivated common bean (Phaseolus vulgaris L.), a Middle American and an Andean one. Some evidence, based on unique phaseolin morphotypes and AFLP analysis, suggests that at least one more gene pool exists in cultivated common bean. To investigate this hypothesis, 1072 accessions from a common bean core collection from the primary centres of origin, held at CIAT, were investigated. Various agronomic and morphological attributes (14 categorical and 11 quantitative) were measured. Multivariate analyses, consisting of homogeneity analysis and clustering for categorical data, clustering and ordination techniques for quantitative data and nonlinear principal component analysis for mixed data, were undertaken. The results of most analyses supported the existence of the two major gene pools. However, the analysis of categorical data of protein types showed an additional minor gene pool. The minor gene pool is designated North Andean and includes phaseolin types CH, S and T; lectin types 312, Pr, B and K; and mostly A5, A6 and A4 types alpha-amylase inhibitor. Analysis of the combined categorical data of protein types and some plant categorical data also suggested that some other germplasm with C type phaseolin are distinguished from the major gene pools.

Prediction of protein continuum secondary structure with probabilistic models based on NMR solved structures

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Background: The structure of proteins may change as a result of the inherent flexibility of some protein regions. We develop and explore probabilistic machine learning methods for predicting a continuum secondary structure, i.e. assigning probabilities to the conformational states of a residue. We train our methods using data derived from high-quality NMR models. Results: Several probabilistic models not only successfully estimate the continuum secondary structure, but also provide a categorical output on par with models directly trained on categorical data. Importantly, models trained on the continuum secondary structure are also better than their categorical counterparts at identifying the conformational state for structurally ambivalent residues. Conclusion: Cascaded probabilistic neural networks trained on the continuum secondary structure exhibit better accuracy in structurally ambivalent regions of proteins, while sustaining an overall classification accuracy on par with standard, categorical prediction methods.

A Bayesian hierarchical model for categorical longitudinal data from a social survey of immigrants

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The paper investigates a Bayesian hierarchical model for the analysis of categorical longitudinal data from a large social survey of immigrants to Australia. Data for each subject are observed on three separate occasions, or waves, of the survey. One of the features of the data set is that observations for some variables are missing for at least one wave. A model for the employment status of immigrants is developed by introducing, at the first stage of a hierarchical model, a multinomial model for the response and then subsequent terms are introduced to explain wave and subject effects. To estimate the model, we use the Gibbs sampler, which allows missing data for both the response and the explanatory variables to be imputed at each iteration of the algorithm, given some appropriate prior distributions. After accounting for significant covariate effects in the model, results show that the relative probability of remaining unemployed diminished with time following arrival in Australia.

Methods for Categorical Longitudinal Survey Data: Understanding Employment Status of Australian Women

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Many variables that are of interest in social science research are nominal variables with two or more categories, such as employment status, occupation, political preference, or self-reported health status. With longitudinal survey data it is possible to analyse the transitions of individuals between different employment states or occupations (for example). In the statistical literature, models for analysing categorical dependent variables with repeated observations belong to the family of models known as generalized linear mixed models (GLMMs). The specific GLMM for a dependent variable with three or more categories is the multinomial logit random effects model. For these models, the marginal distribution of the response does not have a closed form solution and hence numerical integration must be used to obtain maximum likelihood estimates for the model parameters. Techniques for implementing the numerical integration are available but are computationally intensive requiring a large amount of computer processing time that increases with the number of clusters (or individuals) in the data and are not always readily accessible to the practitioner in standard software. For the purposes of analysing categorical response data from a longitudinal social survey, there is clearly a need to evaluate the existing procedures for estimating multinomial logit random effects model in terms of accuracy, efficiency and computing time. The computational time will have significant implications as to the preferred approach by researchers. In this paper we evaluate statistical software procedures that utilise adaptive Gaussian quadrature and MCMC methods, with specific application to modeling employment status of women using a GLMM, over three waves of the HILDA survey.

Special issue on advances in data mining and its applications

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Data mining is the process to identify valid, implicit, previously unknown, potentially useful and understandable information from large databases. It is an important step in the process of knowledge discovery in databases, (Olaru & Wehenkel, 1999). In a data mining process, input data can be structured, seme-structured, or unstructured. Data can be in text, categorical or numerical values. One of the important characteristics of data mining is its ability to deal data with large volume, distributed, time variant, noisy, and high dimensionality. A large number of data mining algorithms have been developed for different applications. For example, association rules mining can be useful for market basket problems, clustering algorithms can be used to discover trends in unsupervised learning problems, classification algorithms can be applied in decision-making problems, and sequential and time series mining algorithms can be used in predicting events, fault detection, and other supervised learning problems (Vapnik, 1999). Classification is among the most important tasks in the data mining, particularly for data mining applications into engineering fields. Together with regression, classification is mainly for predictive modelling. So far, there have been a number of classification algorithms in practice. According to (Sebastiani, 2002), the main classification algorithms can be categorized as: decision tree and rule based approach such as C4.5 (Quinlan, 1996); probability methods such as Bayesian classifier (Lewis, 1998); on-line methods such as Winnow (Littlestone, 1988) and CVFDT (Hulten 2001), neural networks methods (Rumelhart, Hinton & Wiliams, 1986); example-based methods such as k-nearest neighbors (Duda & Hart, 1973), and SVM (Cortes & Vapnik, 1995). Other important techniques for classification tasks include Associative Classification (Liu et al, 1998) and Ensemble Classification (Tumer, 1996).

Repeated occurrence of basal cell carcinoma of the skin and multifailure survival analysis: Follow-up data from the nambour skin cancer prevention trial

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The aim of this study was to apply multifailure survival methods to analyze time to multiple occurrences of basal cell carcinoma (BCC). Data from 4.5 years of follow-up in a randomized controlled trial, the Nambour Skin Cancer Prevention Trial (1992-1996), to evaluate skin cancer prevention were used to assess the influence of sunscreen application on the time to first BCC and the time to subsequent BCCs. Three different approaches of time to ordered multiple events were applied and compared: the Andersen-Gill, Wei-Lin-Weissfeld, and Prentice-Williams-Peterson models. Robust variance estimation approaches were used for all multifailure survival models. Sunscreen treatment was not associated with time to first occurrence of a BCC (hazard ratio = 1.04, 95% confidence interval: 0.79, 1.45). Time to subsequent BCC tumors using the Andersen-Gill model resulted in a lower estimated hazard among the daily sunscreen application group, although statistical significance was not reached (hazard ratio = 0.82, 95% confidence interval: 0.59, 1.15). Similarly, both the Wei-Lin-Weissfeld marginal-hazards and the Prentice-Williams-Peterson gap-time models revealed trends toward a lower risk of subsequent BCC tumors among the sunscreen intervention group. These results demonstrate the importance of conducting multiple-event analysis for recurring events, as risk factors for a single event may differ from those where repeated events are considered.

Migrating eprints.org data to a Fez repository

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This document records the process of migrating eprints.org data to a Fez repository. Fez is a Web-based digital repository and workflow management system based on Fedora (http://www.fedora.info/). At the time of migration, the University of Queensland Library was using EPrints 2.2.1 [pepper] for its ePrintsUQ repository. Once we began to develop Fez, we did not upgrade to later versions of eprints.org software since we knew we would be migrating data from ePrintsUQ to the Fez-based UQ eSpace. Since this document records our experiences of migration from an earlier version of eprints.org, anyone seeking to migrate eprints.org data into a Fez repository might encounter some small differences. Moving UQ publication data from an eprints.org repository into a Fez repository (hereafter called UQ eSpace (http://espace.uq.edu.au/) was part of a plan to integrate metadata (and, in some cases, full texts) about all UQ research outputs, including theses, images, multimedia and datasets, in a single repository. This tied in with the plan to identify and capture the research output of a single institution, the main task of the eScholarshipUQ testbed for the Australian Partnership for Sustainable Repositories project (http://www.apsr.edu.au/). The migration could not occur at UQ until the functionality in Fez was at least equal to that of the existing ePrintsUQ repository. Accordingly, as Fez development occurred throughout 2006, a list of eprints.org functionality not currently supported in Fez was created so that programming of such development could be planned for and implemented.

Test–retest repeatability of self-reported environmental exposures in Parkinson’s disease cases and healthy controls

Relevância:

20.00% 20.00%

Publicador:

Resumo:

There is substantial disagreement among published epidemiological studies regarding environmental risk factors for Parkinson’s disease (PD). Differences in the quality of measurement of environmental exposures may contribute to this variation. The current study examined the test–retest repeatability of self-report data on risk factors for PD obtained from a series of 32 PD cases recruited from neurology clinics and 29 healthy sex-, age-and residential suburb-matched controls. Exposure data were collected in face-to-face interviews using a structured questionnaire derived from previous epidemiological studies. High repeatability was demonstrated for ‘lifestyle’ exposures, such as smoking and coffee/tea consumption (kappas 0.70–1.00). Environmental exposures that involved some action by the person, such as pesticide application and use of solvents and metals, also showed high repeatability (kappas>0.78). Lower repeatability was seen for rural residency and bore water consumption (kappa 0.39–0.74). In general, we found that case and control participants provided similar rates of incongruent and missing responses for categorical and continuous occupational, domestic, lifestyle and medical exposures.

A simple data logger for student-designed rocket experiments.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The final-year project for Mechanical & Space Engineering students at UQ often involves the design and flight testing of an experiment. This report describes the design and use of a simple data logger that should be suitable for collecting data from the students' flight experiments. The exercise here was taken as far as the construction of a prototype device that is suitable for ground-based testing, say, the static firing of a hybrid rocket motor.

Equational Reasoning as a Tool for Data Analysis

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A combination of deductive reasoning, clustering, and inductive learning is given as an example of a hybrid system for exploratory data analysis. Visualization is replaced by a dialogue with the data.

Leadership Attributes and Cultural Values in Australia and New Zealand Compared: An Initial Report Based on GLOBE Data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper reports a comparative study of Australian and New Zealand leadership attributes, based on the GLOBE (Global Leadership and Organizational Behavior Effectiveness) program. Responses from 344 Australian managers and 184 New Zealand managers in three industries were analyzed using exploratory and confirmatory factor analysis. Results supported some of the etic leadership dimensions identified in the GLOBE study, but also found some emic dimensions of leadership for each country. An interesting finding of the study was that the New Zealand data fitted the Australian model, but not vice versa, suggesting asymmetric perceptions of leadership in the two countries.

Selection bias in gene extraction on the basis of microarray gene-expression data

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called. 632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.

A framework for electricity price spike analysis with advanced data mining methods

Relevância:

20.00% 20.00%

Publicador:

Resumo:

There are many techniques for electricity market price forecasting. However, most of them are designed for expected price analysis rather than price spike forecasting. An effective method of predicting the occurrence of spikes has not yet been observed in the literature so far. In this paper, a data mining based approach is presented to give a reliable forecast of the occurrence of price spikes. Combined with the spike value prediction techniques developed by the same authors, the proposed approach aims at providing a comprehensive tool for price spike forecasting. In this paper, feature selection techniques are firstly described to identify the attributes relevant to the occurrence of spikes. A simple introduction to the classification techniques is given for completeness. Two algorithms: support vector machine and probability classifier are chosen to be the spike occurrence predictors and are discussed in details. Realistic market data are used to test the proposed model with promising results.

«
1
2
3
4
5
6
7
8
9
10
11
»