933 resultados para Data Models


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Here we present a sequential Monte Carlo approach to Bayesian sequential design for the incorporation of model uncertainty. The methodology is demonstrated through the development and implementation of two model discrimination utilities; mutual information and total separation, but it can also be applied more generally if one has different experimental aims. A sequential Monte Carlo algorithm is run for each rival model (in parallel), and provides a convenient estimate of the marginal likelihood (of each model) given the data, which can be used for model comparison and in the evaluation of utility functions. A major benefit of this approach is that it requires very little problem specific tuning and is also computationally efficient when compared to full Markov chain Monte Carlo approaches. This research is motivated by applications in drug development and chemical engineering.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Le but de cette thèse est d étendre la théorie du bootstrap aux modèles de données de panel. Les données de panel s obtiennent en observant plusieurs unités statistiques sur plusieurs périodes de temps. Leur double dimension individuelle et temporelle permet de contrôler l 'hétérogénéité non observable entre individus et entre les périodes de temps et donc de faire des études plus riches que les séries chronologiques ou les données en coupe instantanée. L 'avantage du bootstrap est de permettre d obtenir une inférence plus précise que celle avec la théorie asymptotique classique ou une inférence impossible en cas de paramètre de nuisance. La méthode consiste à tirer des échantillons aléatoires qui ressemblent le plus possible à l échantillon d analyse. L 'objet statitstique d intérêt est estimé sur chacun de ses échantillons aléatoires et on utilise l ensemble des valeurs estimées pour faire de l inférence. Il existe dans la littérature certaines application du bootstrap aux données de panels sans justi cation théorique rigoureuse ou sous de fortes hypothèses. Cette thèse propose une méthode de bootstrap plus appropriée aux données de panels. Les trois chapitres analysent sa validité et son application. Le premier chapitre postule un modèle simple avec un seul paramètre et s 'attaque aux propriétés théoriques de l estimateur de la moyenne. Nous montrons que le double rééchantillonnage que nous proposons et qui tient compte à la fois de la dimension individuelle et la dimension temporelle est valide avec ces modèles. Le rééchantillonnage seulement dans la dimension individuelle n est pas valide en présence d hétérogénéité temporelle. Le ré-échantillonnage dans la dimension temporelle n est pas valide en présence d'hétérogénéité individuelle. Le deuxième chapitre étend le précédent au modèle panel de régression. linéaire. Trois types de régresseurs sont considérés : les caractéristiques individuelles, les caractéristiques temporelles et les régresseurs qui évoluent dans le temps et par individu. En utilisant un modèle à erreurs composées doubles, l'estimateur des moindres carrés ordinaires et la méthode de bootstrap des résidus, on montre que le rééchantillonnage dans la seule dimension individuelle est valide pour l'inférence sur les coe¢ cients associés aux régresseurs qui changent uniquement par individu. Le rééchantillonnage dans la dimen- sion temporelle est valide seulement pour le sous vecteur des paramètres associés aux régresseurs qui évoluent uniquement dans le temps. Le double rééchantillonnage est quand à lui est valide pour faire de l inférence pour tout le vecteur des paramètres. Le troisième chapitre re-examine l exercice de l estimateur de différence en di¤érence de Bertrand, Duflo et Mullainathan (2004). Cet estimateur est couramment utilisé dans la littérature pour évaluer l impact de certaines poli- tiques publiques. L exercice empirique utilise des données de panel provenant du Current Population Survey sur le salaire des femmes dans les 50 états des Etats-Unis d Amérique de 1979 à 1999. Des variables de pseudo-interventions publiques au niveau des états sont générées et on s attend à ce que les tests arrivent à la conclusion qu il n y a pas d e¤et de ces politiques placebos sur le salaire des femmes. Bertrand, Du o et Mullainathan (2004) montre que la non-prise en compte de l hétérogénéité et de la dépendance temporelle entraîne d importantes distorsions de niveau de test lorsqu'on évalue l'impact de politiques publiques en utilisant des données de panel. Une des solutions préconisées est d utiliser la méthode de bootstrap. La méthode de double ré-échantillonnage développée dans cette thèse permet de corriger le problème de niveau de test et donc d'évaluer correctement l'impact des politiques publiques.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In many applications the observed data can be viewed as a censored high dimensional full data random variable X. By the curve of dimensionality it is typically not possible to construct estimators that are asymptotically efficient at every probability distribution in a semiparametric censored data model of such a high dimensional censored data structure. We provide a general method for construction of one-step estimators that are efficient at a chosen submodel of the full-data model, are still well behaved off this submodel and can be chosen to always improve on a given initial estimator. These one-step estimators rely on good estimators of the censoring mechanism and thus will require a parametric or semiparametric model for the censoring mechanism. We present a general theorem that provides a template for proving the desired asymptotic results. We illustrate the general one-step estimation methods by constructing locally efficient one-step estimators of marginal distributions and regression parameters with right-censored data, current status data and bivariate right-censored data, in all models allowing the presence of time-dependent covariates. The conditions of the asymptotics theorem are rigorously verified in one of the examples and the key condition of the general theorem is verified for all examples.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We consider nonparametric missing data models for which the censoring mechanism satisfies coarsening at random and which allow complete observations on the variable X of interest. W show that beyond some empirical process conditions the only essential condition for efficiency of an NPMLE of the distribution of X is that the regions associated with incomplete observations on X contain enough complete observations. This is heuristically explained by describing the EM-algorithm. We provide identifiably of the self-consistency equation and efficiency of the NPMLE in order to make this statement rigorous. The usual kind of differentiability conditions in the proof are avoided by using an identity which holds for the NPMLE of linear parameters in convex models. We provide a bivariate censoring application in which the condition and hence the NPMLE fails, but where other estimators, not based on the NPMLE principle, are highly inefficient. It is shown how to slightly reduce the data so that the conditions hold for the reduced data. The conditions are verified for the univariate censoring, double censored, and Ibragimov-Has'minski models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The schema of an information system can significantly impact the ability of end users to efficiently and effectively retrieve the information they need. Obtaining quickly the appropriate data increases the likelihood that an organization will make good decisions and respond adeptly to challenges. This research presents and validates a methodology for evaluating, ex ante, the relative desirability of alternative instantiations of a model of data. In contrast to prior research, each instantiation is based on a different formal theory. This research theorizes that the instantiation that yields the lowest weighted average query complexity for a representative sample of information requests is the most desirable instantiation for end-user queries. The theory was validated by an experiment that compared end-user performance using an instantiation of a data structure based on the relational model of data with performance using the corresponding instantiation of the data structure based on the object-relational model of data. Complexity was measured using three different Halstead metrics: program length, difficulty, and effort. For a representative sample of queries, the average complexity using each instantiation was calculated. As theorized, end users querying the instantiation with the lower average complexity made fewer semantic errors, i.e., were more effective at composing queries. (c) 2005 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Traditional vegetation mapping methods use high cost, labour-intensive aerial photography interpretation. This approach can be subjective and is limited by factors such as the extent of remnant vegetation, and the differing scale and quality of aerial photography over time. An alternative approach is proposed which integrates a data model, a statistical model and an ecological model using sophisticated Geographic Information Systems (GIS) techniques and rule-based systems to support fine-scale vegetation community modelling. This approach is based on a more realistic representation of vegetation patterns with transitional gradients from one vegetation community to another. Arbitrary, though often unrealistic, sharp boundaries can be imposed on the model by the application of statistical methods. This GIS-integrated multivariate approach is applied to the problem of vegetation mapping in the complex vegetation communities of the Innisfail Lowlands in the Wet Tropics bioregion of Northeastern Australia. The paper presents the full cycle of this vegetation modelling approach including sampling sites, variable selection, model selection, model implementation, internal model assessment, model prediction assessments, models integration of discrete vegetation community models to generate a composite pre-clearing vegetation map, independent data set model validation and model prediction's scale assessments. An accurate pre-clearing vegetation map of the Innisfail Lowlands was generated (0.83r(2)) through GIS integration of 28 separate statistical models. This modelling approach has good potential for wider application, including provision of. vital information for conservation planning and management; a scientific basis for rehabilitation of disturbed and cleared areas; a viable method for the production of adequate vegetation maps for conservation and forestry planning of poorly-studied areas. (c) 2006 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Conceptual modeling forms an important part of systems analysis. If this is done incorrectly or incompletely, there can be serious implications for the resultant system, specifically in terms of rework and useability. One approach to improving the conceptual modelling process is to evaluate how well the model represents reality. Emergence of the Bunge-Wand-Weber (BWW) ontological model introduced a platform to classify and compare the grammar of conceptual modelling languages. This work applies the BWW theory to a real world example in the health arena. The general practice computing group data model was developed using the Barker Entity Relationship Modelling technique. We describe an experiment, grounded in ontological theory, which evaluates how well the GPCG data model is understood by domain experts. The results show that with the exception of the use of entities to represent events, the raw model is better understood by domain experts

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Even when data repositories exhibit near perfect data quality, users may formulate queries that do not correspond to the information requested. Users’ poor information retrieval performance may arise from either problems understanding of the data models that represent the real world systems, or their query skills. This research focuses on users’ understanding of the data structures, i.e., their ability to map the information request and the data model. The Bunge-Wand-Weber ontology was used to formulate three sets of hypotheses. Two laboratory experiments (one using a small data model and one using a larger data model) tested the effect of ontological clarity on users’ performance when undertaking component, record, and aggregate level tasks. The results indicate for the hypotheses associated with different representations but equivalent semantics that parsimonious data model participants performed better for component level tasks but that ontologically clearer data model participants performed better for record and aggregate level tasks.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Two distinct maintenance-data-models are studied: a government Enterprise Resource Planning (ERP) maintenance-data-model, and the Software Engineering Industries (SEI) maintenance-data-model. The objective is to: (i) determine whether the SEI maintenance-data-model is sufficient in the context of ERP (by comparing with an ERP case), (ii) identify whether the ERP maintenance-data-model in this study has adequately captured the essential and common maintenance attributes (by comparing with the SEI), and (iii) proposed a new ERP maintenance-data-model as necessary. Our findings suggest that: (i) there are variations to the SEI model in an ERP-context, and (ii) there are rooms for improvements in our ERP case’s maintenance-data-model. Thus, a new ERP maintenance-data-model capturing the fundamental ERP maintenance attributes is proposed. This model is imperative for: (i) enhancing the reporting and visibility of maintenance activities, (ii) monitoring of the maintenance problems, resolutions and performance, and (iii) helping maintenance manager to better manage maintenance activities and make well-informed maintenance decisions.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

An educational priority of many nations is to enhance mathematical learning in early childhood. One area in need of special attention is that of statistics. This paper argues for a renewed focus on statistical reasoning in the beginning school years, with opportunities for children to engage in data modelling activities. Such modelling involves investigations of meaningful phenomena, deciding what is worthy of attention (i.e., identifying complex attributes), and then progressing to organising, structuring, visualising, and representing data. Results are reported from the first year of a three-year longitudinal study in which three classes of first-grade children and their teachers engaged in activities that required the creation of data models. The theme of “Looking after our Environment,” a component of the children’s science curriculum at the time, provided the context for the activities. Findings focus on how the children dealt with given complex attributes and how they generated their own attributes in classifying broad data sets, and the nature of the models the children created in organising, structuring, and representing their data.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

It is a big challenge to acquire correct user profiles for personalized text classification since users may be unsure in providing their interests. Traditional approaches to user profiling adopt machine learning (ML) to automatically discover classification knowledge from explicit user feedback in describing personal interests. However, the accuracy of ML-based methods cannot be significantly improved in many cases due to the term independence assumption and uncertainties associated with them. This paper presents a novel relevance feedback approach for personalized text classification. It basically applies data mining to discover knowledge from relevant and non-relevant text and constraints specific knowledge by reasoning rules to eliminate some conflicting information. We also developed a Dempster-Shafer (DS) approach as the means to utilise the specific knowledge to build high-quality data models for classification. The experimental results conducted on Reuters Corpus Volume 1 and TREC topics support that the proposed technique achieves encouraging performance in comparing with the state-of-the-art relevance feedback models.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This thesis report attempts to improve the models for predicting forest stand structure for practical use, e.g. forest management planning (FMP) purposes in Finland. Comparisons were made between Weibull and Johnson s SB distribution and alternative regression estimation methods. Data used for preliminary studies was local but the final models were based on representative data. Models were validated mainly in terms of bias and RMSE in the main stand characteristics (e.g. volume) using independent data. The bivariate SBB distribution model was used to mimic realistic variations in tree dimensions by including within-diameter-class height variation. Using the traditional method, diameter distribution with the expected height resulted in reduced height variation, whereas the alternative bivariate method utilized the error-term of the height model. The lack of models for FMP was covered to some extent by the models for peatland and juvenile stands. The validation of these models showed that the more sophisticated regression estimation methods provided slightly improved accuracy. A flexible prediction and application for stand structure consisted of seemingly unrelated regression models for eight stand characteristics, the parameters of three optional distributions and Näslund s height curve. The cross-model covariance structure was used for linear prediction application, in which the expected values of the models were calibrated with the known stand characteristics. This provided a framework to validate the optional distributions and the optional set of stand characteristics. Height distribution is recommended for the earliest state of stands because of its continuous feature. From the mean height of about 4 m, Weibull dbh-frequency distribution is recommended in young stands if the input variables consist of arithmetic stand characteristics. In advanced stands, basal area-dbh distribution models are recommended. Näslund s height curve proved useful. Some efficient transformations of stand characteristics are introduced, e.g. the shape index, which combined the basal area, the stem number and the median diameter. Shape index enabled SB model for peatland stands to detect large variation in stand densities. This model also demonstrated reasonable behaviour for stands in mineral soils.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The aim of this research, which focused on the Irish adult population, was to generate information for policymakers by applying statistical analyses and current technologies to oral health administrative and survey databases. Objectives included identifying socio-demographic influences on oral health and utilisation of dental services, comparing epidemiologically-estimated dental treatment need with treatment provided, and investigating the potential of a dental administrative database to provide information on utilisation of services and the volume and types of treatment provided over time. Information was extracted from the claims databases for the Dental Treatment Benefit Scheme (DTBS) for employed adults and the Dental Treatment Services Scheme (DTSS) for less-well-off adults, the National Surveys of Adult Oral Health, and the 2007 Survey of Lifestyle Attitudes and Nutrition in Ireland. Factors associated with utilisation and retention of natural teeth were analysed using count data models and logistic regression. The chi-square test and the student’s t-test were used to compare epidemiologically-estimated need in a representative sample of adults with treatment provided. Differences were found in dental care utilisation and tooth retention by Socio-Economic Status. An analysis of the five-year utilisation behaviour of a 2003 cohort of DTBS dental attendees revealed that age and being female were positively associated with visiting annually and number of treatments. Number of adults using the DTBS increased, and mean number of treatments per patient decreased, between 1997 and 2008. As a percentage of overall treatments, restorations, dentures, and extractions decreased, while prophylaxis increased. Differences were found between epidemiologically-estimated treatment need and treatment provided for those using the DTBS and DTSS. This research confirms the utility of survey and administrative data to generate knowledge for policymakers. Public administrative databases have not been designed for research purposes, but they have the potential to provide a wealth of knowledge on treatments provided and utilisation patterns.