911 resultados para Statistical Language Model
Resumo:
Prediction of random effects is an important problem with expanding applications. In the simplest context, the problem corresponds to prediction of the latent value (the mean) of a realized cluster selected via two-stage sampling. Recently, Stanek and Singer [Predicting random effects from finite population clustered samples with response error. J. Amer. Statist. Assoc. 99, 119-130] developed best linear unbiased predictors (BLUP) under a finite population mixed model that outperform BLUPs from mixed models and superpopulation models. Their setup, however, does not allow for unequally sized clusters. To overcome this drawback, we consider an expanded finite population mixed model based on a larger set of random variables that span a higher dimensional space than those typically applied to such problems. We show that BLUPs for linear combinations of the realized cluster means derived under such a model have considerably smaller mean squared error (MSE) than those obtained from mixed models, superpopulation models, and finite population mixed models. We motivate our general approach by an example developed for two-stage cluster sampling and show that it faithfully captures the stochastic aspects of sampling in the problem. We also consider simulation studies to illustrate the increased accuracy of the BLUP obtained under the expanded finite population mixed model. (C) 2007 Elsevier B.V. All rights reserved.
Resumo:
In clinical trials, it may be of interest taking into account physical and emotional well-being in addition to survival when comparing treatments. Quality-adjusted survival time has the advantage of incorporating information about both survival time and quality-of-life. In this paper, we discuss the estimation of the expected value of the quality-adjusted survival, based on multistate models for the sojourn times in health states. Semiparametric and parametric (with exponential distribution) approaches are considered. A simulation study is presented to evaluate the performance of the proposed estimator and the jackknife resampling method is used to compute bias and variance of the estimator. (C) 2007 Elsevier B.V. All rights reserved.
Resumo:
We discuss the connection between information and copula theories by showing that a copula can be employed to decompose the information content of a multivariate distribution into marginal and dependence components, with the latter quantified by the mutual information. We define the information excess as a measure of deviation from a maximum-entropy distribution. The idea of marginal invariant dependence measures is also discussed and used to show that empirical linear correlation underestimates the amplitude of the actual correlation in the case of non-Gaussian marginals. The mutual information is shown to provide an upper bound for the asymptotic empirical log-likelihood of a copula. An analytical expression for the information excess of T-copulas is provided, allowing for simple model identification within this family. We illustrate the framework in a financial data set. Copyright (C) EPLA, 2009
Resumo:
We analyse the finite-sample behaviour of two second-order bias-corrected alternatives to the maximum-likelihood estimator of the parameters in a multivariate normal regression model with general parametrization proposed by Patriota and Lemonte [A. G. Patriota and A. J. Lemonte, Bias correction in a multivariate regression model with genereal parameterization, Stat. Prob. Lett. 79 (2009), pp. 1655-1662]. The two finite-sample corrections we consider are the conventional second-order bias-corrected estimator and the bootstrap bias correction. We present the numerical results comparing the performance of these estimators. Our results reveal that analytical bias correction outperforms numerical bias corrections obtained from bootstrapping schemes.
Resumo:
Denna avhandling tar sin utgångspunkt i ett ifrågasättande av effektiviteten i EU:s konditionalitetspolitik avseende minoritetsrättigheter. Baserat på den rationalistiska teoretiska modellen, External Incentives Model of Governance, syftar denna hypotesprövande avhandling till att förklara om tidsavståndet på det potentiella EU medlemskapet påverkar lagstiftningsnivån avseende minoritetsspråksrättigheter. Mätningen av nivån på lagstiftningen avseende minoritetsspråksrättigheter begränsas till att omfatta icke-diskriminering, användning av minoritetsspråk i officiella sammanhang samt minoriteters språkliga rättigheter i utbildningen. Metodologiskt används ett jämförande angreppssätt både avseende tidsramen för studien, som sträcker sig mellan 2003 och 2010, men även avseende urvalet av stater. På basis av det \"mest lika systemet\" kategoriseras staterna i tre grupper efter deras olika tidsavstånd från det potentiella EU medlemskapet. Hypotesen som prövas är följande: ju kortare tidsavstånd till det potentiella EU medlemskapet desto större sannolikhet att staternas lagstiftningsnivå inom de tre områden som studeras har utvecklats till en hög nivå. Studien visar att hypotesen endast bekräftas delvis. Resultaten avseende icke-diskriminering visar att sambandet mellan tidsavståndet och nivån på lagstiftningen har ökat markant under den undersökta tidsperioden. Detta samband har endast stärkts mellan kategorin av stater som ligger tidsmässigt längst bort ett potentiellt EU medlemskap och de två kategorier som ligger närmare respektive närmast ett potentiellt EU medlemskap. Resultaten avseende användning av minoritetsspråk i officiella sammanhang och minoriteters språkliga rättigheter i utbildningen visar inget respektive nästan inget samband mellan tidsavståndet och utvecklingen på lagstiftningen mellan 2003 och 2010.
Resumo:
In holistic theories of protolanguage, a vital step is the fractionation process where holistic utterances are broken down into segments, and segments associated with semantic components. One problem for this process may be the occurrence of counterexamples to any segment-meaning connection. The actual abundance of such counterexamples is a contentious issue \cite{smith06,taller07}. Here I present calculations of the prevalence of counterexamples in model languages. It is found that counterexamples are indeed abundant, much more numerous than positive examples for any plausible holistic language.
Resumo:
This thesis develops and evaluates statistical methods for different types of genetic analyses, including quantitative trait loci (QTL) analysis, genome-wide association study (GWAS), and genomic evaluation. The main contribution of the thesis is to provide novel insights in modeling genetic variance, especially via random effects models. In variance component QTL analysis, a full likelihood model accounting for uncertainty in the identity-by-descent (IBD) matrix was developed. It was found to be able to correctly adjust the bias in genetic variance component estimation and gain power in QTL mapping in terms of precision. Double hierarchical generalized linear models, and a non-iterative simplified version, were implemented and applied to fit data of an entire genome. These whole genome models were shown to have good performance in both QTL mapping and genomic prediction. A re-analysis of a publicly available GWAS data set identified significant loci in Arabidopsis that control phenotypic variance instead of mean, which validated the idea of variance-controlling genes. The works in the thesis are accompanied by R packages available online, including a general statistical tool for fitting random effects models (hglm), an efficient generalized ridge regression for high-dimensional data (bigRR), a double-layer mixed model for genomic data analysis (iQTL), a stochastic IBD matrix calculator (MCIBD), a computational interface for QTL mapping (qtl.outbred), and a GWAS analysis tool for mapping variance-controlling loci (vGWAS).
Resumo:
A number of recent works have introduced statistical methods for detecting genetic loci that affect phenotypic variability, which we refer to as variability-controlling quantitative trait loci (vQTL). These are genetic variants whose allelic state predicts how much phenotype values will vary about their expected means. Such loci are of great potential interest in both human and non-human genetic studies, one reason being that a detected vQTL could represent a previously undetected interaction with other genes or environmental factors. The simultaneous publication of these new methods in different journals has in many cases precluded opportunity for comparison. We survey some of these methods, the respective trade-offs they imply, and the connections between them. The methods fall into three main groups: classical non-parametric, fully parametric, and semi-parametric two-stage approximations. Choosing between alternatives involves balancing the need for robustness, flexibility, and speed. For each method, we identify important assumptions and limitations, including those of practical importance, such as their scope for including covariates and random effects. We show in simulations that both parametric methods and their semi-parametric approximations can give elevated false positive rates when they ignore mean-variance relationships intrinsic to the data generation process. We conclude that choice of method depends on the trait distribution, the need to include non-genetic covariates, and the population size and structure, coupled with a critical evaluation of how these fit with the assumptions of the statistical model.
Resumo:
Renewable energy production is a basic supplement to stabilize rapidly increasing global energy demand and skyrocketing energy price as well as to balance the fluctuation of supply from non-renewable energy sources at electrical grid hubs. The European energy traders, government and private company energy providers and other stakeholders have been, since recently, a major beneficiary, customer and clients of Hydropower simulation solutions. The relationship between rainfall-runoff model outputs and energy productions of hydropower plants has not been clearly studied. In this research, association of rainfall, catchment characteristics, river network and runoff with energy production of a particular hydropower station is examined. The essence of this study is to justify the correspondence between runoff extracted from calibrated catchment and energy production of hydropower plant located at a catchment outlet; to employ a unique technique to convert runoff to energy based on statistical and graphical trend analysis of the two, and to provide environment for energy forecast. For rainfall-runoff model setup and calibration, MIKE 11 NAM model is applied, meanwhile MIKE 11 SO model is used to track, adopt and set a control strategy at hydropower location for runoff-energy correlation. The model is tested at two selected micro run-of-river hydropower plants located in South Germany. Two consecutive calibration is compromised to test the model; one for rainfall-runoff model and other for energy simulation. Calibration results and supporting verification plots of two case studies indicated that simulated discharge and energy production is comparable with the measured discharge and energy production respectively.
Resumo:
With the service life of water supply network (WSN) growth, the growing phenomenon of aging pipe network has become exceedingly serious. As urban water supply network is hidden underground asset, it is difficult for monitoring staff to make a direct classification towards the faults of pipe network by means of the modern detecting technology. In this paper, based on the basic property data (e.g. diameter, material, pressure, distance to pump, distance to tank, load, etc.) of water supply network, decision tree algorithm (C4.5) has been carried out to classify the specific situation of water supply pipeline. Part of the historical data was used to establish a decision tree classification model, and the remaining historical data was used to validate this established model. Adopting statistical methods were used to access the decision tree model including basic statistical method, Receiver Operating Characteristic (ROC) and Recall-Precision Curves (RPC). These methods has been successfully used to assess the accuracy of this established classification model of water pipe network. The purpose of classification model was to classify the specific condition of water pipe network. It is important to maintain the pipeline according to the classification results including asset unserviceable (AU), near perfect condition (NPC) and serious deterioration (SD). Finally, this research focused on pipe classification which plays a significant role in maintaining water supply networks in the future.
Resumo:
The US term structure of interest rates plays a central role in fixed-income analysis. For example, estimating accurately the US term structure is a crucial step for those interested in analyzing Brazilian Brady bonds such as IDUs, DCBs, FLIRBs, EIs, etc. In this work we present a statistical model to estimate the US term structure of interest rates. We address in this report all major issues which drove us in the process of implementing the model developed, concentrating on important practical issues such as computational efficiency, robustness of the final implementation, the statistical properties of the final model, etc. Numerical examples are provided in order to illustrate the use of the model on a daily basis.
Resumo:
Atypical points in the data may result in meaningless e±cient frontiers. This follows since portfolios constructed using classical estimates may re°ect neither the usual nor the unusual days patterns. On the other hand, portfolios constructed using robust approaches are able to capture just the dynamics of the usual days, which constitute the majority of the business days. In this paper we propose an statistical model and a robust estimation procedure to obtain an e±cient frontier which would take into account the behavior of both the usual and most of the atypical days. We show, using real data and simulations, that portfolios constructed in this way require less frequent rebalancing, and may yield higher expected returns for any risk level.
Resumo:
This paper uses an output oriented Data Envelopment Analysis (DEA) measure of technical efficiency to assess the technical efficiencies of the Brazilian banking system. Four approaches to estimation are compared in order to assess the significance of factors affecting inefficiency. These are nonparametric Analysis of Covariance, maximum likelihood using a family of exponential distributions, maximum likelihood using a family of truncated normal distributions, and the normal Tobit model. The sole focus of the paper is on a combined measure of output and the data analyzed refers to the year 2001. The factors of interest in the analysis and likely to affect efficiency are bank nature (multiple and commercial), bank type (credit, business, bursary and retail), bank size (large, medium, small and micro), bank control (private and public), bank origin (domestic and foreign), and non-performing loans. The latter is a measure of bank risk. All quantitative variables, including non-performing loans, are measured on a per employee basis. The best fits to the data are provided by the exponential family and the nonparametric Analysis of Covariance. The significance of a factor however varies according to the model fit although it can be said that there is some agreements between the best models. A highly significant association in all models fitted is observed only for nonperforming loans. The nonparametric Analysis of Covariance is more consistent with the inefficiency median responses observed for the qualitative factors. The findings of the analysis reinforce the significant association of the level of bank inefficiency, measured by DEA residuals, with the risk of bank failure.
Resumo:
O objetivo deste estudo é propor a implementação de um modelo estatístico para cálculo da volatilidade, não difundido na literatura brasileira, o modelo de escala local (LSM), apresentando suas vantagens e desvantagens em relação aos modelos habitualmente utilizados para mensuração de risco. Para estimação dos parâmetros serão usadas as cotações diárias do Ibovespa, no período de janeiro de 2009 a dezembro de 2014, e para a aferição da acurácia empírica dos modelos serão realizados testes fora da amostra, comparando os VaR obtidos para o período de janeiro a dezembro de 2014. Foram introduzidas variáveis explicativas na tentativa de aprimorar os modelos e optou-se pelo correspondente americano do Ibovespa, o índice Dow Jones, por ter apresentado propriedades como: alta correlação, causalidade no sentido de Granger, e razão de log-verossimilhança significativa. Uma das inovações do modelo de escala local é não utilizar diretamente a variância, mas sim a sua recíproca, chamada de “precisão” da série, que segue uma espécie de passeio aleatório multiplicativo. O LSM captou todos os fatos estilizados das séries financeiras, e os resultados foram favoráveis a sua utilização, logo, o modelo torna-se uma alternativa de especificação eficiente e parcimoniosa para estimar e prever volatilidade, na medida em que possui apenas um parâmetro a ser estimado, o que representa uma mudança de paradigma em relação aos modelos de heterocedasticidade condicional.
Resumo:
This study aims to contribute on the forecasting literature in stock return for emerging markets. We use Autometrics to select relevant predictors among macroeconomic, microeconomic and technical variables. We develop predictive models for the Brazilian market premium, measured as the excess return over Selic interest rate, Itaú SA, Itaú-Unibanco and Bradesco stock returns. We nd that for the market premium, an ADL with error correction is able to outperform the benchmarks in terms of economic performance. For individual stock returns, there is a trade o between statistical properties and out-of-sample performance of the model.