963 resultados para Statistics and probability
Sur des estimateurs et des tests non-paramétriques pour des distributions et copules conditionnelles
Resumo:
Pour modéliser un vecteur aléatoire en présence d'une co-variable, on peut d'abord faire appel à la fonction de répartition conditionnelle. En effet, cette dernière contient toute l'information ayant trait au comportement du vecteur étant donné une valeur prise par la co-variable. Il peut aussi être commode de séparer l'étude du comportement conjoint du vecteur de celle du comportement individuel de chacune de ses composantes. Pour ce faire, on utilise la copule conditionnelle, qui caractérise complètement la dépendance conditionnelle régissant les différentes associations entre les variables. Dans chacun des cas, la mise en oeuvre d'une stratégie d'estimation et d'inférence s'avère une étape essentielle à leur utilisant en pratique. Lorsqu'aucune information n'est disponible a priori quant à un choix éventuel de modèle, il devient pertinent d'opter pour des méthodes non-paramétriques. Le premier article de cette thèse, co-écrit par Jean-François Quessy et moi-même, propose une façon de ré-échantillonner des estimateurs non-paramétriques pour des distributions conditionnelles. Cet article a été publié dans la revue Statistics and Computing. En autres choses, nous y montrons comment obtenir des intervalles de confiance pour des statistiques s'écrivant en terme de la fonction de répartition conditionnelle. Le second article de cette thèse, co-écrit par Taoufik Bouezmarni, Jean-François Quessy et moi-même, s'affaire à étudier deux estimateurs non-paramétriques de la copule conditionnelles, proposés par Gijbels et coll. en présence de données sérielles. Cet article a été soumis dans la revue Statistics and Probability Letters. Nous identifions la distribution asymptotique de chacun de ces estimateurs pour des données mélangeantes. Le troisième article de cette thèse, co-écrit par Taoufik Bouezmarni, Jean-François Quessy et moi-même, propose une nouvelle façon d'étudier les relations de causalité entre deux séries chronologiques. Cet article a été soumis dans la revue Electronic Journal of Statistics. Dans cet article, nous utilisons la copule conditionnelle pour caractériser une version locale de la causalité au sens de Granger. Puis, nous proposons des mesures de causalité basées sur la copule conditionnelle. Le quatrième article de cette thèse, co-écrit par Taoufik Bouezmarni, Anouar El Ghouch et moi-même, propose une méthode qui permette d'estimer adéquatement la copule conditionnelle en présence de données incomplètes. Cet article a été soumis dans la revue Scandinavian Journal of Statistics. Les propriétés asymptotiques de l'estimateur proposé y sont aussi étudiées. Finalement, la dernière partie de cette thèse contient un travail inédit, qui porte sur la mise en oeuvre de tests statistiques permettant de déterminer si deux copules conditionnelles sont concordantes. En plus d'y présenter des résultats originaux, cette étude illustre l'utilité des techniques de ré-échantillonnage développées dans notre premier article.
Resumo:
Probability and Statistics—Selected Problems is a unique book for senior undergraduate and graduate students to fast review basic materials in Probability and Statistics. Descriptive statistics are presented first, and probability is reviewed secondly. Discrete and continuous distributions are presented. Sample and estimation with hypothesis testing are presented in the last two chapters. The solutions for proposed excises are listed for readers to references.
Resumo:
Diagnostic methods have been an important tool in regression analysis to detect anomalies, such as departures from error assumptions and the presence of outliers and influential observations with the fitted models. Assuming censored data, we considered a classical analysis and Bayesian analysis assuming no informative priors for the parameters of the model with a cure fraction. A Bayesian approach was considered by using Markov Chain Monte Carlo Methods with Metropolis-Hasting algorithms steps to obtain the posterior summaries of interest. Some influence methods, such as the local influence, total local influence of an individual, local influence on predictions and generalized leverage were derived, analyzed and discussed in survival data with a cure fraction and covariates. The relevance of the approach was illustrated with a real data set, where it is shown that, by removing the most influential observations, the decision about which model best fits the data is changed.
Resumo:
In this paper an alternative approach to the one in Henze (1986) is proposed for deriving the odd moments of the skew-normal distribution considered in Azzalini (1985). The approach is based on a Pascal type triangle, which seems to greatly simplify moments computation. Moreover, it is shown that the likelihood equation for estimating the asymmetry parameter in such model is generated as orthogonal functions to the sample vector. As a consequence, conditions for a unique solution of the likelihood equation are established, which seem to hold in more general setting.
Resumo:
The present notes are intended to present a detailed review of the existing results in dissipative kinetic theory which make use of the contraction properties of two main families of probability metrics: optimal mass transport and Fourier-based metrics. The first part of the notes is devoted to a self-consistent summary and presentation of the properties of both probability metrics, including new aspects on the relationships between them and other metrics of wide use in probability theory. These results are of independent interest with potential use in other contexts in Partial Differential Equations and Probability Theory. The second part of the notes makes a different presentation of the asymptotic behavior of Inelastic Maxwell Models than the one presented in the literature and it shows a new example of application: particle's bath heating. We show how starting from the contraction properties in probability metrics, one can deduce the existence, uniqueness and asymptotic stability in classical spaces. A global strategy with this aim is set up and applied in two dissipative models.
Resumo:
Abstract OBJECTIVE To investigate the association between handgrip strength (HS) and physical activity in physical frailty elderly. METHOD Cross-sectional quantitative study with a sample of 203 elderly calculated based on the population estimated proportion. Tests were applied to detect cognitive impairment and assessment of physical frailty. Descriptive statistics and multivariate analysis by binary logistic regression were used, and also Student's t-test and Fisher's exact test. RESULTS A total of 99 (64.3%) elderly showed decreased handgrip strength and 90 (58.4%) elderly presented decrease in physical activity levels. There was a statistically significant difference between these two components (p=0.019), in which elderly who have decreased HS have lower levels of physical activity. For low levels of physical activity and decreased HS, there was no evidence of significant difference in the probability of the classification as frail elderly (p<0.001). CONCLUSION The components handgrip strength and physical activity are associated with the frail elderly. The joint presence of low levels of physical activity and decreased handgrip strength leads to a significantly higher probability of the elderly to be categorized as frailty.
Resumo:
Résumé Cette thèse est consacrée à l'analyse, la modélisation et la visualisation de données environnementales à référence spatiale à l'aide d'algorithmes d'apprentissage automatique (Machine Learning). L'apprentissage automatique peut être considéré au sens large comme une sous-catégorie de l'intelligence artificielle qui concerne particulièrement le développement de techniques et d'algorithmes permettant à une machine d'apprendre à partir de données. Dans cette thèse, les algorithmes d'apprentissage automatique sont adaptés pour être appliqués à des données environnementales et à la prédiction spatiale. Pourquoi l'apprentissage automatique ? Parce que la majorité des algorithmes d'apprentissage automatiques sont universels, adaptatifs, non-linéaires, robustes et efficaces pour la modélisation. Ils peuvent résoudre des problèmes de classification, de régression et de modélisation de densité de probabilités dans des espaces à haute dimension, composés de variables informatives spatialisées (« géo-features ») en plus des coordonnées géographiques. De plus, ils sont idéaux pour être implémentés en tant qu'outils d'aide à la décision pour des questions environnementales allant de la reconnaissance de pattern à la modélisation et la prédiction en passant par la cartographie automatique. Leur efficacité est comparable au modèles géostatistiques dans l'espace des coordonnées géographiques, mais ils sont indispensables pour des données à hautes dimensions incluant des géo-features. Les algorithmes d'apprentissage automatique les plus importants et les plus populaires sont présentés théoriquement et implémentés sous forme de logiciels pour les sciences environnementales. Les principaux algorithmes décrits sont le Perceptron multicouches (MultiLayer Perceptron, MLP) - l'algorithme le plus connu dans l'intelligence artificielle, le réseau de neurones de régression généralisée (General Regression Neural Networks, GRNN), le réseau de neurones probabiliste (Probabilistic Neural Networks, PNN), les cartes auto-organisées (SelfOrganized Maps, SOM), les modèles à mixture Gaussiennes (Gaussian Mixture Models, GMM), les réseaux à fonctions de base radiales (Radial Basis Functions Networks, RBF) et les réseaux à mixture de densité (Mixture Density Networks, MDN). Cette gamme d'algorithmes permet de couvrir des tâches variées telle que la classification, la régression ou l'estimation de densité de probabilité. L'analyse exploratoire des données (Exploratory Data Analysis, EDA) est le premier pas de toute analyse de données. Dans cette thèse les concepts d'analyse exploratoire de données spatiales (Exploratory Spatial Data Analysis, ESDA) sont traités selon l'approche traditionnelle de la géostatistique avec la variographie expérimentale et selon les principes de l'apprentissage automatique. La variographie expérimentale, qui étudie les relations entre pairs de points, est un outil de base pour l'analyse géostatistique de corrélations spatiales anisotropiques qui permet de détecter la présence de patterns spatiaux descriptible par une statistique. L'approche de l'apprentissage automatique pour l'ESDA est présentée à travers l'application de la méthode des k plus proches voisins qui est très simple et possède d'excellentes qualités d'interprétation et de visualisation. Une part importante de la thèse traite de sujets d'actualité comme la cartographie automatique de données spatiales. Le réseau de neurones de régression généralisée est proposé pour résoudre cette tâche efficacement. Les performances du GRNN sont démontrées par des données de Comparaison d'Interpolation Spatiale (SIC) de 2004 pour lesquelles le GRNN bat significativement toutes les autres méthodes, particulièrement lors de situations d'urgence. La thèse est composée de quatre chapitres : théorie, applications, outils logiciels et des exemples guidés. Une partie importante du travail consiste en une collection de logiciels : Machine Learning Office. Cette collection de logiciels a été développée durant les 15 dernières années et a été utilisée pour l'enseignement de nombreux cours, dont des workshops internationaux en Chine, France, Italie, Irlande et Suisse ainsi que dans des projets de recherche fondamentaux et appliqués. Les cas d'études considérés couvrent un vaste spectre de problèmes géoenvironnementaux réels à basse et haute dimensionnalité, tels que la pollution de l'air, du sol et de l'eau par des produits radioactifs et des métaux lourds, la classification de types de sols et d'unités hydrogéologiques, la cartographie des incertitudes pour l'aide à la décision et l'estimation de risques naturels (glissements de terrain, avalanches). Des outils complémentaires pour l'analyse exploratoire des données et la visualisation ont également été développés en prenant soin de créer une interface conviviale et facile à l'utilisation. Machine Learning for geospatial data: algorithms, software tools and case studies Abstract The thesis is devoted to the analysis, modeling and visualisation of spatial environmental data using machine learning algorithms. In a broad sense machine learning can be considered as a subfield of artificial intelligence. It mainly concerns with the development of techniques and algorithms that allow computers to learn from data. In this thesis machine learning algorithms are adapted to learn from spatial environmental data and to make spatial predictions. Why machine learning? In few words most of machine learning algorithms are universal, adaptive, nonlinear, robust and efficient modeling tools. They can find solutions for the classification, regression, and probability density modeling problems in high-dimensional geo-feature spaces, composed of geographical space and additional relevant spatially referenced features. They are well-suited to be implemented as predictive engines in decision support systems, for the purposes of environmental data mining including pattern recognition, modeling and predictions as well as automatic data mapping. They have competitive efficiency to the geostatistical models in low dimensional geographical spaces but are indispensable in high-dimensional geo-feature spaces. The most important and popular machine learning algorithms and models interesting for geo- and environmental sciences are presented in details: from theoretical description of the concepts to the software implementation. The main algorithms and models considered are the following: multi-layer perceptron (a workhorse of machine learning), general regression neural networks, probabilistic neural networks, self-organising (Kohonen) maps, Gaussian mixture models, radial basis functions networks, mixture density networks. This set of models covers machine learning tasks such as classification, regression, and density estimation. Exploratory data analysis (EDA) is initial and very important part of data analysis. In this thesis the concepts of exploratory spatial data analysis (ESDA) is considered using both traditional geostatistical approach such as_experimental variography and machine learning. Experimental variography is a basic tool for geostatistical analysis of anisotropic spatial correlations which helps to understand the presence of spatial patterns, at least described by two-point statistics. A machine learning approach for ESDA is presented by applying the k-nearest neighbors (k-NN) method which is simple and has very good interpretation and visualization properties. Important part of the thesis deals with a hot topic of nowadays, namely, an automatic mapping of geospatial data. General regression neural networks (GRNN) is proposed as efficient model to solve this task. Performance of the GRNN model is demonstrated on Spatial Interpolation Comparison (SIC) 2004 data where GRNN model significantly outperformed all other approaches, especially in case of emergency conditions. The thesis consists of four chapters and has the following structure: theory, applications, software tools, and how-to-do-it examples. An important part of the work is a collection of software tools - Machine Learning Office. Machine Learning Office tools were developed during last 15 years and was used both for many teaching courses, including international workshops in China, France, Italy, Ireland, Switzerland and for realizing fundamental and applied research projects. Case studies considered cover wide spectrum of the real-life low and high-dimensional geo- and environmental problems, such as air, soil and water pollution by radionuclides and heavy metals, soil types and hydro-geological units classification, decision-oriented mapping with uncertainties, natural hazards (landslides, avalanches) assessments and susceptibility mapping. Complementary tools useful for the exploratory data analysis and visualisation were developed as well. The software is user friendly and easy to use.
Resumo:
Cette présentation examinera le degré de certitude qui peut être atteint dans le domaine scientifique. Le paradigme scientifique est composé de deux extrêmes; causalité et déterminisme d'un côté et probabilité et indéterminisme de l'autre. En faisant appel aux notions de Hume de la ressemblance et la contiguïté, on peut rejeter la causalité ou le hasard objectif comme étant sans fondement et non empirique. Le problème de l'induction et le sophisme du parieur proviennent d’une même source cognitif / heuristique. Hume décrit ces tendances mentales dans ses essais « Of Probability » et « Of the Idea of Necessary Connexion ». Une discussion sur la conception de la probabilité de Hume ainsi que d'autres interprétations de probabilité sera nécessaire. Même si la science glorifie et idéalise la causalité, la probabilité peut être comprise comme étant tout aussi cohérente. Une attitude probabiliste, même si elle est également non empirique, pourrait être plus avantageuse que le vieux paradigme de la causalité.
Resumo:
This paper presents our experience with combining statistical principles and participatory methods to generate national statistics. The methodology was developed in Malawi during 1999–2002. We demonstrate that if PRA is combined with statistical principles (including probability-based sampling and standardization), it can produce total population statistics and estimates of the proportion of households with certain characteristics (e.g., poverty). It can also provide quantitative data on complex issues of national importance such as poverty targeting. This approach is distinct from previous PRA-based approaches, which generate numbers at community level but only provide qualitative information at national level.
Resumo:
The co-polar correlation coefficient (ρhv) has many applications, including hydrometeor classification, ground clutter and melting layer identification, interpretation of ice microphysics and the retrieval of rain drop size distributions (DSDs). However, we currently lack the quantitative error estimates that are necessary if these applications are to be fully exploited. Previous error estimates of ρhv rely on knowledge of the unknown "true" ρhv and implicitly assume a Gaussian probability distribution function of ρhv samples. We show that frequency distributions of ρhv estimates are in fact highly negatively skewed. A new variable: L = -log10(1 - ρhv) is defined, which does have Gaussian error statistics, and a standard deviation depending only on the number of independent radar pulses. This is verified using observations of spherical drizzle drops, allowing, for the first time, the construction of rigorous confidence intervals in estimates of ρhv. In addition, we demonstrate how the imperfect co-location of the horizontal and vertical polarisation sample volumes may be accounted for. The possibility of using L to estimate the dispersion parameter (µ) in the gamma drop size distribution is investigated. We find that including drop oscillations is essential for this application, otherwise there could be biases in retrieved µ of up to ~8. Preliminary results in rainfall are presented. In a convective rain case study, our estimates show µ to be substantially larger than 0 (an exponential DSD). In this particular rain event, rain rate would be overestimated by up to 50% if a simple exponential DSD is assumed.
Resumo:
We review some issues related to the implications of different missing data mechanisms on statistical inference for contingency tables and consider simulation studies to compare the results obtained under such models to those where the units with missing data are disregarded. We confirm that although, in general, analyses under the correct missing at random and missing completely at random models are more efficient even for small sample sizes, there are exceptions where they may not improve the results obtained by ignoring the partially classified data. We show that under the missing not at random (MNAR) model, estimates on the boundary of the parameter space as well as lack of identifiability of the parameters of saturated models may be associated with undesirable asymptotic properties of maximum likelihood estimators and likelihood ratio tests; even in standard cases the bias of the estimators may be low only for very large samples. We also show that the probability of a boundary solution obtained under the correct MNAR model may be large even for large samples and that, consequently, we may not always conclude that a MNAR model is misspecified because the estimate is on the boundary of the parameter space.
Resumo:
Excessive labor turnover may be considered, to a great extent, an undesirable feature of a given economy. This follows from considerations such as underinvestment in human capital by firms. Understanding the determinants and the evolution of turnover in a particular labor market is therefore of paramount importance, including policy considerations. The present paper proposes an econometric analysis of turnover in the Brazilian labor market, based on a partial observability bivariate probit model. This model considers the interdependence of decisions taken by workers and firms, helping to elucidate the causes that lead each of them to end an employment relationship. The Employment and Unemployment Survey (PED) conducted by the State System of Data Analysis (SEADE) and by the Inter-Union Department of Statistics and Socioeconomic Studies (DIEESE) provides data at the individual worker level, allowing for the estimation of the joint probabilities of decisions to quit or stay on the job on the worker’s side, and to maintain or fire the employee on the firm’s side, during a given time period. The estimated parameters relate these estimated probabilities to the characteristics of workers, job contracts, and to the potential macroeconomic determinants in different time periods. The results confirm the theoretical prediction that the probability of termination of an employment relationship tends to be smaller as the worker acquires specific skills. The results also show that the establishment of a formal employment relationship reduces the probability of a quit decision by the worker, and also the firm’s firing decision in non-industrial sectors. With regard to the evolution of quit probability over time, the results show that an increase in the unemployment rate inhibits quitting, although this tends to wane as the unemployment rate rises.
Resumo:
Incluye Bibliografía
Resumo:
The unavailability of data to inform policy planning and formulation has been repeatedly cited as the main challenge to economic and social progress in the Caribbean. Furthermore, even in instances when data is produced, broader gaps exist between its production and eventual use for evidence-based policy formulation. Owing to those challenges, this report explores the use of databases of social and gender statistics in the development of policies and programmes in the Caribbean subregion. The report offers a general appraisal of databases against two main considerations: (i) maximizing the use of existing databases in relevant policies and programmes; and (ii) bridging the gaps in data availability of relevant statistical databases and their analyses. The assessment entailed an inventory of social and gender databases maintained by data producers in the region and analysis of the extent to which the databases are used for policy formulation. To that end, a literature search as well as consultations with a number of knowledgeable persons active in the field of statistics and data provision was conducted. Based on the review, a set of recommendations were produced to improve current practices within the region with respect evidence based policy formulation.