80 resultados para Multivariate statistics
Resumo:
We consider two fundamental properties in the analysis of two-way tables of positive data: the principle of distributional equivalence, one of the cornerstones of correspondence analysis of contingency tables, and the principle of subcompositional coherence, which forms the basis of compositional data analysis. For an analysis to be subcompositionally coherent, it suffices to analyse the ratios of the data values. The usual approach to dimension reduction in compositional data analysis is to perform principal component analysis on the logarithms of ratios, but this method does not obey the principle of distributional equivalence. We show that by introducing weights for the rows and columns, the method achieves this desirable property. This weighted log-ratio analysis is theoretically equivalent to spectral mapping , a multivariate method developed almost 30 years ago for displaying ratio-scale data from biological activity spectra. The close relationship between spectral mapping and correspondence analysis is also explained, as well as their connection with association modelling. The weighted log-ratio methodology is applied here to frequency data in linguistics and to chemical compositional data in archaeology.
Resumo:
In the analysis of multivariate categorical data, typically the analysis of questionnaire data, it is often advantageous, for substantive and technical reasons, to analyse a subset of response categories. In multiple correspondence analysis, where each category is coded as a column of an indicator matrix or row and column of Burt matrix, it is not correct to simply analyse the corresponding submatrix of data, since the whole geometric structure is different for the submatrix . A simple modification of the correspondence analysis algorithm allows the overall geometric structure of the complete data set to be retained while calculating the solution for the selected subset of points. This strategy is useful for analysing patterns of response amongst any subset of categories and relating these patterns to demographic factors, especially for studying patterns of particular responses such as missing and neutral responses. The methodology is illustrated using data from the International Social Survey Program on Family and Changing Gender Roles in 1994.
Resumo:
Principal curves have been defined Hastie and Stuetzle (JASA, 1989) assmooth curves passing through the middle of a multidimensional dataset. They are nonlinear generalizations of the first principalcomponent, a characterization of which is the basis for the principalcurves definition.In this paper we propose an alternative approach based on a differentproperty of principal components. Consider a point in the space wherea multivariate normal is defined and, for each hyperplane containingthat point, compute the total variance of the normal distributionconditioned to belong to that hyperplane. Choose now the hyperplaneminimizing this conditional total variance and look for thecorresponding conditional mean. The first principal component of theoriginal distribution passes by this conditional mean and it isorthogonal to that hyperplane. This property is easily generalized todata sets with nonlinear structure. Repeating the search from differentstarting points, many points analogous to conditional means are found.We call them principal oriented points. When a one-dimensional curveruns the set of these special points it is called principal curve oforiented points. Successive principal curves are recursively definedfrom a generalization of the total variance.
Resumo:
A biplot, which is the multivariate generalization of the two-variable scatterplot, can be used to visualize the results of many multivariate techniques, especially those that are based on the singular value decomposition. We consider data sets consisting of continuous-scale measurements, their fuzzy coding and the biplots that visualize them, using a fuzzy version of multiple correspondence analysis. Of special interest is the way quality of fit of the biplot is measured, since it is well-known that regular (i.e., crisp) multiple correspondence analysis seriously under-estimates this measure. We show how the results of fuzzy multiple correspondence analysis can be defuzzified to obtain estimated values of the original data, and prove that this implies an orthogonal decomposition of variance. This permits a measure of fit to be calculated in the familiar form of a percentage of explained variance, which is directly comparable to the corresponding fit measure used in principal component analysis of the original data. The approach is motivated initially by its application to a simulated data set, showing how the fuzzy approach can lead to diagnosing nonlinear relationships, and finally it is applied to a real set of meteorological data.
Resumo:
We have analyzed the spatial accuracy of European foreign trade statistics compared to Latin American. We have also included USA s data because of the importance of this country in Latin American trade. We have developed a method for mapping discrepancies between exporters and importers, trying to isolate systematic spatial deviations. Although our results don t allow a unique explanation, they present some interesting clues to the distribution channels in the Latin American Continent as well as some spatial deviations for statistics in individual countries. Connecting our results with the literature specialized in the accuracy of foreign trade statistics; we can revisit Morgernstern (1963) as well as Federico and Tena (1991). Morgernstern had had a really pessimistic view on the reliability of this statistic source, but his main alert was focused on the trade balances, not in gross export or import values. Federico and Tena (1991) have demonstrated howaccuracy increases by aggregation, geographical and of product at the same time. But they still have a pessimistic view with relation to distribution questions, remarking that perhaps it will be more accurate to use import sources in this latest case. We have stated that the data set coming from foreign trade statistics for a sample in 1925, being it exporters or importers, it s a valuable tool for geography of trade patterns, although in some specific cases it needs some spatial adjustments.
Resumo:
The goal of this paper is to estimate time-varying covariance matrices.Since the covariance matrix of financial returns is known to changethrough time and is an essential ingredient in risk measurement, portfolioselection, and tests of asset pricing models, this is a very importantproblem in practice. Our model of choice is the Diagonal-Vech version ofthe Multivariate GARCH(1,1) model. The problem is that the estimation ofthe general Diagonal-Vech model model is numerically infeasible indimensions higher than 5. The common approach is to estimate more restrictive models which are tractable but may not conform to the data. Our contributionis to propose an alternative estimation method that is numerically feasible,produces positive semi-definite conditional covariance matrices, and doesnot impose unrealistic a priori restrictions. We provide an empiricalapplication in the context of international stock markets, comparing thenew estimator to a number of existing ones.
Resumo:
We consider the joint visualization of two matrices which have common rowsand columns, for example multivariate data observed at two time pointsor split accord-ing to a dichotomous variable. Methods of interest includeprincipal components analysis for interval-scaled data, or correspondenceanalysis for frequency data or ratio-scaled variables on commensuratescales. A simple result in matrix algebra shows that by setting up thematrices in a particular block format, matrix sum and difference componentscan be visualized. The case when we have more than two matrices is alsodiscussed and the methodology is applied to data from the InternationalSocial Survey Program.
Resumo:
A class of composite estimators of small area quantities that exploit spatial (distancerelated)similarity is derived. It is based on a distribution-free model for the areas, but theestimators are aimed to have optimal design-based properties. Composition is applied alsoto estimate some of the global parameters on which the small area estimators depend.It is shown that the commonly adopted assumption of random effects is not necessaryfor exploiting the similarity of the districts (borrowing strength across the districts). Themethods are applied in the estimation of the mean household sizes and the proportions ofsingle-member households in the counties (comarcas) of Catalonia. The simplest version ofthe estimators is more efficient than the established alternatives, even though the extentof spatial similarity is quite modest.
Resumo:
The generalization of simple (two-variable) correspondence analysis to more than two categorical variables, commonly referred to as multiple correspondence analysis, is neither obvious nor well-defined. We present two alternative ways of generalizing correspondence analysis, one based on the quantification of the variables and intercorrelation relationships, and the other based on the geometric ideas of simple correspondence analysis. We propose a version of multiple correspondence analysis, with adjusted principal inertias, as the method of choice for the geometric definition, since it contains simple correspondence analysis as an exact special case, which is not the situation of the standard generalizations. We also clarify the issue of supplementary point representation and the properties of joint correspondence analysis, a method that visualizes all two-way relationships between the variables. The methodology is illustrated using data on attitudes to science from the International Social Survey Program on Environment in 1993.
Resumo:
This paper proposes a nonparametric test in order to establish the level of accuracy of theforeign trade statistics of 17 Latin American countries when contrasted with the trade statistics of the main partners in 1925. The Wilcoxon Matched-Pairs Ranks test is used to determine whether the differences between the data registered by exporters and importers are meaningful, and if so, whether the differences are systematic in any direction. The paper tests for the reliability of the data registered for two homogeneous products, petroleum and coal, both in volume and value. The conclusion of the several exercises performed is that we cannot accept the existence of statistically significant differences between the data provided by the exporters and the registered by the importing countries in most cases. The qualitative historiography of Latin American describes its foreign trade statistics as mostly unusable. Our quantitative results contest this view.
Resumo:
A national survey designed for estimating a specific population quantity is sometimes used for estimation of this quantity also for a small area, such as a province. Budget constraints do not allow a greater sample size for the small area, and so other means of improving estimation have to be devised. We investigate such methods and assess them by a Monte Carlo study. We explore how a complementary survey can be exploited in small area estimation. We use the context of the Spanish Labour Force Survey (EPA) and the Barometer in Spain for our study.
Resumo:
The singular value decomposition and its interpretation as alinear biplot has proved to be a powerful tool for analysing many formsof multivariate data. Here we adapt biplot methodology to the specifficcase of compositional data consisting of positive vectors each of whichis constrained to have unit sum. These relative variation biplots haveproperties relating to special features of compositional data: the studyof ratios, subcompositions and models of compositional relationships. Themethodology is demonstrated on a data set consisting of six-part colourcompositions in 22 abstract paintings, showing how the singular valuedecomposition can achieve an accurate biplot of the colour ratios and howpossible models interrelating the colours can be diagnosed.
Resumo:
We continue the development of a method for the selection of a bandwidth or a number of design parameters in density estimation. We provideexplicit non-asymptotic density-free inequalities that relate the $L_1$ error of the selected estimate with that of the best possible estimate,and study in particular the connection between the richness of the classof density estimates and the performance bound. For example, our methodallows one to pick the bandwidth and kernel order in the kernel estimatesimultaneously and still assure that for {\it all densities}, the $L_1$error of the corresponding kernel estimate is not larger than aboutthree times the error of the estimate with the optimal smoothing factor and kernel plus a constant times $\sqrt{\log n/n}$, where $n$ is the sample size, and the constant only depends on the complexity of the family of kernels used in the estimate. Further applications include multivariate kernel estimates, transformed kernel estimates, and variablekernel estimates.
Resumo:
Accomplish high quality of final products in pharmaceutical industry is a challenge that requires the control and supervision of all the manufacturing steps. This request created the necessity of developing fast and accurate analytical methods. Near infrared spectroscopy together with chemometrics, fulfill this growing demand. The high speed providing relevant information and the versatility of its application to different types of samples lead these combined techniques as one of the most appropriated. This study is focused on the development of a calibration model able to determine amounts of API from industrial granulates using NIR, chemometrics and process spectra methodology.
Resumo:
This work proposes novel network analysis techniques for multivariate time series.We define the network of a multivariate time series as a graph where verticesdenote the components of the process and edges denote non zero long run partialcorrelations. We then introduce a two step LASSO procedure, called NETS, toestimate high dimensional sparse Long Run Partial Correlation networks. This approachis based on a VAR approximation of the process and allows to decomposethe long run linkages into the contribution of the dynamic and contemporaneousdependence relations of the system. The large sample properties of the estimatorare analysed and we establish conditions for consistent selection and estimation ofthe non zero long run partial correlations. The methodology is illustrated with anapplication to a panel of U.S. bluechips.