836 resultados para CATEGORICAL-DATA ANALYSIS
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-06
Resumo:
Quantile computation has many applications including data mining and financial data analysis. It has been shown that an is an element of-approximate summary can be maintained so that, given a quantile query d (phi, is an element of), the data item at rank [phi N] may be approximately obtained within the rank error precision is an element of N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different phi and is an element of poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.
Resumo:
The importance of availability of comparable real income aggregates and their components to applied economic research is highlighted by the popularity of the Penn World Tables. Any methodology designed to achieve such a task requires the combination of data from several sources. The first is purchasing power parities (PPP) data available from the International Comparisons Project roughly every five years since the 1970s. The second is national level data on a range of variables that explain the behaviour of the ratio of PPP to market exchange rates. The final source of data is the national accounts publications of different countries which include estimates of gross domestic product and various price deflators. In this paper we present a method to construct a consistent panel of comparable real incomes by specifying the problem in state-space form. We present our completed work as well as briefly indicate our work in progress.
Resumo:
This paper describes how the statistical technique of cluster analysis and the machine learning technique of rule induction can be combined to explore a database. The ways in which such an approach alleviates the problems associated with other techniques for data analysis are discussed. We report the results of experiments carried out on a database from the medical diagnosis domain. Finally we describe the future developments which we plan to carry out to build on our current work.
Resumo:
Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss the advantages conveyed by the definition of a probability density function for PCA.
Resumo:
Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss the advantages conveyed by the definition of a probability density function for PCA.
Resumo:
This paper examines the source country determinants of FDI into Japan. The paper highlights certain methodological and theoretical weaknesses in the previous literature and offers some explanations for hitherto ambiguous results. Specifically, the paper highlights the importance of panel data analysis, and the identification of fixed effects in the analysis rather than simply pooling the data. Indeed, we argue that many of the results reported elsewhere are a feature of this mis-specification. To this end, pooled, fixed effects and random effects estimates are compared. The results suggest that FDI into Japan is inversely related to trade flows, such that trade and FDI are substitutes. Moreover, the results also suggest that FDI increases with home country political and economic stability. The paper also shows that previously reported results, regarding the importance of exchange rates, relative borrowing costs and labour costs in explaining FDI flows, are sensitive to the econometric specification and estimation approach. The paper also discusses the importance of these results within a policy context. In recent years Japan has sought to attract FDI, though many firms still complain of barriers to inward investment penetration in Japan. The results show that cultural and geographic distance are only of marginal importance in explaining FDI, and that the results are consistent with the market-seeking explanation of FDI. As such, the attitude to risk in the source country is strongly related to the size of FDI flows to Japan. © 2007 The Authors Journal compilation © 2007 Blackwell Publishing Ltd.
Resumo:
Analysis of variance (ANOVA) is the most efficient method available for the analysis of experimental data. Analysis of variance is a method of considerable complexity and subtlety, with many different variations, each of which applies in a particular experimental context. Hence, it is possible to apply the wrong type of ANOVA to data and, therefore, to draw an erroneous conclusion from an experiment. This article reviews the types of ANOVA most likely to arise in clinical experiments in optometry including the one-way ANOVA ('fixed' and 'random effect' models), two-way ANOVA in randomised blocks, three-way ANOVA, and factorial experimental designs (including the varieties known as 'split-plot' and 'repeated measures'). For each ANOVA, the appropriate experimental design is described, a statistical model is formulated, and the advantages and limitations of each type of design discussed. In addition, the problems of non-conformity to the statistical model and determination of the number of replications are considered. © 2002 The College of Optometrists.
Resumo:
The present work describes the development of a proton induced X-ray emission (PIXE) analysis system, especially designed and builtfor routine quantitative multi-elemental analysis of a large number of samples. The historical and general developments of the analytical technique and the physical processes involved are discussed. The philosophy, design, constructional details and evaluation of a versatile vacuum chamber, an automatic multi-sample changer, an on-demand beam pulsing system and ion beam current monitoring facility are described.The system calibration using thin standard foils of Si, P, S,Cl, K, Ca, Ti, V, Fe, Cu, Ga, Ge, Rb, Y and Mo was undertaken at proton beam energies of 1 to 3 MeV in steps of 0.5 MeV energy and compared with theoretical calculations. An independent calibration check using bovine liver Standard Reference Material was performed. The minimum detectable limits have been experimentally determined at detector positions of 90° and 135° with respect to the incident beam for the above range of proton energies as a function of atomic number Z. The system has detection limits of typically 10- 7 to 10- 9 g for elements 14
Resumo:
This article examines the negotiation of face in post observation feedback conferences on an initial teacher training programme. The conferences were held in groups with one trainer and up to four trainees and followed a set of generic norms. These norms include the right to offer advice and to criticise, speech acts which are often considered to be face threatening in more normal contexts. However, as the data analysis shows, participants also interact in ways that challenge the generic norms, some of which might be considered more conventionally face attacking. The article argues that face should be analysed at the level of interaction (Haugh and Bargiela-Chiappini, 2010) and that situated and contextual detail is relevant to its analysis. It suggests that linguistic ethnography, which 'marries' (Wetherell, 2007) linguistics and ethnography, provides a useful theoretical framework for doing so. To this end the study draws on real-life talk-in-interaction (from transcribed recordings), the participants' perspectives (from focus groups and interviews) and situated detail (from fieldnotes) to produce a contextualised and nuanced analysis. © 2011 Elsevier B.V.
Resumo:
The goal of this study is to determine if various measures of contraction rate are regionally patterned in written Standard American English. In order to answer this question, this study employs a corpus-based approach to data collection and a statistical approach to data analysis. Based on a spatial autocorrelation analysis of the values of eleven measures of contraction across a 25 million word corpus of letters to the editor representing the language of 200 cities from across the contiguous United States, two primary regional patterns were identified: easterners tend to produce relatively few standard contractions (not contraction, verb contraction) compared to westerners, and northeasterners tend to produce relatively few non-standard contractions (to contraction, non-standard not contraction) compared to southeasterners. These findings demonstrate that regional linguistic variation exists in written Standard American English and that regional linguistic variation is more common than is generally assumed.
Resumo:
This thesis seeks to describe the development of an inexpensive and efficient clustering technique for multivariate data analysis. The technique starts from a multivariate data matrix and ends with graphical representation of the data and pattern recognition discriminant function. The technique also results in distances frequency distribution that might be useful in detecting clustering in the data or for the estimation of parameters useful in the discrimination between the different populations in the data. The technique can also be used in feature selection. The technique is essentially for the discovery of data structure by revealing the component parts of the data. lhe thesis offers three distinct contributions for cluster analysis and pattern recognition techniques. The first contribution is the introduction of transformation function in the technique of nonlinear mapping. The second contribution is the us~ of distances frequency distribution instead of distances time-sequence in nonlinear mapping, The third contribution is the formulation of a new generalised and normalised error function together with its optimal step size formula for gradient method minimisation. The thesis consists of five chapters. The first chapter is the introduction. The second chapter describes multidimensional scaling as an origin of nonlinear mapping technique. The third chapter describes the first developing step in the technique of nonlinear mapping that is the introduction of "transformation function". The fourth chapter describes the second developing step of the nonlinear mapping technique. This is the use of distances frequency distribution instead of distances time-sequence. The chapter also includes the new generalised and normalised error function formulation. Finally, the fifth chapter, the conclusion, evaluates all developments and proposes a new program. for cluster analysis and pattern recognition by integrating all the new features.
Resumo:
This book is aimed primarily at microbiologists who are undertaking research and who require a basic knowledge of statistics to analyse their experimental data. Computer software employing a wide range of data analysis methods is widely available to experimental scientists. The availability of this software, however, makes it essential that investigators understand the basic principles of statistics. Statistical analysis of data can be complex with many different methods of approach, each of which applies in a particular experimental circumstance. Hence, it is possible to apply an incorrect statistical method to data and to draw the wrong conclusions from an experiment. The purpose of this book, which has its origin in a series of articles published in the Society for Applied Microbiology journal ‘The Microbiologist’, is an attempt to present the basic logic of statistics as clearly as possible and therefore, to dispel some of the myths that often surround the subject. The 28 ‘Statnotes’ deal with various topics that are likely to be encountered, including the nature of variables, the comparison of means of two or more groups, non-parametric statistics, analysis of variance, correlating variables, and more complex methods such as multiple linear regression and principal components analysis. In each case, the relevant statistical method is illustrated with examples drawn from experiments in microbiological research. The text incorporates a glossary of the most commonly used statistical terms and there are two appendices designed to aid the investigator in the selection of the most appropriate test.
Resumo:
DEA literature continues apace but software has lagged behind. This session uses suitably selected data to present newly developed software which includes many of the most recent DEA models. The software enables the user to address a variety of issues not frequently found in existing DEA software such as: -Assessments under a variety of possible assumptions of returns to scale including NIRS and NDRS; -Scale elasticity computations; -Numerous Input/Output variables and truly unlimited number of assessment units (DMUs) -Panel data analysis -Analysis of categorical data (multiple categories) -Malmquist Index and its decompositions -Computations of Supper efficiency -Automated removal of super-efficient outliers under user-specified criteria; -Graphical presentation of results -Integrated statistical tests