980 resultados para Missing values structures
Resumo:
The research was sparked by an exchange in South Korea, as the author identified a gap in research that provides economic, up to date and realistic information about the North Korean market in English language. A need for a research was identified that would describe the market’s existing and missing market structures and explore possibilities to overcome the missing market structures. Institutional theory was chosen as a suitable framework to describe and explore the market. The research question was formulated as follows: “How can foreign companies overcome institutional voids in the North Korean market?”. To answer the research question, it was divided into three sub-questions as follows: (1) What is the institutional environment in North Korea like? (2) What are the major institutional voids in the North Korean market? (3) What possibilities do foreign companies have to overcome institutional voids? The research is qualitative by nature due to the descriptive and exploratory nature of the research question. Data collection consisted of expert interview and content analysis, resulting in primary data of two interviews and secondary data of 95 articles from 40 different sources. The data was analyzed with the systematical technique of content analysis. The data was coded, classified and presented as concepts with the help of a classification system that was build following the theoretical framework adapted for this study. The findings can be summarized as follows. (1) The market institutions are characterized by an overlapping dual system of formal, socialist structures and informal, market-oriented structures. (2) Institutional voids prevail on both the market’s contextual and on the market level. They are partly result of old institutions being replaced by new institutions that lack institutionalization. (3) Identified possibilities to overcome institutional voids correspond with possibilities drawn from previous research. This decreases the image of North Korea as an impossibly unique market to operate in. (4) Emerging middle class, rapidly growing entrepreneurial activities and women’s increasing role in business drive a down-to-up change in the market. This signals the recent development of the market, yet has been overlooked in the Western media. Thus there is a need for further economic, up to date research concerning North Korea.
Resumo:
Learning Disability (LD) is a classification including several disorders in which a child has difficulty in learning in a typical manner, usually caused by an unknown factor or factors. LD affects about 15% of children enrolled in schools. The prediction of learning disability is a complicated task since the identification of LD from diverse features or signs is a complicated problem. There is no cure for learning disabilities and they are life-long. The problems of children with specific learning disabilities have been a cause of concern to parents and teachers for some time. The aim of this paper is to develop a new algorithm for imputing missing values and to determine the significance of the missing value imputation method and dimensionality reduction method in the performance of fuzzy and neuro fuzzy classifiers with specific emphasis on prediction of learning disabilities in school age children. In the basic assessment method for prediction of LD, checklists are generally used and the data cases thus collected fully depends on the mood of children and may have also contain redundant as well as missing values. Therefore, in this study, we are proposing a new algorithm, viz. the correlation based new algorithm for imputing the missing values and Principal Component Analysis (PCA) for reducing the irrelevant attributes. After the study, it is found that, the preprocessing methods applied by us improves the quality of data and thereby increases the accuracy of the classifiers. The system is implemented in Math works Software Mat Lab 7.10. The results obtained from this study have illustrated that the developed missing value imputation method is very good contribution in prediction system and is capable of improving the performance of a classifier.
Resumo:
Learning Disability (LD) is a neurological condition that affects a child’s brain and impairs his ability to carry out one or many specific tasks. LD affects about 15 % of children enrolled in schools. The prediction of LD is a vital and intricate job. The aim of this paper is to design an effective and powerful tool, using the two intelligent methods viz., Artificial Neural Network and Adaptive Neuro-Fuzzy Inference System, for measuring the percentage of LD that affected in school-age children. In this study, we are proposing some soft computing methods in data preprocessing for improving the accuracy of the tool as well as the classifier. The data preprocessing is performed through Principal Component Analysis for attribute reduction and closest fit algorithm is used for imputing missing values. The main idea in developing the LD prediction tool is not only to predict the LD present in children but also to measure its percentage along with its class like low or minor or major. The system is implemented in Mathworks Software MatLab 7.10. The results obtained from this study have illustrated that the designed prediction system or tool is capable of measuring the LD effectively
Resumo:
As stated in Aitchison (1986), a proper study of relative variation in a compositional data set should be based on logratios, and dealing with logratios excludes dealing with zeros. Nevertheless, it is clear that zero observations might be present in real data sets, either because the corresponding part is completely absent –essential zeros– or because it is below detection limit –rounded zeros. Because the second kind of zeros is usually understood as “a trace too small to measure”, it seems reasonable to replace them by a suitable small value, and this has been the traditional approach. As stated, e.g. by Tauber (1999) and by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000), the principal problem in compositional data analysis is related to rounded zeros. One should be careful to use a replacement strategy that does not seriously distort the general structure of the data. In particular, the covariance structure of the involved parts –and thus the metric properties– should be preserved, as otherwise further analysis on subpopulations could be misleading. Following this point of view, a non-parametric imputation method is introduced in Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000). This method is analyzed in depth by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2003) where it is shown that the theoretical drawbacks of the additive zero replacement method proposed in Aitchison (1986) can be overcome using a new multiplicative approach on the non-zero parts of a composition. The new approach has reasonable properties from a compositional point of view. In particular, it is “natural” in the sense that it recovers the “true” composition if replacement values are identical to the missing values, and it is coherent with the basic operations on the simplex. This coherence implies that the covariance structure of subcompositions with no zeros is preserved. As a generalization of the multiplicative replacement, in the same paper a substitution method for missing values on compositional data sets is introduced
Resumo:
R from http://www.r-project.org/ is ‘GNU S’ – a language and environment for statistical computing and graphics. The environment in which many classical and modern statistical techniques have been implemented, but many are supplied as packages. There are 8 standard packages and many more are available through the cran family of Internet sites http://cran.r-project.org . We started to develop a library of functions in R to support the analysis of mixtures and our goal is a MixeR package for compositional data analysis that provides support for operations on compositions: perturbation and power multiplication, subcomposition with or without residuals, centering of the data, computing Aitchison’s, Euclidean, Bhattacharyya distances, compositional Kullback-Leibler divergence etc. graphical presentation of compositions in ternary diagrams and tetrahedrons with additional features: barycenter, geometric mean of the data set, the percentiles lines, marking and coloring of subsets of the data set, theirs geometric means, notation of individual data in the set . . . dealing with zeros and missing values in compositional data sets with R procedures for simple and multiplicative replacement strategy, the time series analysis of compositional data. We’ll present the current status of MixeR development and illustrate its use on selected data sets
Resumo:
The R-package “compositions”is a tool for advanced compositional analysis. Its basic functionality has seen some conceptual improvement, containing now some facilities to work with and represent ilr bases built from balances, and an elaborated subsys- tem for dealing with several kinds of irregular data: (rounded or structural) zeroes, incomplete observations and outliers. The general approach to these irregularities is based on subcompositions: for an irregular datum, one can distinguish a “regular” sub- composition (where all parts are actually observed and the datum behaves typically) and a “problematic” subcomposition (with those unobserved, zero or rounded parts, or else where the datum shows an erratic or atypical behaviour). Systematic classification schemes are proposed for both outliers and missing values (including zeros) focusing on the nature of irregularities in the datum subcomposition(s). To compute statistics with values missing at random and structural zeros, a projection approach is implemented: a given datum contributes to the estimation of the desired parameters only on the subcompositon where it was observed. For data sets with values below the detection limit, two different approaches are provided: the well-known imputation technique, and also the projection approach. To compute statistics in the presence of outliers, robust statistics are adapted to the characteristics of compositional data, based on the minimum covariance determinant approach. The outlier classification is based on four different models of outlier occur- rence and Monte-Carlo-based tests for their characterization. Furthermore the package provides special plots helping to understand the nature of outliers in the dataset. Keywords: coda-dendrogram, lost values, MAR, missing data, MCD estimator, robustness, rounded zeros
Resumo:
Generally classifiers tend to overfit if there is noise in the training data or there are missing values. Ensemble learning methods are often used to improve a classifier's classification accuracy. Most ensemble learning approaches aim to improve the classification accuracy of decision trees. However, alternative classifiers to decision trees exist. The recently developed Random Prism ensemble learner for classification aims to improve an alternative classification rule induction approach, the Prism family of algorithms, which addresses some of the limitations of decision trees. However, Random Prism suffers like any ensemble learner from a high computational overhead due to replication of the data and the induction of multiple base classifiers. Hence even modest sized datasets may impose a computational challenge to ensemble learners such as Random Prism. Parallelism is often used to scale up algorithms to deal with large datasets. This paper investigates parallelisation for Random Prism, implements a prototype and evaluates it empirically using a Hadoop computing cluster.
Resumo:
The substitution of missing values, also called imputation, is an important data preparation task for many domains. Ideally, the substitution of missing values should not insert biases into the dataset. This aspect has been usually assessed by some measures of the prediction capability of imputation methods. Such measures assume the simulation of missing entries for some attributes whose values are actually known. These artificially missing values are imputed and then compared with the original values. Although this evaluation is useful, it does not allow the influence of imputed values in the ultimate modelling task (e.g. in classification) to be inferred. We argue that imputation cannot be properly evaluated apart from the modelling task. Thus, alternative approaches are needed. This article elaborates on the influence of imputed values in classification. In particular, a practical procedure for estimating the inserted bias is described. As an additional contribution, we have used such a procedure to empirically illustrate the performance of three imputation methods (majority, naive Bayes and Bayesian networks) in three datasets. Three classifiers (decision tree, naive Bayes and nearest neighbours) have been used as modelling tools in our experiments. The achieved results illustrate a variety of situations that can take place in the data preparation practice.
Resumo:
The paper analyses empirical performance data of five commercial PV-plants in Germany. The purpose was on one side to investigate the weak light performance of the different PV-modules used. On the other hand it was to quantify and compare the shading losses of different PV-array configurations. The importance of this study relies on the fact that even if the behavior under weak light conditions or the shading losses might seem to be a relatively small percentage of the total yearly output; in projects where a performance guarantee is given, these variation can make the difference between meeting or not the conditions.When analyzing the data, a high dispersion was found. To reduce the optical losses and spectral effects, a series of data filters were applied based on the angle of incidence and absolute Air Mass. To compensate for the temperature effects and translate the values to STC (25°C), five different methods were assessed. At the end, the Procedure 2 of IEC 60891 was selected due to its relative simplicity, usage of mostly standard parameters found in datasheets, good accuracy even with missing values, and its potential to improve the results when the complete set of inputs is available.After analyzing the data, the weak light performance of the modules did not show a clear superiority of a certain technology or technology group over the others. Moreover, the uncertainties in the measurements restrictive the conclusiveness of the results.In the partial shading analysis, the landscape mounting of mc-Si PV-modules in free-field showed a significantly better performance than the portrait one. The cross-table string using CIGS modules did not proved the benefits expected and performed actually poorer than a regular one-string-per-table layout. Parallel substrings with CdTe showed a proper functioning and relatively low losses. Among the two product generations of CdTe analyzed, none showed a significantly better performance under partial shadings.
Resumo:
Este trabalho tem por objetivo avaliar a eficiência do mercado acionário brasileiro a partir de testes estatísticos, para posterior modelagem das séries de retorno das ações, utilizando os modelos ARMA, ARCH, GARCH, Modelo de Decomposição e, por final, VAR. Para este trabalho foram coletados dados intradiários, que são considerados dados de alta freqüência e menos suscetíveis a possíveis alterações na estrutura de mercado, tanto micro como macroeconômicos. Optou-se por trabalhar com dados coletados a cada cinco minutos, devido à baixa liquidez dos ativos no mercado financeiro (que poderia acarretar em dados ausentes para intervalos de tempo inferiores). As séries escolhidas foram: Petrobrás PN, Gerdau PN, Bradesco PN, Vale do Rio Doce PN e o índice Ibovespa, que apresentam grande representatividade do mercado acionário brasileiro para o período analisado. Com base no teste de Dickey-Fuller, verificou-se indícios que o mercado acionário brasileiro possa ser eficiente e, assim foi proposto modelos para as séries de retorno das ações anteriormente citadas.
Resumo:
Multi-factor models constitute a useful tool to explain cross-sectional covariance in equities returns. We propose in this paper the use of irregularly spaced returns in the multi-factor model estimation and provide an empirical example with the 389 most liquid equities in the Brazilian Market. The market index shows itself significant to explain equity returns while the US$/Brazilian Real exchange rate and the Brazilian standard interest rate does not. This example shows the usefulness of the estimation method in further using the model to fill in missing values and to provide interval forecasts.
Resumo:
This paper presents new methodology for making Bayesian inference about dy~ o!s for exponential famiIy observations. The approach is simulation-based _~t> use of ~vlarkov chain Monte Carlo techniques. A yletropolis-Hastings i:U~UnLlllll 1::; combined with the Gibbs sampler in repeated use of an adjusted version of normal dynamic linear models. Different alternative schemes are derived and compared. The approach is fully Bayesian in obtaining posterior samples for state parameters and unknown hyperparameters. Illustrations to real data sets with sparse counts and missing values are presented. Extensions to accommodate for general distributions for observations and disturbances. intervention. non-linear models and rnultivariate time series are outlined.
Resumo:
Multi-factor models constitute a use fui tool to explain cross-sectional covariance in equities retums. We propose in this paper the use of irregularly spaced returns in the multi-factor model estimation and provide an empirical example with the 389 most liquid equities in the Brazilian Market. The market index shows itself significant to explain equity returns while the US$/Brazilian Real exchange rate and the Brazilian standard interest rate does not. This example shows the usefulness of the estimation method in further using the model to fill in missing values and to provide intervaI forecasts.
Resumo:
In this work we study the survival cure rate model proposed by Yakovlev (1993) that are considered in a competing risk setting. Covariates are introduced for modeling the cure rate and we allow some covariates to have missing values. We consider only the cases by which the missing covariates are categorical and implement the EM algorithm via the method of weights for maximum likelihood estimation. We present a Monte Carlo simulation experiment to compare the properties of the estimators based on this method with those estimators under the complete case scenario. We also evaluate, in this experiment, the impact in the parameter estimates when we increase the proportion of immune and censored individuals among the not immune one. We demonstrate the proposed methodology with a real data set involving the time until the graduation for the undergraduate course of Statistics of the Universidade Federal do Rio Grande do Norte
Resumo:
Aiming to ensure greater reliability and consistency of data stored in the database, the data cleaning stage is set early in the process of Knowledge Discovery in Databases (KDD) and is responsible for eliminating problems and adjust the data for the later stages, especially for the stage of data mining. Such problems occur in the instance level and schema, namely, missing values, null values, duplicate tuples, values outside the domain, among others. Several algorithms were developed to perform the cleaning step in databases, some of them were developed specifically to work with the phonetics of words, since a word can be written in different ways. Within this perspective, this work presents as original contribution an optimization of algorithm for the detection of duplicate tuples in databases through phonetic based on multithreading without the need for trained data, as well as an independent environment of language to be supported for this. © 2011 IEEE.