993 resultados para Unbalanced data
Resumo:
This article deals with classification problems involving unequal probabilities in each class and discusses metrics to systems that use multilayer perceptrons neural networks (MLP) for the task of classifying new patterns. In addition we propose three new pruning methods that were compared to other seven existing methods in the literature for MLP networks. All pruning algorithms presented in this paper have been modified by the authors to do pruning of neurons, in order to produce fully connected MLP networks but being small in its intermediary layer. Experiments were carried out involving the E. coli unbalanced classification problem and ten pruning methods. The proposed methods had obtained good results, actually, better results than another pruning methods previously defined at the MLP neural network area. (C) 2014 Elsevier Ltd. All rights reserved.
Resumo:
Questions of handling unbalanced data considered in this article. As models for classification, PNN and MLP are used. Problem of estimation of model performance in case of unbalanced training set is solved. Several methods (clustering approach and boosting approach) considered as useful to deal with the problem of input data.
Resumo:
Linear mixed effects models are frequently used to analyse longitudinal data, due to their flexibility in modelling the covariance structure between and within observations. Further, it is easy to deal with unbalanced data, either with respect to the number of observations per subject or per time period, and with varying time intervals between observations. In most applications of mixed models to biological sciences, a normal distribution is assumed both for the random effects and for the residuals. This, however, makes inferences vulnerable to the presence of outliers. Here, linear mixed models employing thick-tailed distributions for robust inferences in longitudinal data analysis are described. Specific distributions discussed include the Student-t, the slash and the contaminated normal. A Bayesian framework is adopted, and the Gibbs sampler and the Metropolis-Hastings algorithms are used to carry out the posterior analyses. An example with data on orthodontic distance growth in children is discussed to illustrate the methodology. Analyses based on either the Student-t distribution or on the usual Gaussian assumption are contrasted. The thick-tailed distributions provide an appealing robust alternative to the Gaussian process for modelling distributions of the random effects and of residuals in linear mixed models, and the MCMC implementation allows the computations to be performed in a flexible manner.
Resumo:
In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.
Resumo:
In many research areas (such as public health, environmental contamination, and others) one deals with the necessity of using data to infer whether some proportion (%) of a population of interest is (or one wants it to be) below and/or over some threshold, through the computation of tolerance interval. The idea is, once a threshold is given, one computes the tolerance interval or limit (which might be one or two - sided bounded) and then to check if it satisfies the given threshold. Since in this work we deal with the computation of one - sided tolerance interval, for the two-sided case we recomend, for instance, Krishnamoorthy and Mathew [5]. Krishnamoorthy and Mathew [4] performed the computation of upper tolerance limit in balanced and unbalanced one-way random effects models, whereas Fonseca et al [3] performed it based in a similar ideas but in a tow-way nested mixed or random effects model. In case of random effects model, Fonseca et al [3] performed the computation of such interval only for the balanced data, whereas in the mixed effects case they dit it only for the unbalanced data. For the computation of twosided tolerance interval in models with mixed and/or random effects we recomend, for instance, Sharma and Mathew [7]. The purpose of this paper is the computation of upper and lower tolerance interval in a two-way nested mixed effects models in balanced data. For the case of unbalanced data, as mentioned above, Fonseca et al [3] have already computed upper tolerance interval. Hence, using the notions persented in Fonseca et al [3] and Krishnamoorthy and Mathew [4], we present some results on the construction of one-sided tolerance interval for the balanced case. Thus, in order to do so at first instance we perform the construction for the upper case, and then the construction for the lower case.
Resumo:
Resumen: Este trabajo estudia los resultados en matemáticas y lenguaje de 32000 estudiantes en la prueba saber 11 del 2008, de la ciudad de Bogotá. Este análisis reconoce que los individuos se encuentran contenidos en barrios y colegios, pero no todos los individuos del mismo barrio asisten a la misma escuela y viceversa. Con el fin de modelar esta estructura de datos se utilizan varios modelos econométricos, incluyendo una regresión jerárquica multinivel de efectos cruzados. Nuestro objetivo central es identificar en qué medida y que condiciones del barrio y del colegio se correlacionan con los resultados educacionales de la población objetivo y cuáles características de los barrios y de los colegios están más asociadas al resultado en las pruebas. Usamos datos de la prueba saber 11, del censo de colegios c600, del censo poblacional del 2005 y de la policía metropolitana de Bogotá. Nuestras estimaciones muestran que tanto el barrio como el colegio están correlacionados con los resultados en las pruebas; pero el efecto del colegio parece ser mucho más fuerte que el del barrio. Las características del colegio que están más asociadas con el resultado en las pruebas son la educación de los profesores, la jornada, el valor de la pensión, y el contexto socio económico del colegio. Las características de los barrios más asociadas con el resultado en las pruebas son, la presencia de universitarios en la UPZ, un clúster de altos niveles de educación y nivel de crimen en el barrio que se correlaciona negativamente. Los resultados anteriores fueron hallados teniendo en cuenta controles familiares y personales.
Resumo:
São estabelecidas as matrizes necessárias para a realização da análise de variância de experimentos em parcelas subdivididas, com dados não-balanceados e balanceados, quando os tratamentos aplicados às parcelas e os tratamentos aplicados às subparcelas são ambos fatores quantitativos, usando a teoria de modelos lineares e de modelos lineares generalizados. Foi desenvolvido um programa computacional, na linguagem GLIM, para a realização da análise.
Resumo:
Este trabalho tem por objetivo formalizar os termos das respectivas somas de quadrados e hipóteses mais usuais, que são testadas nos modelos com três fatores de efeitos fixos hierarquizados para dados desbalanceados. Discute-se, também, o problema da interpretação de hipóteses associadas às somas de quadrados, bem como comparam-se os resultados fornecidos por alguns softwares estatísticos.
Resumo:
In this paper, the main factors that influence the demand for maritime passenger transportation in the Caribbean were studied. While maritime studies in the Caribbean have focused on infrastructural and operational systems for intensifying trade and movement of goods, there is little information on the movement of persons within the region and its potential to encourage further integration and sustainable development. Data to inform studies and policies in this area are particularly difficult to source. For this study, an unbalanced data set for the 2000-2014 period in 15 destinations with a focus on departing ferry passengers was compiled. Further a demand equation for maritime passenger transportation in the Caribbean using panel data methods was estimated. The results showed that this demand is related to the real fare of the service, international economic activity and the number of passengers arriving in the country by air.
Resumo:
A Bayesian approach to estimating the intraclass correlation coefficient was used for this research project. The background of the intraclass correlation coefficient, a summary of its standard estimators, and a review of basic Bayesian terminology and methodology were presented. The conditional posterior density of the intraclass correlation coefficient was then derived and estimation procedures related to this derivation were shown in detail. Three examples of applications of the conditional posterior density to specific data sets were also included. Two sets of simulation experiments were performed to compare the mean and mode of the conditional posterior density of the intraclass correlation coefficient to more traditional estimators. Non-Bayesian methods of estimation used were: the methods of analysis of variance and maximum likelihood for balanced data; and the methods of MIVQUE (Minimum Variance Quadratic Unbiased Estimation) and maximum likelihood for unbalanced data. The overall conclusion of this research project was that Bayesian estimates of the intraclass correlation coefficient can be appropriate, useful and practical alternatives to traditional methods of estimation. ^
Resumo:
As análises biplot que utilizam os modelos de efeitos principais aditivos com inter- ação multiplicativa (AMMI) requerem matrizes de dados completas, mas, frequentemente os ensaios multiambientais apresentam dados faltantes. Nesta tese são propostas novas metodologias de imputação simples e múltipla que podem ser usadas para analisar da- dos desbalanceados em experimentos com interação genótipo por ambiente (G×E). A primeira, é uma nova extensão do método de validação cruzada por autovetor (Bro et al, 2008). A segunda, corresponde a um novo algoritmo não-paramétrico obtido por meio de modificações no método de imputação simples desenvolvido por Yan (2013). Também é incluído um estudo que considera sistemas de imputação recentemente relatados na literatura e os compara com o procedimento clássico recomendado para imputação em ensaios (G×E), ou seja, a combinação do algoritmo de Esperança-Maximização com os modelos AMMI ou EM-AMMI. Por último, são fornecidas generalizações da imputação simples descrita por Arciniegas-Alarcón et al. (2010) que mistura regressão com aproximação de posto inferior de uma matriz. Todas as metodologias têm como base a decomposição por valores singulares (DVS), portanto, são livres de pressuposições distribucionais ou estruturais. Para determinar o desempenho dos novos esquemas de imputação foram realizadas simulações baseadas em conjuntos de dados reais de diferentes espécies, com valores re- tirados aleatoriamente em diferentes porcentagens e a qualidade das imputações avaliada com distintas estatísticas. Concluiu-se que a DVS constitui uma ferramenta útil e flexível na construção de técnicas eficientes que contornem o problema de perda de informação em matrizes experimentais.
Resumo:
The Maser thesis is devoted to developing a model to technical state of gas turbine engine estimation. The approaches to preparation data, especially to handle unbalanced data were presented in the thesis. In order to efficient estimation of model performance, the special metric was chosen. Goal of the master thesis is analyzing of monitoring parameters data and developing a model of technical state of GTE estimation based on the data.
Resumo:
An emerging issue in the field of astronomy is the integration, management and utilization of databases from around the world to facilitate scientific discovery. In this paper, we investigate application of the machine learning techniques of support vector machines and neural networks to the problem of amalgamating catalogues of galaxies as objects from two disparate data sources: radio and optical. Formulating this as a classification problem presents several challenges, including dealing with a highly unbalanced data set. Unlike the conventional approach to the problem (which is based on a likelihood ratio) machine learning does not require density estimation and is shown here to provide a significant improvement in performance. We also report some experiments that explore the importance of the radio and optical data features for the matching problem.
Resumo:
P>The determination of normal parameters is an important procedure in the evaluation of the stomatognathic system. We used the surface electromyography standardization protocol described by Ferrario et al. (J Oral Rehabil. 2000;27:33-40, 2006;33:341) to determine reference values of the electromyographic standardized indices for the assessment of muscular symmetry (left and right side, percentage overlapping coefficient, POC), potential lateral displacing components (unbalanced contractile activities of contralateral masseter and temporalis muscles, TC), relative activity (most prevalent pair of masticatory muscles, ATTIV) and total activity (integrated areas of the electromyographic potentials over time, IMPACT) in healthy Brazilian young adults, and the relevant data reproducibility. Electromyography of the right and left masseter and temporalis muscles was performed during maximum teeth clenching in 20 healthy subjects (10 women and 10 men, mean age 23 years, s.d. 3), free from periodontal problems, temporomandibular disorders, oro-facial myofunctional disorder, and with full permanent dentition (28 teeth at least). Data reproducibility was computed for 75% of the sample. The values obtained were POC Temporal (88 center dot 11 +/- 1 center dot 45%), POC masseter (87 center dot 11 +/- 1 center dot 60%), TC (8 center dot 79 +/- 1 center dot 20%), ATTIV (-0 center dot 33 +/- 9 center dot 65%) and IMPACT (110 center dot 40 +/- 23 center dot 69 mu V/mu V center dot s %). There were no statistical differences between test and retest values (P > 0 center dot 05). The Technical Errors of Measurement (TEM) for 50% of subjects assessed during the same session were 1 center dot 5, 1 center dot 39, 1 center dot 06, 3 center dot 83 and 10 center dot 04. For 25% of the subjects assessed after a 6-month interval, the TEM were 0 center dot 80, 1 center dot 03, 0 center dot 73, 12 center dot 70 and 19 center dot 10. For all indices, there was good reproducibility. These electromyographic indices could be used in the assessment of patients with stomatognathic dysfunction.