906 resultados para Exploratory statistical data analysis
Resumo:
Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.
Resumo:
Objetivou-se verificar a prevalência de deficiência auditiva referida pela população urbana de quatro localidades do Estado de São Paulo, Brasil, e estudar as causas atribuídas e variáveis sócio-demográficas. Foi realizado um estudo transversal de base populacional com dados referentes à população com 12 anos ou mais residente nas quatro localidades, em 2001 e 2002. Participaram 5.250 sujeitos selecionados por amostragem probabilística, estratificada e selecionada por conglomerados, em dois estágios. A análise dos dados foi exploratória, incluindo análise bivariada e regressão logística múltipla. A prevalência de deficiência auditiva foi 5,21%, mais acentuada nas faixas etárias acima de 59 anos (18,7%), que referiram doenças nos 15 dias anteriores à entrevista (8,4%), com transtorno mental comum (8,85%) e que fizeram uso de medicamentos nos últimos 3 dias (8,45%). O estudo dos fatores que se associam à deficiência auditiva direcionam intervenções de saúde para que atendam as reais necessidades da população, principalmente na atenção primária. Há necessidade de mais estudos populacionais com enfoque na audição, visto que esta é uma área escassa de publicações no Brasil.
Resumo:
O objetivo do estudo foi apresentar a fração da variância intrapessoal para ajuste da distribuição de nutrientes de adultos e idosos. Utilizaram-se dados de inquérito populacional com amostra representativa (n = 511) de indivíduos com 19 anos ou mais do município de São Paulo, SP, em 2007. A fração da variância intrapessoal foi obtida pelo método proposto pela Iowa State University. Observaram-se diferenças nas frações das variâncias intrapessoais de nutrientes segundo sexo. Esses valores devem ser utilizados para ajustar a distribuição da ingestão de nutrientes, pois sua não utilização pode resultar em viés na análise e interpretação de dados.
Resumo:
Descreve-se o uso da análise fatorial na avaliação dos hábitos alimentares de nipo-brasileiros. Utilizaram-se dados dietéticos de 1.283 participantes de estudo transversal. A partir de critérios estatísticos e do significado conceitual dos padrões identificados, geraram-se escores que definiram os perfis dietéticos (japonês ou ocidental). Empregou-se o teste t de Student pareado, os modelos de regressão linear e de Poisson para examinar as relações desses escores com, respectivamente, a geração, índice de massa corporal (IMC), perímetro abdominal e a presença de síndrome metabólica. Aqueles de primeira geração, em relação aos de segunda, apresentaram escores maiores para o perfil japonês e menores para o ocidental. O perfil ocidental relacionou-se com o IMC (p = 0,001), perímetro abdominal (p = 0,023) e a síndrome metabólica (p < 0,05). Conclui-se que os escores discriminaram sujeitos que mantêm ou não estilo de vida tradicional japonês e que a incorporação de hábitos ocidentais associou-se a maiores valores de IMC, perímetro abdominal e a presença de síndrome metabólica
Resumo:
Gene clustering is a useful exploratory technique to group together genes with similar expression levels under distinct cell cycle phases or distinct conditions. It helps the biologist to identify potentially meaningful relationships between genes. In this study, we propose a clustering method based on multivariate normal mixture models, where the number of clusters is predicted via sequential hypothesis tests: at each step, the method considers a mixture model of m components (m = 2 in the first step) and tests if in fact it should be m - 1. If the hypothesis is rejected, m is increased and a new test is carried out. The method continues (increasing m) until the hypothesis is accepted. The theoretical core of the method is the full Bayesian significance test, an intuitive Bayesian approach, which needs no model complexity penalization nor positive probabilities for sharp hypotheses. Numerical experiments were based on a cDNA microarray dataset consisting of expression levels of 205 genes belonging to four functional categories, for 10 distinct strains of Saccharomyces cerevisiae. To analyze the method's sensitivity to data dimension, we performed principal components analysis on the original dataset and predicted the number of classes using 2 to 10 principal components. Compared to Mclust (model-based clustering), our method shows more consistent results.
Resumo:
Aims. We derive lists of proper-motions and kinematic membership probabilities for 49 open clusters and possible open clusters in the zone of the Bordeaux PM2000 proper motion catalogue (+ 11 degrees <= delta <= + 18 degrees). We test different parametrisations of the proper motion and position distribution functions and select the most successful one. In the light of those results, we analyse some objects individually. Methods. We differenciate between cluster and field member stars, and assign membership probabilities, by applying a new and fully automated method based on both parametrisations of the proper motion and position distribution functions, and genetic algorithm optimization heuristics associated with a derivative-based hill climbing algorithm for the likelihood optimization. Results. We present a catalogue comprising kinematic parameters and associated membership probability lists for 49 open clusters and possible open clusters in the Bordeaux PM2000 catalogue region. We note that this is the first determination of proper motions for five open clusters. We confirm the non-existence of two kinematic populations in the region of 15 previously suspected non-existent objects.
Resumo:
Aims. In this work, we describe the pipeline for the fast supervised classification of light curves observed by the CoRoT exoplanet CCDs. We present the classification results obtained for the first four measured fields, which represent a one-year in-orbit operation. Methods. The basis of the adopted supervised classification methodology has been described in detail in a previous paper, as is its application to the OGLE database. Here, we present the modifications of the algorithms and of the training set to optimize the performance when applied to the CoRoT data. Results. Classification results are presented for the observed fields IRa01, SRc01, LRc01, and LRa01 of the CoRoT mission. Statistics on the number of variables and the number of objects per class are given and typical light curves of high-probability candidates are shown. We also report on new stellar variability types discovered in the CoRoT data. The full classification results are publicly available.
Resumo:
Recurrences are close returns of a given state in a time series, and can be used to identify different dynamical regimes and other related phenomena, being particularly suited for analyzing experimental data. In this work, we use recurrence quantification analysis to investigate dynamical patterns in scalar data series obtained from measurements of floating potential and ion saturation current at the plasma edge of the Tokamak Chauffage Alfveacuten Breacutesilien [R. M. O. Galva approximate to o , Plasma Phys. Controlled Fusion 43, 1181 (2001)]. We consider plasma discharges with and without the application of radial electric bias, and also with two different regimes of current ramp. Our results indicate that biasing improves confinement through destroying highly recurrent regions within the plasma column that enhance particle and heat transport.
Resumo:
Alternative splicing of gene transcripts greatly expands the functional capacity of the genome, and certain splice isoforms may indicate specific disease states such as cancer. Splice junction microarrays interrogate thousands of splice junctions, but data analysis is difficult and error prone because of the increased complexity compared to differential gene expression analysis. We present Rank Change Detection (RCD) as a method to identify differential splicing events based upon a straightforward probabilistic model comparing the over-or underrepresentation of two or more competing isoforms. RCD has advantages over commonly used methods because it is robust to false positive errors due to nonlinear trends in microarray measurements. Further, RCD does not depend on prior knowledge of splice isoforms, yet it takes advantage of the inherent structure of mutually exclusive junctions, and it is conceptually generalizable to other types of splicing arrays or RNA-Seq. RCD specifically identifies the biologically important cases when a splice junction becomes more or less prevalent compared to other mutually exclusive junctions. The example data is from different cell lines of glioblastoma tumors assayed with Agilent microarrays.
Resumo:
A rapid method for classification of mineral waters is proposed. The discrimination power was evaluated by a novel combination of chemometric data analysis and qualitative multi-elemental fingerprints of mineral water samples acquired from different regions of the Brazilian territory. The classification of mineral waters was assessed using only the wavelength emission intensities obtained by inductively coupled plasma optical emission spectrometry (ICP OES), monitoring different lines of Al, B, Ba, Ca, Cl, Cu, Co, Cr, Fe, K, Mg, Mn, Na, Ni, P, Pb, S, Sb, Si, Sr, Ti, V, and Zn, and Be, Dy, Gd, In, La, Sc and Y as internal standards. Data acquisition was done under robust (RC) and non-robust (NRC) conditions. Also, the combination of signal intensities of two or more emission lines for each element were evaluated instead of the individual lines. The performance of two classification-k-nearest neighbor (kNN) and soft independent modeling of class analogy (SIMCA)-and preprocessing algorithms, autoscaling and Pareto scaling, were evaluated for the ability to differentiate between the various samples in each approach tested (combination of robust or non-robust conditions with use of individual lines or sum of the intensities of emission lines). It was shown that qualitative ICP OES fingerprinting in combination with multivariate analysis is a promising analytical tool that has potential to become a recognized procedure for rapid authenticity and adulteration testing of mineral water samples or other material whose physicochemical properties (or origin) are directly related to mineral content.
Resumo:
The inverse Weibull distribution has the ability to model failure rates which are quite common in reliability and biological studies. A three-parameter generalized inverse Weibull distribution with decreasing and unimodal failure rate is introduced and studied. We provide a comprehensive treatment of the mathematical properties of the new distribution including expressions for the moment generating function and the rth generalized moment. The mixture model of two generalized inverse Weibull distributions is investigated. The identifiability property of the mixture model is demonstrated. For the first time, we propose a location-scale regression model based on the log-generalized inverse Weibull distribution for modeling lifetime data. In addition, we develop some diagnostic tools for sensitivity analysis. Two applications of real data are given to illustrate the potentiality of the proposed regression model.
Resumo:
In this paper, we compare three residuals to assess departures from the error assumptions as well as to detect outlying observations in log-Burr XII regression models with censored observations. These residuals can also be used for the log-logistic regression model, which is a special case of the log-Burr XII regression model. For different parameter settings, sample sizes and censoring percentages, various simulation studies are performed and the empirical distribution of each residual is displayed and compared with the standard normal distribution. These studies suggest that the residual analysis usually performed in normal linear regression models can be straightforwardly extended to the modified martingale-type residual in log-Burr XII regression models with censored data.
Resumo:
Joint generalized linear models and double generalized linear models (DGLMs) were designed to model outcomes for which the variability can be explained using factors and/or covariates. When such factors operate, the usual normal regression models, which inherently exhibit constant variance, will under-represent variation in the data and hence may lead to erroneous inferences. For count and proportion data, such noise factors can generate a so-called overdispersion effect, and the use of binomial and Poisson models underestimates the variability and, consequently, incorrectly indicate significant effects. In this manuscript, we propose a DGLM from a Bayesian perspective, focusing on the case of proportion data, where the overdispersion can be modeled using a random effect that depends on some noise factors. The posterior joint density function was sampled using Monte Carlo Markov Chain algorithms, allowing inferences over the model parameters. An application to a data set on apple tissue culture is presented, for which it is shown that the Bayesian approach is quite feasible, even when limited prior information is available, thereby generating valuable insight for the researcher about its experimental results.
Resumo:
Traditionally the basal ganglia have been implicated in motor behavior, as they are involved in both the execution of automatic actions and the modification of ongoing actions in novel contexts. Corresponding to cognition, the role of the basal ganglia has not been defined as explicitly. Relative to linguistic processes, contemporary theories of subcortical participation in language have endorsed a role for the globus pallidus internus (GPi) in the control of lexical-semantic operations. However, attempts to empirically validate these postulates have been largely limited to neuropsychological investigations of verbal fluency abilities subsequent to pallidotomy. We evaluated the impact of bilateral posteroventral pallidotomy (BPVP) on language function across a range of general and high-level linguistic abilities, and validated/extended working theories of pallidal participation in language. Comprehensive linguistic profiles were compiled up to 1 month before and 3 months after BPVP in 6 subjects with Parkinson's disease (PD). Commensurate linguistic profiles were also gathered over a 3-month period for a nonsurgical control cohort of 16 subjects with PD and a group of 16 non-neurologically impaired controls (NC). Nonparametric between-groups comparisons were conducted and reliable change indices calculated, relative to baseline/3-month follow-up difference scores. Group-wise statistical comparisons between the three groups failed to reveal significant postoperative changes in language performance. Case-by-case data analysis relative to clinically consequential change indices revealed reliable alterations in performance across several language variables as a consequence of BPVP. These findings lend support to models of subcortical participation in language, which promote a role for the GPi in lexical-semantic manipulation mechanisms. Concomitant improvements and decrements in postoperative performance were interpreted within the context of additive and subtractive postlesional effects. Relative to parkinsonian cohorts, clinically reliable versus statistically significant changes on a case by case basis may provide the most accurate method of characterizing the way in which pathophysiologically divergent basal ganglia linguistic circuits respond to BPVP.
Resumo:
The hepatic disposition and metabolite kinetics of a homologous series of diflunisal O-acyl esters (acetyl, butanoyl, pentanoyl, anti hexanoyl) were determined using a single-pass perfused in situ rat liver preparation. The experiments were conducted using 2% BSA Krebs-Henseleit buffer (pH 7.4), and perfusions were performed at 30 mL/min in each liver. O-Acyl esters of diflunisal and pregenerated diflunisal were injected separately into the portal vein. The venous outflow samples containing the esters and metabolite diflunisal were analyzed by high performance liquid chromatography (HPLC). The normalized outflow concentration-time profiles for each parent ester and the formed metabolite, diflunisal, were analyzed using statistical moments analysis and the two-compartment dispersion model. Data (presented as mean +/- standard error for triplicate experiments) was compared using ANOVA repeated measures, significance level P < 0.05. The hepatic availability (AUC'), the fraction of the injected dose recovered in the outflowing perfusate, for O-acetyldiflunisal (C2D = 0.21 +/- 0.03) was significantly lower than the other esters (0.34-0.38). However, R-N/f(u), the removal efficiency number R-N divided by the unbound fraction in perfusate f(u), which represents the removal efficiency of unbound ester by the liver, was significantly higher for the most lipophilic ester (O-hexanoyldiflunisal, C6D = 16.50 +/- 0.22) compared to the other members of the series (9.57 to 11.17). The most lipophilic ester, C6D, had the largest permeability surface area (PS) product (94.52 +/- 38.20 mt min-l g-l liver) and tissue distribution value VT (35.62 +/- 11.33 mL g(-1) liver) in this series. The MTT of these O-acyl esters of diflunisal were not significantly different from one another. However, the metabolite diflunisal MTTs tended to increase with the increase in the parent ester lipophilicity (11.41 +/- 2.19 s for C2D to 38.63 +/- 9.81 s for C6D). The two-compartment dispersion model equations adequately described the outflow profiles for the parent esters and the metabolite diflunisal formed from the O-acyl esters of diflunisal in the liver.