980 resultados para Data errors
Resumo:
Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.
Resumo:
Background: The inherent complexity of statistical methods and clinical phenomena compel researchers with diverse domains of expertise to work in interdisciplinary teams, where none of them have a complete knowledge in their counterpart's field. As a result, knowledge exchange may often be characterized by miscommunication leading to misinterpretation, ultimately resulting in errors in research and even clinical practice. Though communication has a central role in interdisciplinary collaboration and since miscommunication can have a negative impact on research processes, to the best of our knowledge, no study has yet explored how data analysis specialists and clinical researchers communicate over time. Methods/Principal Findings: We conducted qualitative analysis of encounters between clinical researchers and data analysis specialists (epidemiologist, clinical epidemiologist, and data mining specialist). These encounters were recorded and systematically analyzed using a grounded theory methodology for extraction of emerging themes, followed by data triangulation and analysis of negative cases for validation. A policy analysis was then performed using a system dynamics methodology looking for potential interventions to improve this process. Four major emerging themes were found. Definitions using lay language were frequently employed as a way to bridge the language gap between the specialties. Thought experiments presented a series of ""what if'' situations that helped clarify how the method or information from the other field would behave, if exposed to alternative situations, ultimately aiding in explaining their main objective. Metaphors and analogies were used to translate concepts across fields, from the unfamiliar to the familiar. Prolepsis was used to anticipate study outcomes, thus helping specialists understand the current context based on an understanding of their final goal. Conclusion/Significance: The communication between clinical researchers and data analysis specialists presents multiple challenges that can lead to errors.
Resumo:
Background: There are several studies in the literature depicting measurement error in gene expression data and also, several others about regulatory network models. However, only a little fraction describes a combination of measurement error in mathematical regulatory networks and shows how to identify these networks under different rates of noise. Results: This article investigates the effects of measurement error on the estimation of the parameters in regulatory networks. Simulation studies indicate that, in both time series (dependent) and non-time series (independent) data, the measurement error strongly affects the estimated parameters of the regulatory network models, biasing them as predicted by the theory. Moreover, when testing the parameters of the regulatory network models, p-values computed by ignoring the measurement error are not reliable, since the rate of false positives are not controlled under the null hypothesis. In order to overcome these problems, we present an improved version of the Ordinary Least Square estimator in independent (regression models) and dependent (autoregressive models) data when the variables are subject to noises. Moreover, measurement error estimation procedures for microarrays are also described. Simulation results also show that both corrected methods perform better than the standard ones (i.e., ignoring measurement error). The proposed methodologies are illustrated using microarray data from lung cancer patients and mouse liver time series data. Conclusions: Measurement error dangerously affects the identification of regulatory network models, thus, they must be reduced or taken into account in order to avoid erroneous conclusions. This could be one of the reasons for high biological false positive rates identified in actual regulatory network models.
Resumo:
This work presents an automated system for the measurement of form errors of mechanical components using an industrial robot. A three-probe error separation technique was employed to allow decoupling between the measured form error and errors introduced by the robotic system. A mathematical model of the measuring system was developed to provide inspection results by means of the solution of a system of linear equations. A new self-calibration procedure, which employs redundant data from several runs, minimizes the influence of probes zero-adjustment on the final result. Experimental tests applied to the measurement of straightness errors of mechanical components were accomplished and demonstrated the effectiveness of the employed methodology. (C) 2007 Elsevier Ltd. All rights reserved.
Resumo:
Estimation of Taylor`s power law for species abundance data may be performed by linear regression of the log empirical variances on the log means, but this method suffers from a problem of bias for sparse data. We show that the bias may be reduced by using a bias-corrected Pearson estimating function. Furthermore, we investigate a more general regression model allowing for site-specific covariates. This method may be efficiently implemented using a Newton scoring algorithm, with standard errors calculated from the inverse Godambe information matrix. The method is applied to a set of biomass data for benthic macrofauna from two Danish estuaries. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
Potential errors in the application of mixture theory to the analysis of multiple-frequency bioelectrical impedance data for the determination of body fluid volumes are assessed. Potential sources of error include: conductive length; tissue fluid resistivity; body density; weight and technical errors of measurement. Inclusion of inaccurate estimates of body density and weight introduce errors of typically < +/-3% but incorrect assumptions regarding conductive length or fluid resistivities may each incur errors of up to 20%.
Resumo:
We present a method of estimating HIV incidence rates in epidemic situations from data on age-specific prevalence and changes in the overall prevalence over time. The method is applied to women attending antenatal clinics in Hlabisa, a rural district of KwaZulu/Natal, South Africa, where transmission of HIV is overwhelmingly through heterosexual contact. A model which gives age-specific prevalence rates in the presence of a progressing epidemic is fitted to prevalence data for 1998 using maximum likelihood methods and used to derive the age-specific incidence. Error estimates are obtained using a Monte Carlo procedure. Although the method is quite general some simplifying assumptions are made concerning the form of the risk function and sensitivity analyses are performed to explore the importance of these assumptions. The analysis shows that in 1998 the annual incidence of infection per susceptible woman increased from 5.4 per cent (3.3-8.5 per cent; here and elsewhere ranges give 95 per cent confidence limits) at age 15 years to 24.5 per cent (20.6-29.1 per cent) at age 22 years and declined to 1.3 per cent (0.5-2.9 per cent) at age 50 years; standardized to a uniform age distribution, the overall incidence per susceptible woman aged 15 to 59 was 11.4 per cent (10.0-13.1 per cent); per women in the population it was 8.4 per cent (7.3-9.5 per cent). Standardized to the age distribution of the female population the average incidence per woman was 9.6 per cent (8.4-11.0 per cent); standardized to the age distribution of women attending antenatal clinics, it was 11.3 per cent (9.8-13.3 per cent). The estimated incidence depends on the values used for the epidemic growth rate and the AIDS related mortality. To ensure that, for this population, errors in these two parameters change the age specific estimates of the annual incidence by less than the standard deviation of the estimates of the age specific incidence, the AIDS related mortality should be known to within +/-50 per cent and the epidemic growth rate to within +/-25 per cent, both of which conditions are met. In the absence of cohort studies to measure the incidence of HIV infection directly, useful estimates of the age-specific incidence can be obtained from cross-sectional, age-specific prevalence data and repeat cross-sectional data on the overall prevalence of HIV infection. Several assumptions were made because of the lack of data but sensitivity analyses show that they are unlikely to affect the overall estimates significantly. These estimates are important in assessing the magnitude of the public health problem, for designing vaccine trials and for evaluating the impact of interventions. Copyright (C) 2001 John Wiley & Sons, Ltd.
Resumo:
Matrix population models, elasticity analysis and loop analysis can potentially provide powerful techniques for the analysis of life histories. Data from a capture-recapture study on a population of southern highland water skinks (Eulamprus tympanum) were used to construct a matrix population model. Errors in elasticities were calculated by using the parametric bootstrap technique. Elasticity and loop analyses were then conducted to identify the life history stages most important to fitness. The same techniques were used to investigate the relative importance of fast versus slow growth, and rapid versus delayed reproduction. Mature water skinks were long-lived, but there was high immature mortality. The most sensitive life history stage was the subadult stage. It is suggested that life history evolution in E. tympanum may be strongly affected by predation, particularly by birds. Because our population declined over the study, slow growth and delayed reproduction were the optimal life history strategies over this period. Although the techniques of evolutionary demography provide a powerful approach for the analysis of life histories, there are formidable logistical obstacles in gathering enough high-quality data for robust estimates of the critical parameters.
Resumo:
In this paper use consider the problem of providing standard errors of the component means in normal mixture models fitted to univariate or multivariate data by maximum likelihood via the EM algorithm. Two methods of estimation of the standard errors are considered: the standard information-based method and the computationally-intensive bootstrap method. They are compared empirically by their application to three real data sets and by a small-scale Monte Carlo experiment.
Resumo:
Mitochondrial DNA (mtDNA) population data for forensic purposes are still scarce for some populations, which may limit the evaluation of forensic evidence especially when the rarity of a haplotype needs to be determined in a database search. In order to improve the collection of mtDNA lineages from the Iberian and South American subcontinents, we here report the results of a collaborative study involving nine laboratories from the Spanish and Portuguese Speaking Working Group of the International Society for Forensic Genetics (GHEP-ISFG) and EMPOP. The individual laboratories contributed population data that were generated throughout the past 10 years, but in the majority of cases have not been made available to the scientific community. A total of 1019 haplotypes from Iberia (Basque Country, 2 general Spanish populations, 2 North and 1 Central Portugal populations), and Latin America (3 populations from Sao Paulo) were collected, reviewed and harmonized according to defined EMPOP criteria. The majority of data ambiguities that were found during the reviewing process (41 in total) were transcription errors confirming that the documentation process is still the most error-prone stage in reporting mtDNA population data, especially when performed manually. This GHEP-EMPOP collaboration has significantly improved the quality of the individual mtDNA datasets and adds mtDNA population data as valuable resource to the EMPOP database (www.empop.org). (C) 2010 Elsevier Ireland Ltd. All rights reserved.
Resumo:
P>The determination of normal parameters is an important procedure in the evaluation of the stomatognathic system. We used the surface electromyography standardization protocol described by Ferrario et al. (J Oral Rehabil. 2000;27:33-40, 2006;33:341) to determine reference values of the electromyographic standardized indices for the assessment of muscular symmetry (left and right side, percentage overlapping coefficient, POC), potential lateral displacing components (unbalanced contractile activities of contralateral masseter and temporalis muscles, TC), relative activity (most prevalent pair of masticatory muscles, ATTIV) and total activity (integrated areas of the electromyographic potentials over time, IMPACT) in healthy Brazilian young adults, and the relevant data reproducibility. Electromyography of the right and left masseter and temporalis muscles was performed during maximum teeth clenching in 20 healthy subjects (10 women and 10 men, mean age 23 years, s.d. 3), free from periodontal problems, temporomandibular disorders, oro-facial myofunctional disorder, and with full permanent dentition (28 teeth at least). Data reproducibility was computed for 75% of the sample. The values obtained were POC Temporal (88 center dot 11 +/- 1 center dot 45%), POC masseter (87 center dot 11 +/- 1 center dot 60%), TC (8 center dot 79 +/- 1 center dot 20%), ATTIV (-0 center dot 33 +/- 9 center dot 65%) and IMPACT (110 center dot 40 +/- 23 center dot 69 mu V/mu V center dot s %). There were no statistical differences between test and retest values (P > 0 center dot 05). The Technical Errors of Measurement (TEM) for 50% of subjects assessed during the same session were 1 center dot 5, 1 center dot 39, 1 center dot 06, 3 center dot 83 and 10 center dot 04. For 25% of the subjects assessed after a 6-month interval, the TEM were 0 center dot 80, 1 center dot 03, 0 center dot 73, 12 center dot 70 and 19 center dot 10. For all indices, there was good reproducibility. These electromyographic indices could be used in the assessment of patients with stomatognathic dysfunction.
Resumo:
The success of dental implant-supported prosthesis is directly linked to the accuracy obtained during implant’s pose estimation (position and orientation). Although traditional impression techniques and recent digital acquisition methods are acceptably accurate, a simultaneously fast, accurate and operator-independent methodology is still lacking. Hereto, an image-based framework is proposed to estimate the patient-specific implant’s pose using cone-beam computed tomography (CBCT) and prior knowledge of implanted model. The pose estimation is accomplished in a threestep approach: (1) a region-of-interest is extracted from the CBCT data using 2 operator-defined points at the implant’s main axis; (2) a simulated CBCT volume of the known implanted model is generated through Feldkamp-Davis-Kress reconstruction and coarsely aligned to the defined axis; and (3) a voxel-based rigid registration is performed to optimally align both patient and simulated CBCT data, extracting the implant’s pose from the optimal transformation. Three experiments were performed to evaluate the framework: (1) an in silico study using 48 implants distributed through 12 tridimensional synthetic mandibular models; (2) an in vitro study using an artificial mandible with 2 dental implants acquired with an i-CAT system; and (3) two clinical case studies. The results shown positional errors of 67±34μm and 108μm, and angular misfits of 0.15±0.08º and 1.4º, for experiment 1 and 2, respectively. Moreover, in experiment 3, visual assessment of clinical data results shown a coherent alignment of the reference implant. Overall, a novel image-based framework for implants’ pose estimation from CBCT data was proposed, showing accurate results in agreement with dental prosthesis modelling requirements.
Resumo:
Cluster scheduling and collision avoidance are crucial issues in large-scale cluster-tree Wireless Sensor Networks (WSNs). The paper presents a methodology that provides a Time Division Cluster Scheduling (TDCS) mechanism based on the cyclic extension of RCPS/TC (Resource Constrained Project Scheduling with Temporal Constraints) problem for a cluster-tree WSN, assuming bounded communication errors. The objective is to meet all end-to-end deadlines of a predefined set of time-bounded data flows while minimizing the energy consumption of the nodes by setting the TDCS period as long as possible. Sinceeach cluster is active only once during the period, the end-to-end delay of a given flow may span over several periods when there are the flows with opposite direction. The scheduling tool enables system designers to efficiently configure all required parameters of the IEEE 802.15.4/ZigBee beaconenabled cluster-tree WSNs in the network design time. The performance evaluation of thescheduling tool shows that the problems with dozens of nodes can be solved while using optimal solvers.
Resumo:
Environment monitoring has an important role in occupational exposure assessment. However, due to several factors is done with insufficient frequency and normally don´t give the necessary information to choose the most adequate safety measures to avoid or control exposure. Identifying all the tasks developed in each workplace and conducting a task-based exposure assessment help to refine the exposure characterization and reduce assessment errors. A task-based assessment can provide also a better evaluation of exposure variability, instead of assessing personal exposures using continuous 8-hour time weighted average measurements. Health effects related with exposure to particles have mainly been investigated with mass-measuring instruments or gravimetric analysis. However, more recently, there are some studies that support that size distribution and particle number concentration may have advantages over particle mass concentration for assessing the health effects of airborne particles. Several exposure assessments were performed in different occupational settings (bakery, grill house, cork industry and horse stable) and were applied these two resources: task-based exposure assessment and particle number concentration by size. The results showed interesting results: task-based approach applied permitted to identify the tasks with higher exposure to the smaller particles (0.3 μm) in the different occupational settings. The data obtained allow more concrete and effective risk assessment and the identification of priorities for safety investments.
Resumo:
Epidemiological studies have shown the effect of diet on the incidence of chronic diseases; however, proper planning, designing, and statistical modeling are necessary to obtain precise and accurate food consumption data. Evaluation methods used for short-term assessment of food consumption of a population, such as tracking of food intake over 24h or food diaries, can be affected by random errors or biases inherent to the method. Statistical modeling is used to handle random errors, whereas proper designing and sampling are essential for controlling biases. The present study aimed to analyze potential biases and random errors and determine how they affect the results. We also aimed to identify ways to prevent them and/or to use statistical approaches in epidemiological studies involving dietary assessments.