996 resultados para preprocessing data
Resumo:
Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to be analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham’s razor non-plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.
Resumo:
Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.
Resumo:
The accurate and reliable estimation of travel time based on point detector data is needed to support Intelligent Transportation System (ITS) applications. It has been found that the quality of travel time estimation is a function of the method used in the estimation and varies for different traffic conditions. In this study, two hybrid on-line travel time estimation models, and their corresponding off-line methods, were developed to achieve better estimation performance under various traffic conditions, including recurrent congestion and incidents. The first model combines the Mid-Point method, which is a speed-based method, with a traffic flow-based method. The second model integrates two speed-based methods: the Mid-Point method and the Minimum Speed method. In both models, the switch between travel time estimation methods is based on the congestion level and queue status automatically identified by clustering analysis. During incident conditions with rapidly changing queue lengths, shock wave analysis-based refinements are applied for on-line estimation to capture the fast queue propagation and recovery. Travel time estimates obtained from existing speed-based methods, traffic flow-based methods, and the models developed were tested using both simulation and real-world data. The results indicate that all tested methods performed at an acceptable level during periods of low congestion. However, their performances vary with an increase in congestion. Comparisons with other estimation methods also show that the developed hybrid models perform well in all cases. Further comparisons between the on-line and off-line travel time estimation methods reveal that off-line methods perform significantly better only during fast-changing congested conditions, such as during incidents. The impacts of major influential factors on the performance of travel time estimation, including data preprocessing procedures, detector errors, detector spacing, frequency of travel time updates to traveler information devices, travel time link length, and posted travel time range, were investigated in this study. The results show that these factors have more significant impacts on the estimation accuracy and reliability under congested conditions than during uncongested conditions. For the incident conditions, the estimation quality improves with the use of a short rolling period for data smoothing, more accurate detector data, and frequent travel time updates.
Resumo:
The importance of non-destructive techniques (NDT) in structural health monitoring programmes is being critically felt in the recent times. The quality of the measured data, often affected by various environmental conditions can be a guiding factor in terms usefulness and prediction efficiencies of the various detection and monitoring methods used in this regard. Often, a preprocessing of the acquired data in relation to the affecting environmental parameters can improve the information quality and lead towards a significantly more efficient and correct prediction process. The improvement can be directly related to the final decision making policy about a structure or a network of structures and is compatible with general probabilistic frameworks of such assessment and decision making programmes. This paper considers a preprocessing technique employed for an image analysis based structural health monitoring methodology to identify sub-marine pitting corrosion in the presence of variable luminosity, contrast and noise affecting the quality of images. A preprocessing of the gray-level threshold of the various images is observed to bring about a significant improvement in terms of damage detection as compared to an automatically computed gray-level threshold. The case dependent adjustments of the threshold enable to obtain the best possible information from an existing image. The corresponding improvements are observed in a qualitative manner in the present study.
Resumo:
A miniaturised gas analyser is described and evaluated based on the use of a substrate-integrated hollow waveguide (iHWG) coupled to a microsized near-infrared spectrophotometer comprising a linear variable filter and an array of InGaAs detectors. This gas sensing system was applied to analyse surrogate samples of natural fuel gas containing methane, ethane, propane and butane, quantified by using multivariate regression models based on partial least square (PLS) algorithms and Savitzky-Golay 1(st) derivative data preprocessing. The external validation of the obtained models reveals root mean square errors of prediction of 0.37, 0.36, 0.67 and 0.37% (v/v), for methane, ethane, propane and butane, respectively. The developed sensing system provides particularly rapid response times upon composition changes of the gaseous sample (approximately 2 s) due the minute volume of the iHWG-based measurement cell. The sensing system developed in this study is fully portable with a hand-held sized analyser footprint, and thus ideally suited for field analysis. Last but not least, the obtained results corroborate the potential of NIR-iHWG analysers for monitoring the quality of natural gas and petrochemical gaseous products.
Resumo:
High-throughput screening of physical, genetic and chemical-genetic interactions brings important perspectives in the Systems Biology field, as the analysis of these interactions provides new insights into protein/gene function, cellular metabolic variations and the validation of therapeutic targets and drug design. However, such analysis depends on a pipeline connecting different tools that can automatically integrate data from diverse sources and result in a more comprehensive dataset that can be properly interpreted. We describe here the Integrated Interactome System (IIS), an integrative platform with a web-based interface for the annotation, analysis and visualization of the interaction profiles of proteins/genes, metabolites and drugs of interest. IIS works in four connected modules: (i) Submission module, which receives raw data derived from Sanger sequencing (e.g. two-hybrid system); (ii) Search module, which enables the user to search for the processed reads to be assembled into contigs/singlets, or for lists of proteins/genes, metabolites and drugs of interest, and add them to the project; (iii) Annotation module, which assigns annotations from several databases for the contigs/singlets or lists of proteins/genes, generating tables with automatic annotation that can be manually curated; and (iv) Interactome module, which maps the contigs/singlets or the uploaded lists to entries in our integrated database, building networks that gather novel identified interactions, protein and metabolite expression/concentration levels, subcellular localization and computed topological metrics, GO biological processes and KEGG pathways enrichment. This module generates a XGMML file that can be imported into Cytoscape or be visualized directly on the web. We have developed IIS by the integration of diverse databases following the need of appropriate tools for a systematic analysis of physical, genetic and chemical-genetic interactions. IIS was validated with yeast two-hybrid, proteomics and metabolomics datasets, but it is also extendable to other datasets. IIS is freely available online at: http://www.lge.ibi.unicamp.br/lnbio/IIS/.
Resumo:
The article seeks to investigate patterns of performance and relationships between grip strength, gait speed and self-rated health, and investigate the relationships between them, considering the variables of gender, age and family income. This was conducted in a probabilistic sample of community-dwelling elderly aged 65 and over, members of a population study on frailty. A total of 689 elderly people without cognitive deficit suggestive of dementia underwent tests of gait speed and grip strength. Comparisons between groups were based on low, medium and high speed and strength. Self-related health was assessed using a 5-point scale. The males and the younger elderly individuals scored significantly higher on grip strength and gait speed than the female and oldest did; the richest scored higher than the poorest on grip strength and gait speed; females and men aged over 80 had weaker grip strength and lower gait speed; slow gait speed and low income arose as risk factors for a worse health evaluation. Lower muscular strength affects the self-rated assessment of health because it results in a reduction in functional capacity, especially in the presence of poverty and a lack of compensatory factors.
Resumo:
Obstructive sleep apnea syndrome has a high prevalence among adults. Cephalometric variables can be a valuable method for evaluating patients with this syndrome. To correlate cephalometric data with the apnea-hypopnea sleep index. We performed a retrospective and cross-sectional study that analyzed the cephalometric data of patients followed in the Sleep Disorders Outpatient Clinic of the Discipline of Otorhinolaryngology of a university hospital, from June 2007 to May 2012. Ninety-six patients were included, 45 men, and 51 women, with a mean age of 50.3 years. A total of 11 patients had snoring, 20 had mild apnea, 26 had moderate apnea, and 39 had severe apnea. The distance from the hyoid bone to the mandibular plane was the only variable that showed a statistically significant correlation with the apnea-hypopnea index. Cephalometric variables are useful tools for the understanding of obstructive sleep apnea syndrome. The distance from the hyoid bone to the mandibular plane showed a statistically significant correlation with the apnea-hypopnea index.
Resumo:
In acquired immunodeficiency syndrome (AIDS) studies it is quite common to observe viral load measurements collected irregularly over time. Moreover, these measurements can be subjected to some upper and/or lower detection limits depending on the quantification assays. A complication arises when these continuous repeated measures have a heavy-tailed behavior. For such data structures, we propose a robust structure for a censored linear model based on the multivariate Student's t-distribution. To compensate for the autocorrelation existing among irregularly observed measures, a damped exponential correlation structure is employed. An efficient expectation maximization type algorithm is developed for computing the maximum likelihood estimates, obtaining as a by-product the standard errors of the fixed effects and the log-likelihood function. The proposed algorithm uses closed-form expressions at the E-step that rely on formulas for the mean and variance of a truncated multivariate Student's t-distribution. The methodology is illustrated through an application to an Human Immunodeficiency Virus-AIDS (HIV-AIDS) study and several simulation studies.
Resumo:
To assess the completeness and reliability of the Information System on Live Births (Sinasc) data. A cross-sectional analysis of the reliability and completeness of Sinasc's data was performed using a sample of Live Birth Certificate (LBC) from 2009, related to births from Campinas, Southeast Brazil. For data analysis, hospitals were grouped according to category of service (Unified National Health System, private or both), 600 LBCs were randomly selected and the data were collected in LBC-copies through mothers and newborns' hospital records and by telephone interviews. The completeness of LBCs was evaluated, calculating the percentage of blank fields, and the LBCs agreement comparing the originals with the copies was evaluated by Kappa and intraclass correlation coefficients. The percentage of completeness of LBCs ranged from 99.8%-100%. For the most items, the agreement was excellent. However, the agreement was acceptable for marital status, maternal education and newborn infants' race/color, low for prenatal visits and presence of birth defects, and very low for the number of deceased children. The results showed that the municipality Sinasc is reliable for most of the studied variables. Investments in training of the professionals are suggested in an attempt to improve system capacity to support planning and implementation of health activities for the benefit of maternal and child population.
Resumo:
Often in biomedical research, we deal with continuous (clustered) proportion responses ranging between zero and one quantifying the disease status of the cluster units. Interestingly, the study population might also consist of relatively disease-free as well as highly diseased subjects, contributing to proportion values in the interval [0, 1]. Regression on a variety of parametric densities with support lying in (0, 1), such as beta regression, can assess important covariate effects. However, they are deemed inappropriate due to the presence of zeros and/or ones. To evade this, we introduce a class of general proportion density, and further augment the probabilities of zero and one to this general proportion density, controlling for the clustering. Our approach is Bayesian and presents a computationally convenient framework amenable to available freeware. Bayesian case-deletion influence diagnostics based on q-divergence measures are automatic from the Markov chain Monte Carlo output. The methodology is illustrated using both simulation studies and application to a real dataset from a clinical periodontology study.
Resumo:
Patients with obstructive sleep apnea syndrome usually present with changes in upper airway morphology and/or body fat distribution, which may occur throughout life and increase the severity of obstructive sleep apnea syndrome with age. To correlate cephalometric and anthropometric measures with obstructive sleep apnea syndrome severity in different age groups. A retrospective study of cephalometric and anthropometric measures of 102 patients with obstructive sleep apnea syndrome was analyzed. Patients were divided into three age groups (≥20 and <40 years, ≥40 and <60 years, and ≥60 years). Pearson's correlation was performed for these measures with the apnea-hypopnea index in the full sample, and subsequently by age group. The cephalometric measures MP-H (distance between the mandibular plane and the hyoid bone) and PNS-P (distance between the posterior nasal spine and the tip of the soft palate) and the neck and waist circumferences showed a statistically significant correlation with apnea-hypopnea index in both the full sample and in the ≥40 and <60 years age group. These variables did not show any significant correlation with the other two age groups (<40 and ≥60 years). Cephalometric measurements MP-H and PNS-P and cervical and waist circumferences correlated with obstructive sleep apnea syndrome severity in patients in the ≥40 and <60 age group.
Resumo:
The syndrome of resistance to thyroid hormone (RTH β) is an inherited disorder characterized by variable tissue hyposensitivity to 3,5,30-l-triiodothyronine (T3), with persistent elevation of free-circulating T3 (FT3) and free thyroxine (FT4) levels in association with nonsuppressed serum thyrotropin (TSH). Clinical presentation is variable and the molecular analysis of THRB gene provides a short cut diagnosis. Here, we describe 2 cases in which RTH β was suspected on the basis of laboratory findings. The diagnosis was confirmed by direct THRB sequencing that revealed 2 novel mutations: the heterozygous p.Ala317Ser in subject 1 and the heterozygous p.Arg438Pro in subject 2. Both mutations were shown to be deleterious by SIFT, PolyPhen, and Align GV-GD predictive methods.
Resumo:
The caffeine solubility in supercritical CO2 was studied by assessing the effects of pressure and temperature on the extraction of green coffee oil (GCO). The Peng-Robinson¹ equation of state was used to correlate the solubility of caffeine with a thermodynamic model and two mixing rules were evaluated: the classical mixing rule of van der Waals with two adjustable parameters (PR-VDW) and a density dependent one, proposed by Mohamed and Holder² with two (PR-MH, two parameters adjusted to the attractive term) and three (PR-MH3 two parameters adjusted to the attractive and one to the repulsive term) adjustable parameters. The best results were obtained with the mixing rule of Mohamed and Holder² with three parameters.
Resumo:
Advances in diagnostic research are moving towards methods whereby the periodontal risk can be identified and quantified by objective measures using biomarkers. Patients with periodontitis may have elevated circulating levels of specific inflammatory markers that can be correlated to the severity of the disease. The purpose of this study was to evaluate whether differences in the serum levels of inflammatory biomarkers are differentially expressed in healthy and periodontitis patients. Twenty-five patients (8 healthy patients and 17 chronic periodontitis patients) were enrolled in the study. A 15 mL blood sample was used for identification of the inflammatory markers, with a human inflammatory flow cytometry multiplex assay. Among 24 assessed cytokines, only 3 (RANTES, MIG and Eotaxin) were statistically different between groups (p<0.05). In conclusion, some of the selected markers of inflammation are differentially expressed in healthy and periodontitis patients. Cytokine profile analysis may be further explored to distinguish the periodontitis patients from the ones free of disease and also to be used as a measure of risk. The present data, however, are limited and larger sample size studies are required to validate the findings of the specific biomarkers.