947 resultados para Data aggregation
Resumo:
In the nonparametric framework of Data Envelopment Analysis the statistical properties of its estimators have been investigated and only asymptotic results are available. For DEA estimators results of practical use have been proved only for the case of one input and one output. However, in the real world problems the production process is usually well described by many variables. In this paper a machine learning approach to variable aggregation based on Canonical Correlation Analysis is presented. This approach is applied for efficiency estimation of all the farms in Terceira Island of the Azorean archipelago.
Resumo:
Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to be analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham’s razor non-plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.
Resumo:
An iterative travel time forecasting scheme, named the Advanced Multilane Prediction based Real-time Fastest Path (AMPRFP) algorithm, is presented in this dissertation. This scheme is derived from the conventional kernel estimator based prediction model by the association of real-time nonlinear impacts that caused by neighboring arcs’ traffic patterns with the historical traffic behaviors. The AMPRFP algorithm is evaluated by prediction of the travel time of congested arcs in the urban area of Jacksonville City. Experiment results illustrate that the proposed scheme is able to significantly reduce both the relative mean error (RME) and the root-mean-squared error (RMSE) of the predicted travel time. To obtain high quality real-time traffic information, which is essential to the performance of the AMPRFP algorithm, a data clean scheme enhanced empirical learning (DCSEEL) algorithm is also introduced. This novel method investigates the correlation between distance and direction in the geometrical map, which is not considered in existing fingerprint localization methods. Specifically, empirical learning methods are applied to minimize the error that exists in the estimated distance. A direction filter is developed to clean joints that have negative influence to the localization accuracy. Synthetic experiments in urban, suburban and rural environments are designed to evaluate the performance of DCSEEL algorithm in determining the cellular probe’s position. The results show that the cellular probe’s localization accuracy can be notably improved by the DCSEEL algorithm. Additionally, a new fast correlation technique for overcoming the time efficiency problem of the existing correlation algorithm based floating car data (FCD) technique is developed. The matching process is transformed into a 1-dimensional (1-D) curve matching problem and the Fast Normalized Cross-Correlation (FNCC) algorithm is introduced to supersede the Pearson product Moment Correlation Co-efficient (PMCC) algorithm in order to achieve the real-time requirement of the FCD method. The fast correlation technique shows a significant improvement in reducing the computational cost without affecting the accuracy of the matching process.
Resumo:
Temporal replicate counts are often aggregated to improve model fit by reducing zero-inflation and count variability, and in the case of migration counts collected hourly throughout a migration, allows one to ignore nonindependence. However, aggregation can represent a loss of potentially useful information on the hourly or seasonal distribution of counts, which might impact our ability to estimate reliable trends. We simulated 20-year hourly raptor migration count datasets with known rate of change to test the effect of aggregating hourly counts to daily or annual totals on our ability to recover known trend. We simulated data for three types of species, to test whether results varied with species abundance or migration strategy: a commonly detected species, e.g., Northern Harrier, Circus cyaneus; a rarely detected species, e.g., Peregrine Falcon, Falco peregrinus; and a species typically counted in large aggregations with overdispersed counts, e.g., Broad-winged Hawk, Buteo platypterus. We compared accuracy and precision of estimated trends across species and count types (hourly/daily/annual) using hierarchical models that assumed a Poisson, negative binomial (NB) or zero-inflated negative binomial (ZINB) count distribution. We found little benefit of modeling zero-inflation or of modeling the hourly distribution of migration counts. For the rare species, trends analyzed using daily totals and an NB or ZINB data distribution resulted in a higher probability of detecting an accurate and precise trend. In contrast, trends of the common and overdispersed species benefited from aggregation to annual totals, and for the overdispersed species in particular, trends estimating using annual totals were more precise, and resulted in lower probabilities of estimating a trend (1) in the wrong direction, or (2) with credible intervals that excluded the true trend, as compared with hourly and daily counts.
Resumo:
By providing vehicle-to-vehicle and vehicle-to-infrastructure wireless communications, vehicular ad hoc networks (VANETs), also known as the “networks on wheels”, can greatly enhance traffic safety, traffic efficiency and driving experience for intelligent transportation system (ITS). However, the unique features of VANETs, such as high mobility and uneven distribution of vehicular nodes, impose critical challenges of high efficiency and reliability for the implementation of VANETs. This dissertation is motivated by the great application potentials of VANETs in the design of efficient in-network data processing and dissemination. Considering the significance of message aggregation, data dissemination and data collection, this dissertation research targets at enhancing the traffic safety and traffic efficiency, as well as developing novel commercial applications, based on VANETs, following four aspects: 1) accurate and efficient message aggregation to detect on-road safety relevant events, 2) reliable data dissemination to reliably notify remote vehicles, 3) efficient and reliable spatial data collection from vehicular sensors, and 4) novel promising applications to exploit the commercial potentials of VANETs. Specifically, to enable cooperative detection of safety relevant events on the roads, the structure-less message aggregation (SLMA) scheme is proposed to improve communication efficiency and message accuracy. The scheme of relative position based message dissemination (RPB-MD) is proposed to reliably and efficiently disseminate messages to all intended vehicles in the zone-of-relevance in varying traffic density. Due to numerous vehicular sensor data available based on VANETs, the scheme of compressive sampling based data collection (CS-DC) is proposed to efficiently collect the spatial relevance data in a large scale, especially in the dense traffic. In addition, with novel and efficient solutions proposed for the application specific issues of data dissemination and data collection, several appealing value-added applications for VANETs are developed to exploit the commercial potentials of VANETs, namely general purpose automatic survey (GPAS), VANET-based ambient ad dissemination (VAAD) and VANET based vehicle performance monitoring and analysis (VehicleView). Thus, by improving the efficiency and reliability in in-network data processing and dissemination, including message aggregation, data dissemination and data collection, together with the development of novel promising applications, this dissertation will help push VANETs further to the stage of massive deployment.
Resumo:
Gluten-induced aggregation of K562 cells represents an in vitro model reproducing the early steps occurring in the small bowel of celiac patients exposed to gliadin. Despite the clear involvement of TG2 in the activation of the antigen-presenting cells, it is not yet clear in which compartment it occurs. Herein we study the calcium-dependent aggregation of these cells, using either cell-permeable or cell-impermeable TG2 inhibitors. Gluten induces efficient aggregation when calcium is absent in the extracellular environment, while TG2 inhibitors do not restore the full aggregating potential of gluten in the presence of calcium. These findings suggest that TG2 activity is not essential in the cellular aggregation mechanism. We demonstrate that gluten contacts the cells and provokes their aggregation through a mechanism involving the A-gliadin peptide 31-43. This peptide also activates the cell surface associated extracellular TG2 in the absence of calcium. Using a bioinformatics approach, we identify the possible docking sites of this peptide on the open and closed TG2 structures. Peptide docks with the closed TG2 structure near to the GTP/GDP site, by establishing molecular interactions with the same amino acids involved in stabilization of GTP binding. We suggest that it may occur through the displacement of GTP, switching the TG2 structure from the closed to the active open conformation. Furthermore, docking analysis shows peptide binding with the β-sandwich domain of the closed TG2 structure, suggesting that this region could be responsible for the different aggregating effects of gluten shown in the presence or absence of calcium. We deduce from these data a possible mechanism of action by which gluten makes contact with the cell surface, which could have possible implications in the celiac disease onset.
Resumo:
The present Dissertation shows how recent statistical analysis tools and open datasets can be exploited to improve modelling accuracy in two distinct yet interconnected domains of flood hazard (FH) assessment. In the first Part, unsupervised artificial neural networks are employed as regional models for sub-daily rainfall extremes. The models aim to learn a robust relation to estimate locally the parameters of Gumbel distributions of extreme rainfall depths for any sub-daily duration (1-24h). The predictions depend on twenty morphoclimatic descriptors. A large study area in north-central Italy is adopted, where 2238 annual maximum series are available. Validation is performed over an independent set of 100 gauges. Our results show that multivariate ANNs may remarkably improve the estimation of percentiles relative to the benchmark approach from the literature, where Gumbel parameters depend on mean annual precipitation. Finally, we show that the very nature of the proposed ANN models makes them suitable for interpolating predicted sub-daily rainfall quantiles across space and time-aggregation intervals. In the second Part, decision trees are used to combine a selected blend of input geomorphic descriptors for predicting FH. Relative to existing DEM-based approaches, this method is innovative, as it relies on the combination of three characteristics: (1) simple multivariate models, (2) a set of exclusively DEM-based descriptors as input, and (3) an existing FH map as reference information. First, the methods are applied to northern Italy, represented with the MERIT DEM (∼90m resolution), and second, to the whole of Italy, represented with the EU-DEM (25m resolution). The results show that multivariate approaches may (a) significantly enhance flood-prone areas delineation relative to a selected univariate one, (b) provide accurate predictions of expected inundation depths, (c) produce encouraging results in extrapolation, (d) complete the information of imperfect reference maps, and (e) conveniently convert binary maps into continuous representation of FH.
Resumo:
This article analyzes food insecurity and hunger in Brazilian families with children under five years of age. This was a nationally representative cross-sectional study using data from the National Demographic and Health Survey on Women and Children (PNDS-2006), in which the outcome variable was moderate to severe food insecurity, measured by the Brazilian Food Insecurity Scale (EBIA). Prevalence estimates and prevalence ratios were generated with 95% confidence intervals. The results showed a high prevalence of moderate to severe food insecurity, concentrated in the North and Northeast regions (30.7%), in economic classes D and E (34%), and in beneficiaries of conditional cash transfer programs (36.5%). Multivariate analysis showed that the socioeconomic relative risks (beneficiaries of conditional cash transfers), regional relative risks (North and Northeast regions), and economic relative risks (classes D and E) were 1.8, 2.0 and 2.4, respectively. Aggregation of the three risks showed 48% of families with moderate to severe food insecurity, meaning that adults and children were going hungry during the three months preceding the survey.
Resumo:
High-throughput screening of physical, genetic and chemical-genetic interactions brings important perspectives in the Systems Biology field, as the analysis of these interactions provides new insights into protein/gene function, cellular metabolic variations and the validation of therapeutic targets and drug design. However, such analysis depends on a pipeline connecting different tools that can automatically integrate data from diverse sources and result in a more comprehensive dataset that can be properly interpreted. We describe here the Integrated Interactome System (IIS), an integrative platform with a web-based interface for the annotation, analysis and visualization of the interaction profiles of proteins/genes, metabolites and drugs of interest. IIS works in four connected modules: (i) Submission module, which receives raw data derived from Sanger sequencing (e.g. two-hybrid system); (ii) Search module, which enables the user to search for the processed reads to be assembled into contigs/singlets, or for lists of proteins/genes, metabolites and drugs of interest, and add them to the project; (iii) Annotation module, which assigns annotations from several databases for the contigs/singlets or lists of proteins/genes, generating tables with automatic annotation that can be manually curated; and (iv) Interactome module, which maps the contigs/singlets or the uploaded lists to entries in our integrated database, building networks that gather novel identified interactions, protein and metabolite expression/concentration levels, subcellular localization and computed topological metrics, GO biological processes and KEGG pathways enrichment. This module generates a XGMML file that can be imported into Cytoscape or be visualized directly on the web. We have developed IIS by the integration of diverse databases following the need of appropriate tools for a systematic analysis of physical, genetic and chemical-genetic interactions. IIS was validated with yeast two-hybrid, proteomics and metabolomics datasets, but it is also extendable to other datasets. IIS is freely available online at: http://www.lge.ibi.unicamp.br/lnbio/IIS/.
Resumo:
The article seeks to investigate patterns of performance and relationships between grip strength, gait speed and self-rated health, and investigate the relationships between them, considering the variables of gender, age and family income. This was conducted in a probabilistic sample of community-dwelling elderly aged 65 and over, members of a population study on frailty. A total of 689 elderly people without cognitive deficit suggestive of dementia underwent tests of gait speed and grip strength. Comparisons between groups were based on low, medium and high speed and strength. Self-related health was assessed using a 5-point scale. The males and the younger elderly individuals scored significantly higher on grip strength and gait speed than the female and oldest did; the richest scored higher than the poorest on grip strength and gait speed; females and men aged over 80 had weaker grip strength and lower gait speed; slow gait speed and low income arose as risk factors for a worse health evaluation. Lower muscular strength affects the self-rated assessment of health because it results in a reduction in functional capacity, especially in the presence of poverty and a lack of compensatory factors.
Resumo:
Obstructive sleep apnea syndrome has a high prevalence among adults. Cephalometric variables can be a valuable method for evaluating patients with this syndrome. To correlate cephalometric data with the apnea-hypopnea sleep index. We performed a retrospective and cross-sectional study that analyzed the cephalometric data of patients followed in the Sleep Disorders Outpatient Clinic of the Discipline of Otorhinolaryngology of a university hospital, from June 2007 to May 2012. Ninety-six patients were included, 45 men, and 51 women, with a mean age of 50.3 years. A total of 11 patients had snoring, 20 had mild apnea, 26 had moderate apnea, and 39 had severe apnea. The distance from the hyoid bone to the mandibular plane was the only variable that showed a statistically significant correlation with the apnea-hypopnea index. Cephalometric variables are useful tools for the understanding of obstructive sleep apnea syndrome. The distance from the hyoid bone to the mandibular plane showed a statistically significant correlation with the apnea-hypopnea index.
Resumo:
Histological and histochemical observations support the hypothesis that collagen fibers can link to elastic fibers. However, the resulting organization of elastin and collagen type complexes and differences between these materials in terms of macromolecular orientation and frequencies of their chemical vibrational groups have not yet been solved. This study aimed to investigate the macromolecular organization of pure elastin, collagen type I and elastin-collagen complexes using polarized light DIC-microscopy. Additionally, differences and similarities between pure elastin and collagen bundles (CB) were investigated by Fourier transform-infrared (FT-IR) microspectroscopy. Although elastin exhibited a faint birefringence, the elastin-collagen complex aggregates formed in solution exhibited a deep birefringence and formation of an ordered-supramolecular complex typical of collagen chiral structure. The FT-IR study revealed elastin and CB peptide NH groups involved in different types of H-bonding. More energy is absorbed in the vibrational transitions corresponding to CH, CH2 and CH3 groups (probably associated with the hydrophobicity demonstrated by 8-anilino-1-naphtalene sulfonic acid sodium salt [ANS] fluorescence), and to νCN, δNH and ωCH2 groups of elastin compared to CB. It is assumed that the α-helix contribution to the pure elastin amide I profile is 46.8%, whereas that of the B-sheet is 20% and that unordered structures contribute to the remaining percentage. An FT-IR profile library reveals that the elastin signature within the 1360-1189cm(-1) spectral range resembles that of Conex-Toray aramid fibers.
Resumo:
In acquired immunodeficiency syndrome (AIDS) studies it is quite common to observe viral load measurements collected irregularly over time. Moreover, these measurements can be subjected to some upper and/or lower detection limits depending on the quantification assays. A complication arises when these continuous repeated measures have a heavy-tailed behavior. For such data structures, we propose a robust structure for a censored linear model based on the multivariate Student's t-distribution. To compensate for the autocorrelation existing among irregularly observed measures, a damped exponential correlation structure is employed. An efficient expectation maximization type algorithm is developed for computing the maximum likelihood estimates, obtaining as a by-product the standard errors of the fixed effects and the log-likelihood function. The proposed algorithm uses closed-form expressions at the E-step that rely on formulas for the mean and variance of a truncated multivariate Student's t-distribution. The methodology is illustrated through an application to an Human Immunodeficiency Virus-AIDS (HIV-AIDS) study and several simulation studies.
Resumo:
To assess the completeness and reliability of the Information System on Live Births (Sinasc) data. A cross-sectional analysis of the reliability and completeness of Sinasc's data was performed using a sample of Live Birth Certificate (LBC) from 2009, related to births from Campinas, Southeast Brazil. For data analysis, hospitals were grouped according to category of service (Unified National Health System, private or both), 600 LBCs were randomly selected and the data were collected in LBC-copies through mothers and newborns' hospital records and by telephone interviews. The completeness of LBCs was evaluated, calculating the percentage of blank fields, and the LBCs agreement comparing the originals with the copies was evaluated by Kappa and intraclass correlation coefficients. The percentage of completeness of LBCs ranged from 99.8%-100%. For the most items, the agreement was excellent. However, the agreement was acceptable for marital status, maternal education and newborn infants' race/color, low for prenatal visits and presence of birth defects, and very low for the number of deceased children. The results showed that the municipality Sinasc is reliable for most of the studied variables. Investments in training of the professionals are suggested in an attempt to improve system capacity to support planning and implementation of health activities for the benefit of maternal and child population.
Resumo:
Often in biomedical research, we deal with continuous (clustered) proportion responses ranging between zero and one quantifying the disease status of the cluster units. Interestingly, the study population might also consist of relatively disease-free as well as highly diseased subjects, contributing to proportion values in the interval [0, 1]. Regression on a variety of parametric densities with support lying in (0, 1), such as beta regression, can assess important covariate effects. However, they are deemed inappropriate due to the presence of zeros and/or ones. To evade this, we introduce a class of general proportion density, and further augment the probabilities of zero and one to this general proportion density, controlling for the clustering. Our approach is Bayesian and presents a computationally convenient framework amenable to available freeware. Bayesian case-deletion influence diagnostics based on q-divergence measures are automatic from the Markov chain Monte Carlo output. The methodology is illustrated using both simulation studies and application to a real dataset from a clinical periodontology study.