934 resultados para Data anonymization and sanitization
Resumo:
Background: High-density tiling arrays and new sequencing technologies are generating rapidly increasing volumes of transcriptome and protein-DNA interaction data. Visualization and exploration of this data is critical to understanding the regulatory logic encoded in the genome by which the cell dynamically affects its physiology and interacts with its environment. Results: The Gaggle Genome Browser is a cross-platform desktop program for interactively visualizing high-throughput data in the context of the genome. Important features include dynamic panning and zooming, keyword search and open interoperability through the Gaggle framework. Users may bookmark locations on the genome with descriptive annotations and share these bookmarks with other users. The program handles large sets of user-generated data using an in-process database and leverages the facilities of SQL and the R environment for importing and manipulating data. A key aspect of the Gaggle Genome Browser is interoperability. By connecting to the Gaggle framework, the genome browser joins a suite of interconnected bioinformatics tools for analysis and visualization with connectivity to major public repositories of sequences, interactions and pathways. To this flexible environment for exploring and combining data, the Gaggle Genome Browser adds the ability to visualize diverse types of data in relation to its coordinates on the genome. Conclusions: Genomic coordinates function as a common key by which disparate biological data types can be related to one another. In the Gaggle Genome Browser, heterogeneous data are joined by their location on the genome to create information-rich visualizations yielding insight into genome organization, transcription and its regulation and, ultimately, a better understanding of the mechanisms that enable the cell to dynamically respond to its environment.
Resumo:
Grass reference evapotranspiration (ETo) is an important agrometeorological parameter for climatological and hydrological studies, as well as for irrigation planning and management. There are several methods to estimate ETo, but their performance in different environments is diverse, since all of them have some empirical background. The FAO Penman-Monteith (FAD PM) method has been considered as a universal standard to estimate ETo for more than a decade. This method considers many parameters related to the evapotranspiration process: net radiation (Rn), air temperature (7), vapor pressure deficit (Delta e), and wind speed (U); and has presented very good results when compared to data from lysimeters Populated with short grass or alfalfa. In some conditions, the use of the FAO PM method is restricted by the lack of input variables. In these cases, when data are missing, the option is to calculate ETo by the FAD PM method using estimated input variables, as recommended by FAD Irrigation and Drainage Paper 56. Based on that, the objective of this study was to evaluate the performance of the FAO PM method to estimate ETo when Rn, Delta e, and U data are missing, in Southern Ontario, Canada. Other alternative methods were also tested for the region: Priestley-Taylor, Hargreaves, and Thornthwaite. Data from 12 locations across Southern Ontario, Canada, were used to compare ETo estimated by the FAD PM method with a complete data set and with missing data. The alternative ETo equations were also tested and calibrated for each location. When relative humidity (RH) and U data were missing, the FAD PM method was still a very good option for estimating ETo for Southern Ontario, with RMSE smaller than 0.53 mm day(-1). For these cases, U data were replaced by the normal values for the region and Delta e was estimated from temperature data. The Priestley-Taylor method was also a good option for estimating ETo when U and Delta e data were missing, mainly when calibrated locally (RMSE = 0.40 mm day(-1)). When Rn was missing, the FAD PM method was not good enough for estimating ETo, with RMSE increasing to 0.79 mm day(-1). When only T data were available, adjusted Hargreaves and modified Thornthwaite methods were better options to estimate ETo than the FAO) PM method, since RMSEs from these methods, respectively 0.79 and 0.83 mm day(-1), were significantly smaller than that obtained by FAO PM (RMSE = 1.12 mm day(-1). (C) 2009 Elsevier B.V. All rights reserved.
Resumo:
This paper develops an interactive approach for exploratory spatial data analysis. Measures of attribute similarity and spatial proximity are combined in a clustering model to support the identification of patterns in spatial information. Relationships between the developed clustering approach, spatial data mining and choropleth display are discussed. Analysis of property crime rates in Brisbane, Australia is presented. A surprising finding in this research is that there are substantial inconsistencies in standard choropleth display options found in two widely used commercial geographical information systems, both in terms of definition and performance. The comparative results demonstrate the usefulness and appeal of the developed approach in a geographical information system environment for exploratory spatial data analysis.
Resumo:
An analysis of the relationships of the major arthropod groups Was undertaken using mitochondrial genome data to examine the hypotheses that Hexapoda is polyphyletic and that Collembola is more closely related to branchiopod crustaceans than insects. We sought to examine the sensitivity of this relationship to outgroup choice, data treatment. gene choice and optimality criteria used in the phylogenetic analysis of mitochondrial genome data. Additionally we sequenced the mitochondrial genome of ail archaeognathan, Nesomachilis australica. to improve taxon selection in the apterygote insects, a group poorly represented in previous mitochondrial phylogenies. The sister group of the Collembola was rarely resolved in our analyses with a significant level of support. The use of different outgroups (myriapods, nematodes, or annelids + mollusks) resulted in many different placements of Collembola. The way in which the dataset was coded for analysis (DNA, DNA with the exclusion of third codon position and as amino acids) also had marked affects on tree topology. We found that nodal Support was spread evenly throughout the 13 mitochondrial genes and the exclusion of genes resulted in significantly less resolution in the inferred trees. Optimality criteria had a much lesser effect on topology than the preceding factors; parsimony and Bayesian trees for a given data set and treatment were quite similar. We therefore conclude that the relationships of the extant arthropod groups as inferred by mitochondrial genomes are highly vulnerable to outgroup choice, data treatment and gene choice, and no consistent alternative hypothesis of Collembola's relationships is supported. Pending the resolution of these identified problems with the application of mitogenomic data to basal arthropod relationships, it is difficult to justify the rejection of hexapod monophyly, which is well supported on morphological grounds. (c) The Willi Hennig Society 2004.
Resumo:
Binning and truncation of data are common in data analysis and machine learning. This paper addresses the problem of fitting mixture densities to multivariate binned and truncated data. The EM approach proposed by McLachlan and Jones (Biometrics, 44: 2, 571-578, 1988) for the univariate case is generalized to multivariate measurements. The multivariate solution requires the evaluation of multidimensional integrals over each bin at each iteration of the EM procedure. Naive implementation of the procedure can lead to computationally inefficient results. To reduce the computational cost a number of straightforward numerical techniques are proposed. Results on simulated data indicate that the proposed methods can achieve significant computational gains with no loss in the accuracy of the final parameter estimates. Furthermore, experimental results suggest that with a sufficient number of bins and data points it is possible to estimate the true underlying density almost as well as if the data were not binned. The paper concludes with a brief description of an application of this approach to diagnosis of iron deficiency anemia, in the context of binned and truncated bivariate measurements of volume and hemoglobin concentration from an individual's red blood cells.
Resumo:
Background: With the decrease of DNA sequencing costs, sequence-based typing methods are rapidly becoming the gold standard for epidemiological surveillance. These methods provide reproducible and comparable results needed for a global scale bacterial population analysis, while retaining their usefulness for local epidemiological surveys. Online databases that collect the generated allelic profiles and associated epidemiological data are available but this wealth of data remains underused and are frequently poorly annotated since no user-friendly tool exists to analyze and explore it. Results: PHYLOViZ is platform independent Java software that allows the integrated analysis of sequence-based typing methods, including SNP data generated from whole genome sequence approaches, and associated epidemiological data. goeBURST and its Minimum Spanning Tree expansion are used for visualizing the possible evolutionary relationships between isolates. The results can be displayed as an annotated graph overlaying the query results of any other epidemiological data available. Conclusions: PHYLOViZ is a user-friendly software that allows the combined analysis of multiple data sources for microbial epidemiological and population studies. It is freely available at http://www.phyloviz.net.
Resumo:
Dissertation presented to obtain the Ph.D degree in Bioinformatics
Resumo:
This paper studies the statistical distributions of worldwide earthquakes from year 1963 up to year 2012. A Cartesian grid, dividing Earth into geographic regions, is considered. Entropy and the Jensen–Shannon divergence are used to analyze and compare real-world data. Hierarchical clustering and multi-dimensional scaling techniques are adopted for data visualization. Entropy-based indices have the advantage of leading to a single parameter expressing the relationships between the seismic data. Classical and generalized (fractional) entropy and Jensen–Shannon divergence are tested. The generalized measures lead to a clear identification of patterns embedded in the data and contribute to better understand earthquake distributions.
Resumo:
New arguments proving that successive (repeated) measurements have a memory and actually remember each other are presented. The recognition of this peculiarity can change essentially the existing paradigm associated with conventional observation in behavior of different complex systems and lead towards the application of an intermediate model (IM). This IM can provide a very accurate fit of the measured data in terms of the Prony's decomposition. This decomposition, in turn, contains a small set of the fitting parameters relatively to the number of initial data points and allows comparing the measured data in cases where the “best fit” model based on some specific physical principles is absent. As an example, we consider two X-ray diffractometers (defined in paper as A- (“cheap”) and B- (“expensive”) that are used after their proper calibration for the measuring of the same substance (corundum a-Al2O3). The amplitude-frequency response (AFR) obtained in the frame of the Prony's decomposition can be used for comparison of the spectra recorded from (A) and (B) - X-ray diffractometers (XRDs) for calibration and other practical purposes. We prove also that the Fourier decomposition can be adapted to “ideal” experiment without memory while the Prony's decomposition corresponds to real measurement and can be fitted in the frame of the IM in this case. New statistical parameters describing the properties of experimental equipment (irrespective to their internal “filling”) are found. The suggested approach is rather general and can be used for calibration and comparison of different complex dynamical systems in practical purposes.
Resumo:
Retail services are a main contributor to municipal budget and are an activity that affects perceived quality-of-life, especially for those with mobility difficulties (e.g. the elderly, low income citizens). However, there is evidence of a decline in some of the services market towns provide to their citizens. In market towns, this decline has been reported all over the western world, from North America to Australia. The aim of this research was to understand retail decline and enlighten on some ways of addressing this decline, using a case study, Thornbury, a small town in the Southwest of England. Data collected came from two participatory approaches: photo-surveys and multicriteria mapping. The interpretation of data came from using participants as analysts, but also, using systems thinking (systems diagramming and social trap theory) for theory building. This research moves away from mainstream economic and town planning perspectives by making use of different methods and concepts used in anthropology and visual sociology (photo-surveys), decision-making and ecological economics (multicriteria mapping and social trap theory). In sum, this research has experimented with different methods, out of their context, to analyse retail decline in a small town. This research developed a conceptual model for retail decline and identified the existence of conflicting goals and interests and their implications for retail decline, as well as causes for these. Most of the potential causes have had little attention in the literature. This research also identified that some of the measures commonly used for dealing with retail decline may be contributing to the causes of retail decline itself. Additionally, this research reviewed some of the measures that can be used to deal with retail decline, implications for policy-making and reflected on the use of the data collection and analysis methods in the context of small to medium towns.
Resumo:
Contém resumo
Resumo:
Biofilm research is growing more diverse and dependent on high-throughput technologies and the large-scale production of results aggravates data substantiation. In particular, it is often the case that experimental protocols are adapted to meet the needs of a particular laboratory and no statistical validation of the modified method is provided. This paper discusses the impact of intra-laboratory adaptation and non-rigorous documentation of experimental protocols on biofilm data interchange and validation. The case study is a non-standard, but widely used, workflow for Pseudomonas aeruginosa biofilm development, considering three analysis assays: the crystal violet (CV) assay for biomass quantification, the XTT assay for respiratory activity assessment, and the colony forming units (CFU) assay for determination of cell viability. The ruggedness of the protocol was assessed by introducing small changes in the biofilm growth conditions, which simulate minor protocol adaptations and non-rigorous protocol documentation. Results show that even minor variations in the biofilm growth conditions may affect the results considerably, and that the biofilm analysis assays lack repeatability. Intra-laboratory validation of non-standard protocols is found critical to ensure data quality and enable the comparison of results within and among laboratories.
Resumo:
Dissertação de mestrado integrado em Engenharia Biomédica
Resumo:
We construct estimates of educational attainment for a sample of OECD countries using previously unexploited sources. We follow a heuristic approach to obtain plausible time profiles for attainment levels by removing sharp breaks in the data that seem to reflect changes in classification criteria. We then construct indicators of the information content of our series and a number of previously available data sets and examine their performance in several growth specifications. We find a clear positive correlation between data quality and the size and significance of human capital coefficients in growth regressions. Using an extension of the classical errors in variables model, we construct a set of meta-estimates of the coefficient of years of schooling in an aggregate Cobb-Douglas production function. Our results suggest that, after correcting for measurement error bias, the value of this parameter is well above 0.50.
Resumo:
The paper investigates the role of real exchange rate misalignment on long-run growth for a set of ninety countries using time series data from 1980 to 2004. We first estimate a panel data model (using fixed and random effects) for the real exchange rate, with different model specifications, in order to produce estimates of the equilibrium real exchange rate and this is then used to construct measures of real exchange rate misalignment. We also provide an alternative set of estimates of real exchange rate misalignment using panel cointegration methods. The variables used in our real exchange rate models are: real per capita GDP; net foreign assets; terms of trade and government consumption. The results for the two-step System GMM panel growth models indicate that the coefficients for real exchange rate misalignment are positive for different model specification and samples, which means that a more depreciated (appreciated) real exchange rate helps (harms) long-run growth. The estimated coefficients are higher for developing and emerging countries.