924 resultados para spatial data analysis
Resumo:
Dimensionality reduction is employed for visual data analysis as a way to obtaining reduced spaces for high dimensional data or to mapping data directly into 2D or 3D spaces. Although techniques have evolved to improve data segregation on reduced or visual spaces, they have limited capabilities for adjusting the results according to user's knowledge. In this paper, we propose a novel approach to handling both dimensionality reduction and visualization of high dimensional data, taking into account user's input. It employs Partial Least Squares (PLS), a statistical tool to perform retrieval of latent spaces focusing on the discriminability of the data. The method employs a training set for building a highly precise model that can then be applied to a much larger data set very effectively. The reduced data set can be exhibited using various existing visualization techniques. The training data is important to code user's knowledge into the loop. However, this work also devises a strategy for calculating PLS reduced spaces when no training data is available. The approach produces increasingly precise visual mappings as the user feeds back his or her knowledge and is capable of working with small and unbalanced training sets.
Resumo:
In this work we aim to propose a new approach for preliminary epidemiological studies on Standardized Mortality Ratios (SMR) collected in many spatial regions. A preliminary study on SMRs aims to formulate hypotheses to be investigated via individual epidemiological studies that avoid bias carried on by aggregated analyses. Starting from collecting disease counts and calculating expected disease counts by means of reference population disease rates, in each area an SMR is derived as the MLE under the Poisson assumption on each observation. Such estimators have high standard errors in small areas, i.e. where the expected count is low either because of the low population underlying the area or the rarity of the disease under study. Disease mapping models and other techniques for screening disease rates among the map aiming to detect anomalies and possible high-risk areas have been proposed in literature according to the classic and the Bayesian paradigm. Our proposal is approaching this issue by a decision-oriented method, which focus on multiple testing control, without however leaving the preliminary study perspective that an analysis on SMR indicators is asked to. We implement the control of the FDR, a quantity largely used to address multiple comparisons problems in the eld of microarray data analysis but which is not usually employed in disease mapping. Controlling the FDR means providing an estimate of the FDR for a set of rejected null hypotheses. The small areas issue arises diculties in applying traditional methods for FDR estimation, that are usually based only on the p-values knowledge (Benjamini and Hochberg, 1995; Storey, 2003). Tests evaluated by a traditional p-value provide weak power in small areas, where the expected number of disease cases is small. Moreover tests cannot be assumed as independent when spatial correlation between SMRs is expected, neither they are identical distributed when population underlying the map is heterogeneous. The Bayesian paradigm oers a way to overcome the inappropriateness of p-values based methods. Another peculiarity of the present work is to propose a hierarchical full Bayesian model for FDR estimation in testing many null hypothesis of absence of risk.We will use concepts of Bayesian models for disease mapping, referring in particular to the Besag York and Mollié model (1991) often used in practice for its exible prior assumption on the risks distribution across regions. The borrowing of strength between prior and likelihood typical of a hierarchical Bayesian model takes the advantage of evaluating a singular test (i.e. a test in a singular area) by means of all observations in the map under study, rather than just by means of the singular observation. This allows to improve the power test in small areas and addressing more appropriately the spatial correlation issue that suggests that relative risks are closer in spatially contiguous regions. The proposed model aims to estimate the FDR by means of the MCMC estimated posterior probabilities b i's of the null hypothesis (absence of risk) for each area. An estimate of the expected FDR conditional on data (\FDR) can be calculated in any set of b i's relative to areas declared at high-risk (where thenull hypothesis is rejected) by averaging the b i's themselves. The\FDR can be used to provide an easy decision rule for selecting high-risk areas, i.e. selecting as many as possible areas such that the\FDR is non-lower than a prexed value; we call them\FDR based decision (or selection) rules. The sensitivity and specicity of such rule depend on the accuracy of the FDR estimate, the over-estimation of FDR causing a loss of power and the under-estimation of FDR producing a loss of specicity. Moreover, our model has the interesting feature of still being able to provide an estimate of relative risk values as in the Besag York and Mollié model (1991). A simulation study to evaluate the model performance in FDR estimation accuracy, sensitivity and specificity of the decision rule, and goodness of estimation of relative risks, was set up. We chose a real map from which we generated several spatial scenarios whose counts of disease vary according to the spatial correlation degree, the size areas, the number of areas where the null hypothesis is true and the risk level in the latter areas. In summarizing simulation results we will always consider the FDR estimation in sets constituted by all b i's selected lower than a threshold t. We will show graphs of the\FDR and the true FDR (known by simulation) plotted against a threshold t to assess the FDR estimation. Varying the threshold we can learn which FDR values can be accurately estimated by the practitioner willing to apply the model (by the closeness between\FDR and true FDR). By plotting the calculated sensitivity and specicity (both known by simulation) vs the\FDR we can check the sensitivity and specicity of the corresponding\FDR based decision rules. For investigating the over-smoothing level of relative risk estimates we will compare box-plots of such estimates in high-risk areas (known by simulation), obtained by both our model and the classic Besag York Mollié model. All the summary tools are worked out for all simulated scenarios (in total 54 scenarios). Results show that FDR is well estimated (in the worst case we get an overestimation, hence a conservative FDR control) in small areas, low risk levels and spatially correlated risks scenarios, that are our primary aims. In such scenarios we have good estimates of the FDR for all values less or equal than 0.10. The sensitivity of\FDR based decision rules is generally low but specicity is high. In such scenario the use of\FDR = 0:05 or\FDR = 0:10 based selection rule can be suggested. In cases where the number of true alternative hypotheses (number of true high-risk areas) is small, also FDR = 0:15 values are well estimated, and \FDR = 0:15 based decision rules gains power maintaining an high specicity. On the other hand, in non-small areas and non-small risk level scenarios the FDR is under-estimated unless for very small values of it (much lower than 0.05); this resulting in a loss of specicity of a\FDR = 0:05 based decision rule. In such scenario\FDR = 0:05 or, even worse,\FDR = 0:1 based decision rules cannot be suggested because the true FDR is actually much higher. As regards the relative risk estimation, our model achieves almost the same results of the classic Besag York Molliè model. For this reason, our model is interesting for its ability to perform both the estimation of relative risk values and the FDR control, except for non-small areas and large risk level scenarios. A case of study is nally presented to show how the method can be used in epidemiology.
Resumo:
The rotational nature of shifting cultivation poses several challenges to its detection by remote sensing. Consequently, there is a lack of spatial data on the dynamics of shifting cultivation landscapes on a regional, i.e. sub-national, or national level. We present an approach based on a time series of Landsat and MODIS data and landscape metrics to delineate the dynamics of shifting cultivation landscapes. Our results reveal that shifting cultivation is a land use system still widely and dynamically utilized in northern Laos. While there is an overall reduction in the areas dominated by shifting cultivation, some regions also show an expansion. A review of relevant reports and articles indicates that policies tend to lead to a reduction while market forces can result in both expansion and reduction. For a better understanding of the different factors affecting shifting cultivation landscapes in Laos, further research should focus on spatially explicit analyses.
Resumo:
A time series is a sequence of observations made over time. Examples in public health include daily ozone concentrations, weekly admissions to an emergency department or annual expenditures on health care in the United States. Time series models are used to describe the dependence of the response at each time on predictor variables including covariates and possibly previous values in the series. Time series methods are necessary to account for the correlation among repeated responses over time. This paper gives an overview of time series ideas and methods used in public health research.
Resumo:
The primary challenge in groundwater and contaminant transport modeling is obtaining the data needed for constructing, calibrating and testing the models. Large amounts of data are necessary for describing the hydrostratigraphy in areas with complex geology. Increasingly states are making spatial data available that can be used for input to groundwater flow models. The appropriateness of this data for large-scale flow systems has not been tested. This study focuses on modeling a plume of 1,4-dioxane in a heterogeneous aquifer system in Scio Township, Washtenaw County, Michigan. The analysis consisted of: (1) characterization of hydrogeology of the area and construction of a conceptual model based on publicly available spatial data, (2) development and calibration of a regional flow model for the site, (3) conversion of the regional model to a more highly resolved local model, (4) simulation of the dioxane plume, and (5) evaluation of the model's ability to simulate field data and estimation of the possible dioxane sources and subsequent migration until maximum concentrations are at or below the Michigan Department of Environmental Quality's residential cleanup standard for groundwater (85 ppb). MODFLOW-2000 and MT3D programs were utilized to simulate the groundwater flow and the development and movement of the 1, 4-dioxane plume, respectively. MODFLOW simulates transient groundwater flow in a quasi-3-dimensional sense, subject to a variety of boundary conditions that can simulate recharge, pumping, and surface-/groundwater interactions. MT3D simulates solute advection with groundwater flow (using the flow solution from MODFLOW), dispersion, source/sink mixing, and chemical reaction of contaminants. This modeling approach was successful at simulating the groundwater flows by calibrating recharge and hydraulic conductivities. The plume transport was adequately simulated using literature dispersivity and sorption coefficients, although the plume geometries were not well constrained.
Resumo:
Nitrogen and water are essential for plant growth and development. In this study, we designed experiments to produce gene expression data of poplar roots under nitrogen starvation and water deprivation conditions. We found low concentration of nitrogen led first to increased root elongation followed by lateral root proliferation and eventually increased root biomass. To identify genes regulating root growth and development under nitrogen starvation and water deprivation, we designed a series of data analysis procedures, through which, we have successfully identified biologically important genes. Differentially Expressed Genes (DEGs) analysis identified the genes that are differentially expressed under nitrogen starvation or drought. Protein domain enrichment analysis identified enriched themes (in same domains) that are highly interactive during the treatment. Gene Ontology (GO) enrichment analysis allowed us to identify biological process changed during nitrogen starvation. Based on the above analyses, we examined the local Gene Regulatory Network (GRN) and identified a number of transcription factors. After testing, one of them is a high hierarchically ranked transcription factor that affects root growth under nitrogen starvation. It is very tedious and time-consuming to analyze gene expression data. To avoid doing analysis manually, we attempt to automate a computational pipeline that now can be used for identification of DEGs and protein domain analysis in a single run. It is implemented in scripts of Perl and R.
Resumo:
Cloud computing provides a promising solution to the genomics data deluge problem resulting from the advent of next-generation sequencing (NGS) technology. Based on the concepts of “resources-on-demand” and “pay-as-you-go”, scientists with no or limited infrastructure can have access to scalable and cost-effective computational resources. However, the large size of NGS data causes a significant data transfer latency from the client’s site to the cloud, which presents a bottleneck for using cloud computing services. In this paper, we provide a streaming-based scheme to overcome this problem, where the NGS data is processed while being transferred to the cloud. Our scheme targets the wide class of NGS data analysis tasks, where the NGS sequences can be processed independently from one another. We also provide the elastream package that supports the use of this scheme with individual analysis programs or with workflow systems. Experiments presented in this paper show that our solution mitigates the effect of data transfer latency and saves both time and cost of computation.
New methods for quantification and analysis of quantitative real-time polymerase chain reaction data
Resumo:
Quantitative real-time polymerase chain reaction (qPCR) is a sensitive gene quantitation method that has been widely used in the biological and biomedical fields. The currently used methods for PCR data analysis, including the threshold cycle (CT) method, linear and non-linear model fitting methods, all require subtracting background fluorescence. However, the removal of background fluorescence is usually inaccurate, and therefore can distort results. Here, we propose a new method, the taking-difference linear regression method, to overcome this limitation. Briefly, for each two consecutive PCR cycles, we subtracted the fluorescence in the former cycle from that in the later cycle, transforming the n cycle raw data into n-1 cycle data. Then linear regression was applied to the natural logarithm of the transformed data. Finally, amplification efficiencies and the initial DNA molecular numbers were calculated for each PCR run. To evaluate this new method, we compared it in terms of accuracy and precision with the original linear regression method with three background corrections, being the mean of cycles 1-3, the mean of cycles 3-7, and the minimum. Three criteria, including threshold identification, max R2, and max slope, were employed to search for target data points. Considering that PCR data are time series data, we also applied linear mixed models. Collectively, when the threshold identification criterion was applied and when the linear mixed model was adopted, the taking-difference linear regression method was superior as it gave an accurate estimation of initial DNA amount and a reasonable estimation of PCR amplification efficiencies. When the criteria of max R2 and max slope were used, the original linear regression method gave an accurate estimation of initial DNA amount. Overall, the taking-difference linear regression method avoids the error in subtracting an unknown background and thus it is theoretically more accurate and reliable. This method is easy to perform and the taking-difference strategy can be extended to all current methods for qPCR data analysis.^
Resumo:
The linear instability of the three-dimensional boundary-layer over the HIFiRE-5 flight test geometry, i.e. a rounded-tip 2:1 elliptic cone, at Mach 7, has been analyzed through spatial BiGlobal analysis, in a effort to understand transition and accurately predict local heat loads on next-generation ight vehicles. The results at an intermediate axial section of the cone, Re x = 8x10 5, show three different families of spatially amplied linear global modes, the attachment-line and cross- ow modes known from earlier analyses, and a new global mode, peaking in the vicinity of the minor axis of the cone, termed \center-line mode". We discover that a sequence of symmetric and anti-symmetric centerline modes exist and, for the basic ow at hand, are maximally amplied around F* = 130kHz. The wavenumbers and spatial distribution of amplitude functions of the centerline modes are documented
Resumo:
In coffee processing the fermentation stage is considered one of the critical operations by its impact on the final quality of the product. However, the level of control of the fermentation process on each farm is often not adequate; the use of sensorics for controlling coffee fermentation is not common. The objective of this work is to characterize the fermentation temperature in a fermentation tank by applying spatial interpolation and a new methodology of data analysis based on phase space diagrams of temperature data, collected by means of multi-distributed, low cost and autonomous wireless sensors. A real coffee fermentation was supervised in the Cauca region (Colombia) with a network of 24 semi-passive TurboTag RFID temperature loggers with vacuum plastic cover, submerged directly in the fermenting mass. Temporal evolution and spatial distribution of temperature is described in terms of the phase diagram areas which characterizes the cyclic behaviour of temperature and highlights the significant heterogeneity of thermal conditions at different locations in the tank where the average temperature of the fermentation was 21.2 °C, although there were temperature ranges of 4.6°C, and average spatial standard deviation of ±1.21ºC. In the upper part of the tank we found high heterogeneity of temperatures, the higher temperatures and therefore the higher fermentation rates. While at the bottom, it has been computed an area in the phase diagram practically half of the area occupied by the sensors of the upper tank, therefore this location showed higher temperature homogeneity
Resumo:
Este trabajo, «Una aproximación a Ia integración en Open Data de los recursos Inspire de Ia IDEE », tiene por objetivo el construir un puente entre las Infraestructuras de Datos Espaciales (IDE) y el mundo de los «datos abiertos » aprovechando el marco legal de la Reutilización de la Información del Sector Público (RISP). Tras analizar qué es RISP y en particular los datos abiertos, y cómo se implementa en distintas Administraciones, se estudian los requisitos técnicos y legales necesarios para construir el «traductor» que permita canalizar la información IDE en el portal central de reutilización de información español datos.gob.es, dando una mayor visibilidad a los recursos INSPIRE. El trabajo se centra específicamente en dos puntos: en primer lugar en proporcionar y documentar la solución técnica que sirva en primera instancia para que el Instituto Geográfico Nacional aporte con más eficiencia sus recursos a datos.gob.es. En segundo lugar, a estudiar la aplicabilidad de esta misma solución al ámbito de la IDE de España (IDEE), señalando problemas detectados en el análisis de su contenido y sugiriendo recomendaciones para minimizar los problemas de su potencial reutilización. ABSTRACT: This work titled «Analysis of the integration of INSPIRE resources coming from Spanish Spatial Data Infrastructure within the National Public Sector Information portal», aims to build a bridge between the Spatial Data Infrastructures (SDI ) and the world of "Open Data" taking advantage of the legal framework on the Re-use of Public Sector Information (PSI) . After analyzing what PSI reuse and Open Data is and how it is implemented by different administrations, a study to extract the technical and legal requirements is done to build the "translator" that will allow adding SDI resources within the Spanish portal for the PSI reuse data .gob.es while giving greater visibility to INSPIRE. This document specifically focuses on two aspects: first to provide and document the technical solution that serves primarily for the National Geographic Institute to supply more efficiently its resources to datos.gob.es. Secondly, to study the applicability of the proposed solution to the whole Spanish SDI (IDEE), noting identified problems and suggesting recommendations to minimize problems of its potential reuse.