919 resultados para exploratory spatial data analysis
Resumo:
Nitrogen and water are essential for plant growth and development. In this study, we designed experiments to produce gene expression data of poplar roots under nitrogen starvation and water deprivation conditions. We found low concentration of nitrogen led first to increased root elongation followed by lateral root proliferation and eventually increased root biomass. To identify genes regulating root growth and development under nitrogen starvation and water deprivation, we designed a series of data analysis procedures, through which, we have successfully identified biologically important genes. Differentially Expressed Genes (DEGs) analysis identified the genes that are differentially expressed under nitrogen starvation or drought. Protein domain enrichment analysis identified enriched themes (in same domains) that are highly interactive during the treatment. Gene Ontology (GO) enrichment analysis allowed us to identify biological process changed during nitrogen starvation. Based on the above analyses, we examined the local Gene Regulatory Network (GRN) and identified a number of transcription factors. After testing, one of them is a high hierarchically ranked transcription factor that affects root growth under nitrogen starvation. It is very tedious and time-consuming to analyze gene expression data. To avoid doing analysis manually, we attempt to automate a computational pipeline that now can be used for identification of DEGs and protein domain analysis in a single run. It is implemented in scripts of Perl and R.
DIMENSION REDUCTION FOR POWER SYSTEM MODELING USING PCA METHODS CONSIDERING INCOMPLETE DATA READINGS
Resumo:
Principal Component Analysis (PCA) is a popular method for dimension reduction that can be used in many fields including data compression, image processing, exploratory data analysis, etc. However, traditional PCA method has several drawbacks, since the traditional PCA method is not efficient for dealing with high dimensional data and cannot be effectively applied to compute accurate enough principal components when handling relatively large portion of missing data. In this report, we propose to use EM-PCA method for dimension reduction of power system measurement with missing data, and provide a comparative study of traditional PCA and EM-PCA methods. Our extensive experimental results show that EM-PCA method is more effective and more accurate for dimension reduction of power system measurement data than traditional PCA method when dealing with large portion of missing data set.
Resumo:
Cloud computing provides a promising solution to the genomics data deluge problem resulting from the advent of next-generation sequencing (NGS) technology. Based on the concepts of “resources-on-demand” and “pay-as-you-go”, scientists with no or limited infrastructure can have access to scalable and cost-effective computational resources. However, the large size of NGS data causes a significant data transfer latency from the client’s site to the cloud, which presents a bottleneck for using cloud computing services. In this paper, we provide a streaming-based scheme to overcome this problem, where the NGS data is processed while being transferred to the cloud. Our scheme targets the wide class of NGS data analysis tasks, where the NGS sequences can be processed independently from one another. We also provide the elastream package that supports the use of this scheme with individual analysis programs or with workflow systems. Experiments presented in this paper show that our solution mitigates the effect of data transfer latency and saves both time and cost of computation.
Resumo:
The purpose of this comparative analysis of CHIP Perinatal policy (42 CFR § 457) was to provide a basis for understanding the variation in policy outputs across the twelve states that, as of June 2007, implemented the Unborn Child rule. This Department of Health and Human Services regulation expanded in 2002 the definition of “child” to include the period from conception to birth, allowing states to consider an unborn child a “targeted low-income child” and therefore eligible for SCHIP coverage. ^ Specific study aims were to (1) describe typologically the structural and contextual features of the twelve states that adopted a CHIP Perinatal policy; (2) describe and differentiate among the various designs of CHIP Perinatal policy implemented in the states; and (3) develop a conceptual model that links the structural and contextual features of the adopting states to differences in the forms the policy assumed, once it was implemented. ^ Secondary data were collected from publicly available information sources to describe characteristics of states’ political system, health system, economic system, sociodemographic context and implemented policy attributes. I posited that socio-demographic differences, political system differences and health system differences would directly account for the observed differences in policy output among the states. ^ Exploratory data analysis techniques, which included median polishing and multidimensional scaling, were employed to identify compelling patterns in the data. Scaled results across model components showed that economic system was most closely related to policy output, followed by health system. Political system and socio-demographic characteristics were shown to be weakly associated with policy output. Goodness-of-fit measures for MDS solutions implemented across states and model components, in one- and two-dimensions, were very good. ^ This comparative policy analysis of twelve states that adopted and implemented HHS Regulation 42 C.F.R. § 457 contributes to existing knowledge in three areas: CHIP Perinatal policy, public health policy and policy sciences. First, the framework allows for the identification of CHIP Perinatal program design possibilities and provides a basis for future studies that evaluate policy impact or performance. Second, studies of policy determinants are not well represented in the health policy literature. Thus, this study contributes to the development of the literature in public health policy. Finally, the conceptual framework for policy determinants developed in this study suggests new ways for policy makers and practitioners to frame policy arguments, encouraging policy change or reform. ^
Resumo:
When choosing among models to describe categorical data, the necessity to consider interactions makes selection more difficult. With just four variables, considering all interactions, there are 166 different hierarchical models and many more non-hierarchical models. Two procedures have been developed for categorical data which will produce the "best" subset or subsets of each model size where size refers to the number of effects in the model. Both procedures are patterned after the Leaps and Bounds approach used by Furnival and Wilson for continuous data and do not generally require fitting all models. For hierarchical models, likelihood ratio statistics (G('2)) are computed using iterative proportional fitting and "best" is determined by comparing, among models with the same number of effects, the Pr((chi)(,k)('2) (GREATERTHEQ) G(,ij)('2)) where k is the degrees of freedom for ith model of size j. To fit non-hierarchical as well as hierarchical models, a weighted least squares procedure has been developed.^ The procedures are applied to published occupational data relating to the occurrence of byssinosis. These results are compared to previously published analyses of the same data. Also, the procedures are applied to published data on symptoms in psychiatric patients and again compared to previously published analyses.^ These procedures will make categorical data analysis more accessible to researchers who are not statisticians. The procedures should also encourage more complex exploratory analyses of epidemiologic data and contribute to the development of new hypotheses for study. ^
Resumo:
These three manuscripts are presented as a PhD dissertation for the study of using GeoVis application to evaluate telehealth programs. The primary reason of this research was to understand how the GeoVis applications can be designed and developed using combined approaches of HC approach and cognitive fit theory and in terms utilized to evaluate telehealth program in Brazil. First manuscript The first manuscript in this dissertation presented a background about the use of GeoVisualization to facilitate visual exploration of public health data. The manuscript covered the existing challenges that were associated with an adoption of existing GeoVis applications. The manuscript combines the principles of Human Centered approach and Cognitive Fit Theory and a framework using a combination of these approaches is developed that lays the foundation of this research. The framework is then utilized to propose the design, development and evaluation of “the SanaViz” to evaluate telehealth data in Brazil, as a proof of concept. Second manuscript The second manuscript is a methods paper that describes the approaches that can be employed to design and develop “the SanaViz” based on the proposed framework. By defining the various elements of the HC approach and CFT, a mixed methods approach is utilized for the card sorting and sketching techniques. A representative sample of 20 study participants currently involved in the telehealth program at the NUTES telehealth center at UFPE, Recife, Brazil was enrolled. The findings of this manuscript helped us understand the needs of the diverse group of telehealth users, the tasks that they perform and helped us determine the essential features that might be necessary to be included in the proposed GeoVis application “the SanaViz”. Third manuscript The third manuscript involved mix- methods approach to compare the effectiveness and usefulness of the HC GeoVis application “the SanaViz” against a conventional GeoVis application “Instant Atlas”. The same group of 20 study participants who had earlier participated during Aim 2 was enrolled and a combination of quantitative and qualitative assessments was done. Effectiveness was gauged by the time that the participants took to complete the tasks using both the GeoVis applications, the ease with which they completed the tasks and the number of attempts that were taken to complete each task. Usefulness was assessed by System Usability Scale (SUS), a validated questionnaire tested in prior studies. In-depth interviews were conducted to gather opinions about both the GeoVis applications. This manuscript helped us in the demonstration of the usefulness and effectiveness of HC GeoVis applications to facilitate visual exploration of telehealth data, as a proof of concept. Together, these three manuscripts represent challenges of combining principles of Human Centered approach, Cognitive Fit Theory to design and develop GeoVis applications as a method to evaluate Telehealth data. To our knowledge, this is the first study to explore the usefulness and effectiveness of GeoVis to facilitate visual exploration of telehealth data. The results of the research enabled us to develop a framework for the design and development of GeoVis applications related to the areas of public health and especially telehealth. The results of our study showed that the varied users were involved with the telehealth program and the tasks that they performed. Further it enabled us to identify the components that might be essential to be included in these GeoVis applications. The results of our research answered the following questions; (a) Telehealth users vary in their level of understanding about GeoVis (b) Interaction features such as zooming, sorting, and linking and multiple views and representation features such as bar chart and choropleth maps were considered the most essential features of the GeoVis applications. (c) Comparing and sorting were two important tasks that the telehealth users would perform for exploratory data analysis. (d) A HC GeoVis prototype application is more effective and useful for exploration of telehealth data than a conventional GeoVis application. Future studies should be done to incorporate the proposed HC GeoVis framework to enable comprehensive assessment of the users and the tasks they perform to identify the features that might be necessary to be a part of the GeoVis applications. The results of this study demonstrate a novel approach to comprehensively and systematically enhance the evaluation of telehealth programs using the proposed GeoVis Framework.
New methods for quantification and analysis of quantitative real-time polymerase chain reaction data
Resumo:
Quantitative real-time polymerase chain reaction (qPCR) is a sensitive gene quantitation method that has been widely used in the biological and biomedical fields. The currently used methods for PCR data analysis, including the threshold cycle (CT) method, linear and non-linear model fitting methods, all require subtracting background fluorescence. However, the removal of background fluorescence is usually inaccurate, and therefore can distort results. Here, we propose a new method, the taking-difference linear regression method, to overcome this limitation. Briefly, for each two consecutive PCR cycles, we subtracted the fluorescence in the former cycle from that in the later cycle, transforming the n cycle raw data into n-1 cycle data. Then linear regression was applied to the natural logarithm of the transformed data. Finally, amplification efficiencies and the initial DNA molecular numbers were calculated for each PCR run. To evaluate this new method, we compared it in terms of accuracy and precision with the original linear regression method with three background corrections, being the mean of cycles 1-3, the mean of cycles 3-7, and the minimum. Three criteria, including threshold identification, max R2, and max slope, were employed to search for target data points. Considering that PCR data are time series data, we also applied linear mixed models. Collectively, when the threshold identification criterion was applied and when the linear mixed model was adopted, the taking-difference linear regression method was superior as it gave an accurate estimation of initial DNA amount and a reasonable estimation of PCR amplification efficiencies. When the criteria of max R2 and max slope were used, the original linear regression method gave an accurate estimation of initial DNA amount. Overall, the taking-difference linear regression method avoids the error in subtracting an unknown background and thus it is theoretically more accurate and reliable. This method is easy to perform and the taking-difference strategy can be extended to all current methods for qPCR data analysis.^
Resumo:
The analysis of time-dependent data is an important problem in many application domains, and interactive visualization of time-series data can help in understanding patterns in large time series data. Many effective approaches already exist for visual analysis of univariate time series supporting tasks such as assessment of data quality, detection of outliers, or identification of periodically or frequently occurring patterns. However, much fewer approaches exist which support multivariate time series. The existence of multiple values per time stamp makes the analysis task per se harder, and existing visualization techniques often do not scale well. We introduce an approach for visual analysis of large multivariate time-dependent data, based on the idea of projecting multivariate measurements to a 2D display, visualizing the time dimension by trajectories. We use visual data aggregation metaphors based on grouping of similar data elements to scale with multivariate time series. Aggregation procedures can either be based on statistical properties of the data or on data clustering routines. Appropriately defined user controls allow to navigate and explore the data and interactively steer the parameters of the data aggregation to enhance data analysis. We present an implementation of our approach and apply it on a comprehensive data set from the field of earth bservation, demonstrating the applicability and usefulness of our approach.
Resumo:
The linear instability of the three-dimensional boundary-layer over the HIFiRE-5 flight test geometry, i.e. a rounded-tip 2:1 elliptic cone, at Mach 7, has been analyzed through spatial BiGlobal analysis, in a effort to understand transition and accurately predict local heat loads on next-generation ight vehicles. The results at an intermediate axial section of the cone, Re x = 8x10 5, show three different families of spatially amplied linear global modes, the attachment-line and cross- ow modes known from earlier analyses, and a new global mode, peaking in the vicinity of the minor axis of the cone, termed \center-line mode". We discover that a sequence of symmetric and anti-symmetric centerline modes exist and, for the basic ow at hand, are maximally amplied around F* = 130kHz. The wavenumbers and spatial distribution of amplitude functions of the centerline modes are documented
Resumo:
In coffee processing the fermentation stage is considered one of the critical operations by its impact on the final quality of the product. However, the level of control of the fermentation process on each farm is often not adequate; the use of sensorics for controlling coffee fermentation is not common. The objective of this work is to characterize the fermentation temperature in a fermentation tank by applying spatial interpolation and a new methodology of data analysis based on phase space diagrams of temperature data, collected by means of multi-distributed, low cost and autonomous wireless sensors. A real coffee fermentation was supervised in the Cauca region (Colombia) with a network of 24 semi-passive TurboTag RFID temperature loggers with vacuum plastic cover, submerged directly in the fermenting mass. Temporal evolution and spatial distribution of temperature is described in terms of the phase diagram areas which characterizes the cyclic behaviour of temperature and highlights the significant heterogeneity of thermal conditions at different locations in the tank where the average temperature of the fermentation was 21.2 °C, although there were temperature ranges of 4.6°C, and average spatial standard deviation of ±1.21ºC. In the upper part of the tank we found high heterogeneity of temperatures, the higher temperatures and therefore the higher fermentation rates. While at the bottom, it has been computed an area in the phase diagram practically half of the area occupied by the sensors of the upper tank, therefore this location showed higher temperature homogeneity
Resumo:
Este trabajo, «Una aproximación a Ia integración en Open Data de los recursos Inspire de Ia IDEE », tiene por objetivo el construir un puente entre las Infraestructuras de Datos Espaciales (IDE) y el mundo de los «datos abiertos » aprovechando el marco legal de la Reutilización de la Información del Sector Público (RISP). Tras analizar qué es RISP y en particular los datos abiertos, y cómo se implementa en distintas Administraciones, se estudian los requisitos técnicos y legales necesarios para construir el «traductor» que permita canalizar la información IDE en el portal central de reutilización de información español datos.gob.es, dando una mayor visibilidad a los recursos INSPIRE. El trabajo se centra específicamente en dos puntos: en primer lugar en proporcionar y documentar la solución técnica que sirva en primera instancia para que el Instituto Geográfico Nacional aporte con más eficiencia sus recursos a datos.gob.es. En segundo lugar, a estudiar la aplicabilidad de esta misma solución al ámbito de la IDE de España (IDEE), señalando problemas detectados en el análisis de su contenido y sugiriendo recomendaciones para minimizar los problemas de su potencial reutilización. ABSTRACT: This work titled «Analysis of the integration of INSPIRE resources coming from Spanish Spatial Data Infrastructure within the National Public Sector Information portal», aims to build a bridge between the Spatial Data Infrastructures (SDI ) and the world of "Open Data" taking advantage of the legal framework on the Re-use of Public Sector Information (PSI) . After analyzing what PSI reuse and Open Data is and how it is implemented by different administrations, a study to extract the technical and legal requirements is done to build the "translator" that will allow adding SDI resources within the Spanish portal for the PSI reuse data .gob.es while giving greater visibility to INSPIRE. This document specifically focuses on two aspects: first to provide and document the technical solution that serves primarily for the National Geographic Institute to supply more efficiently its resources to datos.gob.es. Secondly, to study the applicability of the proposed solution to the whole Spanish SDI (IDEE), noting identified problems and suggesting recommendations to minimize problems of its potential reuse.
Resumo:
La gran cantidad de datos que se registran diariamente en los sistemas de base de datos de las organizaciones ha generado la necesidad de analizarla. Sin embargo, se enfrentan a la complejidad de procesar enormes volúmenes de datos a través de métodos tradicionales de análisis. Además, dentro de un contexto globalizado y competitivo las organizaciones se mantienen en la búsqueda constante de mejorar sus procesos, para lo cual requieren herramientas que les permitan tomar mejores decisiones. Esto implica estar mejor informado y conocer su historia digital para describir sus procesos y poder anticipar (predecir) eventos no previstos. Estos nuevos requerimientos de análisis de datos ha motivado el desarrollo creciente de proyectos de minería de datos. El proceso de minería de datos busca obtener desde un conjunto masivo de datos, modelos que permitan describir los datos o predecir nuevas instancias en el conjunto. Implica etapas de: preparación de los datos, procesamiento parcial o totalmente automatizado para identificar modelos en los datos, para luego obtener como salida patrones, relaciones o reglas. Esta salida debe significar un nuevo conocimiento para la organización, útil y comprensible para los usuarios finales, y que pueda ser integrado a los procesos para apoyar la toma de decisiones. Sin embargo, la mayor dificultad es justamente lograr que el analista de datos, que interviene en todo este proceso, pueda identificar modelos lo cual es una tarea compleja y muchas veces requiere de la experiencia, no sólo del analista de datos, sino que también del experto en el dominio del problema. Una forma de apoyar el análisis de datos, modelos y patrones es a través de su representación visual, utilizando las capacidades de percepción visual del ser humano, la cual puede detectar patrones con mayor facilidad. Bajo este enfoque, la visualización ha sido utilizada en minería datos, mayormente en el análisis descriptivo de los datos (entrada) y en la presentación de los patrones (salida), dejando limitado este paradigma para el análisis de modelos. El presente documento describe el desarrollo de la Tesis Doctoral denominada “Nuevos Esquemas de Visualizaciones para Mejorar la Comprensibilidad de Modelos de Data Mining”. Esta investigación busca aportar con un enfoque de visualización para apoyar la comprensión de modelos minería de datos, para esto propone la metáfora de modelos visualmente aumentados. ABSTRACT The large amount of data to be recorded daily in the systems database of organizations has generated the need to analyze it. However, faced with the complexity of processing huge volumes of data over traditional methods of analysis. Moreover, in a globalized and competitive environment organizations are kept constantly looking to improve their processes, which require tools that allow them to make better decisions. This involves being bettered informed and knows your digital story to describe its processes and to anticipate (predict) unanticipated events. These new requirements of data analysis, has led to the increasing development of data-mining projects. The data-mining process seeks to obtain from a massive data set, models to describe the data or predict new instances in the set. It involves steps of data preparation, partially or fully automated processing to identify patterns in the data, and then get output patterns, relationships or rules. This output must mean new knowledge for the organization, useful and understandable for end users, and can be integrated into the process to support decision-making. However, the biggest challenge is just getting the data analyst involved in this process, which can identify models is complex and often requires experience not only of the data analyst, but also the expert in the problem domain. One way to support the analysis of the data, models and patterns, is through its visual representation, i.e., using the capabilities of human visual perception, which can detect patterns easily in any context. Under this approach, the visualization has been used in data mining, mostly in exploratory data analysis (input) and the presentation of the patterns (output), leaving limited this paradigm for analyzing models. This document describes the development of the doctoral thesis entitled "New Visualizations Schemes to Improve Understandability of Data-Mining Models". This research aims to provide a visualization approach to support understanding of data mining models for this proposed metaphor visually enhanced models.
Resumo:
8 pages, 2 figures, to be published in the conference proceedings of 11th international conference "Computer Data Analysis & Modeling 2016"
Resumo:
The spatial data set delineates areas with similar environmental properties regarding soil, terrain morphology, climate and affiliation to the same administrative unit (NUTS3 or comparable units in size) at a minimum pixel size of 1km2. The scope of developing this data set is to provide a link between spatial environmental information (e.g. soil properties) and statistical data (e.g. crop distribution) available at administrative level. Impact assessment of agricultural management on emissions of pollutants or radiative active gases, or analysis regarding the influence of agricultural management on the supply of ecosystem services, require the proper spatial coincidence of the driving factors. The HSU data set provides e.g. the link between the agro-economic model CAPRI and biophysical assessment of environmental impacts (updating previously spatial units, Leip et al. 2008), for the analysis of policy scenarios. Recently, a statistical model to disaggregate crop information available from regional statistics to the HSU has been developed (Lamboni et al. 2016). The HSU data set consists of the spatial layers provided in vector and raster format as well as attribute tables with information on the properties of the HSU. All input data for the delineation the HSU is publicly available. For some parameters the attribute tables provide the link between the HSU data set and e.g. the soil map(s) rather than the data itself. The HSU data set is closely linked the USCIE data set.
Resumo:
Mode of access: Internet.