920 resultados para Multivariate data analysis
Resumo:
A partir de las últimas décadas se ha impulsado el desarrollo y la utilización de los Sistemas de Información Geográficos (SIG) y los Sistemas de Posicionamiento Satelital (GPS) orientados a mejorar la eficiencia productiva de distintos sistemas de cultivos extensivos en términos agronómicos, económicos y ambientales. Estas nuevas tecnologías permiten medir variabilidad espacial de propiedades del sitio como conductividad eléctrica aparente y otros atributos del terreno así como el efecto de las mismas sobre la distribución espacial de los rendimientos. Luego, es posible aplicar el manejo sitio-específico en los lotes para mejorar la eficiencia en el uso de los insumos agroquímicos, la protección del medio ambiente y la sustentabilidad de la vida rural. En la actualidad, existe una oferta amplia de recursos tecnológicos propios de la agricultura de precisión para capturar variación espacial a través de los sitios dentro del terreno. El óptimo uso del gran volumen de datos derivado de maquinarias de agricultura de precisión depende fuertemente de las capacidades para explorar la información relativa a las complejas interacciones que subyacen los resultados productivos. La covariación espacial de las propiedades del sitio y el rendimiento de los cultivos ha sido estudiada a través de modelos geoestadísticos clásicos que se basan en la teoría de variables regionalizadas. Nuevos desarrollos de modelos estadísticos contemporáneos, entre los que se destacan los modelos lineales mixtos, constituyen herramientas prometedoras para el tratamiento de datos correlacionados espacialmente. Más aún, debido a la naturaleza multivariada de las múltiples variables registradas en cada sitio, las técnicas de análisis multivariado podrían aportar valiosa información para la visualización y explotación de datos georreferenciados. La comprensión de las bases agronómicas de las complejas interacciones que se producen a la escala de lotes en producción, es hoy posible con el uso de éstas nuevas tecnologías. Los objetivos del presente proyecto son: (l) desarrollar estrategias metodológicas basadas en la complementación de técnicas de análisis multivariados y geoestadísticas, para la clasificación de sitios intralotes y el estudio de interdependencias entre variables de sitio y rendimiento; (ll) proponer modelos mixtos alternativos, basados en funciones de correlación espacial de los términos de error que permitan explorar patrones de correlación espacial de los rendimientos intralotes y las propiedades del suelo en los sitios delimitados. From the last decades the use and development of Geographical Information Systems (GIS) and Satellite Positioning Systems (GPS) is highly promoted in cropping systems. Such technologies allow measuring spatial variability of site properties including electrical conductivity and others soil features as well as their impact on the spatial variability of yields. Therefore, site-specific management could be applied to improve the efficiency in the use of agrochemicals, the environmental protection, and the sustainability of the rural life. Currently, there is a wide offer of technological resources to capture spatial variation across sites within field. However, the optimum use of data coming from the precision agriculture machineries strongly depends on the capabilities to explore the information about the complex interactions underlying the productive outputs. The covariation between spatial soil properties and yields from georeferenced data has been treated in a graphical manner or with standard geostatistical approaches. New statistical modeling capabilities from the Mixed Linear Model framework are promising to deal with correlated data such those produced by the precision agriculture. Moreover, rescuing the multivariate nature of the multiple data collected at each site, several multivariate statistical approaches could be crucial tools for data analysis with georeferenced data. Understanding the basis of complex interactions at the scale of production field is now within reach the use of these new techniques. Our main objectives are: (1) to develop new statistical strategies, based on the complementarities of geostatistics and multivariate methods, useful to classify sites within field grown with grain crops and analyze the interrelationships of several soil and yield variables, (2) to propose mixed linear models to predict yield according spatial soil variability and to build contour maps to promote a more sustainable agriculture.
Resumo:
This thesis describes a search for very high energy (VHE) gamma-ray emission from the starburst galaxy IC 342. The analysis was based on data from the 2003 — 2004 observing season recorded using the Whipple 10-metre imaging atmospheric Cherenkov telescope located on Mount Hopkins in southern Arizona. IC 342 may be classed as a non-blazar type galaxy and to date only a few such galaxies (M 87, Cen A, M 82 and NGC 253) have been detected as VHE gamma-ray sources. Analysis of approximately 24 hours of good quality IC 342 data, consisting entirely of ON/OFF observations, was carried out using a number of methods (standard Supercuts, optimised Supercuts, scaled optimised Supercuts and the multivariate kernel analysis technique). No evidence for TeV gamma-ray emission from IC 342 was found. The significance was 0.6 a with a nominal rate of 0.04 ± 0.06 gamma rays per minute. The flux upper limit above 600 GeV (at 99.9 % confidence) was determined to be 5.5 x 10-8 m-2 s-1, corresponding to 8 % of the Crab Nebula flux in the same energy range.
Resumo:
This paper examines the relationship between the level of public infrastructure and the level of productivity using panel data for the Spanish provinces over the period 1984-2004, a period which is particularly relevant due to the substantial changes occurring in the Spanish economy at that time. The underlying model used for the data analysis is based on the wage equation, which is one of a handful of simultaneous equations which when satisfied correspond to the short-run equilibrium of New Economic Geography theory. This is estimated using a spatial panel model with fixed time and province effects, so that unmodelled space and time constant sources of heterogeneity are eliminated. The model assumes that productivity depends on the level of educational attainment and the public capital stock endowment of each province. The results show that although changes in productivity are positively associated with changes in public investment within the same province, there is a negative relationship between productivity changes and changes in public investment in other regions.
Resumo:
The paper investigates the role of real exchange rate misalignment on long-run growth for a set of ninety countries using time series data from 1980 to 2004. We first estimate a panel data model (using fixed and random effects) for the real exchange rate, with different model specifications, in order to produce estimates of the equilibrium real exchange rate and this is then used to construct measures of real exchange rate misalignment. We also provide an alternative set of estimates of real exchange rate misalignment using panel cointegration methods. The variables used in our real exchange rate models are: real per capita GDP; net foreign assets; terms of trade and government consumption. The results for the two-step System GMM panel growth models indicate that the coefficients for real exchange rate misalignment are positive for different model specification and samples, which means that a more depreciated (appreciated) real exchange rate helps (harms) long-run growth. The estimated coefficients are higher for developing and emerging countries.
Resumo:
This paper investigates the role of institutions in determining per capita income levels and growth. It contributes to the empirical literature by using different variables as proxies for institutions and by developing a deeper analysis of the issues arising from the use of weak and too many instruments in per capita income and growth regressions. The cross-section estimation suggests that institutions seem to matter, regardless if they are the only explanatory variable or are combined with geographical and integration variables, although most models suffer from the issue of weak instruments. The results from the growth models provides some interesting results: there is mixed evidence on the role of institutions and such evidence is more likely to be associated with law and order and investment profile; government spending is an important policy variable; collapsing the number of instruments results in fewer significant coefficients for institutions.
Resumo:
INTRODUCTION/OBJECTIVES: Detection rates for adenoma and early colorectal cancer (CRC) are insufficient due to low compliance towards invasive screening procedures, like colonoscopy.Available non-invasive screening tests have unfortunately low sensitivity and specificity performances.Therefore, there is a large unmet need calling for a cost-effective, reliable and non-invasive test to screen for early neoplastic and pre-neoplastic lesions AIMS & Methods: The objective is to develop a screening test able to detect early CRCs and adenomas.This test is based on a nucleic acids multi-gene assay performed on peripheral blood mononuclear cells (PBMCs).A colonoscopy-controlled feasibility study was conducted on 179 subjects.The first 92 subjects was used as training set to generate a statistical significant signature.Colonoscopy revealed 21 subjects with CRC,30 with adenoma bigger than 1 cm and 41 with no neoplastic or inflammatory lesions.The second group of 48 subjects (controls, CRC and polyps) was used as a test set and will be kept blinded for the entire data analysis.To determine the organ and disease specificity 38 subjects were used:24 with inflammatory bowel disease (IBD),14 with other cancers than CRC (OC).Blood samples were taken from each patient the day of the colonoscopy and PBMCs were purified. Total RNA was extracted following standard procedures.Multiplex RT-qPCR was applied on 92 different candidate biomarkers.Different univariate and multivariate statistical methods were applied on these candidates and among them 60 biomarkers with significant p-values (<0.01) were selected.These biomarkers are involved in several different biological functions as cellular movement,cell signaling and interaction,tissue and cellular development,cancer and cell growth and proliferation.Two distinct biomarker signatures are used to separate patients without lesion from those with cancer or with adenoma, named COLOX CRC and COLOX POL respectively.COLOX performances were validated using random resampling method, bootstrap. RESULTS: COLOX CRC and POL tests successfully separate patients without lesions from those with CRC (Se 67%,Sp 93%,AUC 0.87) and from those with adenoma bigger than 1cm (Se 63%,Sp 83%,AUC 0.77),respectively. 6/24 patients in the IBD group and 1/14 patients in the OC group have a positive COLOX CRC CONCLUSION: The two COLOX tests demonstrated a high sensitivity and specificity to detect the presence of CRCs and adenomas bigger than 1 cm.A prospective, multicenter, pivotal study is underway in order to confirm these promising results in a larger cohort.
Resumo:
SUMMARY : Eukaryotic DNA interacts with the nuclear proteins using non-covalent ionic interactions. Proteins can recognize specific nucleotide sequences based on the sterical interactions with the DNA and these specific protein-DNA interactions are the basis for many nuclear processes, e.g. gene transcription, chromosomal replication, and recombination. New technology termed ChIP-Seq has been recently developed for the analysis of protein-DNA interactions on a whole genome scale and it is based on immunoprecipitation of chromatin and high-throughput DNA sequencing procedure. ChIP-Seq is a novel technique with a great potential to replace older techniques for mapping of protein-DNA interactions. In this thesis, we bring some new insights into the ChIP-Seq data analysis. First, we point out to some common and so far unknown artifacts of the method. Sequence tag distribution in the genome does not follow uniform distribution and we have found extreme hot-spots of tag accumulation over specific loci in the human and mouse genomes. These artifactual sequence tags accumulations will create false peaks in every ChIP-Seq dataset and we propose different filtering methods to reduce the number of false positives. Next, we propose random sampling as a powerful analytical tool in the ChIP-Seq data analysis that could be used to infer biological knowledge from the massive ChIP-Seq datasets. We created unbiased random sampling algorithm and we used this methodology to reveal some of the important biological properties of Nuclear Factor I DNA binding proteins. Finally, by analyzing the ChIP-Seq data in detail, we revealed that Nuclear Factor I transcription factors mainly act as activators of transcription, and that they are associated with specific chromatin modifications that are markers of open chromatin. We speculate that NFI factors only interact with the DNA wrapped around the nucleosome. We also found multiple loci that indicate possible chromatin barrier activity of NFI proteins, which could suggest the use of NFI binding sequences as chromatin insulators in biotechnology applications. RESUME : L'ADN des eucaryotes interagit avec les protéines nucléaires par des interactions noncovalentes ioniques. Les protéines peuvent reconnaître les séquences nucléotidiques spécifiques basées sur l'interaction stérique avec l'ADN, et des interactions spécifiques contrôlent de nombreux processus nucléaire, p.ex. transcription du gène, la réplication chromosomique, et la recombinaison. Une nouvelle technologie appelée ChIP-Seq a été récemment développée pour l'analyse des interactions protéine-ADN à l'échelle du génome entier et cette approche est basée sur l'immuno-précipitation de la chromatine et sur la procédure de séquençage de l'ADN à haut débit. La nouvelle approche ChIP-Seq a donc un fort potentiel pour remplacer les anciennes techniques de cartographie des interactions protéine-ADN. Dans cette thèse, nous apportons de nouvelles perspectives dans l'analyse des données ChIP-Seq. Tout d'abord, nous avons identifié des artefacts très communs associés à cette méthode qui étaient jusqu'à présent insoupçonnés. La distribution des séquences dans le génome ne suit pas une distribution uniforme et nous avons constaté des positions extrêmes d'accumulation de séquence à des régions spécifiques, des génomes humains et de la souris. Ces accumulations des séquences artéfactuelles créera de faux pics dans toutes les données ChIP-Seq, et nous proposons différentes méthodes de filtrage pour réduire le nombre de faux positifs. Ensuite, nous proposons un nouvel échantillonnage aléatoire comme un outil puissant d'analyse des données ChIP-Seq, ce qui pourraient augmenter l'acquisition de connaissances biologiques à partir des données ChIP-Seq. Nous avons créé un algorithme d'échantillonnage aléatoire et nous avons utilisé cette méthode pour révéler certaines des propriétés biologiques importantes de protéines liant à l'ADN nommés Facteur Nucléaire I (NFI). Enfin, en analysant en détail les données de ChIP-Seq pour la famille de facteurs de transcription nommés Facteur Nucléaire I, nous avons révélé que ces protéines agissent principalement comme des activateurs de transcription, et qu'elles sont associées à des modifications de la chromatine spécifiques qui sont des marqueurs de la chromatine ouverte. Nous pensons que lés facteurs NFI interagir uniquement avec l'ADN enroulé autour du nucléosome. Nous avons également constaté plusieurs régions génomiques qui indiquent une éventuelle activité de barrière chromatinienne des protéines NFI, ce qui pourrait suggérer l'utilisation de séquences de liaison NFI comme séquences isolatrices dans des applications de la biotechnologie.
Resumo:
This paper presents general problems and approaches for the spatial data analysis using machine learning algorithms. Machine learning is a very powerful approach to adaptive data analysis, modelling and visualisation. The key feature of the machine learning algorithms is that they learn from empirical data and can be used in cases when the modelled environmental phenomena are hidden, nonlinear, noisy and highly variable in space and in time. Most of the machines learning algorithms are universal and adaptive modelling tools developed to solve basic problems of learning from data: classification/pattern recognition, regression/mapping and probability density modelling. In the present report some of the widely used machine learning algorithms, namely artificial neural networks (ANN) of different architectures and Support Vector Machines (SVM), are adapted to the problems of the analysis and modelling of geo-spatial data. Machine learning algorithms have an important advantage over traditional models of spatial statistics when problems are considered in a high dimensional geo-feature spaces, when the dimension of space exceeds 5. Such features are usually generated, for example, from digital elevation models, remote sensing images, etc. An important extension of models concerns considering of real space constrains like geomorphology, networks, and other natural structures. Recent developments in semi-supervised learning can improve modelling of environmental phenomena taking into account on geo-manifolds. An important part of the study deals with the analysis of relevant variables and models' inputs. This problem is approached by using different feature selection/feature extraction nonlinear tools. To demonstrate the application of machine learning algorithms several interesting case studies are considered: digital soil mapping using SVM, automatic mapping of soil and water system pollution using ANN; natural hazards risk analysis (avalanches, landslides), assessments of renewable resources (wind fields) with SVM and ANN models, etc. The dimensionality of spaces considered varies from 2 to more than 30. Figures 1, 2, 3 demonstrate some results of the studies and their outputs. Finally, the results of environmental mapping are discussed and compared with traditional models of geostatistics.
Resumo:
The geographic information system approach has permitted integration between demographic, socio-economic and environmental data, providing correlation between information from several data banks. In the current work, occurrence of human and canine visceral leishmaniases and insect vectors (Lutzomyia longipalpis) as well as biogeographic information related to 9 areas that comprise the city of Belo Horizonte, Brazil, between April 2001 and March 2002 were correlated and georeferenced. By using this technique it was possible to define concentration loci of canine leishmaniasis in the following regions: East; Northeast; Northwest; West; and Venda Nova. However, as for human leishmaniasis, it was not possible to perform the same analysis. Data analysis has also shown that 84.2% of the human leishmaniasis cases were related with canine leishmaniasis cases. Concerning biogeographic (altitude, area of vegetation influence, hydrographic, and areas of poverty) analysis, only altitude showed to influence emergence of leishmaniasis cases. A number of 4673 canine leishmaniasis cases and 64 human leishmaniasis cases were georeferenced, of which 67.5 and 71.9%, respectively, were living between 780 and 880 m above the sea level. At these same altitudes, a large number of phlebotomine sand flies were collected. Therefore, we suggest control measures for leishmaniasis in the city of Belo Horizonte, giving priority to canine leishmaniasis foci and regions at altitudes between 780 and 880 m.
Resumo:
Dysregulation of intestinal epithelial cell performance is associated with an array of pathologies whose onset mechanisms are incompletely understood. While whole-genomics approaches have been valuable for studying the molecular basis of several intestinal diseases, a thorough analysis of gene expression along the healthy gastrointestinal tract is still lacking. The aim of this study was to map gene expression in gastrointestinal regions of healthy human adults and to implement a procedure for microarray data analysis that would allow its use as a reference when screening for pathological deviations. We analyzed the gene expression signature of antrum, duodenum, jejunum, ileum, and transverse colon biopsies using a biostatistical method based on a multivariate and univariate approach to identify region-selective genes. One hundred sixty-six genes were found responsible for distinguishing the five regions considered. Nineteen had never been described in the GI tract, including a semaphorin probably implicated in pathogen invasion and six novel genes. Moreover, by crossing these genes with those retrieved from an existing data set of gene expression in the intestine of ulcerative colitis and Crohn's disease patients, we identified genes that might be biomarkers of Crohn's and/or ulcerative colitis in ileum and/or colon. These include CLCA4 and SLC26A2, both implicated in ion transport. This study furnishes the first map of gene expression along the healthy human gastrointestinal tract. Furthermore, the approach implemented here, and validated by retrieving known gene profiles, allowed the identification of promising new leads in both healthy and disease states.
Resumo:
In this paper we look at how a web-based social software can be used to make qualitative data analysis of online peer-to-peer learning experiences. Specifically, we propose to use Cohere, a web-based social sense-making tool, to observe, track, annotate and visualize discussion group activities in online courses. We define a specific methodology for data observation and structuring, and present results of the analysis of peer interactions conducted in discussion forum in a real case study of a P2PU course. Finally we discuss how network visualization and analysis can be used to gather a better understanding of the peer-to-peer learning experience. To do so, we provide preliminary insights on the social, dialogical and conceptual connections that have been generated within one online discussion group.
Resumo:
In contrast to some extensively examined food mutagens, for example, aflatoxins, N-nitrosamines and heterocyclic amines, some other food contaminants, in particular polycyclic aromatic hydrocarbons (PAH) and other aromatic compounds, have received less attention. Therefore, exploring the relationships between dietary habits and the levels of biomarkers related to exposure to aromatic compounds is highly relevant. We have investigated in the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort the association between dietary items (food groups and nutrients) and aromatic DNA adducts and 4-aminobiphenyl-Hb adducts. Both types of adducts are biomarkers of carcinogen exposure and possibly of cancer risk, and were measured, respectively, in leucocytes and erythrocytes of 1086 (DNA adducts) and 190 (Hb adducts) non-smokers. An inverse, statistically significant, association has been found between DNA adduct levels and dietary fibre intake (P = 0·02), vitamin E (P = 0·04) and alcohol (P = 0·03) but not with other nutrients or food groups. Also, an inverse association between fibre and fruit intake, and BMI and 4-aminobiphenyl-Hb adducts (P = 0·03, 0·04, and 0·03 respectively) was observed. After multivariate regression analysis these inverse correlations remained statistically significant, except for the correlation adducts v. fruit intake. The present study suggests that fibre intake in the usual range can modify the level of DNA or Hb aromatic adducts, but such role seems to be quantitatively modest. Fibres could reduce the formation of DNA adducts in different manners, by diluting potential food mutagens and carcinogens in the gastrointestinal tract, by speeding their transit through the colon and by binding carcinogenic substances.
Resumo:
Theory of compositional data analysis is often focused on the composition only. However in practical applications we often treat a composition together with covariableswith some other scale. This contribution systematically gathers and develop statistical tools for this situation. For instance, for the graphical display of the dependenceof a composition with a categorical variable, a colored set of ternary diagrams mightbe a good idea for a first look at the data, but it will fast hide important aspects ifthe composition has many parts, or it takes extreme values. On the other hand colored scatterplots of ilr components could not be very instructive for the analyst, if theconventional, black-box ilr is used.Thinking on terms of the Euclidean structure of the simplex, we suggest to set upappropriate projections, which on one side show the compositional geometry and on theother side are still comprehensible by a non-expert analyst, readable for all locations andscales of the data. This is e.g. done by defining special balance displays with carefully-selected axes. Following this idea, we need to systematically ask how to display, explore,describe, and test the relation to complementary or explanatory data of categorical, real,ratio or again compositional scales.This contribution shows that it is sufficient to use some basic concepts and very fewadvanced tools from multivariate statistics (principal covariances, multivariate linearmodels, trellis or parallel plots, etc.) to build appropriate procedures for all these combinations of scales. This has some fundamental implications in their software implementation, and how might they be taught to analysts not already experts in multivariateanalysis
Resumo:
Self-organizing maps (Kohonen 1997) is a type of artificial neural network developedto explore patterns in high-dimensional multivariate data. The conventional versionof the algorithm involves the use of Euclidean metric in the process of adaptation ofthe model vectors, thus rendering in theory a whole methodology incompatible withnon-Euclidean geometries.In this contribution we explore the two main aspects of the problem:1. Whether the conventional approach using Euclidean metric can shed valid resultswith compositional data.2. If a modification of the conventional approach replacing vectorial sum and scalarmultiplication by the canonical operators in the simplex (i.e. perturbation andpowering) can converge to an adequate solution.Preliminary tests showed that both methodologies can be used on compositional data.However, the modified version of the algorithm performs poorer than the conventionalversion, in particular, when the data is pathological. Moreover, the conventional ap-proach converges faster to a solution, when data is \well-behaved".Key words: Self Organizing Map; Artificial Neural networks; Compositional data
Resumo:
Functional Data Analysis (FDA) deals with samples where a whole function is observedfor each individual. A particular case of FDA is when the observed functions are densityfunctions, that are also an example of infinite dimensional compositional data. In thiswork we compare several methods for dimensionality reduction for this particular typeof data: functional principal components analysis (PCA) with or without a previousdata transformation and multidimensional scaling (MDS) for diferent inter-densitiesdistances, one of them taking into account the compositional nature of density functions. The difeerent methods are applied to both artificial and real data (householdsincome distributions)