58 resultados para exploratory spatial data analysis
Resumo:
Quantile normalization (QN) is a technique for microarray data processing and is the default normalization method in the Robust Multi-array Average (RMA) procedure, which was primarily designed for analysing gene expression data from Affymetrix arrays. Given the abundance of Affymetrix microarrays and the popularity of the RMA method, it is crucially important that the normalization procedure is applied appropriately. In this study we carried out simulation experiments and also analysed real microarray data to investigate the suitability of RMA when it is applied to dataset with different groups of biological samples. From our experiments, we showed that RMA with QN does not preserve the biological signal included in each group, but rather it would mix the signals between the groups. We also showed that the Median Polish method in the summarization step of RMA has similar mixing effect. RMA is one of the most widely used methods in microarray data processing and has been applied to a vast volume of data in biomedical research. The problematic behaviour of this method suggests that previous studies employing RMA could have been misadvised or adversely affected. Therefore we think it is crucially important that the research community recognizes the issue and starts to address it. The two core elements of the RMA method, quantile normalization and Median Polish, both have the undesirable effects of mixing biological signals between different sample groups, which can be detrimental to drawing valid biological conclusions and to any subsequent analyses. Based on the evidence presented here and that in the literature, we recommend exercising caution when using RMA as a method of processing microarray gene expression data, particularly in situations where there are likely to be unknown subgroups of samples.
Resumo:
Statistics are regularly used to make some form of comparison between trace evidence or deploy the exclusionary principle (Morgan and Bull, 2007) in forensic investigations. Trace evidence are routinely the results of particle size, chemical or modal analyses and as such constitute compositional data. The issue is that compositional data including percentages, parts per million etc. only carry relative information. This may be problematic where a comparison of percentages and other constraint/closed data is deemed a statistically valid and appropriate way to present trace evidence in a court of law. Notwithstanding an awareness of the existence of the constant sum problem since the seminal works of Pearson (1896) and Chayes (1960) and the introduction of the application of log-ratio techniques (Aitchison, 1986; Pawlowsky-Glahn and Egozcue, 2001; Pawlowsky-Glahn and Buccianti, 2011; Tolosana-Delgado and van den Boogaart, 2013) the problem that a constant sum destroys the potential independence of variances and covariances required for correlation regression analysis and empirical multivariate methods (principal component analysis, cluster analysis, discriminant analysis, canonical correlation) is all too often not acknowledged in the statistical treatment of trace evidence. Yet the need for a robust treatment of forensic trace evidence analyses is obvious. This research examines the issues and potential pitfalls for forensic investigators if the constant sum constraint is ignored in the analysis and presentation of forensic trace evidence. Forensic case studies involving particle size and mineral analyses as trace evidence are used to demonstrate the use of a compositional data approach using a centred log-ratio (clr) transformation and multivariate statistical analyses.
Resumo:
This paper is part of a special issue of Applied Geochemistry focusing on reliable applications of compositional multivariate statistical methods. This study outlines the application of compositional data analysis (CoDa) to calibration of geochemical data and multivariate statistical modelling of geochemistry and grain-size data from a set of Holocene sedimentary cores from the Ganges-Brahmaputra (G-B) delta. Over the last two decades, understanding near-continuous records of sedimentary sequences has required the use of core-scanning X-ray fluorescence (XRF) spectrometry, for both terrestrial and marine sedimentary sequences. Initial XRF data are generally unusable in ‘raw-format’, requiring data processing in order to remove instrument bias, as well as informed sequence interpretation. The applicability of these conventional calibration equations to core-scanning XRF data are further limited by the constraints posed by unknown measurement geometry and specimen homogeneity, as well as matrix effects. Log-ratio based calibration schemes have been developed and applied to clastic sedimentary sequences focusing mainly on energy dispersive-XRF (ED-XRF) core-scanning. This study has applied high resolution core-scanning XRF to Holocene sedimentary sequences from the tidal-dominated Indian Sundarbans, (Ganges-Brahmaputra delta plain). The Log-Ratio Calibration Equation (LRCE) was applied to a sub-set of core-scan and conventional ED-XRF data to quantify elemental composition. This provides a robust calibration scheme using reduced major axis regression of log-ratio transformed geochemical data. Through partial least squares (PLS) modelling of geochemical and grain-size data, it is possible to derive robust proxy information for the Sundarbans depositional environment. The application of these techniques to Holocene sedimentary data offers an improved methodological framework for unravelling Holocene sedimentation patterns.
Resumo:
The problem of detecting spatially-coherent groups of data that exhibit anomalous behavior has started to attract attention due to applications across areas such as epidemic analysis and weather forecasting. Earlier efforts from the data mining community have largely focused on finding outliers, individual data objects that display deviant behavior. Such point-based methods are not easy to extend to find groups of data that exhibit anomalous behavior. Scan Statistics are methods from the statistics community that have considered the problem of identifying regions where data objects exhibit a behavior that is atypical of the general dataset. The spatial scan statistic and methods that build upon it mostly adopt the framework of defining a character for regions (e.g., circular or elliptical) of objects and repeatedly sampling regions of such character followed by applying a statistical test for anomaly detection. In the past decade, there have been efforts from the statistics community to enhance efficiency of scan statstics as well as to enable discovery of arbitrarily shaped anomalous regions. On the other hand, the data mining community has started to look at determining anomalous regions that have behavior divergent from their neighborhood.In this chapter,we survey the space of techniques for detecting anomalous regions on spatial data from across the data mining and statistics communities while outlining connections to well-studied problems in clustering and image segmentation. We analyze the techniques systematically by categorizing them appropriately to provide a structured birds eye view of the work on anomalous region detection;we hope that this would encourage better cross-pollination of ideas across communities to help advance the frontier in anomaly detection.
Resumo:
A compositional multivariate approach is used to analyse regional scale soil geochemical data obtained as part of the Tellus Project generated by the Geological Survey Northern Ireland (GSNI). The multi-element total concentration data presented comprise XRF analyses of 6862 rural soil samples collected at 20cm depths on a non-aligned grid at one site per 2 km2. Censored data were imputed using published detection limits. Using these imputed values for 46 elements (including LOI), each soil sample site was assigned to the regional geology map provided by GSNI initially using the dominant lithology for the map polygon. Northern Ireland includes a diversity of geology representing a stratigraphic record from the Mesoproterozoic, up to and including the Palaeogene. However, the advance of ice sheets and their meltwaters over the last 100,000 years has left at least 80% of the bedrock covered by superficial deposits, including glacial till and post-glacial alluvium and peat. The question is to what extent the soil geochemistry reflects the underlying geology or superficial deposits. To address this, the geochemical data were transformed using centered log ratios (clr) to observe the requirements of compositional data analysis and avoid closure issues. Following this, compositional multivariate techniques including compositional Principal Component Analysis (PCA) and minimum/maximum autocorrelation factor (MAF) analysis method were used to determine the influence of underlying geology on the soil geochemistry signature. PCA showed that 72% of the variation was determined by the first four principal components (PC’s) implying “significant” structure in the data. Analysis of variance showed that only 10 PC’s were necessary to classify the soil geochemical data. To consider an improvement over PCA that uses the spatial relationships of the data, a classification based on MAF analysis was undertaken using the first 6 dominant factors. Understanding the relationship between soil geochemistry and superficial deposits is important for environmental monitoring of fragile ecosystems such as peat. To explore whether peat cover could be predicted from the classification, the lithology designation was adapted to include the presence of peat, based on GSNI superficial deposit polygons and linear discriminant analysis (LDA) undertaken. Prediction accuracy for LDA classification improved from 60.98% based on PCA using 10 principal components to 64.73% using MAF based on the 6 most dominant factors. The misclassification of peat may reflect degradation of peat covered areas since the creation of superficial deposit classification. Further work will examine the influence of underlying lithologies on elemental concentrations in peat composition and the effect of this in classification analysis.
Resumo:
A new method for automated coronal loop tracking, in both spatial and temporal domains, is presented. Applying this technique to TRACE data, obtained using the 171 angstrom filter on 1998 July 14, we detect a coronal loop undergoing a 270 s kink-mode oscillation, as previously found by Aschwanden et al. However, we also detect flare-induced, and previously unnoticed, spatial periodicities on a scale of 3500 km, which occur along the coronal loop edge. Furthermore, we establish a reduction in oscillatory power for these spatial periodicities of 45% over a 222 s interval. We relate the reduction in detected oscillatory power to the physical damping of these loop-top oscillations.
Resumo:
Assessment of elevated concentrations of potentially toxic elements (PTE) in soils and the association with specific soil parent material have been the focus of research for a number of years. Risk-based assessment of potential exposure scenarios to identified elevated PTE concentrations has led to the derivation of site- and contaminant-specific soil guideline values (SGVs), which represent generic assessment criteria (GACs) to identify exceeded levels that may reflect an unacceptable risk to human health. A better understanding of the ‘bioavailable’ or ‘bioaccessible’ contaminant concentrations offers an opportunity to better refine contaminant exposure assessments. Utilizing a comprehensive soil geochemical dataset for Northern Ireland provided by the Tellus Survey (GSNI) in conjunction with supplementary bioaccessibility testing of selected soil samples following the Unified BARGE Method, this paper uses exploratory data analysis and geostatistical analysis to investigate the spatial variability of pseudo-total and bioaccessible concentrations of As, Cd, Co, Cr. Cu, Ni, Pb, U, V and Zn. The paper investigates variations in individual element concentrations as well as cross-element correlations and observed lithological/pedological associations. The analysis of PTE concentrations highlighted exceeded levels of GAC values for V and Cr and exceeded SGV/GAC values for Cd, Cu, Ni, Pb, and Zn. UBM testing showed that for some soil parent materials associated with elevated PTE concentrations e.g. the Antrim Lava Group with high Ni concentrations, the measured oral bioaccessible fraction was relatively low. For other soil parent materials with relatively moderate PTE concentrations, measured oral bioaccessible fraction was relatively high (e.g. the Gala Sandstone Group of the Southern Uplands-Down Longford Terrain). These findings have implications for regional human health risk assessments for specific PTEs.
Resumo:
Identifying differential expression of genes in psoriatic and healthy skin by microarray data analysis is a key approach to understand the pathogenesis of psoriasis. Analysis of more than one dataset to identify genes commonly upregulated reduces the likelihood of false positives and narrows down the possible signature genes. Genes controlling the critical balance between T helper 17 and regulatory T cells are of special interest in psoriasis. Our objectives were to identify genes that are consistently upregulated in lesional skin from three published microarray datasets. We carried out a reanalysis of gene expression data extracted from three experiments on samples from psoriatic and nonlesional skin using the same stringency threshold and software and further compared the expression levels of 92 genes related to the T helper 17 and regulatory T cell signaling pathways. We found 73 probe sets representing 57 genes commonly upregulated in lesional skin from all datasets. These included 26 probe sets representing 20 genes that have no previous link to the etiopathogenesis of psoriasis. These genes may represent novel therapeutic targets and surely need more rigorous experimental testing to be validated. Our analysis also identified 12 of 92 genes known to be related to the T helper 17 and regulatory T cell signaling pathways, and these were found to be differentially expressed in the lesional skin samples.
Resumo:
This research investigates the relationship between elevated trace elements in soils, stream sediments and stream water and the prevalence of Chronic Kidney Disease (CKD). The study uses a collaboration of datasets provided from the UK Renal Registry Report (UKRR) on patients with renal diseases requiring treatment including Renal Replacement Therapy (RRT), the soil geochemical dataset for Northern Ireland provided by the Tellus Survey, Geological Survey of Northern Ireland (GSNI) and the bioaccessibility of Potentially Toxic Elements (PTEs) from soil samples which were obtained from the Unified Barge Method (UBM). The relationship between these factors derives from the UKRR report which highlights incidence rates of renal impaired patients showing regional variation with cases of unknown aetiology. Studies suggest a potential cause of the large variation and uncertain aetiology is associated with underlying environmental factors such as the oral bioaccessibility of trace elements in the gastrointestinal tract.
As previous research indicates that long term exposure is related to environmental factors, Northern Ireland is ideally placed for this research as people traditionally live in the same location for long periods of time. Exploratory data analysis and multivariate analyses are used to examine the soil, stream sediments and stream water geochemistry data for a range of key elements including arsenic, lead, cadmium and mercury identified from a review of previous renal disease literature. The spatial prevalence of patients with long term CKD is analysed on an area basis. Further work includes cluster analysis to detect areas of low or high incidences of CKD that are significantly correlated in space, Geographical Weighted Regression (GWR) and Poisson kriging to examine locally varying relationship between elevated concentrations of PTEs and the prevalence of CKD.
Resumo:
Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation–maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits).
Resumo:
The predominant fear in capital markets is that of a price spike. Commodity markets differ in that there is a fear of both upward and down jumps, this results in implied volatility curves displaying distinct shapes when compared to equity markets. The use of a novel functional data analysis (FDA) approach, provides a framework to produce and interpret functional objects that characterise the underlying dynamics of oil future options. We use the FDA framework to examine implied volatility, jump risk, and pricing dynamics within crude oil markets. Examining a WTI crude oil sample for the 2007–2013 period, which includes the global financial crisis and the Arab Spring, strong evidence is found of converse jump dynamics during periods of demand and supply side weakness. This is used as a basis for an FDA-derived Merton (1976) jump diffusion optimised delta hedging strategy, which exhibits superior portfolio management results over traditional methods.
Resumo:
The Antrim Coast Road stretching from the seaport of Larne in the East of Northern Ireland to the famous Giant’s Causeway in the North has a well-deserved reputation for being one of the most spectacular roads in Europe (Day, 2006). At various locations along the route, fluid interactions between the problematic geology, Jurassic Lias Clay and Triassic Mudstone overlain by Cretaceous Limestone and Tertiary Basalt, and environmental variables result in frequent instances of slope instability within the vadose zone. During such instances of instability, debris flows and composite mudflows encroach on the carriageway posing a hazard to road users. This paper examines the site investigative, geotechnical and spatial analysis techniques currently being implemented to monitor slope stability for one site at Straidkilly Point, Glenarm, Northern Ireland. An in-depth understanding of the geology was obtained via boreholes, resistivity surveys and laboratory testing. Environmental variables recorded by an on-site weather station were correlated with measured pore water pressure and soil moisture infiltration dynamic data.
Terrestrial LiDAR (TLS) was applied to the slope for the monitoring of failures, with surveys carried out on a bi-monthly basis. TLS monitoring allowed for the generation of Digital Elevation Models (DEMs) of difference, highlighting areas of recent movement, erosion and deposition. Morphology parameters were generated from the DEMs and include slope, curvature and multiple measures of roughness. Changes in the structure of the slope coupled with morphological parameters are characterised and linked to progressive failures from the temporal monitoring. In addition to TLS monitoring, Aerial LiDARi datasets were used for the spatio-morphological characterisation of the slope on a macro scale. Results from the geotechnical and environmental monitoring were compared with spatial data obtained through Terrestrial and Airborne LiDAR, providing a multi-faceted approach to slope stability characterization, which facilitates more informed management of geotechnical risk by the Northern Ireland Roads Service.
Resumo:
Summary statistics continue to play an important role in identifying and monitoring patterns and trends in educational inequalities between differing groups of pupils over time. However, this article argues that their uncritical use can also encourage the labelling of whole groups of pupils as ‘underachievers’ or ‘overachievers’ as the findings of group-level data are simply applied to individual group members, a practice commonly termed the ‘ecological fallacy’. Some of the adverse consequences of this will be outlined in relation to current debates concerning gender and ethnic differences in educational attainment. It will be argued that one way of countering this uncritical use of summary statistics and the ecological fallacy that it tends to encourage, is to make much more use of the principles and methods of what has been termed ‘exploratory data analysis’. Such an approach is illustrated through a secondary analysis of data from the Youth Cohort Study of England and Wales, focusing on gender and ethnic differences in educational attainment. It will be shown that, by placing an emphasis on the graphical display of data and on encouraging researchers to describe those data more qualitatively, such an approach represents an essential addition to the use of simple summary statistics and helps to avoid the limitations associated with them.
Resumo:
Purpose – The purpose of this paper is to present an analysis of media representation of business ethics within 62 international newspapers to explore the longitudinal and contextual evolution of business ethics and associated terminology. Levels of coverage and contextual analysis of the content of the articles are used as surrogate measures of the penetration of business ethics concepts into society. Design/methodology/approach – This paper uses a text mining application based on two samples of data: analysis of 62 national newspapers in 21 countries from 1990 to 2008; analysis of the content of two samples of articles containing the term business ethics (comprised of 100 newspaper articles spread over an 18-year period from a sample of US and UK newspapers). Findings – The paper demonstrates increased coverage of sustainability topics within the media over the last 18 years associated with events such as the Rio Summit. Whilst some peaks are associated with business ethics scandals, the overall coverage remains steady. There is little apparent use in the media of concepts such as corporate citizenship. The academic community and company ethical codes appear to adopt a wider definition of business ethics more akin to that associated with sustainability, in comparison with the focus taken by the media, especially in the USA. Coverage demonstrates clear regional bias and contextual analysis of the articles in the UK and USA also shows interesting parallels and divergences in the media representation of business ethics. Originality/value – A promising avenue to explore how the evolution of sustainability issues including business ethics can be tracked within a societal context.