906 resultados para Exploratory statistical data analysis
Resumo:
Background: Expression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples. Nevertheless, there is still much debate on the best ways to process data, to design experiments and analyse the output. Furthermore, many of the more sophisticated mathematical approaches to data analysis in the literature remain inaccessible to much of the biological research community. In this study we examine ways of extracting and analysing a large data set obtained using the Agilent long oligonucleotide transcriptomics platform, applied to a set of human macrophage and dendritic cell samples. Results: We describe and validate a series of data extraction, transformation and normalisation steps which are implemented via a new R function. Analysis of replicate normalised reference data demonstrate that intrarray variability is small (only around 2 of the mean log signal), while interarray variability from replicate array measurements has a standard deviation (SD) of around 0.5 log(2) units (6 of mean). The common practise of working with ratios of Cy5/Cy3 signal offers little further improvement in terms of reducing error. Comparison to expression data obtained using Arabidopsis samples demonstrates that the large number of genes in each sample showing a low level of transcription reflect the real complexity of the cellular transcriptome. Multidimensional scaling is used to show that the processed data identifies an underlying structure which reflect some of the key biological variables which define the data set. This structure is robust, allowing reliable comparison of samples collected over a number of years and collected by a variety of operators. Conclusions: This study outlines a robust and easily implemented pipeline for extracting, transforming normalising and visualising transcriptomic array data from Agilent expression platform. The analysis is used to obtain quantitative estimates of the SD arising from experimental (non biological) intra- and interarray variability, and for a lower threshold for determining whether an individual gene is expressed. The study provides a reliable basis for further more extensive studies of the systems biology of eukaryotic cells.
Resumo:
JASMIN is a super-data-cluster designed to provide a high-performance high-volume data analysis environment for the UK environmental science community. Thus far JASMIN has been used primarily by the atmospheric science and earth observation communities, both to support their direct scientific workflow, and the curation of data products in the STFC Centre for Environmental Data Archival (CEDA). Initial JASMIN configuration and first experiences are reported here. Useful improvements in scientific workflow are presented. It is clear from the explosive growth in stored data and use that there was a pent up demand for a suitable big-data analysis environment. This demand is not yet satisfied, in part because JASMIN does not yet have enough compute, the storage is fully allocated, and not all software needs are met. Plans to address these constraints are introduced.
Resumo:
Streamwater nitrate dynamics in the River Hafren, Plynlimon, mid-Wales were investigated over decadal to sub-daily timescales using a range of statistical techniques. Long-term data were derived from weekly grab samples (1984–2010) and high-frequency data from 7-hourly samples (2007–2009) both measured at two sites: a headwater stream draining moorland and a downstream site below plantation forest. This study is one of the first to analyse upland streamwater nitrate dynamics across such a wide range of timescales and report on the principal mechanisms identified. The data analysis provided no clear evidence that the long-term decline in streamwater nitrate concentrations was related to a decline in atmospheric deposition alone, because nitrogen deposition first increased and then decreased during the study period. Increased streamwater temperature and denitrification may also have contributed to the decline in stream nitrate concentrations, the former through increased N uptake rates and the latter resultant from increased dissolved organic carbon concentrations. Strong seasonal cycles, with concentration minimums in the summer, were driven by seasonal flow minimums and seasonal biological activity enhancing nitrate uptake. Complex diurnal dynamics were observed, with seasonal changes in phase and amplitude of the cycling, and the diurnal dynamics were variable along the river. At the moorland site, a regular daily cycle, with minimum concentrations in the early afternoon, corresponding with peak air temperatures, indicated the importance of instream biological processing. At the downstream site, the diurnal dynamics were a composite signal, resultant from advection, dispersion and nitrate processing in the soils of the lower catchment. The diurnal streamwater nitrate dynamics were also affected by drought conditions. Enhanced diurnal cycling in Spring 2007 was attributed to increased nitrate availability in the post-drought period as well as low flow rates and high temperatures over this period. The combination of high-frequency short-term measurements and long-term monitoring provides a powerful tool for increasing understanding of the controls of element fluxes and concentrations in surface waters.
Resumo:
There are now many reports of imaging experiments with small cohorts of typical participants that precede large-scale, often multicentre studies of psychiatric and neurological disorders. Data from these calibration experiments are sufficient to make estimates of statistical power and predictions of sample size and minimum observable effect sizes. In this technical note, we suggest how previously reported voxel-based power calculations can support decision making in the design, execution and analysis of cross-sectional multicentre imaging studies. The choice of MRI acquisition sequence, distribution of recruitment across acquisition centres, and changes to the registration method applied during data analysis are considered as examples. The consequences of modification are explored in quantitative terms by assessing the impact on sample size for a fixed effect size and detectable effect size for a fixed sample size. The calibration experiment dataset used for illustration was a precursor to the now complete Medical Research Council Autism Imaging Multicentre Study (MRC-AIMS). Validation of the voxel-based power calculations is made by comparing the predicted values from the calibration experiment with those observed in MRC-AIMS. The effect of non-linear mappings during image registration to a standard stereotactic space on the prediction is explored with reference to the amount of local deformation. In summary, power calculations offer a validated, quantitative means of making informed choices on important factors that influence the outcome of studies that consume significant resources.
Resumo:
Owing to continuous advances in the computational power of handheld devices like smartphones and tablet computers, it has become possible to perform Big Data operations including modern data mining processes onboard these small devices. A decade of research has proved the feasibility of what has been termed as Mobile Data Mining, with a focus on one mobile device running data mining processes. However, it is not before 2010 until the authors of this book initiated the Pocket Data Mining (PDM) project exploiting the seamless communication among handheld devices performing data analysis tasks that were infeasible until recently. PDM is the process of collaboratively extracting knowledge from distributed data streams in a mobile computing environment. This book provides the reader with an in-depth treatment on this emerging area of research. Details of techniques used and thorough experimental studies are given. More importantly and exclusive to this book, the authors provide detailed practical guide on the deployment of PDM in the mobile environment. An important extension to the basic implementation of PDM dealing with concept drift is also reported. In the era of Big Data, potential applications of paramount importance offered by PDM in a variety of domains including security, business and telemedicine are discussed.
Resumo:
Much UK research and market practice on portfolio strategy and performance benchmarking relies on a sector‐geography subdivision of properties. Prior tests of the appropriateness of such divisions have generally relied on aggregated or hypothetical return data. However, the results found in aggregate may not hold when individual buildings are considered. This paper makes use of a dataset of individual UK property returns. A series of multivariate exploratory statistical techniques are utilised to test whether the return behaviour of individual properties conforms to their a priori grouping. The results suggest strongly that neither standard sector nor regional classifications provide a clear demarcation of individual building performance. This has important implications for both portfolio strategy and performance measurement and benchmarking. However, there do appear to be size and yield effects that help explain return behaviour at the property level.
Resumo:
Smart healthcare is a complex domain for systems integration due to human and technical factors and heterogeneous data sources involved. As a part of smart city, it is such a complex area where clinical functions require smartness of multi-systems collaborations for effective communications among departments, and radiology is one of the areas highly relies on intelligent information integration and communication. Therefore, it faces many challenges regarding integration and its interoperability such as information collision, heterogeneous data sources, policy obstacles, and procedure mismanagement. The purpose of this study is to conduct an analysis of data, semantic, and pragmatic interoperability of systems integration in radiology department, and to develop a pragmatic interoperability framework for guiding the integration. We select an on-going project at a local hospital for undertaking our case study. The project is to achieve data sharing and interoperability among Radiology Information Systems (RIS), Electronic Patient Record (EPR), and Picture Archiving and Communication Systems (PACS). Qualitative data collection and analysis methods are used. The data sources consisted of documentation including publications and internal working papers, one year of non-participant observations and 37 interviews with radiologists, clinicians, directors of IT services, referring clinicians, radiographers, receptionists and secretary. We identified four primary phases of data analysis process for the case study: requirements and barriers identification, integration approach, interoperability measurements, and knowledge foundations. Each phase is discussed and supported by qualitative data. Through the analysis we also develop a pragmatic interoperability framework that summaries the empirical findings and proposes recommendations for guiding the integration in the radiology context.
Resumo:
Traditionally, spoor (tracks, pug marks) have been used as a cost effective tool to assess the presence of larger mammals. Automated camera traps are now increasingly utilized to monitor wildlife, primarily as the cost has greatly declined and statistical approaches to data analysis have improved. While camera traps have become ubiquitous, we have little understanding of their effectiveness when compared to traditional approaches using spoor in the field. Here, we a) test the success of camera traps in recording a range of carnivore species against spoor; b) ask if simple measures of spoor size taken by amateur volunteers is likely to allow individual identification of leopards and c) for a trained tracker, ask if this approach may allow individual leopards to be followed with confidence in savannah habitat. We found that camera traps significantly under-recorded mammalian top and meso-carnivores, with camera traps more likely under-record the presence of smaller carnivores (civet 64%; genet 46%, Meller’s mongoose 45%) than larger (jackal sp. 30%, brown hyena 22%), while leopard was more likely to be recorded by camera trap (all recorded by camera trap only). We found that amateur trackers could be beneficial in regards to collecting presence data; however the large variance in measurements of spoor taken in the field by volunteers suggests that this approach is unlikely to add further data. Nevertheless, the use of simple spoor measurements in the field by a trained field researcher increases their ability to reliably follow a leopard trail in difficult terrain. This allows researchers to glean further data on leopard behaviour and habitat utilisation without the need for complex analysis.
Resumo:
The dynamical processes that lead to open cluster disruption cause its mass to decrease. To investigate such processes from the observational point of view, it is important to identify open cluster remnants (OCRs), which are intrinsically poorly populated. Due to their nature, distinguishing them from field star fluctuations is still an unresolved issue. In this work, we developed a statistical diagnostic tool to distinguish poorly populated star concentrations from background field fluctuations. We use 2MASS photometry to explore one of the conditions required for a stellar group to be a physical group: to produce distinct sequences in a colour-magnitude diagram (CMD). We use automated tools to (i) derive the limiting radius; (ii) decontaminate the field and assign membership probabilities; (iii) fit isochrones; and (iv) compare object and field CMDs, considering the isochrone solution, in order to verify the similarity. If the object cannot be statistically considered as a field fluctuation, we derive its probable age, distance modulus, reddening and uncertainties in a self-consistent way. As a test, we apply the tool to open clusters and comparison fields. Finally, we study the OCR candidates DoDz 6, NGC 272, ESO 435 SC48 and ESO 325 SC15. The tool is optimized to treat these low-statistic objects and to separate the best OCR candidates for studies on kinematics and chemical composition. The study of the possible OCRs will certainly provide a deep understanding of OCR properties and constraints for theoretical models, including insights into the evolution of open clusters and dissolution rates.
Resumo:
Evidence of jet precession in many galactic and extragalactic sources has been reported in the literature. Much of this evidence is based on studies of the kinematics of the jet knots, which depends on the correct identification of the components to determine their respective proper motions and position angles on the plane of the sky. Identification problems related to fitting procedures, as well as observations poorly sampled in time, may influence the follow-up of the components in time, which consequently might contribute to a misinterpretation of the data. In order to deal with these limitations, we introduce a very powerful statistical tool to analyse jet precession: the cross-entropy method for continuous multi-extremal optimization. Only based on the raw data of the jet components (right ascension and declination offsets from the core), the cross-entropy method searches for the precession model parameters that better represent the data. In this work we present a large number of tests to validate this technique, using synthetic precessing jets built from a given set of precession parameters. With the aim of recovering these parameters, we applied the cross-entropy method to our precession model, varying exhaustively the quantities associated with the method. Our results have shown that even in the most challenging tests, the cross-entropy method was able to find the correct parameters within a 1 per cent level. Even for a non-precessing jet, our optimization method could point out successfully the lack of precession.
Resumo:
We present a new technique for obtaining model fittings to very long baseline interferometric images of astrophysical jets. The method minimizes a performance function proportional to the sum of the squared difference between the model and observed images. The model image is constructed by summing N(s) elliptical Gaussian sources characterized by six parameters: two-dimensional peak position, peak intensity, eccentricity, amplitude, and orientation angle of the major axis. We present results for the fitting of two main benchmark jets: the first constructed from three individual Gaussian sources, the second formed by five Gaussian sources. Both jets were analyzed by our cross-entropy technique in finite and infinite signal-to-noise regimes, the background noise chosen to mimic that found in interferometric radio maps. Those images were constructed to simulate most of the conditions encountered in interferometric images of active galactic nuclei. We show that the cross-entropy technique is capable of recovering the parameters of the sources with a similar accuracy to that obtained from the very traditional Astronomical Image Processing System Package task IMFIT when the image is relatively simple (e. g., few components). For more complex interferometric maps, our method displays superior performance in recovering the parameters of the jet components. Our methodology is also able to show quantitatively the number of individual components present in an image. An additional application of the cross-entropy technique to a real image of a BL Lac object is shown and discussed. Our results indicate that our cross-entropy model-fitting technique must be used in situations involving the analysis of complex emission regions having more than three sources, even though it is substantially slower than current model-fitting tasks (at least 10,000 times slower for a single processor, depending on the number of sources to be optimized). As in the case of any model fitting performed in the image plane, caution is required in analyzing images constructed from a poorly sampled (u, v) plane.
Resumo:
Non-linear methods for estimating variability in time-series are currently of widespread use. Among such methods are approximate entropy (ApEn) and sample approximate entropy (SampEn). The applicability of ApEn and SampEn in analyzing data is evident and their use is increasing. However, consistency is a point of concern in these tools, i.e., the classification of the temporal organization of a data set might indicate a relative less ordered series in relation to another when the opposite is true. As highlighted by their proponents themselves, ApEn and SampEn might present incorrect results due to this lack of consistency. In this study, we present a method which gains consistency by using ApEn repeatedly in a wide range of combinations of window lengths and matching error tolerance. The tool is called volumetric approximate entropy, vApEn. We analyze nine artificially generated prototypical time-series with different degrees of temporal order (combinations of sine waves, logistic maps with different control parameter values, random noises). While ApEn/SampEn clearly fail to consistently identify the temporal order of the sequences, vApEn correctly do. In order to validate the tool we performed shuffled and surrogate data analysis. Statistical analysis confirmed the consistency of the method. (C) 2008 Elsevier Ltd. All rights reserved.
Resumo:
A large amount of biological data has been produced in the last years. Important knowledge can be extracted from these data by the use of data analysis techniques. Clustering plays an important role in data analysis, by organizing similar objects from a dataset into meaningful groups. Several clustering algorithms have been proposed in the literature. However, each algorithm has its bias, being more adequate for particular datasets. This paper presents a mathematical formulation to support the creation of consistent clusters for biological data. Moreover. it shows a clustering algorithm to solve this formulation that uses GRASP (Greedy Randomized Adaptive Search Procedure). We compared the proposed algorithm with three known other algorithms. The proposed algorithm presented the best clustering results confirmed statistically. (C) 2009 Elsevier Ltd. All rights reserved.
Resumo:
Accessibility has become a serious issue to be considered by various sectors of the society. However, what are the differences between the perception of accessibility by academy, government and industry? In this paper, we present an analysis of this issue based on a large survey carried out with 613 participants involved with Web development, from all of the 27 Brazilian states. The paper presents results from the data analysis for each sector, along with statistical tests regarding the main different issues related to each of the sectors, such as: government and law, industry and techniques, academy and education. The concern about accessibility law is poor even amongst people from government sector. The analyses have also pointed out that the academy has not been addressing accessibility training accordingly. The knowledge about proper techniques to produce accessible contents is better than other sectors`, but still limited in industry. Stronger investments in training and in the promotion of consciousness about the law may be pointed as the most important tools to help a more effective policy on Web accessibility in Brazil.
Resumo:
In interval-censored survival data, the event of interest is not observed exactly but is only known to occur within some time interval. Such data appear very frequently. In this paper, we are concerned only with parametric forms, and so a location-scale regression model based on the exponentiated Weibull distribution is proposed for modeling interval-censored data. We show that the proposed log-exponentiated Weibull regression model for interval-censored data represents a parametric family of models that include other regression models that are broadly used in lifetime data analysis. Assuming the use of interval-censored data, we employ a frequentist analysis, a jackknife estimator, a parametric bootstrap and a Bayesian analysis for the parameters of the proposed model. We derive the appropriate matrices for assessing local influences on the parameter estimates under different perturbation schemes and present some ways to assess global influences. Furthermore, for different parameter settings, sample sizes and censoring percentages, various simulations are performed; in addition, the empirical distribution of some modified residuals are displayed and compared with the standard normal distribution. These studies suggest that the residual analysis usually performed in normal linear regression models can be straightforwardly extended to a modified deviance residual in log-exponentiated Weibull regression models for interval-censored data. (C) 2009 Elsevier B.V. All rights reserved.