44 resultados para statistical data analysis


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Adaptive methods which “equidistribute” a given positive weight function are now used fairly widely for selecting discrete meshes. The disadvantage of such schemes is that the resulting mesh may not be smoothly varying. In this paper a technique is developed for equidistributing a function subject to constraints on the ratios of adjacent steps in the mesh. Given a weight function $f \geqq 0$ on an interval $[a,b]$ and constants $c$ and $K$, the method produces a mesh with points $x_0 = a,x_{j + 1} = x_j + h_j ,j = 0,1, \cdots ,n - 1$ and $x_n = b$ such that\[ \int_{xj}^{x_{j + 1} } {f \leqq c\quad {\text{and}}\quad \frac{1} {K}} \leqq \frac{{h_{j + 1} }} {{h_j }} \leqq K\quad {\text{for}}\, j = 0,1, \cdots ,n - 1 . \] A theoretical analysis of the procedure is presented, and numerical algorithms for implementing the method are given. Examples show that the procedure is effective in practice. Other types of constraints on equidistributing meshes are also discussed. The principal application of the procedure is to the solution of boundary value problems, where the weight function is generally some error indicator, and accuracy and convergence properties may depend on the smoothness of the mesh. Other practical applications include the regrading of statistical data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Background: Microarray based comparative genomic hybridisation (CGH) experiments have been used to study numerous biological problems including understanding genome plasticity in pathogenic bacteria. Typically such experiments produce large data sets that are difficult for biologists to handle. Although there are some programmes available for interpretation of bacterial transcriptomics data and CGH microarray data for looking at genetic stability in oncogenes, there are none specifically to understand the mosaic nature of bacterial genomes. Consequently a bottle neck still persists in accurate processing and mathematical analysis of these data. To address this shortfall we have produced a simple and robust CGH microarray data analysis process that may be automated in the future to understand bacterial genomic diversity. Results: The process involves five steps: cleaning, normalisation, estimating gene presence and absence or divergence, validation, and analysis of data from test against three reference strains simultaneously. Each stage of the process is described and we have compared a number of methods available for characterising bacterial genomic diversity, for calculating the cut-off between gene presence and absence or divergence, and shown that a simple dynamic approach using a kernel density estimator performed better than both established, as well as a more sophisticated mixture modelling technique. We have also shown that current methods commonly used for CGH microarray analysis in tumour and cancer cell lines are not appropriate for analysing our data. Conclusion: After carrying out the analysis and validation for three sequenced Escherichia coli strains, CGH microarray data from 19 E. coli O157 pathogenic test strains were used to demonstrate the benefits of applying this simple and robust process to CGH microarray studies using bacterial genomes.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Nitrogen flows from European watersheds to coastal marine waters Executive summary Nature of the problem • Most regional watersheds in Europe constitute managed human territories importing large amounts of new reactive nitrogen. • As a consequence, groundwater, surface freshwater and coastal seawater are undergoing severe nitrogen contamination and/or eutrophication problems. Approaches • A comprehensive evaluation of net anthropogenic inputs of reactive nitrogen (NANI) through atmospheric deposition, crop N fixation,fertiliser use and import of food and feed has been carried out for all European watersheds. A database on N, P and Si fluxes delivered at the basin outlets has been assembled. • A number of modelling approaches based on either statistical regression analysis or mechanistic description of the processes involved in nitrogen transfer and transformations have been developed for relating N inputs to watersheds to outputs into coastal marine ecosystems. Key findings/state of knowledge • Throughout Europe, NANI represents 3700 kgN/km2/yr (range, 0–8400 depending on the watershed), i.e. five times the background rate of natural N2 fixation. • A mean of approximately 78% of NANI does not reach the basin outlet, but instead is stored (in soils, sediments or ground water) or eliminated to the atmosphere as reactive N forms or as N2. • N delivery to the European marine coastal zone totals 810 kgN/km2/yr (range, 200–4000 depending on the watershed), about four times the natural background. In areas of limited availability of silica, these inputs cause harmful algal blooms. Major uncertainties/challenges • The exact dimension of anthropogenic N inputs to watersheds is still imperfectly known and requires pursuing monitoring programmes and data integration at the international level. • The exact nature of ‘retention’ processes, which potentially represent a major management lever for reducing N contamination of water resources, is still poorly understood. • Coastal marine eutrophication depends to a large degree on local morphological and hydrographic conditions as well as on estuarine processes, which are also imperfectly known. Recommendations • Better control and management of the nitrogen cascade at the watershed scale is required to reduce N contamination of ground- and surface water, as well as coastal eutrophication. • In spite of the potential of these management measures, there is no choice at the European scale but to reduce the primary inputs of reactive nitrogen to watersheds, through changes in agriculture, human diet and other N flows related to human activity.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In the recent years, the area of data mining has been experiencing considerable demand for technologies that extract knowledge from large and complex data sources. There has been substantial commercial interest as well as active research in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from large datasets. Artificial neural networks (NNs) are popular biologically-inspired intelligent methodologies, whose classification, prediction, and pattern recognition capabilities have been utilized successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction, and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks. © 2012 Wiley Periodicals, Inc.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Advances in hardware and software technology enable us to collect, store and distribute large quantities of data on a very large scale. Automatically discovering and extracting hidden knowledge in the form of patterns from these large data volumes is known as data mining. Data mining technology is not only a part of business intelligence, but is also used in many other application areas such as research, marketing and financial analytics. For example medical scientists can use patterns extracted from historic patient data in order to determine if a new patient is likely to respond positively to a particular treatment or not; marketing analysts can use extracted patterns from customer data for future advertisement campaigns; finance experts have an interest in patterns that forecast the development of certain stock market shares for investment recommendations. However, extracting knowledge in the form of patterns from massive data volumes imposes a number of computational challenges in terms of processing time, memory, bandwidth and power consumption. These challenges have led to the development of parallel and distributed data analysis approaches and the utilisation of Grid and Cloud computing. This chapter gives an overview of parallel and distributed computing approaches and how they can be used to scale up data mining to large datasets.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The purpose of this lecture is to review recent development in data analysis, initialization and data assimilation. The development of 3-dimensional multivariate schemes has been very timely because of its suitability to handle the many different types of observations during FGGE. Great progress has taken place in the initialization of global models by the aid of non-linear normal mode technique. However, in spite of great progress, several fundamental problems are still unsatisfactorily solved. Of particular importance is the question of the initialization of the divergent wind fields in the Tropics and to find proper ways to initialize weather systems driven by non-adiabatic processes. The unsatisfactory ways in which such processes are being initialized are leading to excessively long spin-up times.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This chapter introduces the latest practices and technologies in the interactive interpretation of environmental data. With environmental data becoming ever larger, more diverse and more complex, there is a need for a new generation of tools that provides new capabilities over and above those of the standard workhorses of science. These new tools aid the scientist in discovering interesting new features (and also problems) in large datasets by allowing the data to be explored interactively using simple, intuitive graphical tools. In this way, new discoveries are made that are commonly missed by automated batch data processing. This chapter discusses the characteristics of environmental science data, common current practice in data analysis and the supporting tools and infrastructure. New approaches are introduced and illustrated from the points of view of both the end user and the underlying technology. We conclude by speculating as to future developments in the field and what must be achieved to fulfil this vision.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Background: Expression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples. Nevertheless, there is still much debate on the best ways to process data, to design experiments and analyse the output. Furthermore, many of the more sophisticated mathematical approaches to data analysis in the literature remain inaccessible to much of the biological research community. In this study we examine ways of extracting and analysing a large data set obtained using the Agilent long oligonucleotide transcriptomics platform, applied to a set of human macrophage and dendritic cell samples. Results: We describe and validate a series of data extraction, transformation and normalisation steps which are implemented via a new R function. Analysis of replicate normalised reference data demonstrate that intrarray variability is small (only around 2 of the mean log signal), while interarray variability from replicate array measurements has a standard deviation (SD) of around 0.5 log(2) units (6 of mean). The common practise of working with ratios of Cy5/Cy3 signal offers little further improvement in terms of reducing error. Comparison to expression data obtained using Arabidopsis samples demonstrates that the large number of genes in each sample showing a low level of transcription reflect the real complexity of the cellular transcriptome. Multidimensional scaling is used to show that the processed data identifies an underlying structure which reflect some of the key biological variables which define the data set. This structure is robust, allowing reliable comparison of samples collected over a number of years and collected by a variety of operators. Conclusions: This study outlines a robust and easily implemented pipeline for extracting, transforming normalising and visualising transcriptomic array data from Agilent expression platform. The analysis is used to obtain quantitative estimates of the SD arising from experimental (non biological) intra- and interarray variability, and for a lower threshold for determining whether an individual gene is expressed. The study provides a reliable basis for further more extensive studies of the systems biology of eukaryotic cells.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

JASMIN is a super-data-cluster designed to provide a high-performance high-volume data analysis environment for the UK environmental science community. Thus far JASMIN has been used primarily by the atmospheric science and earth observation communities, both to support their direct scientific workflow, and the curation of data products in the STFC Centre for Environmental Data Archival (CEDA). Initial JASMIN configuration and first experiences are reported here. Useful improvements in scientific workflow are presented. It is clear from the explosive growth in stored data and use that there was a pent up demand for a suitable big-data analysis environment. This demand is not yet satisfied, in part because JASMIN does not yet have enough compute, the storage is fully allocated, and not all software needs are met. Plans to address these constraints are introduced.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Streamwater nitrate dynamics in the River Hafren, Plynlimon, mid-Wales were investigated over decadal to sub-daily timescales using a range of statistical techniques. Long-term data were derived from weekly grab samples (1984–2010) and high-frequency data from 7-hourly samples (2007–2009) both measured at two sites: a headwater stream draining moorland and a downstream site below plantation forest. This study is one of the first to analyse upland streamwater nitrate dynamics across such a wide range of timescales and report on the principal mechanisms identified. The data analysis provided no clear evidence that the long-term decline in streamwater nitrate concentrations was related to a decline in atmospheric deposition alone, because nitrogen deposition first increased and then decreased during the study period. Increased streamwater temperature and denitrification may also have contributed to the decline in stream nitrate concentrations, the former through increased N uptake rates and the latter resultant from increased dissolved organic carbon concentrations. Strong seasonal cycles, with concentration minimums in the summer, were driven by seasonal flow minimums and seasonal biological activity enhancing nitrate uptake. Complex diurnal dynamics were observed, with seasonal changes in phase and amplitude of the cycling, and the diurnal dynamics were variable along the river. At the moorland site, a regular daily cycle, with minimum concentrations in the early afternoon, corresponding with peak air temperatures, indicated the importance of instream biological processing. At the downstream site, the diurnal dynamics were a composite signal, resultant from advection, dispersion and nitrate processing in the soils of the lower catchment. The diurnal streamwater nitrate dynamics were also affected by drought conditions. Enhanced diurnal cycling in Spring 2007 was attributed to increased nitrate availability in the post-drought period as well as low flow rates and high temperatures over this period. The combination of high-frequency short-term measurements and long-term monitoring provides a powerful tool for increasing understanding of the controls of element fluxes and concentrations in surface waters.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

There are now many reports of imaging experiments with small cohorts of typical participants that precede large-scale, often multicentre studies of psychiatric and neurological disorders. Data from these calibration experiments are sufficient to make estimates of statistical power and predictions of sample size and minimum observable effect sizes. In this technical note, we suggest how previously reported voxel-based power calculations can support decision making in the design, execution and analysis of cross-sectional multicentre imaging studies. The choice of MRI acquisition sequence, distribution of recruitment across acquisition centres, and changes to the registration method applied during data analysis are considered as examples. The consequences of modification are explored in quantitative terms by assessing the impact on sample size for a fixed effect size and detectable effect size for a fixed sample size. The calibration experiment dataset used for illustration was a precursor to the now complete Medical Research Council Autism Imaging Multicentre Study (MRC-AIMS). Validation of the voxel-based power calculations is made by comparing the predicted values from the calibration experiment with those observed in MRC-AIMS. The effect of non-linear mappings during image registration to a standard stereotactic space on the prediction is explored with reference to the amount of local deformation. In summary, power calculations offer a validated, quantitative means of making informed choices on important factors that influence the outcome of studies that consume significant resources.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Owing to continuous advances in the computational power of handheld devices like smartphones and tablet computers, it has become possible to perform Big Data operations including modern data mining processes onboard these small devices. A decade of research has proved the feasibility of what has been termed as Mobile Data Mining, with a focus on one mobile device running data mining processes. However, it is not before 2010 until the authors of this book initiated the Pocket Data Mining (PDM) project exploiting the seamless communication among handheld devices performing data analysis tasks that were infeasible until recently. PDM is the process of collaboratively extracting knowledge from distributed data streams in a mobile computing environment. This book provides the reader with an in-depth treatment on this emerging area of research. Details of techniques used and thorough experimental studies are given. More importantly and exclusive to this book, the authors provide detailed practical guide on the deployment of PDM in the mobile environment. An important extension to the basic implementation of PDM dealing with concept drift is also reported. In the era of Big Data, potential applications of paramount importance offered by PDM in a variety of domains including security, business and telemedicine are discussed.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Smart healthcare is a complex domain for systems integration due to human and technical factors and heterogeneous data sources involved. As a part of smart city, it is such a complex area where clinical functions require smartness of multi-systems collaborations for effective communications among departments, and radiology is one of the areas highly relies on intelligent information integration and communication. Therefore, it faces many challenges regarding integration and its interoperability such as information collision, heterogeneous data sources, policy obstacles, and procedure mismanagement. The purpose of this study is to conduct an analysis of data, semantic, and pragmatic interoperability of systems integration in radiology department, and to develop a pragmatic interoperability framework for guiding the integration. We select an on-going project at a local hospital for undertaking our case study. The project is to achieve data sharing and interoperability among Radiology Information Systems (RIS), Electronic Patient Record (EPR), and Picture Archiving and Communication Systems (PACS). Qualitative data collection and analysis methods are used. The data sources consisted of documentation including publications and internal working papers, one year of non-participant observations and 37 interviews with radiologists, clinicians, directors of IT services, referring clinicians, radiographers, receptionists and secretary. We identified four primary phases of data analysis process for the case study: requirements and barriers identification, integration approach, interoperability measurements, and knowledge foundations. Each phase is discussed and supported by qualitative data. Through the analysis we also develop a pragmatic interoperability framework that summaries the empirical findings and proposes recommendations for guiding the integration in the radiology context.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Traditionally, spoor (tracks, pug marks) have been used as a cost effective tool to assess the presence of larger mammals. Automated camera traps are now increasingly utilized to monitor wildlife, primarily as the cost has greatly declined and statistical approaches to data analysis have improved. While camera traps have become ubiquitous, we have little understanding of their effectiveness when compared to traditional approaches using spoor in the field. Here, we a) test the success of camera traps in recording a range of carnivore species against spoor; b) ask if simple measures of spoor size taken by amateur volunteers is likely to allow individual identification of leopards and c) for a trained tracker, ask if this approach may allow individual leopards to be followed with confidence in savannah habitat. We found that camera traps significantly under-recorded mammalian top and meso-carnivores, with camera traps more likely under-record the presence of smaller carnivores (civet 64%; genet 46%, Meller’s mongoose 45%) than larger (jackal sp. 30%, brown hyena 22%), while leopard was more likely to be recorded by camera trap (all recorded by camera trap only). We found that amateur trackers could be beneficial in regards to collecting presence data; however the large variance in measurements of spoor taken in the field by volunteers suggests that this approach is unlikely to add further data. Nevertheless, the use of simple spoor measurements in the field by a trained field researcher increases their ability to reliably follow a leopard trail in difficult terrain. This allows researchers to glean further data on leopard behaviour and habitat utilisation without the need for complex analysis.