882 resultados para large scale data gathering


Relevância:

100.00% 100.00%

Publicador:

Resumo:

A methodology is presented for the development of a combined seasonal weather and crop productivity forecasting system. The first stage of the methodology is the determination of the spatial scale(s) on which the system could operate; this determination has been made for the case of groundnut production in India. Rainfall is a dominant climatic determinant of groundnut yield in India. The relationship between yield and rainfall has been explored using data from 1966 to 1995. On the all-India scale, seasonal rainfall explains 52% of the variance in yield. On the subdivisional scale, correlations vary between variance r(2) = 0.62 (significance level p < 10(-4)) and a negative correlation with r(2) = 0.1 (p = 0.13). The spatial structure of the relationship between rainfall and groundnut yield has been explored using empirical orthogonal function (EOF) analysis. A coherent, large-scale pattern emerges for both rainfall and yield. On the subdivisional scale (similar to 300 km), the first principal component (PC) of rainfall is correlated well with the first PC of yield (r(2) = 0.53, p < 10(-4)), demonstrating that the large-scale patterns picked out by the EOFs are related. The physical significance of this result is demonstrated. Use of larger averaging areas for the EOF analysis resulted in lower and (over time) less robust correlations. Because of this loss of detail when using larger spatial scales, the subdivisional scale is suggested as an upper limit on the spatial scale for the proposed forecasting system. Further, district-level EOFs of the yield data demonstrate the validity of upscaling these data to the subdivisional scale. Similar patterns have been produced using data on both of these scales, and the first PCs are very highly correlated (r(2) = 0.96). Hence, a working spatial scale has been identified, typical of that used in seasonal weather forecasting, that can form the basis of crop modeling work for the case of groundnut production in India. Last, the change in correlation between yield and seasonal rainfall during the study period has been examined using seasonal totals and monthly EOFs. A further link between yield and subseasonal variability is demonstrated via analysis of dynamical data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

1.There is concern over the possibility of unwanted environmental change following transgene movement from genetically modified (GM) rapeseed Brassica napus to its wild and weedy relatives. 2. The aim of this research was to develop a remote sensing-assisted methodology to help quantify gene flow from crops to their wild relatives over wide areas. Emphasis was placed on locating sites of sympatry, where the frequency of gene flow is likely to be highest, and on measuring the size of rapeseed fields to allow spatially explicit modelling of wind-mediated pollen-dispersal patterns. 3. Remote sensing was used as a tool to locate rapeseed fields, and a variety of image-processing techniques was adopted to facilitate the compilation of a spatially explicit profile of sympatry between the crop and Brassica rapa. 4. Classified satellite images containing rapeseed fields were first used to infer the spatial relationship between donor rapeseed fields and recipient riverside B. rapa populations. Such images also have utility for improving the efficiency of ground surveys by identifying probable sites of sympatry. The same data were then also used for the calculation of mean field size. 5. This paper forms a companion paper to Wilkinson et al. (2003), in which these elements were combined to produce a spatially explicit profile of hybrid formation over the UK. The current paper demonstrates the value of remote sensing and image processing for large-scale studies of gene flow, and describes a generic method that could be applied to a variety of crops in many countries. 6.Synthesis and applications. The decision to approve or prevent the release of a GM cultivar is made at a national rather than regional level. It is highly desirable that data relating to the decision-making process are collected at the same scale, rather than relying on extrapolation from smaller experiments designed at the plot, field or even regional scale. It would be extremely difficult and labour intensive to attempt to carry out such large-scale investigations without the use of remote-sensing technology. This study used rapeseed in the UK as a model to demonstrate the value of remote sensing in assembling empirical information at a national level.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper describes a method that employs Earth Observation (EO) data to calculate spatiotemporal estimates of soil heat flux, G, using a physically-based method (the Analytical Method). The method involves a harmonic analysis of land surface temperature (LST) data. It also requires an estimate of near-surface soil thermal inertia; this property depends on soil textural composition and varies as a function of soil moisture content. The EO data needed to drive the model equations, and the ground-based data required to provide verification of the method, were obtained over the Fakara domain within the African Monsoon Multidisciplinary Analysis (AMMA) program. LST estimates (3 km × 3 km, one image 15 min−1) were derived from MSG-SEVIRI data. Soil moisture estimates were obtained from ENVISAT-ASAR data, while estimates of leaf area index, LAI, (to calculate the effect of the canopy on G, largely due to radiation extinction) were obtained from SPOT-HRV images. The variation of these variables over the Fakara domain, and implications for values of G derived from them, were discussed. Results showed that this method provides reliable large-scale spatiotemporal estimates of G. Variations in G could largely be explained by the variability in the model input variables. Furthermore, it was shown that this method is relatively insensitive to model parameters related to the vegetation or soil texture. However, the strong sensitivity of thermal inertia to soil moisture content at low values of relative saturation (<0.2) means that in arid or semi-arid climates accurate estimates of surface soil moisture content are of utmost importance, if reliable estimates of G are to be obtained. This method has the potential to improve large-scale evaporation estimates, to aid land surface model prediction and to advance research that aims to explain failure in energy balance closure of meteorological field studies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We present a new approach that allows the determination of force-field parameters for the description of disordered macromolecular systems from experimental neutron diffraction data obtained over a large Q range. The procedure is based on a tight coupling between experimentally derived structure factors and computer modelling. We separate the molecular potential into non-interacting terms representing respectively bond stretching, angle bending and torsional rotation. The parameters for each of the potentials are extracted directly from experimental data through comparison of the experimental structure factor and those derived from atomistic level molecular models. The viability of these force fields is assessed by comparison of predicted large-scale features such as the characteristic ratio. The procedure is illustrated on molten poly(ethylene) and poly(tetrafluoroethylene).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recently major processor manufacturers have announced a dramatic shift in their paradigm to increase computing power over the coming years. Instead of focusing on faster clock speeds and more powerful single core CPUs, the trend clearly goes towards multi core systems. This will also result in a paradigm shift for the development of algorithms for computationally expensive tasks, such as data mining applications. Obviously, work on parallel algorithms is not new per se but concentrated efforts in the many application domains are still missing. Multi-core systems, but also clusters of workstations and even large-scale distributed computing infrastructures provide new opportunities and pose new challenges for the design of parallel and distributed algorithms. Since data mining and machine learning systems rely on high performance computing systems, research on the corresponding algorithms must be on the forefront of parallel algorithm research in order to keep pushing data mining and machine learning applications to be more powerful and, especially for the former, interactive. To bring together researchers and practitioners working in this exciting field, a workshop on parallel data mining was organized as part of PKDD/ECML 2006 (Berlin, Germany). The six contributions selected for the program describe various aspects of data mining and machine learning approaches featuring low to high degrees of parallelism: The first contribution focuses the classic problem of distributed association rule mining and focuses on communication efficiency to improve the state of the art. After this a parallelization technique for speeding up decision tree construction by means of thread-level parallelism for shared memory systems is presented. The next paper discusses the design of a parallel approach for dis- tributed memory systems of the frequent subgraphs mining problem. This approach is based on a hierarchical communication topology to solve issues related to multi-domain computational envi- ronments. The forth paper describes the combined use and the customization of software packages to facilitate a top down parallelism in the tuning of Support Vector Machines (SVM) and the next contribution presents an interesting idea concerning parallel training of Conditional Random Fields (CRFs) and motivates their use in labeling sequential data. The last contribution finally focuses on very efficient feature selection. It describes a parallel algorithm for feature selection from random subsets. Selecting the papers included in this volume would not have been possible without the help of an international Program Committee that has provided detailed reviews for each paper. We would like to also thank Matthew Otey who helped with publicity for the workshop.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper reports the findings from two large scale national on-line surveys carried out in 2009 and 2010, which explored the state of history teaching in English secondary schools. Large variation in provision was identified within comprehensive schools in response to national policy decisions and initiatives. Using the data from the surveys and school level data that is publicly available, this study examines situated factors, particularly the nature of the school intake, the numbers of pupils with special educational needs and the socio-economic status of the area surrounding the school, and the impact these have on the provision of history education. The findings show that there is a growing divide between those students that have access to the ‘powerful knowledge’, provided by subjects like history, and those that do not.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Providing homeowners with real-time feedback on their electricity consumption through a dedicated display device has been shown to reduce consumption by approximately 6-10%. However, recent advances in smart grid technology have enabled larger sample sizes and more representative sample selection and recruitment methods for display trials. By analyzing these factors using data from current studies, this paper argues that a realistic, large-scale conservation effect from feedback is in the range of 3-5%. Subsequent analysis shows that providing real-time feedback may not be a cost effective strategy for reducing carbon emissions in Australia, but that it may enable additional benefits such as customer retention and peak-load shift.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

There is large uncertainty about the magnitude of warming and how rainfall patterns will change in response to any given scenario of future changes in atmospheric composition and land use. The models used for future climate projections were developed and calibrated using climate observations from the past 40 years. The geologic record of environmental responses to climate changes provides a unique opportunity to test model performance outside this limited climate range. Evaluation of model simulations against palaeodata shows that models reproduce the direction and large-scale patterns of past changes in climate, but tend to underestimate the magnitude of regional changes. As part of the effort to reduce model-related uncertainty and produce more reliable estimates of twenty-first century climate, the Palaeoclimate Modelling Intercomparison Project is systematically applying palaeoevaluation techniques to simulations of the past run with the models used to make future projections. This evaluation will provide assessments of model performance, including whether a model is sufficiently sensitive to changes in atmospheric composition, as well as providing estimates of the strength of biosphere and other feedbacks that could amplify the model response to these changes and modify the characteristics of climate variability.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The large scale urban consumption of energy (LUCY) model simulates all components of anthropogenic heat flux (QF) from the global to individual city scale at 2.5 × 2.5 arc-minute resolution. This includes a database of different working patterns and public holidays, vehicle use and energy consumption in each country. The databases can be edited to include specific diurnal and seasonal vehicle and energy consumption patterns, local holidays and flows of people within a city. If better information about individual cities is available within this (open-source) database, then the accuracy of this model can only improve, to provide the community data from global-scale climate modelling or the individual city scale in the future. The results show that QF varied widely through the year, through the day, between countries and urban areas. An assessment of the heat emissions estimated revealed that they are reasonably close to those produced by a global model and a number of small-scale city models, so results from LUCY can be used with a degree of confidence. From LUCY, the global mean urban QF has a diurnal range of 0.7–3.6 W m−2, and is greater on weekdays than weekends. The heat release from building is the largest contributor (89–96%), to heat emissions globally. Differences between months are greatest in the middle of the day (up to 1 W m−2 at 1 pm). December to February, the coldest months in the Northern Hemisphere, have the highest heat emissions. July and August are at the higher end. The least QF is emitted in May. The highest individual grid cell heat fluxes in urban areas were located in New York (577), Paris (261.5), Tokyo (178), San Francisco (173.6), Vancouver (119) and London (106.7). Copyright © 2010 Royal Meteorological Society

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background: Expression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples. Nevertheless, there is still much debate on the best ways to process data, to design experiments and analyse the output. Furthermore, many of the more sophisticated mathematical approaches to data analysis in the literature remain inaccessible to much of the biological research community. In this study we examine ways of extracting and analysing a large data set obtained using the Agilent long oligonucleotide transcriptomics platform, applied to a set of human macrophage and dendritic cell samples. Results: We describe and validate a series of data extraction, transformation and normalisation steps which are implemented via a new R function. Analysis of replicate normalised reference data demonstrate that intrarray variability is small (only around 2 of the mean log signal), while interarray variability from replicate array measurements has a standard deviation (SD) of around 0.5 log(2) units (6 of mean). The common practise of working with ratios of Cy5/Cy3 signal offers little further improvement in terms of reducing error. Comparison to expression data obtained using Arabidopsis samples demonstrates that the large number of genes in each sample showing a low level of transcription reflect the real complexity of the cellular transcriptome. Multidimensional scaling is used to show that the processed data identifies an underlying structure which reflect some of the key biological variables which define the data set. This structure is robust, allowing reliable comparison of samples collected over a number of years and collected by a variety of operators. Conclusions: This study outlines a robust and easily implemented pipeline for extracting, transforming normalising and visualising transcriptomic array data from Agilent expression platform. The analysis is used to obtain quantitative estimates of the SD arising from experimental (non biological) intra- and interarray variability, and for a lower threshold for determining whether an individual gene is expressed. The study provides a reliable basis for further more extensive studies of the systems biology of eukaryotic cells.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Before the advent of genome-wide association studies (GWASs), hundreds of candidate genes for obesity-susceptibility had been identified through a variety of approaches. We examined whether those obesity candidate genes are enriched for associations with body mass index (BMI) compared with non-candidate genes by using data from a large-scale GWAS. A thorough literature search identified 547 candidate genes for obesity-susceptibility based on evidence from animal studies, Mendelian syndromes, linkage studies, genetic association studies and expression studies. Genomic regions were defined to include the genes ±10 kb of flanking sequence around candidate and non-candidate genes. We used summary statistics publicly available from the discovery stage of the genome-wide meta-analysis for BMI performed by the genetic investigation of anthropometric traits consortium in 123 564 individuals. Hypergeometric, rank tail-strength and gene-set enrichment analysis tests were used to test for the enrichment of association in candidate compared with non-candidate genes. The hypergeometric test of enrichment was not significant at the 5% P-value quantile (P = 0.35), but was nominally significant at the 25% quantile (P = 0.015). The rank tail-strength and gene-set enrichment tests were nominally significant for the full set of genes and borderline significant for the subset without SNPs at P < 10(-7). Taken together, the observed evidence for enrichment suggests that the candidate gene approach retains some value. However, the degree of enrichment is small despite the extensive number of candidate genes and the large sample size. Studies that focus on candidate genes have only slightly increased chances of detecting associations, and are likely to miss many true effects in non-candidate genes, at least for obesity-related traits.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A potential problem with Ensemble Kalman Filter is the implicit Gaussian assumption at analysis times. Here we explore the performance of a recently proposed fully nonlinear particle filter on a high-dimensional but simplified ocean model, in which the Gaussian assumption is not made. The model simulates the evolution of the vorticity field in time, described by the barotropic vorticity equation, in a highly nonlinear flow regime. While common knowledge is that particle filters are inefficient and need large numbers of model runs to avoid degeneracy, the newly developed particle filter needs only of the order of 10-100 particles on large scale problems. The crucial new ingredient is that the proposal density cannot only be used to ensure all particles end up in high-probability regions of state space as defined by the observations, but also to ensure that most of the particles have similar weights. Using identical twin experiments we found that the ensemble mean follows the truth reliably, and the difference from the truth is captured by the ensemble spread. A rank histogram is used to show that the truth run is indistinguishable from any of the particles, showing statistical consistency of the method.