96 resultados para Data-Intensive Science
Resumo:
In the recent years, the area of data mining has been experiencing considerable demand for technologies that extract knowledge from large and complex data sources. There has been substantial commercial interest as well as active research in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from large datasets. Artificial neural networks (NNs) are popular biologically-inspired intelligent methodologies, whose classification, prediction, and pattern recognition capabilities have been utilized successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction, and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks. © 2012 Wiley Periodicals, Inc.
Resumo:
The P-found protein folding and unfolding simulation repository is designed to allow scientists to perform data mining and other analyses across large, distributed simulation data sets. There are two storage components in P-found: a primary repository of simulation data that is used to populate the second component, and a data warehouse that contains important molecular properties. These properties may be used for data mining studies. Here we demonstrate how grid technologies can support multiple, distributed P-found installations. In particular, we look at two aspects: firstly, how grid data management technologies can be used to access the distributed data warehouses; and secondly, how the grid can be used to transfer analysis programs to the primary repositories — this is an important and challenging aspect of P-found, due to the large data volumes involved and the desire of scientists to maintain control of their own data. The grid technologies we are developing with the P-found system will allow new large data sets of protein folding simulations to be accessed and analysed in novel ways, with significant potential for enabling scientific discovery.
Resumo:
We describe ncWMS, an implementation of the Open Geospatial Consortium’s Web Map Service (WMS) specification for multidimensional gridded environmental data. ncWMS can read data in a large number of common scientific data formats – notably the NetCDF format with the Climate and Forecast conventions – then efficiently generate map imagery in thousands of different coordinate reference systems. It is designed to require minimal configuration from the system administrator and, when used in conjunction with a suitable client tool, provides end users with an interactive means for visualizing data without the need to download large files or interpret complex metadata. It is also used as a “bridging” tool providing interoperability between the environmental science community and users of geographic information systems. ncWMS implements a number of extensions to the WMS standard in order to fulfil some common scientific requirements, including the ability to generate plots representing timeseries and vertical sections. We discuss these extensions and their impact upon present and future interoperability. We discuss the conceptual mapping between the WMS data model and the data models used by gridded data formats, highlighting areas in which the mapping is incomplete or ambiguous. We discuss the architecture of the system and particular technical innovations of note, including the algorithms used for fast data reading and image generation. ncWMS has been widely adopted within the environmental data community and we discuss some of the ways in which the software is integrated within data infrastructures and portals.
Resumo:
This Editorial presents the focus, scope and policies of the inaugural issue of Nature Conservation, a new open access, peer-reviewed journal bridging natural sciences, social sciences and hands-on applications in conservation management. The journal covers all aspects of nature conservation and aims particularly at facilitating better interaction between scientists and practitioners. The journal will impose no restrictions on manuscript size or the use of colour. We will use an XML-based editorial workflow and several cutting-edge innovations in publishing and information dissemination. These include semantic mark-up of, and enhancements to published text, data, and extensive cross-linking within the journal and to external sources. We believe the journal will make an important contribution to better linking science and practice, offers rapid, peer-reviewed and flexible publication for authors and unrestricted access to content.
Resumo:
A programmable data acquisition system to allow novel use of meteorological radiosondes for atmospheric science measurements is described. In its basic form it supports four analogue inputs at 16 bit resolution, and up to two further inputs at lower resolution configurable instead for digital instruments. It also provides multiple instrument power supplies (+8V, +16V, +5V and -8V) from the 9V radiosonde battery. During a balloon flight encountering air temperatures from +17°C to -66°C, the worst case voltage drift in the 5V unipolar digitisation circuitry was 20mV. The system liberates a new range of low cost atmospheric research measurements, by utilising radiosondes routinely launched internationally for weather forecasting purposes. No additional receiving equipment is required. Comparisons between the specially instrumented and standard meteorological radiosondes show negligible effect of the additional instrumentation on the standard meteorological data.
Resumo:
Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element
Resumo:
This chapter introduces the latest practices and technologies in the interactive interpretation of environmental data. With environmental data becoming ever larger, more diverse and more complex, there is a need for a new generation of tools that provides new capabilities over and above those of the standard workhorses of science. These new tools aid the scientist in discovering interesting new features (and also problems) in large datasets by allowing the data to be explored interactively using simple, intuitive graphical tools. In this way, new discoveries are made that are commonly missed by automated batch data processing. This chapter discusses the characteristics of environmental science data, common current practice in data analysis and the supporting tools and infrastructure. New approaches are introduced and illustrated from the points of view of both the end user and the underlying technology. We conclude by speculating as to future developments in the field and what must be achieved to fulfil this vision.
Resumo:
The Advanced Along-Track Scanning Radiometer (AATSR) was launched on Envisat in March 2002. The AATSR instrument is designed to retrieve precise and accurate global sea surface temperature (SST) that, combined with the large data set collected from its predecessors, ATSR and ATSR-2, will provide a long term record of SST data that is greater than 15 years. This record can be used for independent monitoring and detection of climate change. The AATSR validation programme has successfully completed its initial phase. The programme involves validation of the AATSR derived SST values using in situ radiometers, in situ buoys and global SST fields from other data sets. The results of the initial programme presented here will demonstrate that the AATSR instrument is currently close to meeting its scientific objectives of determining global SST to an accuracy of 0.3 K (one sigma). For night time data, the analysis gives a warm bias of between +0.04 K (0.28 K) for buoys to +0.06 K (0.20 K) for radiometers, with slightly higher errors observed for day time data, showing warm biases of between +0.02 (0.39 K) for buoys to +0.11 K (0.33 K) for radiometers. They show that the ATSR series of instruments continues to be the world leader in delivering accurate space-based observations of SST, which is a key climate parameter.
Resumo:
Traditionally, the formal scientific output in most fields of natural science has been limited to peer- reviewed academic journal publications, with less attention paid to the chain of intermediate data results and their associated metadata, including provenance. In effect, this has constrained the representation and verification of the data provenance to the confines of the related publications. Detailed knowledge of a dataset’s provenance is essential to establish the pedigree of the data for its effective re-use, and to avoid redundant re-enactment of the experiment or computation involved. It is increasingly important for open-access data to determine their authenticity and quality, especially considering the growing volumes of datasets appearing in the public domain. To address these issues, we present an approach that combines the Digital Object Identifier (DOI) – a widely adopted citation technique – with existing, widely adopted climate science data standards to formally publish detailed provenance of a climate research dataset as an associated scientific workflow. This is integrated with linked-data compliant data re-use standards (e.g. OAI-ORE) to enable a seamless link between a publication and the complete trail of lineage of the corresponding dataset, including the dataset itself.
Resumo:
We describe the CHARMe project, which aims to link climate datasets with publications, user feedback and other items of "commentary metadata". The system will help users learn from previous community experience and select datasets that best suit their needs, as well as providing direct traceability between conclusions and the data that supported them. The project applies the principles of Linked Data and adopts the Open Annotation standard to record and publish commentary information. CHARMe contributes to the emerging landscape of "climate services", which will provide climate data and information to influence policy and decision-making. Although the project focuses on climate science, the technologies and concepts are very general and could be applied to other fields.
Resumo:
Tests of the new Rossby wave theories that have been developed over the past decade to account for discrepancies between theoretical wave speeds and those observed by satellite altimeters have focused primarily on the surface signature of such waves. It appears, however, that the surface signature of the waves acts only as a rather weak constraint, and that information on the vertical structure of the waves is required to better discriminate between competing theories. Due to the lack of 3-D observations, this paper uses high-resolution model data to construct realistic vertical structures of Rossby waves and compares these to structures predicted by theory. The meridional velocity of a section at 24° S in the Atlantic Ocean is pre-processed using the Radon transform to select the dominant westward signal. Normalized profiles are then constructed using three complementary methods based respectively on: (1) averaging vertical profiles of velocity, (2) diagnosing the amplitude of the Radon transform of the westward propagating signal at different depths, and (3) EOF analysis. These profiles are compared to profiles calculated using four different Rossby wave theories: standard linear theory (SLT), SLT plus mean flow, SLT plus topographic effects, and theory including mean flow and topographic effects. Our results support the classical theoretical assumption that westward propagating signals have a well-defined vertical modal structure associated with a phase speed independent of depth, in contrast with the conclusions of a recent study using the same model but for different locations in the North Atlantic. The model structures are in general surface intensified, with a sign reversal at depth in some regions, notably occurring at shallower depths in the East Atlantic. SLT provides a good fit to the model structures in the top 300 m, but grossly overestimates the sign reversal at depth. The addition of mean flow slightly improves the latter issue, but is too surface intensified. SLT plus topography rectifies the overestimation of the sign reversal, but overestimates the amplitude of the structure for much of the layer above the sign reversal. Combining the effects of mean flow and topography provided the best fit for the mean model profiles, although small errors at the surface and mid-depths are carried over from the individual effects of mean flow and topography respectively. Across the section the best fitting theory varies between SLT plus topography and topography with mean flow, with, in general, SLT plus topography performing better in the east where the sign reversal is less pronounced. None of the theories could accurately reproduce the deeper sign reversals in the west. All theories performed badly at the boundaries. The generalization of this method to other latitudes, oceans, models and baroclinic modes would provide greater insight into the variability in the ocean, while better observational data would allow verification of the model findings.
Resumo:
JASMIN is a super-data-cluster designed to provide a high-performance high-volume data analysis environment for the UK environmental science community. Thus far JASMIN has been used primarily by the atmospheric science and earth observation communities, both to support their direct scientific workflow, and the curation of data products in the STFC Centre for Environmental Data Archival (CEDA). Initial JASMIN configuration and first experiences are reported here. Useful improvements in scientific workflow are presented. It is clear from the explosive growth in stored data and use that there was a pent up demand for a suitable big-data analysis environment. This demand is not yet satisfied, in part because JASMIN does not yet have enough compute, the storage is fully allocated, and not all software needs are met. Plans to address these constraints are introduced.
Resumo:
There is a growing need for massive computational resources for the analysis of new astronomical datasets. To tackle this problem, we present here our first steps towards marrying two new and emerging technologies; the Virtual Observatory (e.g, AstroGrid) and the computa- tional grid (e.g. TeraGrid, COSMOS etc.). We discuss the construction of VOTechBroker, which is a modular software tool designed to abstract the tasks of submission and management of a large number of compu- tational jobs to a distributed computer system. The broker will also interact with the AstroGrid workflow and MySpace environments. We discuss our planned usages of the VOTechBroker in computing a huge number of n–point correlation functions from the SDSS data and mas- sive model-fitting of millions of CMBfast models to WMAP data. We also discuss other applications including the determination of the XMM Cluster Survey selection function and the construction of new WMAP maps.
Resumo:
We qualitatively describe the condition of communally managed rangelands in the Transkei, South Africa, using GIS and high resolution near-infrared imagery. Using livestock census data from 28 magisterial districts in the Transkei, we explored the trends in livestock biomass from 1923–1998. The area had been subjected to intensive herbivory by domestic livestock during that period, and the high livestock biomass had been blamed for the perceived degradation or ‘overgrazing’ of the region. Our assessment used the concept rain-use efficiency (RUE) (kg dry matter ha–1 mm–1) to determine whether there is evidence of change in the efficiency of the system to produce domestic livestock. We calculated RUE from annual livestock numbers and the mean annual rainfall for each district. We found no evidence of a decline in rain-use efficiency between the two assessment periods (1923–1944, 1945–1998). There was evidence of a shift in the ratio of sheep to goats between 1923 and 1998, with goat numbers increasing (greater than twofold) relative to sheep in eight districts. This trend may be associated with changes in the structure of vegetation. We conclude that this region is not showing evidence of system run down that affects domestic livestock production.
Resumo:
We present a novel approach to assessing the attentiveness of professional forecasters to news about the macroeconomy. We find evidence that professional forecasters, taken as a group, do not always update their estimates of the current state of the economy to reflect the latest releases of revised estimates of key data.