939 resultados para Scientific Data
Resumo:
Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.
Resumo:
Understanding the functioning of a neural system in terms of its underlying circuitry is an important problem in neuroscience. Recent d evelopments in electrophysiology and imaging allow one to simultaneously record activities of hundreds of neurons. Inferring the underlying neuronal connectivity patterns from such multi-neuronal spike train data streams is a challenging statistical and computational problem. This task involves finding significant temporal patterns from vast amounts of symbolic time series data. In this paper we show that the frequent episode mining methods from the field of temporal data mining can be very useful in this context. In the frequent episode discovery framework, the data is viewed as a sequence of events, each of which is characterized by an event type and its time of occurrence and episodes are certain types of temporal patterns in such data. Here we show that, using the set of discovered frequent episodes from multi-neuronal data, one can infer different types of connectivity patterns in the neural system that generated it. For this purpose, we introduce the notion of mining for frequent episodes under certain temporal constraints; the structure of these temporal constraints is motivated by the application. We present algorithms for discovering serial and parallel episodes under these temporal constraints. Through extensive simulation studies we demonstrate that these methods are useful for unearthing patterns of neuronal network connectivity.
Resumo:
The Earth's ecosystems are protected from the dangerous part of the solar ultraviolet (UV) radiation by stratospheric ozone, which absorbs most of the harmful UV wavelengths. Severe depletion of stratospheric ozone has been observed in the Antarctic region, and to a lesser extent in the Arctic and midlatitudes. Concern about the effects of increasing UV radiation on human beings and the natural environment has led to ground based monitoring of UV radiation. In order to achieve high-quality UV time series for scientific analyses, proper quality control (QC) and quality assurance (QA) procedures have to be followed. In this work, practices of QC and QA are developed for Brewer spectroradiometers and NILU-UV multifilter radiometers, which measure in the Arctic and Antarctic regions, respectively. These practices are applicable to other UV instruments as well. The spectral features and the effect of different factors affecting UV radiation were studied for the spectral UV time series at Sodankylä. The QA of the Finnish Meteorological Institute's (FMI) two Brewer spectroradiometers included daily maintenance, laboratory characterizations, the calculation of long-term spectral responsivity, data processing and quality assessment. New methods for the cosine correction, the temperature correction and the calculation of long-term changes of spectral responsivity were developed. Reconstructed UV irradiances were used as a QA tool for spectroradiometer data. The actual cosine correction factor was found to vary between 1.08-1.12 and 1.08-1.13. The temperature characterization showed a linear temperature dependence between the instrument's internal temperature and the photon counts per cycle. Both Brewers have participated in international spectroradiometer comparisons and have shown good stability. The differences between the Brewers and the portable reference spectroradiometer QASUME have been within 5% during 2002-2010. The features of the spectral UV radiation time series at Sodankylä were analysed for the time period 1990-2001. No statistically significant long-term changes in UV irradiances were found, and the results were strongly dependent on the time period studied. Ozone was the dominant factor affecting UV radiation during the springtime, whereas clouds played a more important role during the summertime. During this work, the Antarctic NILU-UV multifilter radiometer network was established by the Instituto Nacional de Meteorogía (INM) as a joint Spanish-Argentinian-Finnish cooperation project. As part of this work, the QC/QA practices of the network were developed. They included training of the operators, daily maintenance, regular lamp tests and solar comparisons with the travelling reference instrument. Drifts of up to 35% in the sensitivity of the channels of the NILU-UV multifilter radiometers were found during the first four years of operation. This work emphasized the importance of proper QC/QA, including regular lamp tests, for the multifilter radiometers also. The effect of the drifts were corrected by a method scaling the site NILU-UV channels to those of the travelling reference NILU-UV. After correction, the mean ratios of erythemally-weighted UV dose rates measured during solar comparisons between the reference NILU-UV and the site NILU-UVs were 1.007±0.011 and 1.012±0.012 for Ushuaia and Marambio, respectively, when the solar zenith angle varied up to 80°. Solar comparisons between the NILU-UVs and spectroradiometers showed a ±5% difference near local noon time, which can be seen as proof of successful QC/QA procedures and transfer of irradiance scales. This work also showed that UV measurements made in the Arctic and Antarctic can be comparable with each other.
Resumo:
A miniature furnace suitable for routine collection of x-ray data up to 1000°C from single crystals on the Hilger and Watts linear diffractometer, without restricting the normally allowed region of reciprocal space on the diffractometer, is described. The crystal is heated primarily by radiation from a surrounding current-heated, stationary platinum coil wound on a silica bracket. The coil is split at its middle to provide a 4 mm gap for crystal mounting and x-irradiation. The crystal, mounted on a standard goniometer head, can be rotated and centred freely, as in the room temperature case. There is no need for any radiation shields or water-cooling arrangement. Investigations up to 1500°C are possible with slight modifications of the furnace.
Resumo:
Introduction. We estimate the total yearly volume of peer-reviewed scientific journal articles published world-wide as well as the share of these articles available openly on the Web either directly or as copies in e-print repositories. Method. We rely on data from two commercial databases (ISI and Ulrich's Periodicals Directory) supplemented by sampling and Google searches. Analysis. A central issue is the finding that ISI-indexed journals publish far more articles per year (111) than non ISI-indexed journals (26), which means that the total figure we obtain is much lower than many earlier estimates. Our method of analysing the number of repository copies (green open access) differs from several earlier studies which have studied the number of copies in identified repositories, since we start from a random sample of articles and then test if copies can be found by a Web search engine. Results. We estimate that in 2006 the total number of articles published was approximately 1,350,000. Of this number 4.6% became immediately openly available and an additional 3.5% after an embargo period of, typically, one year. Furthermore, usable copies of 11.3% could be found in subject-specific or institutional repositories or on the home pages of the authors. Conclusions. We believe our results are the most reliable so far published and, therefore, should be useful in the on-going debate about Open Access among both academics and science policy makers. The method is replicable and also lends itself to longitudinal studies in the future.
Resumo:
In this work we explore the application of wireless sensor technologies for the benefit of small and marginal farmers in semi-arid regions. The focus in this paper is to discuss the merits and demerits of data gathering & relay paradigms that collect localized data over a wide area. The data gathered includes soil moisture, temperature, pressure, rain data and humidity. The challenge to technology intervention comes mainly due to two reasons: (a) Farmers in general are interested in crop yield specific to their piece of land. This is because soil texture can vary rapidly over small regions. (b) Due to a high run-off, the soil moisture retention can vary from region to region depending on the topology of the farm. Both these reasons alter the needs drastically. Additionally, small and marginal farms can be sandwiched between rich farm lands. The village has very little access to grid power. Power cuts can extend up to 12 hours in a day and upto 3 or 4 days during some months in the year. In this paper, we discuss 3 technology paradigms for data relaying. These include Wi-Fi (Wireless Fidelity), GPRS (General Packet Radio Service) and DTN (Delay and Disruption Tolerant Network) technologies. We detail the merits and demerits of each of these solutions and provide our final recommendations. The project site is a village called Chennakesavapura in the state of Karnataka, India.
Resumo:
We propose an iterative data reconstruction technique specifically designed for multi-dimensional multi-color fluorescence imaging. Markov random field is employed (for modeling the multi-color image field) in conjunction with the classical maximum likelihood method. It is noted that, ill-posed nature of the inverse problem associated with multi-color fluorescence imaging forces iterative data reconstruction. Reconstruction of three-dimensional (3D) two-color images (obtained from nanobeads and cultured cell samples) show significant reduction in the background noise (improved signal-to-noise ratio) with an impressive overall improvement in the spatial resolution (approximate to 250 nm) of the imaging system. Proposed data reconstruction technique may find immediate application in 3D in vivo and in vitro multi-color fluorescence imaging of biological specimens. (C) 2012 American Institute of Physics. http://dx.doi.org/10.1063/1.4769058]
Resumo:
It is increasingly being recognized that resting state brain connectivity derived from functional magnetic resonance imaging (fMRI) data is an important marker of brain function both in healthy and clinical populations. Though linear correlation has been extensively used to characterize brain connectivity, it is limited to detecting first order dependencies. In this study, we propose a framework where in phase synchronization (PS) between brain regions is characterized using a new metric ``correlation between probabilities of recurrence'' (CPR) and subsequent graph-theoretic analysis of the ensuing networks. We applied this method to resting state fMRI data obtained from human subjects with and without administration of propofol anesthetic. Our results showed decreased PS during anesthesia and a biologically more plausible community structure using CPR rather than linear correlation. We conclude that CPR provides an attractive nonparametric method for modeling interactions in brain networks as compared to standard correlation for obtaining physiologically meaningful insights about brain function.
Missing (in-situ) snow cover data hampers climate change and runoff studies in the Greater Himalayas
Resumo:
The Himalayas are presently holding the largest ice masses outside the polar regions and thus (temporarily) store important freshwater resources. In contrast to the contemplation of glaciers, the role of runoff from snow cover has received comparably little attention in the past, although (i) its contribution is thought to be at least equally or even more important than that of ice melt in many Himalayan catchments and (ii) climate change is expected to have widespread and significant consequences on snowmelt runoff. Here, we show that change assessment of snowmelt runoff and its timing is not as straightforward as often postulated, mainly as larger partial pressure of H2O, CO2, CH4, and other greenhouse gases might increase net long-wave input for snowmelt quite significantly in a future atmosphere. In addition, changes in the short-wave energy balance such as the pollution of the snow cover through black carbon or the sensible or latent heat contribution to snowmelt are likely to alter future snowmelt and runoff characteristics as well. For the assessment of snow cover extent and depletion, but also for its monitoring over the extremely large areas of the Himalayas, remote sensing has been used in the past and is likely to become even more important in the future. However, for the calibration and validation of remotely-sensed data, and even-more so in light of possible changes in snow-cover energy balance, we strongly call for more in-situ measurements across the Himalayas, in particular for daily data on new snow and snow cover water equivalent, or the respective energy balance components. Moreover, data should be made accessible to the scientific community, so that the latter can more accurately estimate climate change impacts on Himalayan snow cover and possible consequences thereof on runoff. (C) 2013 Elsevier B.V. All rights reserved.
Resumo:
Multi-GPU machines are being increasingly used in high-performance computing. Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs have to manually allocate and manage data on each GPU. Existing works that propose to automate data allocations for GPUs have limitations and inefficiencies in terms of allocation sizes, exploiting reuse, transfer costs, and scalability. We propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding-Box-based Memory Manager (BBMM). BBMM can perform at runtime, during standard set operations like union, intersection, and difference, finding subset and superset relations on hyperrectangular regions of array data (bounding boxes). It uses these operations along with some compiler assistance to identify, allocate, and manage data required by applications in terms of disjoint bounding boxes. This allows it to (1) allocate exactly or nearly as much data as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence maximize data reuse across tiles and minimize data transfer overhead, and (3) and as a result, maximize utilization of the combined memory on multi-GPU machines. BBMM can work with any choice of parallelizing transformations, computation placement, and scheduling schemes, whether static or dynamic. Experiments run on a four-GPU machine with various scientific programs showed that BBMM reduces data allocations on each GPU by up to 75% compared to current allocation schemes, yields performance of at least 88% of manually written code, and allows excellent weak scaling.
Resumo:
Table of Contents [pdf, 0.13 Mb] Section I - Practical Workshop Description [pdf, 21.22 Mb] Section II - Site Description and Oceanography [pdf, 0.40 Mb] Section III - Extended Abstracts Contaminant Concentrations in Sediment and Biota [pdf, 1.36 Mb] Biochemical and Physiological Studies [pdf, 0.77 Mb] Community Studies [pdf, 1.01 Mb] Harmful Algae Studies [pdf, 0.67 Mb] Section IV - Comprehensive Data Tables Site Locations [pdf, 0.10 Mb] Sediment Chemistry [pdf, 0.54 Mb] Tissue Chemistry – Fish [pdf, 1.20 Mb] Tissue Chemistry – Bivalves [pdf, 0.49 Mb] Lipid and Fatty Acids in Mytilus trossulus [pdf, 0.15 Mb] Biochemical, Physiological and Histopathological Parameters [pdf, 1.20 Mb] Biological Community Data – Fish and Mussels [pdf, 0.87 Mb] Biological Community Data – Macrobenthos [pdf, 0.85 Mb] Harmful Algae [pdf, 0.07 Mb] (Document contains 205 pages)
Resumo:
The mapping and geospatial analysis of benthic environments are multidisciplinary tasks that have become more accessible in recent years because of advances in technology and cost reductions in survey systems. The complex relationships that exist among physical, biological, and chemical seafloor components require advanced, integrated analysis techniques to enable scientists and others to visualize patterns and, in so doing, allow inferences to be made about benthic processes. Effective mapping, analysis, and visualization of marine habitats are particularly important because the subtidal seafloor environment is not readily viewed directly by eye. Research in benthic environments relies heavily, therefore, on remote sensing techniques to collect effective data. Because many benthic scientists are not mapping professionals, they may not adequately consider the links between data collection, data analysis, and data visualization. Projects often start with clear goals, but may be hampered by the technical details and skills required for maintaining data quality through the entire process from collection through analysis and presentation. The lack of technical understanding of the entire data handling process can represent a significant impediment to success. While many benthic mapping efforts have detailed their methodology as it relates to the overall scientific goals of a project, only a few published papers and reports focus on the analysis and visualization components (Paton et al. 1997, Weihe et al. 1999, Basu and Saxena 1999, Bruce et al. 1997). In particular, the benthic mapping literature often briefly describes data collection and analysis methods, but fails to provide sufficiently detailed explanation of particular analysis techniques or display methodologies so that others can employ them. In general, such techniques are in large part guided by the data acquisition methods, which can include both aerial and water-based remote sensing methods to map the seafloor without physical disturbance, as well as physical sampling methodologies (e.g., grab or core sampling). The terms benthic mapping and benthic habitat mapping are often used synonymously to describe seafloor mapping conducted for the purpose of benthic habitat identification. There is a subtle yet important difference, however, between general benthic mapping and benthic habitat mapping. The distinction is important because it dictates the sequential analysis and visualization techniques that are employed following data collection. In this paper general seafloor mapping for identification of regional geologic features and morphology is defined as benthic mapping. Benthic habitat mapping incorporates the regional scale geologic information but also includes higher resolution surveys and analysis of biological communities to identify the biological habitats. In addition, this paper adopts the definition of habitats established by Kostylev et al. (2001) as a “spatially defined area where the physical, chemical, and biological environment is distinctly different from the surrounding environment.” (PDF contains 31 pages)
Resumo:
Scientific research revolves around the production, analysis, storage, management, and re-use of data. Data sharing offers important benefits for scientific progress and advancement of knowledge. However, several limitations and barriers in the general adoption of data sharing are still in place. Probably the most important challenge is that data sharing is not yet very common among scholars and is not yet seen as a regular activity among scientists, although important efforts are being invested in promoting data sharing. In addition, there is a relatively low commitment of scholars to cite data. The most important problems and challenges regarding data metrics are closely tied to the more general problems related to data sharing. The development of data metrics is dependent on the growth of data sharing practices, after all it is nothing more than the registration of researchers’ behaviour. At the same time, the availability of proper metrics can help researchers to make their data work more visible. This may subsequently act as an incentive for more data sharing and in this way a virtuous circle may be set in motion. This report seeks to further explore the possibilities of metrics for datasets (i.e. the creation of reliable data metrics) and an effective reward system that aligns the main interests of the main stakeholders involved in the process. The report reviews the current literature on data sharing and data metrics. It presents interviews with the main stakeholders on data sharing and data metrics. It also analyses the existing repositories and tools in the field of data sharing that have special relevance for the promotion and development of data metrics. On the basis of these three pillars, the report presents a number of solutions and necessary developments, as well as a set of recommendations regarding data metrics. The most important recommendations include the general adoption of data sharing and data publication among scholars; the development of a reward system for scientists that includes data metrics; reducing the costs of data publication; reducing existing negative cultural perceptions of researchers regarding data publication; developing standards for preservation, publication, identification and citation of datasets; more coordination of data repository initiatives; and further development of interoperability protocols across different actors.