902 resultados para Large Data Sets
Resumo:
Novel imaging techniques are playing an increasingly important role in drug development, providing insight into the mechanism of action of new chemical entities. The data sets obtained by these methods can be large with complex inter-relationships, but the most appropriate statistical analysis for handling this data is often uncertain - precisely because of the exploratory nature of the way the data are collected. We present an example from a clinical trial using magnetic resonance imaging to assess changes in atherosclerotic plaques following treatment with a tool compound with established clinical benefit. We compared two specific approaches to handle the correlations due to physical location and repeated measurements: two-level and four-level multilevel models. The two methods identified similar structural variables, but higher level multilevel models had the advantage of explaining a greater proportion of variation, and the modeling assumptions appeared to be better satisfied.
Resumo:
Advances in hardware and software technology enable us to collect, store and distribute large quantities of data on a very large scale. Automatically discovering and extracting hidden knowledge in the form of patterns from these large data volumes is known as data mining. Data mining technology is not only a part of business intelligence, but is also used in many other application areas such as research, marketing and financial analytics. For example medical scientists can use patterns extracted from historic patient data in order to determine if a new patient is likely to respond positively to a particular treatment or not; marketing analysts can use extracted patterns from customer data for future advertisement campaigns; finance experts have an interest in patterns that forecast the development of certain stock market shares for investment recommendations. However, extracting knowledge in the form of patterns from massive data volumes imposes a number of computational challenges in terms of processing time, memory, bandwidth and power consumption. These challenges have led to the development of parallel and distributed data analysis approaches and the utilisation of Grid and Cloud computing. This chapter gives an overview of parallel and distributed computing approaches and how they can be used to scale up data mining to large datasets.
Resumo:
One of the most pervasive assumptions about human brain evolution is that it involved relative enlargement of the frontal lobes. We show that this assumption is without foundation. Analysis of five independent data sets using correctly scaled measures and phylogenetic methods reveals that the size of human frontal lobes, and of specific frontal regions, is as expected relative to the size of other brain structures. Recent claims for relative enlargement of human frontal white matter volume, and for relative enlargement shared by all great apes, seem to be mistaken. Furthermore, using a recently developed method for detecting shifts in evolutionary rates, we find that the rate of change in relative frontal cortex volume along the phylogenetic branch leading to humans was unremarkable and that other branches showed significantly faster rates of change. Although absolute and proportional frontal region size increased rapidly in humans, this change was tightly correlated with corresponding size increases in other areas andwhole brain size, and with decreases in frontal neuron densities. The search for the neural basis of human cognitive uniqueness should therefore focus less on the frontal lobes in isolation and more on distributed neural networks.
Resumo:
Within the SPARC Data Initiative, the first comprehensive assessment of the quality of 13 water vapor products from 11 limb-viewing satellite instruments (LIMS, SAGE II, UARS-MLS, HALOE, POAM III, SMR, SAGE III, MIPAS, SCIAMACHY, ACE-FTS, and Aura-MLS) obtained within the time period 1978-2010 has been performed. Each instrument's water vapor profile measurements were compiled into monthly zonal mean time series on a common latitude-pressure grid. These time series serve as basis for the "climatological" validation approach used within the project. The evaluations include comparisons of monthly or annual zonal mean cross sections and seasonal cycles in the tropical and extratropical upper troposphere and lower stratosphere averaged over one or more years, comparisons of interannual variability, and a study of the time evolution of physical features in water vapor such as the tropical tape recorder and polar vortex dehydration. Our knowledge of the atmospheric mean state in water vapor is best in the lower and middle stratosphere of the tropics and midlatitudes, with a relative uncertainty of. 2-6% (as quantified by the standard deviation of the instruments' multiannual means). The uncertainty increases toward the polar regions (+/- 10-15%), the mesosphere (+/- 15%), and the upper troposphere/lower stratosphere below 100 hPa (+/- 30-50%), where sampling issues add uncertainty due to large gradients and high natural variability in water vapor. The minimum found in multiannual (1998-2008) mean water vapor in the tropical lower stratosphere is 3.5 ppmv (+/- 14%), with slightly larger uncertainties for monthly mean values. The frequently used HALOE water vapor data set shows consistently lower values than most other data sets throughout the atmosphere, with increasing deviations from the multi-instrument mean below 100 hPa in both the tropics and extratropics. The knowledge gained from these comparisons and regarding the quality of the individual data sets in different regions of the atmosphere will help to improve model-measurement comparisons (e.g., for diagnostics such as the tropical tape recorder or seasonal cycles), data merging activities, and studies of climate variability.
Resumo:
A comprehensive quality assessment of the ozone products from 18 limb-viewing satellite instruments is provided by means of a detailed intercomparison. The ozone climatologies in form of monthly zonal mean time series covering the upper troposphere to lower mesosphere are obtained from LIMS, SAGE I/II/III, UARS-MLS, HALOE, POAM II/III, SMR, OSIRIS, MIPAS, GOMOS, SCIAMACHY, ACE-FTS, ACE-MAESTRO, Aura-MLS, HIRDLS, and SMILES within 1978–2010. The intercomparisons focus on mean biases of annual zonal mean fields, interannual variability, and seasonal cycles. Additionally, the physical consistency of the data is tested through diagnostics of the quasi-biennial oscillation and Antarctic ozone hole. The comprehensive evaluations reveal that the uncertainty in our knowledge of the atmospheric ozone mean state is smallest in the tropical and midlatitude middle stratosphere with a 1σ multi-instrument spread of less than ±5%. While the overall agreement among the climatological data sets is very good for large parts of the stratosphere, individual discrepancies have been identified, including unrealistic month-to-month fluctuations, large biases in particular atmospheric regions, or inconsistencies in the seasonal cycle. Notable differences between the data sets exist in the tropical lower stratosphere (with a spread of ±30%) and at high latitudes (±15%). In particular, large relative differences are identified in the Antarctic during the time of the ozone hole, with a spread between the monthly zonal mean fields of ±50%. The evaluations provide guidance on what data sets are the most reliable for applications such as studies of ozone variability, model-measurement comparisons, detection of long-term trends, and data-merging activities.
Resumo:
We present the first comprehensive intercomparison of currently available satellite ozone climatologies in the upper troposphere/lower stratosphere (UTLS) (300–70 hPa) as part of the Stratosphere-troposphere Processes and their Role in Climate (SPARC) Data Initiative. The Tropospheric Emission Spectrometer (TES) instrument is the only nadir-viewing instrument in this initiative, as well as the only instrument with a focus on tropospheric composition. We apply the TES observational operator to ozone climatologies from the more highly vertically resolved limb-viewing instruments. This minimizes the impact of differences in vertical resolution among the instruments and allows identification of systematic differences in the large-scale structure and variability of UTLS ozone. We find that the climatologies from most of the limb-viewing instruments show positive differences (ranging from 5 to 75%) with respect to TES in the tropical UTLS, and comparison to a “zonal mean” ozonesonde climatology indicates that these differences likely represent a positive bias for p ≤ 100 hPa. In the extratropics, there is good agreement among the climatologies regarding the timing and magnitude of the ozone seasonal cycle (differences in the peak-to-peak amplitude of <15%) when the TES observational operator is applied, as well as very consistent midlatitude interannual variability. The discrepancies in ozone temporal variability are larger in the tropics, with differences between the data sets of up to 55% in the seasonal cycle amplitude. However, the differences among the climatologies are everywhere much smaller than the range produced by current chemistry-climate models, indicating that the multiple-instrument ensemble is useful for quantitatively evaluating these models.
Resumo:
Wild bird feeding is popular in domestic gardens across the world. Nevertheless, there is surprisingly little empirical information on certain aspects of the activity and no year-round quantitative records of the amounts and nature of the different foods provided in individual gardens. We sought to characterise garden bird feeding in a large UK urban area in two ways. First, we conducted face-to-face questionnaires with a representative cross-section of residents. Just over half fed birds, the majority doing so year round and at least weekly. Second, a two-year study recorded all foodstuffs put out by households on every provisioning occasion. A median of 628 kcal/garden/day was given. Provisioning level was not significantly influenced by weather or season. Comparisons between the data sets revealed significantly less frequent feeding amongst these ‘keen’ feeders than the face-to-face questionnaire respondents, suggesting that one-off questionnaires may overestimate provisioning frequency. Assuming 100% uptake, the median provisioning level equates to sufficient supplementary resources across the UK to support 196 million individuals of a hypothetical average garden-feeding bird species (based on 10 common UK garden-feeding birds’ energy requirements). Taking the lowest provisioning level recorded (101 kcal/day) as a conservative measure, 31 million of these average individuals could theoretically be supported.
Resumo:
With a rapidly increasing fraction of electricity generation being sourced from wind, extreme wind power generation events such as prolonged periods of low (or high) generation and ramps in generation, are a growing concern for the efficient and secure operation of national power systems. As extreme events occur infrequently, long and reliable meteorological records are required to accurately estimate their characteristics. Recent publications have begun to investigate the use of global meteorological “reanalysis” data sets for power system applications, many of which focus on long-term average statistics such as monthly-mean generation. Here we demonstrate that reanalysis data can also be used to estimate the frequency of relatively short-lived extreme events (including ramping on sub-daily time scales). Verification against 328 surface observation stations across the United Kingdom suggests that near-surface wind variability over spatiotemporal scales greater than around 300 km and 6 h can be faithfully reproduced using reanalysis, with no need for costly dynamical downscaling. A case study is presented in which a state-of-the-art, 33 year reanalysis data set (MERRA, from NASA-GMAO), is used to construct an hourly time series of nationally-aggregated wind power generation in Great Britain (GB), assuming a fixed, modern distribution of wind farms. The resultant generation estimates are highly correlated with recorded data from National Grid in the recent period, both for instantaneous hourly values and for variability over time intervals greater than around 6 h. This 33 year time series is then used to quantify the frequency with which different extreme GB-wide wind power generation events occur, as well as their seasonal and inter-annual variability. Several novel insights into the nature of extreme wind power generation events are described, including (i) that the number of prolonged low or high generation events is well approximated by a Poission-like random process, and (ii) whilst in general there is large seasonal variability, the magnitude of the most extreme ramps is similar in both summer and winter. An up-to-date version of the GB case study data as well as the underlying model are freely available for download from our website: http://www.met.reading.ac.uk/~energymet/data/Cannon2014/.
Resumo:
Catastrophe risk models used by the insurance industry are likely subject to significant uncertainty, but due to their proprietary nature and strict licensing conditions they are not available for experimentation. In addition, even if such experiments were conducted, these would not be repeatable by other researchers because commercial confidentiality issues prevent the details of proprietary catastrophe model structures from being described in public domain documents. However, such experimentation is urgently required to improve decision making in both insurance and reinsurance markets. In this paper we therefore construct our own catastrophe risk model for flooding in Dublin, Ireland, in order to assess the impact of typical precipitation data uncertainty on loss predictions. As we consider only a city region rather than a whole territory and have access to detailed data and computing resources typically unavailable to industry modellers, our model is significantly more detailed than most commercial products. The model consists of four components, a stochastic rainfall module, a hydrological and hydraulic flood hazard module, a vulnerability module, and a financial loss module. Using these we undertake a series of simulations to test the impact of driving the stochastic event generator with four different rainfall data sets: ground gauge data, gauge-corrected rainfall radar, meteorological reanalysis data (European Centre for Medium-Range Weather Forecasts Reanalysis-Interim; ERA-Interim) and a satellite rainfall product (The Climate Prediction Center morphing method; CMORPH). Catastrophe models are unusual because they use the upper three components of the modelling chain to generate a large synthetic database of unobserved and severe loss-driving events for which estimated losses are calculated. We find the loss estimates to be more sensitive to uncertainties propagated from the driving precipitation data sets than to other uncertainties in the hazard and vulnerability modules, suggesting that the range of uncertainty within catastrophe model structures may be greater than commonly believed.
Resumo:
BACKGROUND AND OBJECTIVE: Given the role of uncoupling protein 2 (UCP2) in the accumulation of fat in the hepatocytes and in the enhancement of protective mechanisms in acute ethanol intake, we hypothesised that UCP2 polymorphisms are likely to cause liver disease through their interactions with obesity and alcohol intake. To test this hypothesis, we investigated the interaction between tagging polymorphisms in the UCP2 gene (rs2306819, rs599277 and rs659366), alcohol intake and obesity traits such as BMI and waist circumference (WC) on alanine aminotransferase (ALT) and gamma glutamyl transferase (GGT) in a large meta-analysis of data sets from three populations (n=20 242). DESIGN AND METHODS: The study populations included the Northern Finland Birth Cohort 1966 (n=4996), Netherlands Study of Depression and Anxiety (n=1883) and LifeLines Cohort Study (n=13 363). Interactions between the polymorphisms and obesity and alcohol intake on dichotomised ALT and GGT levels were assessed using logistic regression and the likelihood ratio test. RESULTS: In the meta-analysis of the three cohorts, none of the three UCP2 polymorphisms were associated with GGT or ALT levels. There was no evidence for interaction between the polymorphisms and alcohol intake on GGT and ALT levels. In contrast, the association of WC and BMI with GGT levels varied by rs659366 genotype (Pinteraction=0.03 and 0.007, respectively; adjusted for age, gender, high alcohol intake, diabetes, hypertension and serum lipid concentrations). CONCLUSION: In conclusion, our findings in 20 242 individuals suggest that UCP2 gene polymorphisms may cause liver dysfunction through the interaction with body fat rather than alcohol intake.
Resumo:
A quality assessment of the CFC-11 (CCl3F), CFC-12 (CCl2F2), HF, and SF6 products from limb-viewing satellite instruments is provided by means of a detailed intercomparison. The climatologies in the form of monthly zonal mean time series are obtained from HALOE, MIPAS, ACE-FTS, and HIRDLS within the time period 1991–2010. The intercomparisons focus on the mean biases of the monthly and annual zonal mean fields and aim to identify their vertical, latitudinal and temporal structure. The CFC evaluations (based on MIPAS, ACE-FTS and HIRDLS) reveal that the uncertainty in our knowledge of the atmospheric CFC-11 and CFC-12 mean state, as given by satellite data sets, is smallest in the tropics and mid-latitudes at altitudes below 50 and 20 hPa, respectively, with a 1σ multi-instrument spread of up to ±5 %. For HF, the situation is reversed. The two available data sets (HALOE and ACE-FTS) agree well above 100 hPa, with a spread in this region of ±5 to ±10 %, while at altitudes below 100 hPa the HF annual mean state is less well known, with a spread ±30 % and larger. The atmospheric SF6 annual mean states derived from two satellite data sets (MIPAS and ACE-FTS) show only very small differences with a spread of less than ±5 % and often below ±2.5 %. While the overall agreement among the climatological data sets is very good for large parts of the upper troposphere and lower stratosphere (CFCs, SF6) or middle stratosphere (HF), individual discrepancies have been identified. Pronounced deviations between the instrument climatologies exist for particular atmospheric regions which differ from gas to gas. Notable features are differently shaped isopleths in the subtropics, deviations in the vertical gradients in the lower stratosphere and in the meridional gradients in the upper troposphere, and inconsistencies in the seasonal cycle. Additionally, long-term drifts between the instruments have been identified for the CFC-11 and CFC-12 time series. The evaluations as a whole provide guidance on what data sets are the most reliable for applications such as studies of atmospheric transport and variability, model–measurement comparisons and detection of long-term trends. The data sets will be publicly available from the SPARC Data Centre and through PANGAEA (doi:10.1594/PANGAEA.849223).
Resumo:
Precipitation and temperature climate indices are calculated using the National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis and validated against observational data from some stations over Brazil and other data sources. The spatial patterns of the climate indices trends are analyzed for the period 1961-1990 over South America. In addition, the correlation and linear regression coefficients for some specific stations were also obtained in order to compare with the reanalysis data. In general, the results suggest that NCEP/NCAR reanalysis can provide useful information about minimum temperature and consecutive dry days indices at individual grid cells in Brazil. However, some regional differences in the climate indices trends are observed when different data sets are compared. For instance, the NCEP/NCAR reanalysis shows a reversal signal for all rainfall annual indices and the cold night index over Argentina. Despite these differences, maps of the trends for most of the annual climate indices obtained from the NCEP/NCAR reanalysis and BRANT analysis are generally in good agreement with other available data sources and previous findings in the literature for large areas of southern South America. The pattern of trends for the precipitation annual indices over the 30 years analyzed indicates a change to wetter conditions over southern and southeastern parts of Brazil, Paraguay, Uruguay, central and northern Argentina, and parts of Chile and a decrease over southwestern South America. All over South America, the climate indices related to the minimum temperature (warm or cold nights) have clearly shown a warming tendency; however, no consistent changes in maximum temperature extremes (warm and cold days) have been observed. Therefore, one must be careful before suggesting an), trends for warm or cold days.
Resumo:
A detailed genome mapping analysis of 213,636 expressed sequence tags (EST) derived from nontumor and tumor tissues of the oral cavity, larynx, pharynx, and thyroid was done. Transcripts matching known human genes were identified; potential new splice variants were flagged and subjected to manual curation, pointing to 788 putatively new alternative splicing isoforms, the majority (75%) being insertion events. A subset of 34 new splicing isoforms (5% of 788 events) was selected and 23 (68%) were confirmed by reverse transcription-PCR and DNA sequencing. Putative new genes were revealed, including six transcripts mapped to well-studied chromosomes such as 22, as well as transcripts that mapped to 253 intergenic regions. In addition, 2,251 noncoding intronic RNAs, eventually involved in transcriptional regulation, were found. A set of 250 candidate markers for loss of heterozygosis or gene amplification was selected by identifying transcripts that mapped to genomic regions previously known to be frequently amplified or deleted in head, neck, and thyroid tumors. Three of these markers were evaluated by quantitative reverse transcription-PCR in an independent set of individual samples. Along with detailed clinical data about tumor origin, the information reported here is now publicly available on a dedicated Web site as a resource for further biological investigation. This first in silico reconstruction of the head, neck, and thyroid transcriptomes points to a wealth of new candidate markers that can be used for future studies on the molecular basis of these tumors. Similar analysis is warranted for a number of other tumors for which large EST data sets are available.
Resumo:
This article introduces the software program called EthoSeq, which is designed to extract probabilistic behavioral sequences (tree-generated sequences, or TGSs) from observational data and to prepare a TGS-species matrix for phylogenetic analysis. The program uses Graph Theory algorithms to automatically detect behavioral patterns within the observational sessions. It includes filtering tools to adjust the search procedure to user-specified statistical needs. Preliminary analyses of data sets, such as grooming sequences in birds and foraging tactics in spiders, uncover a large number of TGSs which together yield single phylogenetic trees. An example of the use of the program is our analysis of felid grooming sequences, in which we have obtained 1,386 felid grooming TGSs for seven species, resulting in a single phylogeny. These results show that behavior is definitely useful in phylogenetic analysis. EthoSeq simplifies and automates such analyses, uncovers much of the hidden patterns of long behavioral sequences, and prepares this data for further analysis with standard phylogenetic programs. We hope it will encourage many empirical studies on the evolution of behavior.
Resumo:
Semi-supervised learning is applied to classification problems where only a small portion of the data items is labeled. In these cases, the reliability of the labels is a crucial factor, because mislabeled items may propagate wrong labels to a large portion or even the entire data set. This paper aims to address this problem by presenting a graph-based (network-based) semi-supervised learning method, specifically designed to handle data sets with mislabeled samples. The method uses teams of walking particles, with competitive and cooperative behavior, for label propagation in the network constructed from the input data set. The proposed model is nature-inspired and it incorporates some features to make it robust to a considerable amount of mislabeled data items. Computer simulations show the performance of the method in the presence of different percentage of mislabeled data, in networks of different sizes and average node degree. Importantly, these simulations reveals the existence of the critical points of the mislabeled subset size, below which the network is free of wrong label contamination, but above which the mislabeled samples start to propagate their labels to the rest of the network. Moreover, numerical comparisons have been made among the proposed method and other representative graph-based semi-supervised learning methods using both artificial and real-world data sets. Interestingly, the proposed method has increasing better performance than the others as the percentage of mislabeled samples is getting larger. © 2012 IEEE.