136 resultados para categorical and mix datasets
Resumo:
Ensemble learning techniques generate multiple classifiers, so called base classifiers, whose combined classification results are used in order to increase the overall classification accuracy. In most ensemble classifiers the base classifiers are based on the Top Down Induction of Decision Trees (TDIDT) approach. However, an alternative approach for the induction of rule based classifiers is the Prism family of algorithms. Prism algorithms produce modular classification rules that do not necessarily fit into a decision tree structure. Prism classification rulesets achieve a comparable and sometimes higher classification accuracy compared with decision tree classifiers, if the data is noisy and large. Yet Prism still suffers from overfitting on noisy and large datasets. In practice ensemble techniques tend to reduce the overfitting, however there exists no ensemble learner for modular classification rule inducers such as the Prism family of algorithms. This article describes the first development of an ensemble learner based on the Prism family of algorithms in order to enhance Prism’s classification accuracy by reducing overfitting.
Resumo:
A number of tests exist to check for statistical significance of phase synchronisation within the Electroencephalogram (EEG); however, the majority suffer from a lack of generality and applicability. They may also fail to account for temporal dynamics in the phase synchronisation, regarding synchronisation as a constant state instead of a dynamical process. Therefore, a novel test is developed for identifying the statistical significance of phase synchronisation based upon a combination of work characterising temporal dynamics of multivariate time-series and Markov modelling. We show how this method is better able to assess the significance of phase synchronisation than a range of commonly used significance tests. We also show how the method may be applied to identify and classify significantly different phase synchronisation dynamics in both univariate and multivariate datasets.
Resumo:
n the past decade, the analysis of data has faced the challenge of dealing with very large and complex datasets and the real-time generation of data. Technologies to store and access these complex and large datasets are in place. However, robust and scalable analysis technologies are needed to extract meaningful information from these datasets. The research field of Information Visualization and Visual Data Analytics addresses this need. Information visualization and data mining are often used complementary to each other. Their common goal is the extraction of meaningful information from complex and possibly large data. However, though data mining focuses on the usage of silicon hardware, visualization techniques also aim to access the powerful image-processing capabilities of the human brain. This article highlights the research on data visualization and visual analytics techniques. Furthermore, we highlight existing visual analytics techniques, systems, and applications including a perspective on the field from the chemical process industry.
Resumo:
We describe the CHARMe project, which aims to link climate datasets with publications, user feedback and other items of "commentary metadata". The system will help users learn from previous community experience and select datasets that best suit their needs, as well as providing direct traceability between conclusions and the data that supported them. The project applies the principles of Linked Data and adopts the Open Annotation standard to record and publish commentary information. CHARMe contributes to the emerging landscape of "climate services", which will provide climate data and information to influence policy and decision-making. Although the project focuses on climate science, the technologies and concepts are very general and could be applied to other fields.
Resumo:
Understanding observed changes to the global water cycle is key to predicting future climate changes and their impacts. While many datasets document crucial variables such as precipitation, ocean salinity, runoff, and humidity, most are uncertain for determining long-term changes. In situ networks provide long time-series over land but are sparse in many regions, particularly the tropics. Satellite and reanalysis datasets provide global coverage, but their long-term stability is lacking. However, comparisons of changes among related variables can give insights into the robustness of observed changes. For example, ocean salinity, interpreted with an understanding of ocean processes, can help cross-validate precipitation. Observational evidence for human influences on the water cycle is emerging, but uncertainties resulting from internal variability and observational errors are too large to determine whether the observed and simulated changes are consistent. Improvements to the in situ and satellite observing networks that monitor the changing water cycle are required, yet continued data coverage is threatened by funding reductions. Uncertainty both in the role of anthropogenic aerosols, and due to large climate variability presently limits confidence in attribution of observed changes.
Resumo:
Studiesthat use prolonged periods of sensory stimulation report associations between regional reductions in neural activity and negative blood oxygenation level-dependent (BOLD) signaling. However, the neural generators of the negative BOLD response remain to be characterized. Here, we use single-impulse electrical stimulation of the whisker pad in the anesthetized rat to identify components of the neural response that are related to “negative” hemodynamic changes in the brain. Laminar multiunit activity and local field potential recordings of neural activity were performed concurrently withtwo-dimensional optical imaging spectroscopy measuring hemodynamic changes. Repeated measurements over multiple stimulation trials revealed significant variations in neural responses across session and animal datasets. Within this variation, we found robust long-latency decreases (300 and 2000 ms after stimulus presentation) in gammaband power (30 – 80 Hz) in the middle-superficial cortical layers in regions surrounding the activated whisker barrel cortex. This reduction in gamma frequency activity was associated with corresponding decreases in the hemodynamic responses that drive the negative BOLD signal. These findings suggest a close relationship between BOLD responses and neural events that operate over time scales that outlast the initiating sensory stimulus, and provide important insights into the neurophysiological basis of negative neuroimaging signals.
Resumo:
Heterosis refers to the phenomenon in which an F1 hybrid exhibits enhanced growth or agronomic performance. However, previous theoretical studies on heterosis have been based on bi-parental segregating populations instead of F1 hybrids. To understand the genetic basis of heterosis, here we used a subset of F1 hybrids, named a partial North Carolina II design, to perform association mapping for dependent variables: original trait value, general combining ability (GCA), specific combining ability (SCA) and mid-parental heterosis (MPH). Our models jointly fitted all the additive, dominance and epistatic effects. The analyses resulted in several important findings: 1) Main components are additive and additive-by-additive effects for GCA and dominance-related effects for SCA and MPH, and additive-by-dominant effect for MPH was partly identified as additive effect; 2) the ranking of factors affecting heterosis was dominance > dominance-by-dominance > over-dominance > complete dominance; and 3) increasing the proportion of F1 hybrids in the population could significantly increase the power to detect dominance-related effects, and slightly reduce the power to detect additive and additive-by-additive effects. Analyses of cotton and rapeseed datasets showed that more additive-by-additive QTL were detected from GCA than from trait phenotype, and fewer QTL were from MPH than from other dependent variables.
Resumo:
Initializing the ocean for decadal predictability studies is a challenge, as it requires reconstructing the little observed subsurface trajectory of ocean variability. In this study we explore to what extent surface nudging using well-observed sea surface temperature (SST) can reconstruct the deeper ocean variations for the 1949–2005 period. An ensemble made with a nudged version of the IPSLCM5A model and compared to ocean reanalyses and reconstructed datasets. The SST is restored to observations using a physically-based relaxation coefficient, in contrast to earlier studies, which use a much larger value. The assessment is restricted to the regions where the ocean reanalyses agree, i.e. in the upper 500 m of the ocean, although this can be latitude and basin dependent. Significant reconstruction of the subsurface is achieved in specific regions, namely region of subduction in the subtropical Atlantic, below the thermocline in the equatorial Pacific and, in some cases, in the North Atlantic deep convection regions. Beyond the mean correlations, ocean integrals are used to explore the time evolution of the correlation over 20-year windows. Classical fixed depth heat content diagnostics do not exhibit any significant reconstruction between the different existing observation-based references and can therefore not be used to assess global average time-varying correlations in the nudged simulations. Using the physically based average temperature above an isotherm (14 °C) alleviates this issue in the tropics and subtropics and shows significant reconstruction of these quantities in the nudged simulations for several decades. This skill is attributed to the wind stress reconstruction in the tropics, as already demonstrated in a perfect model study using the same model. Thus, we also show here the robustness of this result in an historical and observational context.
Resumo:
Approximate Bayesian computation (ABC) is a popular family of algorithms which perform approximate parameter inference when numerical evaluation of the likelihood function is not possible but data can be simulated from the model. They return a sample of parameter values which produce simulations close to the observed dataset. A standard approach is to reduce the simulated and observed datasets to vectors of summary statistics and accept when the difference between these is below a specified threshold. ABC can also be adapted to perform model choice. In this article, we present a new software package for R, abctools which provides methods for tuning ABC algorithms. This includes recent dimension reduction algorithms to tune the choice of summary statistics, and coverage methods to tune the choice of threshold. We provide several illustrations of these routines on applications taken from the ABC literature.
Resumo:
Data from four recent reanalysis projects [ECMWF, NCEP-NCAR, NCEP - Department of Energy ( DOE), NASA] have been diagnosed at the scale of synoptic weather systems using an objective feature tracking method. The tracking statistics indicate that, overall, the reanalyses correspond very well in the Northern Hemisphere (NH) lower troposphere, although differences for the spatial distribution of mean intensities show that the ECMWF reanalysis is systematically stronger in the main storm track regions but weaker around major orographic features. A direct comparison of the track ensembles indicates a number of systems with a broad range of intensities that compare well among the reanalyses. In addition, a number of small-scale weak systems are found that have no correspondence among the reanalyses or that only correspond upon relaxing the matching criteria, indicating possible differences in location and/or temporal coherence. These are distributed throughout the storm tracks, particularly in the regions known for small-scale activity, such as secondary development regions and the Mediterranean. For the Southern Hemisphere (SH), agreement is found to be generally less consistent in the lower troposphere with significant differences in both track density and mean intensity. The systems that correspond between the various reanalyses are considerably reduced and those that do not match span a broad range of storm intensities. Relaxing the matching criteria indicates that there is a larger degree of uncertainty in both the location of systems and their intensities compared with the NH. At upper-tropospheric levels, significant differences in the level of activity occur between the ECMWF reanalysis and the other reanalyses in both the NH and SH winters. This occurs due to a lack of coherence in the apparent propagation of the systems in ERA15 and appears most acute above 500 hPa. This is probably due to the use of optimal interpolation data assimilation in ERA15. Also shown are results based on using the same techniques to diagnose the tropical easterly wave activity. Results indicate that the wave activity is sensitive not only to the resolution and assimilation methods used but also to the model formulation.
Resumo:
Genetic association analyses of family-based studies with ordered categorical phenotypes are often conducted using methods either for quantitative or for binary traits, which can lead to suboptimal analyses. Here we present an alternative likelihood-based method of analysis for single nucleotide polymorphism (SNP) genotypes and ordered categorical phenotypes in nuclear families of any size. Our approach, which extends our previous work for binary phenotypes, permits straightforward inclusion of covariate, gene-gene and gene-covariate interaction terms in the likelihood, incorporates a simple model for ascertainment and allows for family-specific effects in the hypothesis test. Additionally, our method produces interpretable parameter estimates and valid confidence intervals. We assess the proposed method using simulated data, and apply it to a polymorphism in the c-reactive protein (CRP) gene typed in families collected to investigate human systemic lupus erythematosus. By including sex interactions in the analysis, we show that the polymorphism is associated with anti-nuclear autoantibody (ANA) production in females, while there appears to be no effect in males.
Resumo:
In this paper we estimate a Translog output distance function for a balanced panel of state level data for the Australian dairy processing sector. We estimate a fixed effects specification employing Bayesian methods, with and without the imposition of monotonicity and curvature restrictions. Our results indicate that Tasmania and Victoria are the most technically efficient states with New South Wales being the least efficient. The imposition of theoretical restrictions marginally affects the results especially with respect to estimates of technical change and industry deregulation. Importantly, our bias estimates show changes in both input use and output mix that result from deregulation. Specifically, we find that deregulation has positively biased the production of butter, cheese and powders.
Resumo:
Buffer strips are refuges for a variety of plants providing resources, such as pollen, nectar and seeds, for higher trophic levels, including invertebrates, mammals and birds. Margins can also harbour plant species that are potentially injurious to the adjacent arable crop (undesirable species). Sowing perennial species in non-cropped buffer strips can reduce weed incidence, but limits the abundance of annuals with the potential to support wider biodiversity (desirable species). We investigated the responses of unsown plant species present in buffer strips established with three different seed mixes managed annually with three contrasting management regimes (cutting, sward scarification and selective graminicide). Sward scarification had the strongest influence on the unsown desirable (e.g. Sonchus spp.) and unsown pernicious (e.g. Elytrigia repens) species, and was generally associated with higher cover values of these species. However, abundances of several desirable weed species, in particular Poa annua, were not promoted by scarification. The treatments of cutting and graminicide tended to have negative impacts on the unsown species, except for Cirsium vulgare, which increased with graminicide application. Differences in unsown species cover between seed mixes were minimal, although the grass-only mix was more susceptible to establishment by C. vulgare and Galium aparine than the two grass and forb mixes. Annual scarification can enable desirable annuals and sown perennials to co-exist, however, this practice can also promote pernicious species, and so is unlikely to be widely adopted as a management tool in its current form.
Resumo:
Liquid chromatography-mass spectrometry (LC-MS) datasets can be compared or combined following chromatographic alignment. Here we describe a simple solution to the specific problem of aligning one LC-MS dataset and one LC-MS/MS dataset, acquired on separate instruments from an enzymatic digest of a protein mixture, using feature extraction and a genetic algorithm. First, the LC-MS dataset is searched within a few ppm of the calculated theoretical masses of peptides confidently identified by LC-MS/MS. A piecewise linear function is then fitted to these matched peptides using a genetic algorithm with a fitness function that is insensitive to incorrect matches but sufficiently flexible to adapt to the discrete shifts common when comparing LC datasets. We demonstrate the utility of this method by aligning ion trap LC-MS/MS data with accurate LC-MS data from an FTICR mass spectrometer and show how hybrid datasets can improve peptide and protein identification by combining the speed of the ion trap with the mass accuracy of the FTICR, similar to using a hybrid ion trap-FTICR instrument. We also show that the high resolving power of FTICR can improve precision and linear dynamic range in quantitative proteomics. The alignment software, msalign, is freely available as open source.