221 resultados para Imbalanced datasets
Resumo:
The CHARMe project enables the annotation of climate data with key pieces of supporting information that we term “commentary”. Commentary reflects the experience that has built up in the user community, and can help new or less-expert users (such as consultants, SMEs, experts in other fields) to understand and interpret complex data. In the context of global climate services, the CHARMe system will record, retain and disseminate this commentary on climate datasets, and provide a means for feeding back this experience to the data providers. Based on novel linked data techniques and standards, the project has developed a core system, data model and suite of open-source tools to enable this information to be shared, discovered and exploited by the community.
Resumo:
The weekly dependence of pollutant aerosols in the urban environment of Lisbon (Portugal) is inferred from the records of atmospheric electric field at Portela meteorological station (38°47′N,9°08′W). Measurements were made with a Bendorf electrograph. The data set exists from 1955 to 1990, but due to the contaminating effect of the radioactive fallout during 1960 and 1970s, only the period between 1980 and 1990 is considered here. Using a relative difference method a weekly dependence of the atmospheric electric field is found in these records, which shows an increasing trend between 1980 and 1990. This is consistent with a growth of population in the Lisbon metropolitan area and consequently urban activity, mainly traffic. Complementarily, using a Lomb–Scargle periodogram technique the presence of a daily and weekly cycle is also found. Moreover, to follow the evolution of theses cycles, in the period considered, a simple representation in a colour surface plot representation of the annual periodograms is presented. Further, a noise analysis of the periodograms is made, which validates the results found. Two datasets were considered: all days in the period, and fair-weather days only.
Resumo:
It is becoming increasingly important that we can understand and model flow processes in urban areas. Applications such as weather forecasting, air quality and sustainable urban development rely on accurate modelling of the interface between an urban surface and the atmosphere above. This review gives an overview of current understanding of turbulence generated by an urban surface up to a few building heights, the layer called the roughness sublayer (RSL). High quality datasets are also identified which can be used in the development of suitable parameterisations of the urban RSL. Datasets derived from physical and numerical modelling, and full-scale observations in urban areas now exist across a range of urban-type morphologies (e.g. street canyons, cubes, idealised and realistic building layouts). Results show that the urban RSL depth falls within 2 – 5 times mean building height and is not easily related to morphology. Systematic perturbations away from uniform layouts (e.g. varying building heights) have a significant impact on RSL structure and depth. Considerable fetch is required to develop an overlying inertial sublayer, where turbulence is more homogeneous, and some authors have suggested that the “patchiness” of urban areas may prevent inertial sublayers from developing at all. Turbulence statistics suggest similarities between vegetation and urban canopies but key differences are emerging. There is no consensus as to suitable scaling variables, e.g. friction velocity above canopy vs. square root of maximum Reynolds stress, mean vs. maximum building height. The review includes a summary of existing modelling practices and highlights research priorities.
Resumo:
This study examines the atmospheric circulation patterns and surface features associated with the seven coldest winters in the U.K. since 1870, using the 20th Century Reanalysis. Six of these winters are outside the scope of previous reanalysis datasets; we examine them here for the first time. All winters show a marked lack of the climatological southwesterly flow over the UK, displaying easterly and northeasterly anomalies. Six of the seven winters (all except 1890) were associated with a negative phase of the North Atlantic Oscillation; 1890 was characterised by a blocking anticyclone over and northeast of the UK.
Resumo:
ESA’s first multi-satellite mission Cluster is unique in its concept of 4 satellites orbiting in controlled formations. This will give an unprecedented opportunity to study structure and dynamics of the magnetosphere. In this paper we discuss ways in which ground-based remote-sensing observations of the ionosphere can be used to support the multipoint in-situ satellite measurements. There are a very large number of potentially useful configurations between the satellites and any one ground-based observatory; however, the number of ideal occurrences for any one configuration is low. Many of the ground-based instruments cannot operate continuously and Cluster will take data only for a part of each orbit, depending on how much high-resolution (‘burst-mode’) data are acquired. In addition, there are a great many instrument modes and the formation, size and shape of the cluster of the four satellites to consider. These circumstances create a clear and pressing need for careful planning to ensure that the scientific return from Cluster is maximised by additional coordinated ground-based observations. For this reason, ESA established a working group to coordinate the observations on the ground with Cluster. We will give a number of examples how the combined spacecraft and ground-based observations can address outstanding questions in magnetospheric physics. An online computer tool has been prepared to allow for the planning of conjunctions and advantageous constellations between the Cluster spacecraft and individual or combined ground-based systems. During the mission a ground-based database containing index and summary data will help to identify interesting datasets and allow to select intervals for coordinated studies. We illustrate the philosophy of our approach, using a few important examples of the many possible configurations between the satellite and the ground-based instruments.
Resumo:
Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.
Resumo:
The Environmental Data Abstraction Library provides a modular data management library for bringing new and diverse datatypes together for visualisation within numerous software packages, including the ncWMS viewing service, which already has very wide international uptake. The structure of EDAL is presented along with examples of its use to compare satellite, model and in situ data types within the same visualisation framework. We emphasize the value of this capability for cross calibration of datasets and evaluation of model products against observations, including preparation for data assimilation.
Resumo:
Retreating ice fronts (as a result of a warming climate) expose large expanses of deglaciated forefield, which become colonized by microbes and plants. There has been increasing interest in characterizing the biogeochemical development of these ecosystems using a chronosequence approach. Prior to the establishment of plants, microbes use autochthonously produced and allochthonously delivered nutrients for growth. The microbial community composition is largely made up of heterotrophic microbes (both bacteria and fungi), autotrophic microbes and nitrogen-fixing diazotrophs. Microbial activity is thought to be responsible for the initial build-up of labile nutrient pools, facilitating the growth of higher order plant life in developed soils. However, it is unclear to what extent these ecosystems rely on external sources of nutrients such as ancient carbon pools and periodic nitrogen deposition. Furthermore, the seasonal variation of chronosequence dynamics and the effect of winter are largely unexplored. Modelling this ecosystem will provide a quantitative evaluation of the key processes and could guide the focus of future research. Year-round datasets combined with novel metagenomic techniques will help answer some of the pressing questions in this relatively new but rapidly expanding field, which is of growing interest in the context of future large-scale ice retreat.
Resumo:
Background: In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications. Results: This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds. Conclusions: The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.
Resumo:
Background: The electroencephalogram (EEG) may be described by a large number of different feature types and automated feature selection methods are needed in order to reliably identify features which correlate with continuous independent variables. New method: A method is presented for the automated identification of features that differentiate two or more groups inneurologicaldatasets basedupona spectraldecompositionofthe feature set. Furthermore, the method is able to identify features that relate to continuous independent variables. Results: The proposed method is first evaluated on synthetic EEG datasets and observed to reliably identify the correct features. The method is then applied to EEG recorded during a music listening task and is observed to automatically identify neural correlates of music tempo changes similar to neural correlates identified in a previous study. Finally,the method is applied to identify neural correlates of music-induced affective states. The identified neural correlates reside primarily over the frontal cortex and are consistent with widely reported neural correlates of emotions. Comparison with existing methods: The proposed method is compared to the state-of-the-art methods of canonical correlation analysis and common spatial patterns, in order to identify features differentiating synthetic event-related potentials of different amplitudes and is observed to exhibit greater performance as the number of unique groups in the dataset increases. Conclusions: The proposed method is able to identify neural correlates of continuous variables in EEG datasets and is shown to outperform canonical correlation analysis and common spatial patterns.
Resumo:
Studiesthat use prolonged periods of sensory stimulation report associations between regional reductions in neural activity and negative blood oxygenation level-dependent (BOLD) signaling. However, the neural generators of the negative BOLD response remain to be characterized. Here, we use single-impulse electrical stimulation of the whisker pad in the anesthetized rat to identify components of the neural response that are related to “negative” hemodynamic changes in the brain. Laminar multiunit activity and local field potential recordings of neural activity were performed concurrently withtwo-dimensional optical imaging spectroscopy measuring hemodynamic changes. Repeated measurements over multiple stimulation trials revealed significant variations in neural responses across session and animal datasets. Within this variation, we found robust long-latency decreases (300 and 2000 ms after stimulus presentation) in gammaband power (30 – 80 Hz) in the middle-superficial cortical layers in regions surrounding the activated whisker barrel cortex. This reduction in gamma frequency activity was associated with corresponding decreases in the hemodynamic responses that drive the negative BOLD signal. These findings suggest a close relationship between BOLD responses and neural events that operate over time scales that outlast the initiating sensory stimulus, and provide important insights into the neurophysiological basis of negative neuroimaging signals.
Resumo:
For general home monitoring, a system should automatically interpret people’s actions. The system should be non-intrusive, and able to deal with a cluttered background, and loose clothes. An approach based on spatio-temporal local features and a Bag-of-Words (BoW) model is proposed for single-person action recognition from combined intensity and depth images. To restore the temporal structure lost in the traditional BoW method, a dynamic time alignment technique with temporal binning is applied in this work, which has not been previously implemented in the literature for human action recognition on depth imagery. A novel human action dataset with depth data has been created using two Microsoft Kinect sensors. The ReadingAct dataset contains 20 subjects and 19 actions for a total of 2340 videos. To investigate the effect of using depth images and the proposed method, testing was conducted on three depth datasets, and the proposed method was compared to traditional Bag-of-Words methods. Results showed that the proposed method improves recognition accuracy when adding depth to the conventional intensity data, and has advantages when dealing with long actions.
Resumo:
In the last decade, several research results have presented formulations for the auto-calibration problem. Most of these have relied on the evaluation of vanishing points to extract the camera parameters. Normally vanishing points are evaluated using pedestrians or the Manhattan World assumption i.e. it is assumed that the scene is necessarily composed of orthogonal planar surfaces. In this work, we present a robust framework for auto-calibration, with improved results and generalisability for real-life situations. This framework is capable of handling problems such as occlusions and the presence of unexpected objects in the scene. In our tests, we compare our formulation with the state-of-the-art in auto-calibration using pedestrians and Manhattan World-based assumptions. This paper reports on the experiments conducted using publicly available datasets; the results have shown that our formulation represents an improvement over the state-of-the-art.
Resumo:
Background Somatic embryogenesis (SE) in plants is a process by which embryos are generated directly from somatic cells, rather than from the fused products of male and female gametes. Despite the detailed expression analysis of several somatic-to-embryonic marker genes, a comprehensive understanding of SE at a molecular level is still lacking. The present study was designed to generate high resolution transcriptome datasets for early SE providing the way for future research to understand the underlying molecular mechanisms that regulate this process. We sequenced Arabidopsis thaliana somatic embryos collected from three distinct developmental time-points (5, 10 and 15 d after in vitro culture) using the Illumina HiSeq 2000 platform. Results This study yielded a total of 426,001,826 sequence reads mapped to 26,520 genes in the A. thaliana reference genome. Analysis of embryonic cultures after 5 and 10 d showed differential expression of 1,195 genes; these included 778 genes that were more highly expressed after 5 d as compared to 10 d. Moreover, 1,718 genes were differentially expressed in embryonic cultures between 10 and 15 d. Our data also showed at least eight different expression patterns during early SE; the majority of genes are transcriptionally more active in embryos after 5 d. Comparison of transcriptomes derived from somatic embryos and leaf tissues revealed that at least 4,951 genes are transcriptionally more active in embryos than in the leaf; increased expression of genes involved in DNA cytosine methylation and histone deacetylation were noted in embryogenic tissues. In silico expression analysis based on microarray data found that approximately 5% of these genes are transcriptionally more active in somatic embryos than in actively dividing callus and non-dividing leaf tissues. Moreover, this identified 49 genes expressed at a higher level in somatic embryos than in other tissues. This included several genes with unknown function, as well as others related to oxidative and osmotic stress, and auxin signalling. Conclusions The transcriptome information provided here will form the foundation for future research on genetic and epigenetic control of plant embryogenesis at a molecular level. In follow-up studies, these data could be used to construct a regulatory network for SE; the genes more highly expressed in somatic embryos than in vegetative tissues can be considered as potential candidates to validate these networks.