43 resultados para data-driven Stochastic Subspace Identification (SSI-data)
Resumo:
Background: In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications. Results: This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds. Conclusions: The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.
Resumo:
ISO19156 Observations and Measurements (O&M) provides a standardised framework for organising information about the collection of information about the environment. Here we describe the implementation of a specialisation of O&M for environmental data, the Metadata Objects for Linking Environmental Sciences (MOLES3). MOLES3 provides support for organising information about data, and for user navigation around data holdings. The implementation described here, “CEDA-MOLES”, also supports data management functions for the Centre for Environmental Data Archival, CEDA. The previous iteration of MOLES (MOLES2) saw active use over five years, being replaced by CEDA-MOLES in late 2014. During that period important lessons were learnt both about the information needed, as well as how to design and maintain the necessary information systems. In this paper we review the problems encountered in MOLES2; how and why CEDA-MOLES was developed and engineered; the migration of information holdings from MOLES2 to CEDA-MOLES; and, finally, provide an early assessment of MOLES3 (as implemented in CEDA-MOLES) and its limitations. Key drivers for the MOLES3 development included the necessity for improved data provenance, for further structured information to support ISO19115 discovery metadata export (for EU INSPIRE compliance), and to provide appropriate fixed landing pages for Digital Object Identifiers (DOIs) in the presence of evolving datasets. Key lessons learned included the importance of minimising information structure in free text fields, and the necessity to support as much agility in the information infrastructure as possible without compromising on maintainability both by those using the systems internally and externally (e.g. citing in to the information infrastructure), and those responsible for the systems themselves. The migration itself needed to ensure continuity of service and traceability of archived assets.
Resumo:
Several global quantities are computed from the ERA40 reanalysis for the period 1958-2001 and explored for trends. These are discussed in the context of changes to the global observing system. Temperature, integrated water vapor (IWV), and kinetic energy are considered. The ERA40 global mean temperature in the lower troposphere has a trend of +0.11 K per decade over the period of 1979-2001, which is slightly higher than the MSU measurements, but within the estimated error limit. For the period 1958 2001 the warming trend is 0.14 K per decade but this is likely to be an artifact of changes in the observing system. When this is corrected for, the warming trend is reduced to 0.10 K per decade. The global trend in IWV for the period 1979-2001 is +0.36 mm per decade. This is about twice as high as the trend determined from the Clausius-Clapeyron relation assuming conservation of relative humidity. It is also larger than results from free climate model integrations driven by the same observed sea surface temperature as used in ERA40. It is suggested that the large trend in IWV does not represent a genuine climate trend but an artifact caused by changes in the global observing system such as the use of SSM/I and more satellite soundings in later years. Recent results are in good agreement with GPS measurements. The IWV trend for the period 1958-2001 is still higher but reduced to +0.16 mm per decade when corrected for changes in the observing systems. Total kinetic energy shows an increasing global trend. Results from data assimilation experiments strongly suggest that this trend is also incorrect and mainly caused by the huge changes in the global observing system in 1979. When this is corrected for, no significant change in global kinetic energy from 1958 onward can be found.
Resumo:
Recent interest in the validation of general circulation models (GCMs) has been devoted to objective methods. A small number of authors have used the direct synoptic identification of phenomena together with a statistical analysis to perform the objective comparison between various datasets. This paper describes a general method for performing the synoptic identification of phenomena that can be used for an objective analysis of atmospheric, or oceanographic, datasets obtained from numerical models and remote sensing. Methods usually associated with image processing have been used to segment the scene and to identify suitable feature points to represent the phenomena of interest. This is performed for each time level. A technique from dynamic scene analysis is then used to link the feature points to form trajectories. The method is fully automatic and should be applicable to a wide range of geophysical fields. An example will be shown of results obtained from this method using data obtained from a run of the Universities Global Atmospheric Modelling Project GCM.
Resumo:
Convectively coupled equatorial waves are fundamental components of the interaction between the physics and dynamics of the tropical atmosphere. A new methodology, which isolates individual equatorial wave modes, has been developed and applied to observational data. The methodology assumes that the horizontal structures given by equatorial wave theory can be used to project upper- and lower-tropospheric data onto equatorial wave modes. The dynamical fields are first separated into eastward- and westward-moving components with a specified domain of frequency–zonal wavenumber. Each of the components for each field is then projected onto the different equatorial modes using the y structures of these modes given by the theory. The latitudinal scale yo of the modes is predetermined by data to fit the equatorial trapping in a suitable latitude belt y = ±Y. The extent to which the different dynamical fields are consistent with one another in their depiction of each equatorial wave structure determines the confidence in the reality of that structure. Comparison of the analyzed modes with the eastward- and westward-moving components in the convection field enables the identification of the dynamical structure and nature of convectively coupled equatorial waves. In a case study, the methodology is applied to two independent data sources, ECMWF Reanalysis and satellite-observed window brightness temperature (Tb) data for the summer of 1992. Various convectively coupled equatorial Kelvin, mixed Rossby–gravity, and Rossby waves have been detected. The results indicate a robust consistency between the two independent data sources. Different vertical structures for different wave modes and a significant Doppler shifting effect of the background zonal winds on wave structures are found and discussed. It is found that in addition to low-level convergence, anomalous fluxes induced by strong equatorial zonal winds associated with equatorial waves are important for inducing equatorial convection. There is evidence that equatorial convection associated with Rossby waves leads to a change in structure involving a horizontal structure similar to that of a Kelvin wave moving westward with it. The vertical structure may also be radically changed. The analysis method should make a very powerful diagnostic tool for investigating convectively coupled equatorial waves and the interaction of equatorial dynamics and physics in the real atmosphere. The results from application of the analysis method for a reanalysis dataset should provide a benchmark against which model studies can be compared.
Resumo:
The Global Ocean Data Assimilation Experiment (GODAE [http:// www.godae.org]) has spanned a decade of rapid technological development. The ever-increasing volume and diversity of oceanographic data produced by in situ instruments, remote-sensing platforms, and computer simulations have driven the development of a number of innovative technologies that are essential for connecting scientists with the data that they need. This paper gives an overview of the technologies that have been developed and applied in the course of GODAE, which now provide users of oceanographic data with the capability to discover, evaluate, visualize, download, and analyze data from all over the world. The key to this capability is the ability to reduce the inherent complexity of oceanographic data by providing a consistent, harmonized view of the various data products. The challenges of data serving have been addressed over the last 10 years through the cooperative skills and energies of many individuals.
Resumo:
Standardisation of microsatellite allele profiles between laboratories is of fundamental importance to the transferability of genetic fingerprint data and the identification of clonal individuals held at multiple sites. Here we describe two methods of standardisation applied to the microsatellite fingerprinting of 429 Theobroma cacao L. trees representing 345 accessions held in the worlds largest Cocoa Intermediate Quarantine facility: the use of a partial allelic ladder through the production of 46 cloned and sequenced allelic standards (AJ748464 to AJ48509), and the use of standard genotypes selected to display a diverse allelic range. Until now a lack of accurate and transferable identification information has impeded efforts to genetically improve the cocoa crop. To address this need, a global initiative to fingerprint all international cocoa germplasm collections using a common set of 15 microsatellite markers is in progress. Data reported here have been deposited with the International Cocoa Germplasm Database and form the basis of a searchable resource for clonal identification. To our knowledge, this is the first quarantine facility to be completely genotyped using microsatellite markers for the purpose of quality control and clonal identification. Implications of the results for retrospective tracking of labelling errors are briefly explored.
Recent developments in genetic data analysis: what can they tell us about human demographic history?
Resumo:
Over the last decade, a number of new methods of population genetic analysis based on likelihood have been introduced. This review describes and explains the general statistical techniques that have recently been used, and discusses the underlying population genetic models. Experimental papers that use these methods to infer human demographic and phylogeographic history are reviewed. It appears that the use of likelihood has hitherto had little impact in the field of human population genetics, which is still primarily driven by more traditional approaches. However, with the current uncertainty about the effects of natural selection, population structure and ascertainment of single-nucleotide polymorphism markers, it is suggested that likelihood-based methods may have a greater impact in the future.
Resumo:
Many kernel classifier construction algorithms adopt classification accuracy as performance metrics in model evaluation. Moreover, equal weighting is often applied to each data sample in parameter estimation. These modeling practices often become problematic if the data sets are imbalanced. We present a kernel classifier construction algorithm using orthogonal forward selection (OFS) in order to optimize the model generalization for imbalanced two-class data sets. This kernel classifier identification algorithm is based on a new regularized orthogonal weighted least squares (ROWLS) estimator and the model selection criterion of maximal leave-one-out area under curve (LOO-AUC) of the receiver operating characteristics (ROCs). It is shown that, owing to the orthogonalization procedure, the LOO-AUC can be calculated via an analytic formula based on the new regularized orthogonal weighted least squares parameter estimator, without actually splitting the estimation data set. The proposed algorithm can achieve minimal computational expense via a set of forward recursive updating formula in searching model terms with maximal incremental LOO-AUC value. Numerical examples are used to demonstrate the efficacy of the algorithm.
Resumo:
The use of data reconciliation techniques can considerably reduce the inaccuracy of process data due to measurement errors. This in turn results in improved control system performance and process knowledge. Dynamic data reconciliation techniques are applied to a model-based predictive control scheme. It is shown through simulations on a chemical reactor system that the overall performance of the model-based predictive controller is enhanced considerably when data reconciliation is applied. The dynamic data reconciliation techniques used include a combined strategy for the simultaneous identification of outliers and systematic bias.
Resumo:
The background error covariance matrix, B, is often used in variational data assimilation for numerical weather prediction as a static and hence poor approximation to the fully dynamic forecast error covariance matrix, Pf. In this paper the concept of an Ensemble Reduced Rank Kalman Filter (EnRRKF) is outlined. In the EnRRKF the forecast error statistics in a subspace defined by an ensemble of states forecast by the dynamic model are found. These statistics are merged in a formal way with the static statistics, which apply in the remainder of the space. The combined statistics may then be used in a variational data assimilation setting. It is hoped that the nonlinear error growth of small-scale weather systems will be accurately captured by the EnRRKF, to produce accurate analyses and ultimately improved forecasts of extreme events.
Resumo:
In this study, we systematically compare a wide range of observational and numerical precipitation datasets for Central Asia. Data considered include two re-analyses, three datasets based on direct observations, and the output of a regional climate model simulation driven by a global re-analysis. These are validated and intercompared with respect to their ability to represent the Central Asian precipitation climate. In each of the datasets, we consider the mean spatial distribution and the seasonal cycle of precipitation, the amplitude of interannual variability, the representation of individual yearly anomalies, the precipitation sensitivity (i.e. the response to wet and dry conditions), and the temporal homogeneity of precipitation. Additionally, we carried out part of these analyses for datasets available in real time. The mutual agreement between the observations is used as an indication of how far these data can be used for validating precipitation data from other sources. In particular, we show that the observations usually agree qualitatively on anomalies in individual years while it is not always possible to use them for the quantitative validation of the amplitude of interannual variability. The regional climate model is capable of improving the spatial distribution of precipitation. At the same time, it strongly underestimates summer precipitation and its variability, while interannual variations are well represented during the other seasons, in particular in the Central Asian mountains during winter and spring
Resumo:
This paper investigates whether using natural logarithms (logs) of price indices for forecasting inflation rates is preferable to employing the original series. Univariate forecasts for annual inflation rates for a number of European countries and the USA based on monthly seasonal consumer price indices are considered. Stochastic seasonality and deterministic seasonality models are used. In many cases, the forecasts based on the original variables result in substantially smaller root mean squared errors than models based on logs. In turn, if forecasts based on logs are superior, the gains are typically small. This outcome sheds doubt on the common practice in the academic literature to forecast inflation rates based on differences of logs.
Resumo:
The purpose of this lecture is to review recent development in data analysis, initialization and data assimilation. The development of 3-dimensional multivariate schemes has been very timely because of its suitability to handle the many different types of observations during FGGE. Great progress has taken place in the initialization of global models by the aid of non-linear normal mode technique. However, in spite of great progress, several fundamental problems are still unsatisfactorily solved. Of particular importance is the question of the initialization of the divergent wind fields in the Tropics and to find proper ways to initialize weather systems driven by non-adiabatic processes. The unsatisfactory ways in which such processes are being initialized are leading to excessively long spin-up times.
Resumo:
It is well known that there is a dynamic relationship between cerebral blood flow (CBF) and cerebral blood volume (CBV). With increasing applications of functional MRI, where the blood oxygen-level-dependent signals are recorded, the understanding and accurate modeling of the hemodynamic relationship between CBF and CBV becomes increasingly important. This study presents an empirical and data-based modeling framework for model identification from CBF and CBV experimental data. It is shown that the relationship between the changes in CBF and CBV can be described using a parsimonious autoregressive with exogenous input model structure. It is observed that neither the ordinary least-squares (LS) method nor the classical total least-squares (TLS) method can produce accurate estimates from the original noisy CBF and CBV data. A regularized total least-squares (RTLS) method is thus introduced and extended to solve such an error-in-the-variables problem. Quantitative results show that the RTLS method works very well on the noisy CBF and CBV data. Finally, a combination of RTLS with a filtering method can lead to a parsimonious but very effective model that can characterize the relationship between the changes in CBF and CBV.