10 resultados para multidimensional data

em CentAUR: Central Archive University of Reading - UK


Relevância:

70.00% 70.00%

Publicador:

Resumo:

Background: In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications. Results: This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds. Conclusions: The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Visual exploration of scientific data in life science area is a growing research field due to the large amount of available data. The Kohonen’s Self Organizing Map (SOM) is a widely used tool for visualization of multidimensional data. In this paper we present a fast learning algorithm for SOMs that uses a simulated annealing method to adapt the learning parameters. The algorithm has been adopted in a data analysis framework for the generation of similarity maps. Such maps provide an effective tool for the visual exploration of large and multi-dimensional input spaces. The approach has been applied to data generated during the High Throughput Screening of molecular compounds; the generated maps allow a visual exploration of molecules with similar topological properties. The experimental analysis on real world data from the National Cancer Institute shows the speed up of the proposed SOM training process in comparison to a traditional approach. The resulting visual landscape groups molecules with similar chemical properties in densely connected regions.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper describes the novel use of cluster analysis in the field of industrial process control. The severe multivariable process problems encountered in manufacturing have often led to machine shutdowns, where the need for corrective actions arises in order to resume operation. Production faults which are caused by processes running in less efficient regions may be prevented or diagnosed using a reasoning based on cluster analysis. Indeed the intemal complexity of a production machinery may be depicted in clusters of multidimensional data points which characterise the manufacturing process. The application of a Mean-Tracking cluster algorithm (developed in Reading) to field data acquired from a high-speed machinery will be discussed. The objective of such an application is to illustrate how machine behaviour can be studied, in particular how regions of erroneous and stable running behaviour can be identified.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

We describe ncWMS, an implementation of the Open Geospatial Consortium’s Web Map Service (WMS) specification for multidimensional gridded environmental data. ncWMS can read data in a large number of common scientific data formats – notably the NetCDF format with the Climate and Forecast conventions – then efficiently generate map imagery in thousands of different coordinate reference systems. It is designed to require minimal configuration from the system administrator and, when used in conjunction with a suitable client tool, provides end users with an interactive means for visualizing data without the need to download large files or interpret complex metadata. It is also used as a “bridging” tool providing interoperability between the environmental science community and users of geographic information systems. ncWMS implements a number of extensions to the WMS standard in order to fulfil some common scientific requirements, including the ability to generate plots representing timeseries and vertical sections. We discuss these extensions and their impact upon present and future interoperability. We discuss the conceptual mapping between the WMS data model and the data models used by gridded data formats, highlighting areas in which the mapping is incomplete or ambiguous. We discuss the architecture of the system and particular technical innovations of note, including the algorithms used for fast data reading and image generation. ncWMS has been widely adopted within the environmental data community and we discuss some of the ways in which the software is integrated within data infrastructures and portals.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A program is provided to determine structural parameters of atoms in or adsorbed on surfaces by refinement of atomistic models towards experimentally determined data generated by the normal incidence X-ray standing wave (NIXSW) technique. The method employs a combination of Differential Evolution Genetic Algorithms and Steepest Descent Line Minimisations to provide a fast, reliable and user friendly tool for experimentalists to interpret complex multidimensional NIXSW data sets.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background: Expression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples. Nevertheless, there is still much debate on the best ways to process data, to design experiments and analyse the output. Furthermore, many of the more sophisticated mathematical approaches to data analysis in the literature remain inaccessible to much of the biological research community. In this study we examine ways of extracting and analysing a large data set obtained using the Agilent long oligonucleotide transcriptomics platform, applied to a set of human macrophage and dendritic cell samples. Results: We describe and validate a series of data extraction, transformation and normalisation steps which are implemented via a new R function. Analysis of replicate normalised reference data demonstrate that intrarray variability is small (only around 2 of the mean log signal), while interarray variability from replicate array measurements has a standard deviation (SD) of around 0.5 log(2) units (6 of mean). The common practise of working with ratios of Cy5/Cy3 signal offers little further improvement in terms of reducing error. Comparison to expression data obtained using Arabidopsis samples demonstrates that the large number of genes in each sample showing a low level of transcription reflect the real complexity of the cellular transcriptome. Multidimensional scaling is used to show that the processed data identifies an underlying structure which reflect some of the key biological variables which define the data set. This structure is robust, allowing reliable comparison of samples collected over a number of years and collected by a variety of operators. Conclusions: This study outlines a robust and easily implemented pipeline for extracting, transforming normalising and visualising transcriptomic array data from Agilent expression platform. The analysis is used to obtain quantitative estimates of the SD arising from experimental (non biological) intra- and interarray variability, and for a lower threshold for determining whether an individual gene is expressed. The study provides a reliable basis for further more extensive studies of the systems biology of eukaryotic cells.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Three methodological limitations in English-Chinese contrastive rhetoric research have been identified in previous research, namely: the failure to control for the quality of L1 data; an inference approach to interpreting the relationship between L1 and L2 writing; and a focus on national cultural factors in interpreting rhetorical differences. Addressing these limitations, the current study examined the presence or absence and placement of thesis statement and topic sentences in four sets of argumentative texts produced by three groups of university students. We found that Chinese students tended to favour a direct/deductive approach in their English and Chinese writing, while native English writers typically adopted an indirect/inductive approach. This study argues for a dynamic and ecological interpretation of rhetorical practices in different languages and cultures.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Purpose – This paper aims to address the gaps in service recovery strategy assessment. An effective service recovery strategy that prevents customer defection after a service failure is a powerful managerial instrument. The literature to date does not present a comprehensive assessment of service recovery strategy. It also lacks a clear picture of the service recovery actions at managers’ disposal in case of failure and the effectiveness of individual strategies on customer outcomes. Design/methodology/approach – Based on service recovery theory, this paper proposes a formative index of service recovery strategy and empirically validates this measure using partial least-squares path modelling with survey data from 437 complainants in the telecommunications industry in Egypt. Findings – The CURE scale (CUstomer REcovery scale) presents evidence of reliability as well as convergent, discriminant and nomological validity. Findings also reveal that problem-solving, speed of response, effort, facilitation and apology are the actions that have an impact on the customer’s satisfaction with service recovery. Practical implications – This new formative index is of potential value in investigating links between strategy and customer evaluations of service by helping managers identify which actions contribute most to changes in the overall service recovery strategy as well as satisfaction with service recovery. Ultimately, the CURE scale facilitates the long-term planning of effective complaint management. Originality/value – This is the first study in the service marketing literature to propose a comprehensive assessment of service recovery strategy and clearly identify the service recovery actions that contribute most to changes in the overall service recovery strategy.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Bloom filters are a data structure for storing data in a compressed form. They offer excellent space and time efficiency at the cost of some loss of accuracy (so-called lossy compression). This work presents a yes-no Bloom filter, which as a data structure consisting of two parts: the yes-filter which is a standard Bloom filter and the no-filter which is another Bloom filter whose purpose is to represent those objects that were recognised incorrectly by the yes-filter (that is, to recognise the false positives of the yes-filter). By querying the no-filter after an object has been recognised by the yes-filter, we get a chance of rejecting it, which improves the accuracy of data recognition in comparison with the standard Bloom filter of the same total length. A further increase in accuracy is possible if one chooses objects to include in the no-filter so that the no-filter recognises as many as possible false positives but no true positives, thus producing the most accurate yes-no Bloom filter among all yes-no Bloom filters. This paper studies how optimization techniques can be used to maximize the number of false positives recognised by the no-filter, with the constraint being that it should recognise no true positives. To achieve this aim, an Integer Linear Program (ILP) is proposed for the optimal selection of false positives. In practice the problem size is normally large leading to intractable optimal solution. Considering the similarity of the ILP with the Multidimensional Knapsack Problem, an Approximate Dynamic Programming (ADP) model is developed making use of a reduced ILP for the value function approximation. Numerical results show the ADP model works best comparing with a number of heuristics as well as the CPLEX built-in solver (B&B), and this is what can be recommended for use in yes-no Bloom filters. In a wider context of the study of lossy compression algorithms, our researchis an example showing how the arsenal of optimization methods can be applied to improving the accuracy of compressed data.