983 resultados para scalable analysis
Resumo:
Real-time geoparsing of social media streams (e.g. Twitter, YouTube, Instagram, Flickr, FourSquare) is providing a new 'virtual sensor' capability to end users such as emergency response agencies (e.g. Tsunami early warning centres, Civil protection authorities) and news agencies (e.g. Deutsche Welle, BBC News). Challenges in this area include scaling up natural language processing (NLP) and information retrieval (IR) approaches to handle real-time traffic volumes, reducing false positives, creating real-time infographic displays useful for effective decision support and providing support for trust and credibility analysis using geosemantics. I will present in this seminar on-going work by the IT Innovation Centre over the last 4 years (TRIDEC and REVEAL FP7 projects) in building such systems, and highlights our research towards improving trustworthy and credible of crisis map displays and real-time analytics for trending topics and influential social networks during major news worthy events.
Resumo:
Accurately and reliably identifying the actual number of clusters present with a dataset of gene expression profiles, when no additional information on cluster structure is available, is a problem addressed by few algorithms. GeneMCL transforms microarray analysis data into a graph consisting of nodes connected by edges, where the nodes represent genes, and the edges represent the similarity in expression of those genes, as given by a proximity measurement. This measurement is taken to be the Pearson correlation coefficient combined with a local non-linear rescaling step. The resulting graph is input to the Markov Cluster (MCL) algorithm, which is an elegant, deterministic, non-specific and scalable method, which models stochastic flow through the graph. The algorithm is inherently affected by any cluster structure present, and rapidly decomposes a graph into cohesive clusters. The potential of the GeneMCL algorithm is demonstrated with a 5730 gene subset (IGS) of the Van't Veer breast cancer database, for which the clusterings are shown to reflect underlying biological mechanisms. (c) 2005 Elsevier Ltd. All rights reserved.
Resumo:
Automatic indexing and retrieval of digital data poses major challenges. The main problem arises from the ever increasing mass of digital media and the lack of efficient methods for indexing and retrieval of such data based on the semantic content rather than keywords. To enable intelligent web interactions, or even web filtering, we need to be capable of interpreting the information base in an intelligent manner. For a number of years research has been ongoing in the field of ontological engineering with the aim of using ontologies to add such (meta) knowledge to information. In this paper, we describe the architecture of a system (Dynamic REtrieval Analysis and semantic metadata Management (DREAM)) designed to automatically and intelligently index huge repositories of special effects video clips, based on their semantic content, using a network of scalable ontologies to enable intelligent retrieval. The DREAM Demonstrator has been evaluated as deployed in the film post-production phase to support the process of storage, indexing and retrieval of large data sets of special effects video clips as an exemplar application domain. This paper provides its performance and usability results and highlights the scope for future enhancements of the DREAM architecture which has proven successful in its first and possibly most challenging proving ground, namely film production, where it is already in routine use within our test bed Partners' creative processes. (C) 2009 Published by Elsevier B.V.
Resumo:
K-Means is a popular clustering algorithm which adopts an iterative refinement procedure to determine data partitions and to compute their associated centres of mass, called centroids. The straightforward implementation of the algorithm is often referred to as `brute force' since it computes a proximity measure from each data point to each centroid at every iteration of the K-Means process. Efficient implementations of the K-Means algorithm have been predominantly based on multi-dimensional binary search trees (KD-Trees). A combination of an efficient data structure and geometrical constraints allow to reduce the number of distance computations required at each iteration. In this work we present a general space partitioning approach for improving the efficiency and the scalability of the K-Means algorithm. We propose to adopt approximate hierarchical clustering methods to generate binary space partitioning trees in contrast to KD-Trees. In the experimental analysis, we have tested the performance of the proposed Binary Space Partitioning K-Means (BSP-KM) when a divisive clustering algorithm is used. We have carried out extensive experimental tests to compare the proposed approach to the one based on KD-Trees (KD-KM) in a wide range of the parameters space. BSP-KM is more scalable than KDKM, while keeping the deterministic nature of the `brute force' algorithm. In particular, the proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-Trees have been used.
Resumo:
The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.
Resumo:
Security administrators face the challenge of designing, deploying and maintaining a variety of configuration files related to security systems, especially in large-scale networks. These files have heterogeneous syntaxes and follow differing semantic concepts. Nevertheless, they are interdependent due to security services having to cooperate and their configuration to be consistent with each other, so that global security policies are completely and correctly enforced. To tackle this problem, our approach supports a comfortable definition of an abstract high-level security policy and provides an automated derivation of the desired configuration files. It is an extension of policy-based management and policy hierarchies, combining model-based management (MBM) with system modularization. MBM employs an object-oriented model of the managed system to obtain the details needed for automated policy refinement. The modularization into abstract subsystems (ASs) segment the system-and the model-into units which more closely encapsulate related system components and provide focused abstract views. As a result, scalability is achieved and even comprehensive IT systems can be modelled in a unified manner. The associated tool MoBaSeC (Model-Based-Service-Configuration) supports interactive graphical modelling, automated model analysis and policy refinement with the derivation of configuration files. We describe the MBM and AS approaches, outline the tool functions and exemplify their applications and results obtained. Copyright (C) 2010 John Wiley & Sons, Ltd.
Resumo:
In this work, we propose a Geographical Information System that can be used as a tool for the treatment and study of problems related with environmental and city management issues. It is based on the Scalable Vector Graphics (SVG) standard for Web development of graphics. The project uses the concept of remate and real-time mar creation by database access through instructions executed by browsers on the Internet. As a way of proving the system effectiveness, we present two study cases;.the first on a region named Maracajaú Coral Reefs, located in Rio Grande do Norte coast, and the second in the Switzerland Northeast in which we intended to promote the substitution of MapServer by the system proposed here. We also show some results that demonstrate the larger geographical data capability achieved by the use of the standardized codes and open source tools, such as Extensible Markup Language (XML), Document Object Model (DOM), script languages ECMAScript/ JavaScript, Hypertext Preprocessor (PHP) and PostgreSQL and its extension, PostGIS
Resumo:
Although cluster environments have an enormous potential processing power, real applications that take advantage of this power remain an elusive goal. This is due, in part, to the lack of understanding about the characteristics of the applications best suited for these environments. This paper focuses on Master/Slave applications for large heterogeneous clusters. It defines application, cluster and execution models to derive an analytic expression for the execution time. It defines speedup and derives speedup bounds based on the inherent parallelism of the application and the aggregated computing power of the cluster. The paper derives an analytical expression for efficiency and uses it to define scalability of the algorithm-cluster combination based on the isoefficiency metric. Furthermore, the paper establishes necessary and sufficient conditions for an algorithm-cluster combination to be scalable which are easy to verify and use in practice. Finally, it covers the impact of network contention as the number of processors grow. (C) 2007 Elsevier B.V. All rights reserved.
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Cloud computing provides a promising solution to the genomics data deluge problem resulting from the advent of next-generation sequencing (NGS) technology. Based on the concepts of “resources-on-demand” and “pay-as-you-go”, scientists with no or limited infrastructure can have access to scalable and cost-effective computational resources. However, the large size of NGS data causes a significant data transfer latency from the client’s site to the cloud, which presents a bottleneck for using cloud computing services. In this paper, we provide a streaming-based scheme to overcome this problem, where the NGS data is processed while being transferred to the cloud. Our scheme targets the wide class of NGS data analysis tasks, where the NGS sequences can be processed independently from one another. We also provide the elastream package that supports the use of this scheme with individual analysis programs or with workflow systems. Experiments presented in this paper show that our solution mitigates the effect of data transfer latency and saves both time and cost of computation.
Resumo:
Here we report the first study on the electrochemical energy storage application of a surface-immobilized ruthenium complex multilayer thin film with anion storage capability. We employed a novel dinuclear ruthenium complex with tetrapodal anchoring groups to build well-ordered redox-active multilayer coatings on an indium tin oxide (ITO) surface using a layer-by-layer self-assembly process. Cyclic voltammetry (CV), UV-Visible (UV-Vis) and Raman spectroscopy showed a linear increase of peak current, absorbance and Raman intensities, respectively with the number of layers. These results indicate the formation of well-ordered multilayers of the ruthenium complex on ITO, which is further supported by the X-ray photoelectron spectroscopy analysis. The thickness of the layers can be controlled with nanometer precision. In particular, the thickest layer studied (65 molecular layers and approx. 120 nm thick) demonstrated fast electrochemical oxidation/reduction, indicating a very low attenuation of the charge transfer within the multilayer. In situ-UV-Vis and resonance Raman spectroscopy results demonstrated the reversible electrochromic/redox behavior of the ruthenium complex multilayered films on ITO with respect to the electrode potential, which is an ideal prerequisite for e.g. smart electrochemical energy storage applications. Galvanostatic charge–discharge experiments demonstrated a pseudocapacitor behavior of the multilayer film with a good specific capacitance of 92.2 F g−1 at a current density of 10 μA cm−2 and an excellent cycling stability. As demonstrated in our prototypical experiments, the fine control of physicochemical properties at nanometer scale, relatively good stability of layers under ambient conditions makes the multilayer coatings of this type an excellent material for e.g. electrochemical energy storage, as interlayers in inverted bulk heterojunction solar cell applications and as functional components in molecular electronics applications.
Resumo:
Static analyses of object-oriented programs usually rely on intermediate representations that respect the original semantics while having a more uniform and basic syntax. Most of the work involving object-oriented languages and abstract interpretation usually omits the description of that language or just refers to the Control Flow Graph(CFG) it represents. However, this lack of formalization on one hand results in an absence of assurances regarding the correctness of the transformation and on the other it typically strongly couples the analysis to the source language. In this work we present a framework for analysis of object-oriented languages in which in a first phase we transform the input program into a representation based on Horn clauses. This allows on one hand proving the transformation correct attending to a simple condition and on the other being able to apply an existing analyzer for (constraint) logic programming to automatically derive a safe approximation of the semantics of the original program. The approach is flexible in the sense that the first phase decouples the analyzer from most languagedependent features, and correct because the set of Horn clauses returned by the transformation phase safely approximates the standard semantics of the input program. The resulting analysis is also reasonably scalable due to the use of mature, modular (C)LP-based analyzers. The overall approach allows us to report results for medium-sized programs.
Resumo:
Abstract. We study the problem of efficient, scalable set-sharing analysis of logic programs. We use the idea of representing sharing information as a pair of abstract substitutions, one of which is a worst-case sharing representation called a clique set, which was previously proposed for the case of inferring pair-sharing. We use the clique-set representation for (1) inferring actual set-sharing information, and (2) analysis within a top-down framework. In particular, we define the new abstract functions required by standard top-down analyses, both for sharing alone and also for the case of including freeness in addition to sharing. We use cliques both as an alternative representation and as widening, defining several widening operators. Our experimental evaluation supports the conclusión that, for inferring set-sharing, as it was the case for inferring pair-sharing, precisión losses are limited, while useful efficieney gains are obtained. We also derive useful conclusions regarding the interactions between thresholds, precisión, efficieney and cost of widening. At the limit, the clique-set representation allowed analyzing some programs that exceeded memory capacity using classical sharing representations.
Resumo:
Set-Sharing analysis, the classic Jacobs and Langen's domain, has been widely used to infer several interesting properties of programs at compile-time such as occurs-check reduction, automatic parallelization, flnite-tree analysis, etc. However, performing abstract uniflcation over this domain implies the use of a closure operation which makes the number of sharing groups grow exponentially. Much attention has been given in the literature to mitígate this key inefficiency in this otherwise very useful domain. In this paper we present two novel alternative representations for the traditional set-sharing domain, tSH and tNSH. which compress efficiently the number of elements into fewer elements enabling more efficient abstract operations, including abstract uniflcation, without any loss of accuracy. Our experimental evaluation supports that both representations can reduce dramatically the number of sharing groups showing they can be more practical solutions towards scalable set-sharing.
Resumo:
We study the problem of efñcient, scalable set-sharing analysis of logic programs. We use the idea of representing sharing information as a pair of abstract substitutions, one of which is a worst-case sharing representation called a clique set, which was previously proposed for the case of inferring pair-sharing. We use the clique-set representation for (1) inferring actual set-sharing information, and (2) analysis within a topdown framework. In particular, we define the abstract functions required by standard top-down analyses, both for sharing alone and also for the case of including freeness in addition to sharing. Our experimental evaluation supports the conclusión that, for inferring set-sharing, as it was the case for inferring pair-sharing, precisión losses are limited, while useful efñciency gains are obtained. At the limit, the clique-set representation allowed analyzing some programs that exceeded memory capacity using classical sharing representations.