36 resultados para Scientific Workflows
em Universidad Politécnica de Madrid
Resumo:
Virtualized Infrastructures are a promising way for providing flexible and dynamic computing solutions for resourceconsuming tasks. Scientific Workflows are one of these kind of tasks, as they need a large amount of computational resources during certain periods of time. To provide the best infrastructure configuration for a workflow it is necessary to explore as many providers as possible taking into account different criteria like Quality of Service, pricing, response time, network latency, etc. Moreover, each one of these new resources must be tuned to provide the tools and dependencies required by each of the steps of the workflow. Working with different infrastructure providers, either public or private using their own concepts and terms, and with a set of heterogeneous applications requires a framework for integrating all the information about these elements. This work proposes semantic technologies for describing and integrating all the information about the different components of the overall system and a set of policies created by the user. Based on this information a scheduling process will be performed to generate an infrastructure configuration defining the set of virtual machines that must be run and the tools that must be deployed on them.
Resumo:
While workflow technology has gained momentum in the last decade as a means for specifying and enacting computational experiments in modern science, reusing and repurposing existing workflows to build new scientific experiments is still a daunting task. This is partly due to the difficulty that scientists experience when attempting to understand existing workflows, which contain several data preparation and adaptation steps in addition to the scientifically significant analysis steps. One way to tackle the understandability problem is through providing abstractions that give a high-level view of activities undertaken within workflows. As a first step towards abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna and Wings systems. Our analysis has resulted in a set of scientific workflow motifs that outline i) the kinds of data intensive activities that are observed in workflows (data oriented motifs), and ii) the different manners in which activities are implemented within workflows (workflow oriented motifs). These motifs can be useful to inform workflow designers on the good and bad practices for workflow development, to inform the design of automated tools for the generation of workflow abstractions, etc.
Resumo:
Workflow technology continues to play an important role as a means for specifying and enacting computational experiments in modern science. Reusing and re-purposing workflows allow scientists to do new experiments faster, since the workflows capture useful expertise from others. As workflow libraries grow, scientists face the challenge of finding workflows appropriate for their task, understanding what each workflow does, and reusing relevant portions of a given workflow.We believe that workflows would be easier to understand and reuse if high-level views (abstractions) of their activities were available in workflow libraries. As a first step towards obtaining these abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna, Wings, Galaxy and Vistrails. Our analysis has resulted in a set of scientific workflow motifs that outline (i) the kinds of data-intensive activities that are observed in workflows (Data-Operation motifs), and (ii) the different manners in which activities are implemented within workflows (Workflow-Oriented motifs). These motifs are helpful to identify the functionality of the steps in a given workflow, to develop best practices for workflow design, and to develop approaches for automated generation of workflow abstractions.
Resumo:
Reproducibility of published results is a cornerstone in scientific publishing and progress. Therefore, the scientific community has been encouraging authors and editors to publish their contributions in a verifiable and understandable way. Efforts such as the Reproducibility Initiative [1], or the Reproducibility Projects on Biology [2] and Psychology [3] domains, have been defining standards and patterns to assess whether an experimental result is reproducible.
Resumo:
Scientific workflows provide the means to define, execute and reproduce computational experiments. However, reusing existing workflows still poses challenges for workflow designers. Workflows are often too large and too specific to reuse in their entirety, so reuse is more likely to happen for fragments of workflows. These fragments may be identified manually by users as sub-workflows, or detected automatically. In this paper we present the FragFlow approach, which detects workflow fragments automatically by analyzing existing workflow corpora with graph mining algorithms. FragFlow detects the most common workflow fragments, links them to the original workflows and visualizes them. We evaluate our approach by comparing FragFlow results against user-defined sub-workflows from three different corpora of the LONI Pipeline system. Based on this evaluation, we discuss how automated workflow fragment detection could facilitate workflow reuse.
Resumo:
Reproducible research in scientific workflows is often addressed by tracking the provenance of the produced results. While this approach allows inspecting intermediate and final results, improves understanding, and permits replaying a workflow execution, it does not ensure that the computational environment is available for subsequent executions to reproduce the experiment. In this work, we propose describing the resources involved in the execution of an experiment using a set of semantic vocabularies, so as to conserve the computational environment. We define a process for documenting the workflow application, management system, and their dependencies based on 4 domain ontologies. We then conduct an experimental evaluation using a real workflow application on an academic and a public Cloud platform. Results show that our approach can reproduce an equivalent execution environment of a predefined virtual machine image on both computing platforms.
Resumo:
Provenance models are crucial for describing experimental results in science. The W3C Provenance Working Group has recently released the PROV family of specifications for provenance on the Web. While provenance focuses on what is executed, it is important in science to publish the general methods that describe scientific processes at a more abstract and general level. In this paper, we propose P-PLAN, an extension of PROV to represent plans that guid-ed the execution and their correspondence to provenance records that describe the execution itself. We motivate and discuss the use of P-PLAN and PROV to publish scientific workflows as Linked Data.
Resumo:
Los flujos de trabajo científicos han sido adoptados durante la última década para representar los métodos computacionales utilizados en experimentos in silico, así como para dar soporte a sus publicaciones asociadas. Dichos flujos de trabajo han demostrado ser útiles para compartir y reproducir experimentos científicos, permitiendo a investigadores visualizar, depurar y ahorrar tiempo a la hora de re-ejecutar un trabajo realizado con anterioridad. Sin embargo, los flujos de trabajo científicos pueden ser en ocasiones difíciles de entender y reutilizar. Esto es debido a impedimentos como el gran número de flujos de trabajo existentes en repositorios, su heterogeneidad o la falta generalizada de documentación y ejemplos de uso. Además, dado que normalmente es posible implementar un mismo método utilizando algoritmos o técnicas distintas, flujos de trabajo aparentemente distintos pueden estar relacionados a un determinado nivel de abstracción, basándose, por ejemplo, en su funcionalidad común. Esta tesis se centra en la reutilización de flujos de trabajo y su abstracción mediante la exploración de relaciones entre los flujos de trabajo de un repositorio y la extracción de abstracciones que podrían ayudar a la hora de reutilizar otros flujos de trabajo existentes. Para ello, se propone un modelo simple de representación de flujos de trabajo y sus ejecuciones, se analizan las abstracciones típicas que se pueden encontrar en los repositorios de flujos de trabajo, se exploran las prácticas habituales de los usuarios a la hora de reutilizar flujos de trabajo existentes y se describe un método para descubrir abstracciones útiles para usuarios, basadas en técnicas existentes de teoría de grafos. Los resultados obtenidos exponen las abstracciones y prácticas comunes de usuarios en términos de reutilización de flujos de trabajo, y muestran cómo las abstracciones que se extraen automáticamente tienen potencial para ser reutilizadas por usuarios que buscan diseñar nuevos flujos de trabajo. Abstract Scientific workflows have been adopted in the last decade to represent the computational methods used in in silico scientific experiments and their associated research products. Scientific workflows have demonstrated to be useful for sharing and reproducing scientific experiments, allowing scientists to visualize, debug and save time when re-executing previous work. However, scientific workflows may be difficult to understand and reuse. The large amount of available workflows in repositories, together with their heterogeneity and lack of documentation and usage examples may become an obstacle for a scientist aiming to reuse the work from other scientists. Furthermore, given that it is often possible to implement a method using different algorithms or techniques, seemingly disparate workflows may be related at a higher level of abstraction, based on their common functionality. In this thesis we address the issue of reusability and abstraction by exploring how workflows relate to one another in a workflow repository, mining abstractions that may be helpful for workflow reuse. In order to do so, we propose a simple model for representing and relating workflows and their executions, we analyze the typical common abstractions that can be found in workflow repositories, we explore the current practices of users regarding workflow reuse and we describe a method for discovering useful abstractions for workflows based on existing graph mining techniques. Our results expose the common abstractions and practices of users in terms of workflow reuse, and show how our proposed abstractions have potential to become useful for users designing new workflows.
Resumo:
La reproducibilidad de estudios y resultados científicos es una meta a tener en cuenta por cualquier científico a la hora de publicar el producto de una investigación. El auge de la ciencia computacional, como una forma de llevar a cabo estudios empíricos haciendo uso de modelos matemáticos y simulaciones, ha derivado en una serie de nuevos retos con respecto a la reproducibilidad de dichos experimentos. La adopción de los flujos de trabajo como método para especificar el procedimiento científico de estos experimentos, así como las iniciativas orientadas a la conservación de los datos experimentales desarrolladas en las últimas décadas, han solucionado parcialmente este problema. Sin embargo, para afrontarlo de forma completa, la conservación y reproducibilidad del equipamiento computacional asociado a los flujos de trabajo científicos deben ser tenidas en cuenta. La amplia gama de recursos hardware y software necesarios para ejecutar un flujo de trabajo científico hace que sea necesario aportar una descripción completa detallando que recursos son necesarios y como estos deben de ser configurados. En esta tesis abordamos la reproducibilidad de los entornos de ejecución para flujos de trabajo científicos, mediante su documentación usando un modelo formal que puede ser usado para obtener un entorno equivalente. Para ello, se ha propuesto un conjunto de modelos para representar y relacionar los conceptos relevantes de dichos entornos, así como un conjunto de herramientas que hacen uso de dichos módulos para generar una descripción de la infraestructura, y un algoritmo capaz de generar una nueva especificación de entorno de ejecución a partir de dicha descripción, la cual puede ser usada para recrearlo usando técnicas de virtualización. Estas contribuciones han sido aplicadas a un conjunto representativo de experimentos científicos pertenecientes a diferentes dominios de la ciencia, exponiendo cada uno de ellos diferentes requisitos hardware y software. Los resultados obtenidos muestran la viabilidad de propuesta desarrollada, reproduciendo de forma satisfactoria los experimentos estudiados en diferentes entornos de virtualización. ABSTRACT Reproducibility of scientific studies and results is a goal that every scientist must pursuit when announcing research outcomes. The rise of computational science, as a way of conducting empirical studies by using mathematical models and simulations, have opened a new range of challenges in this context. The adoption of workflows as a way of detailing the scientific procedure of these experiments, along with the experimental data conservation initiatives that have been undertaken during last decades, have partially eased this problem. However, in order to fully address it, the conservation and reproducibility of the computational equipment related to them must be also considered. The wide range of software and hardware resources required to execute a scientific workflow implies that a comprehensive description detailing what those resources are and how they are arranged is necessary. In this thesis we address the issue of reproducibility of execution environments for scientific workflows, by documenting them in a formalized way, which can be later used to obtain and equivalent one. In order to do so, we propose a set of semantic models for representing and relating the relevant information of those environments, as well as a set of tools that uses these models for generating a description of the infrastructure, and an algorithmic process that consumes these descriptions for deriving a new execution environment specification, which can be enacted into a new equivalent one using virtualization solutions. We apply these three contributions to a set of representative scientific experiments, belonging to different scientific domains, and exposing different software and hardware requirements. The obtained results prove the feasibility of the proposed approach, by successfully reproducing the target experiments under different virtualization environments.
Resumo:
We describe a corpus of provenance traces that we have collected by executing 120 real world scientific workflows. The workflows are from two different workflow systems: Taverna [5] and Wings [3], and 12 different application domains (see Figure 1). Table 1 provides a summary of this PROV-corpus.
Resumo:
New digital artifacts are emerging in data-intensive science. For example, scientific workflows are executable descriptions of scientific procedures that define the sequence of computational steps in an automated data analysis, supporting reproducible research and the sharing and replication of best-practice and know-how through reuse. Workflows are specified at design time and interpreted through their execution in a variety of situations, environments, and domains. Hence it is essential to preserve both their static and dynamic aspects, along with the research context in which they are used. To achieve this, we propose the use of multidimensional digital objects (Research Objects) that aggregate the resources used and/or produced in scientific investigations, including workflow models, provenance of their executions, and links to the relevant associated resources, along with the provision of technological support for their preservation and efficient retrieval and reuse. In this direction, we specified a software architecture for the design and implementation of a Research Object preservation system, and realized this architecture with a set of services and clients, drawing together practices in digital libraries, preservation systems, workflow management, social networking and Semantic Web technologies. In this paper, we describe the backbone system of this realization, a digital library system built on top of dLibra.
Resumo:
Workflows are increasingly used to manage and share scientific computations and methods. Workflow tools can be used to design, validate, execute and visualize scientific workflows and their execution results. Other tools manage workflow libraries or mine their contents. There has been a lot of recent work on workflow system integration as well as common workflow interlinguas, but the interoperability among workflow systems remains a challenge. Ideally, these tools would form a workflow ecosystem such that it should be possible to create a workflow with a tool, execute it with another, visualize it with another, and use yet another tool to mine a repository of such workflows or their executions. In this paper, we describe our approach to create a workflow ecosystem through the use of standard models for provenance (OPM and W3C PROV) and extensions (P-PLAN and OPMW) to represent workflows. The ecosystem integrates different workflow tools with diverse functions (workflow generation, execution, browsing, mining, and visualization) created by a variety of research groups. This is, to our knowledge, the first time that such a variety of workflow systems and functions are integrated.
Resumo:
Two important characteristics of science are the ?reproducibility? and ?clarity?. By rigorous practices, scientists explore aspects of the world that they can reproduce under carefully controlled experimental conditions. The clarity, complementing reproducibility, provides unambiguous descriptions of results in a mechanical or mathematical form. Both pillars depend on well-structured and accurate descriptions of scientific practices, which are normally recorded in experimental protocols, scientific workflows, etc. Here we present SMART Protocols (SP), our ontology-based approach for representing experimental protocols and our contribution to clarity and reproducibility. SP delivers an unambiguous description of processes by means of which data is produced; by doing so, we argue, it facilitates reproducibility. Moreover, SP is thought to be part of e-science infrastructures. SP results from the analysis of 175 protocols; from this dataset, we extracted common elements. From our analysis, we identified document, workflow and domain-specific aspects in the representation of experimental protocols. The ontology is available at http://purl.org/net/SMARTprotocol
Resumo:
Provenance plays a major role when understanding and reusing the methods applied in a scientic experiment, as it provides a record of inputs, the processes carried out and the use and generation of intermediate and nal results. In the specic case of in-silico scientic experiments, a large variety of scientic workflow systems (e.g., Wings, Taverna, Galaxy, Vistrails) have been created to support scientists. All of these systems produce some sort of provenance about the executions of the workflows that encode scientic experiments. However, provenance is normally recorded at a very low level of detail, which complicates the understanding of what happened during execution. In this paper we propose an approach to automatically obtain abstractions from low-level provenance data by finding common workflow fragments on workflow execution provenance and relating them to templates. We have tested our approach with a dataset of workflows published by the Wings workflow system. Our results show that by using these kinds of abstractions we can highlight the most common abstract methods used in the executions of a repository, relating different runs and workflow templates with each other.
Resumo:
In recent years, a variety of systems have been developed that export the workflows used to analyze data and make them part of published articles. We argue that the workflows that are published in current approaches are dependent on the specific codes used for execution, the specific workflow system used, and the specific workflow catalogs where they are published. In this paper, we describe a new approach that addresses these shortcomings and makes workflows more reusable through: 1) the use of abstract workflows to complement executable workflows to make them reusable when the execution environment is different, 2) the publication of both abstract and executable workflows using standards such as the Open Provenance Model that can be imported by other workflow systems, 3) the publication of workflows as Linked Data that results in open web accessible workflow repositories. We illustrate this approach using a complex workflow that we re-created from an influential publication that describes the generation of 'drugomes'.