You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@taverna.apache.org by br...@apache.org on 2015/02/10 17:00:20 UTC
svn commit: r1658743 -
/incubator/taverna/site/trunk/content/documentation/provenance/index.md
Author: brenninc
Date: Tue Feb 10 16:00:20 2015
New Revision: 1658743
URL: http://svn.apache.org/r1658743
Log:
Provenance management
Added:
incubator/taverna/site/trunk/content/documentation/provenance/index.md
Added: incubator/taverna/site/trunk/content/documentation/provenance/index.md
URL: http://svn.apache.org/viewvc/incubator/taverna/site/trunk/content/documentation/provenance/index.md?rev=1658743&view=auto
==============================================================================
--- incubator/taverna/site/trunk/content/documentation/provenance/index.md (added)
+++ incubator/taverna/site/trunk/content/documentation/provenance/index.md Tue Feb 10 16:00:20 2015
@@ -0,0 +1,151 @@
+Title: Provenance management
+Notice: Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+ .
+ http://www.apache.org/licenses/LICENSE-2.0
+ .
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+> Provenance is information about entities, activities, and people
+> involved in producing a piece of data or thing, which can be used to
+> form assessments about its quality, reliability or trustworthiness.
+- <small><a href="http://www.w3.org/TR/prov-overview/">W3C PROV-Overview</a>*</small>
+
+For a scientific workflow system, provenance can have several aspects:
+
+1. Provenance of the workflow definition
+2. Provenance of a workflow run
+3. Provenance of data
+
+##Provenance of workflow definitions
+
+Taverna does not capture provenance of editing a *workflow definition*,
+ but assume the scientist manages the evolution of workflow definitions through existing
+ means for versioning files, such as filenames and folders,
+ version control systems like [git](https://help.github.com/articles/set-up-git),
+ or workflow sharing websites like [myExperiment](http://www.myexperiment.org/).
+
+Within Taverna, a
+ [workflow can be annotated](http://dev.mygrid.org.uk/wiki/display/taverna/Annotations)
+ to give *attribution* to the **Authors** of a workflow (or nested workflow).
+We recommend using comma or linefeed for multiple authors.
+
+Taverna's workflow fileformat has an internal workflow identifier (UUID) which is updated for
+ every workflow change.
+A log of previous workflow identifiers is included within the workflow definition formats
+ [t2flow](http://taverna.googlecode.com/svn/taverna/dev/xsd/trunk/t2flow/t2flow.xsd) and
+ [Taverna 3 workflow bundle](http://dev.mygrid.org.uk/wiki/display/developer/Taverna+Workflow+Bundle),
+ allowing
+ [detection of workflows with common ancestry](http://www.myexperiment.org/workflows/2899)>.
+
+##Provenance of workflow runs
+
+Taverna can
+ [capture provenance of workflow runs](http://dev.mygrid.org.uk/wiki/display/taverna/Data+and+provenance+preferences),
+ including individual processor iterations and their inputs and outputs.
+This provenance is kept in an internal database,
+ which is used to populate *Previous runs* and *Intermediate results* in the
+ [Results perspective](http://dev.mygrid.org.uk/wiki/display/taverna/Result+Perspective)
+ in the Taverna Workbench.
+
+The provenance trace can be used by the
+ [Taverna-PROV plugin](https://github.com/wf4ever/taverna-prov)
+ to export the workflow run, including the output and intermediate values,
+ and the provenance trace as a [PROV-O](http://www.w3.org/TR/prov-o/) RDF graph which can
+ be queried using [SPARQL](http://www.w3.org/TR/sparql11-overview/) and processed with other
+ PROV tools, such as the [PROV Toolbox](https://github.com/lucmoreau/ProvToolbox/).
+
+We are planning to extend myExperiment to handle uploading of such provenance traces,
+ which would give a mechanism to present and browse values and details of a workflow runs
+ within the browser.
+
+This [presentation about Taverna's provenance support](http://www.slideshare.net/soilandreyes/20130529-taverna-provenance)
+ gives an overview of the model and software architecture.
+
+##Provenance of data
+
+Scientists using Taverna to perform analysis are often less concerned about the detailed provenance of a workflow run, which semantically just describes inputs and outputs to a chain of processes, but are rather interested in *derivation* and *attribution* of the data that is involved in a workflow. For instance, a workflow might be performing text-mining on a biomedical article to extract gene names, and then retrieve the genome sequences for those genes by looking up in a database. The sequences can then be said to be derived from that database and should (according to the license of the web service) also be attributed to its maintainers. The *list* of sequences can be said to be derived from the biomedical article.
+
+The typical world of Taverna workflows is to combine web services “in the wild” (say found on <a href="">http://www.biocatalogue.org/</a> BioCatalogue) with local tools. Neither of these will typical have any facility to provide such “science-level provenance”. myGrid is planning a facility for such data provenance in different ways:
+
+1. Merging and propagation of [PROV-AQ](http://www.w3.org/TR/prov-aq/) provided provenance
+ traces for [REST services](http://dev.mygrid.org.uk/wiki/display/taverna/REST)
+ (including matching data identity) -- âwhite-box serviceâ
+2. A provenance âbackchannelâ for [Components](developers/work-in-progress/components/),
+ which can be populated either by the underlying service directly or by shims within the
+ component.
+ This allows higher level provenance that is meaningful for a set of components instead of
+ service-specific execution details.
+3. Annotation of workflow fragments by
+ [common motifs](http://www.slideshare.net/dgarijo/common-motifs-in-scientific-workflows-an-empirical-analysis),
+ which can provide higher-level provenance for data generated by the workflow
+
+The paper [Enhancing and Abstracting Scientific Workflow Provenance for Data
+ Publishing](http://www.edbt.org/Proceedings/2013-Genova/papers/workshops/a45-alper.pdf)
+ (doi [10.1145/2457317.2457370](http://dx.doi.org/10.1145/2457317.2457370)) details these
+ approaches.
+
+##Collaborations
+
+myGrid actively participated in the
+ [W3C Provenance Working Group](http://www.w3.org/2011/prov/wiki/Main_Page)
+ which developed the [PROV family of standards](http://www.w3.org/TR/prov-overview/).
+The [Taverna-PROV plugin](https://github.com/wf4ever/taverna-prov) has been developed for
+ Taverna that allows the export of workflow run provenance as
+ [PROV-O RDF](http://www.w3.org/TR/prov-o/).
+
+The [wf4ever project](http://www.wf4ever-project.org) is investigating the sharing of workflows
+ and workflow runs as [research objects](http://www.researchobject.org/), in particular for
+ Taverna is the development of the [Research Object Bundle](https://w3id.org/bundle),
+ which will form a single archive of a workflow run, including run *provenance*, *inputs*,
+ *outputs*, *intermediate values*, *workflow definition* and (for Taverna 3)
+ information about the *run environment*.
+
+##Past collaborations
+
+Since early 2010, we are invited partners of the [NSF DataONE project](https://dataone.org/),
+ dedicated to large-scale preservation of scientific data, and founding members of the
+ Worklow and Provenance Working Group promoted by the project, along with Prof. Ludaescher
+ at UC Davis, USA and Juliana Freire at University of Utah, USA.
+
+Historically, work on provenance within the myGrid consortium and Taverna team has been
+ focusing on multiple aspects, beginning with the design and implementation of *Janus*,
+ a data model and software component for provenance capture and analysis for Taverna.
+Our research in this area is often pursued in collaboration with external partners:
+
+ - A model and architecture for capturing provenance.
+ We have designed a data model for *Janus* that is at the same time specific to Taverna,
+ but can also be exported to other models,
+ notably the [Open Provenance Model](http://openprovenance.org/) (OPM),
+ to enable interoperability with third party provenance-generating systems.
+ Taverna has been retrofitted with provenance generation capabilities.
+ - An expressive provenance query language and efficient query processing model for large
+ provenance graphs.
+ - Investigation into provenance interoperability and exchange, using the OPM.
+ The Taverna provenance component now exports data as OPM graphs,
+ and can also import OPM graphs (with basic features) received from third parties.
+ We have also been working with the Kepler group on a project to promote provenance
+ interoperability, in collaboration with Prof. Ludaescher at UC Davis, CA, and
+ Ilkay Antintas at UCSD, CA .
+ - Investigation into the role of semantics and of Linked Open Data (LOD) in provenance
+ modelling and management, in collaboration with the Knoesis Centre at Wright University,
+ Ohio (Prof. Amit Sheth, Dr. Satya Sahoo) and with Jun Zhao of Oxford University.
+
+Other past collaborations on the topic of provenance include:
+
+ - Participation in the
+ [Third Provenance Challenge](http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge)
+
+ - [The Semantic Provenance project](http://www.mygrid.org.uk/projects/semantic-provenance-project/),
+ with Ely Lilly and IU
+