You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@taverna.apache.org by Paul Brack <pa...@manchester.ac.uk> on 2016/06/02 12:37:27 UTC

Taverna hashes

Hi guys,

I've been building Taverna workflows for a while now, and there's a piece of functionality that would be really helpful - hashes as a data type on input/output ports, and an option on string merges.

We have built a generic tool to submit a job to a HPC (Hydra) and monitor its status. This job can be anything, so we pass in two lists (one of keys, one of values) so we can do variable substitution in a script further down the line. To produce a list of values, we can just do a merge, but I have to hardcode the key list in a beanscript. This is not very nice and prone to human error.

I don't know the overall implications of this, but if I could do a merge that preserved the variable names, and paired these with the values, then pass these through to the input port of a subworkflow, this would be very useful indeed. We are trying to avoid serialising the data into JSON/XML or similar.

Is this possible? Is this significant development effort?

Best Regards,
Paul Brack

Re: Taverna hashes

Posted by Stian Soiland-Reyes <st...@apache.org>.

On 2 June 2016 at 13:37, Paul Brack <pa...@manchester.ac.uk> wrote:
> I've been building Taverna workflows for a while now, and there's a piece of functionality that would be really helpful - hashes as a data type on input/output ports, and an option on string merges.

Hi! Incidentally this was also discussed to support Common Workflow
Language datatypes, which have the notion of a "record" (like a hash
with pre-defined types).

Also earlier discussed has been JSON support:
https://issues.apache.org/jira/browse/TAVERNA-345

And Table support:
https://issues.apache.org/jira/browse/TAVERNA-389

Somewhat related, a JSON construction and selection activity:
https://github.com/markborkum/t2-json-activity

Adding general support for hashes with a couple of local workers to do
hash modifications would ideal!

This could add a new Reference type that your activity could pick up, and
have a native mapping to JSON as the default
String/InputStream serialization.

Would you be interested in working with us on making that?

> We have built a generic tool to submit a job to a HPC (Hydra) and monitor its status. This job can be anything, so we pass in two lists (one of keys, one of values) so we can do variable substitution in a script further down the line. To produce a list of values, we can just do a merge, but I have to hardcode the key list in a beanscript. This is not very nice and prone to human error.
> I don't know the overall implications of this, but if I could do a merge that preserved the variable names, and paired these with the values, then pass these through to the input port of a subworkflow, this would be very useful indeed. We are trying to avoid serialising the data into JSON/XML or similar.

So what you are describing would be like a shim activity that has two
input ports with keys and values in order - and returns a new hash
(e.g. a java.util.Map in some form) which then can be picked up
directly by the next Hydra-submit activity?

I guess other hash manipulation activities would be useful as well?
E.g. for a configured set of keys, provide multiple inputs then
returned as a single map (similar to XML input splitter), or another
activity that can do Map.* operations to look up by key, add for a
particular key, etc.

> Is this possible? Is this significant development effort?

I think it should be straight forward if you use

(for 2.5)
https://github.com/apache/incubator-taverna-engine/blob/old/reference-core-extensions-1.5/src/main/java/net/sf/taverna/t2/reference/impl/external/object/VMObjectReference.java
(for 3.x)
https://github.com/apache/incubator-taverna-engine/blob/master/taverna-reference-types/src/main/java/org/apache/taverna/reference/impl/external/object/VMObjectReference.java

In the shim activity you do something like:

hash = new HashMap();
hash.put("key1", "value1");
hash.put("key2", "value2");
vmref = new VMObjectReference();
vmref.setObject(hash);
outputs.put("hash", referenceService.register(vmref));

Then in the receiving Hydra activity you can dereference the
T2Reference to a VMObjectReference and call getObject().

The actual hash is kept in a static field of VMObjectReference for the
duration of the JVM execution and only the UUID of the object is
passed through the workflow or stored in the provenance.

(That is, the hash would be transient, but memory consuming).

As we have
https://github.com/apache/incubator-taverna-engine/blob/master/taverna-reference-types/src/main/java/org/apache/taverna/reference/impl/external/object/StreamToVMObjectReferenceConverter.java
then it's possible to construct a VMObjectReference from any
InputStream (think: byte[]) saved from a java.io.ObjectOutputStream)

so you could in theory do the shim today in a Beanshell:

import java.io.ObjectOutputStream;
import java.io.ByteArrayOutputStream;

hash = new HashMap();
hash.put("key1", "value1");
hash.put("key2", "value2");
b = hash.toString();

os = new ByteArrayOutputStream();
objectStream = new ObjectOutputStream(os);
objectStream.writeObject(hash);
objectStream.close();

bytes = os.toByteArray();

Such a beanshell script could then be added as a Local Worker (all
local workers are Beanshell scripts), and only the HydraActivity need
to understand the VMObject.

I think the above still looks like a hack, and is probably not much
better performing than JSON serialization. :) --
perhaps adding Hash support in Taverna is a better approach.

-- 
Stian Soiland-Reyes
Apache Taverna (incubating), Apache Commons
http://orcid.org/0000-0001-9842-9718