You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2010/08/03 20:24:08 UTC

Fwd: cascading + riffle + ?

---------- Forwarded message ----------
From: Chris K Wensel <ch...@wensel.net>
Date: Tue, Aug 3, 2010 at 11:19 AM
Subject: cascading + riffle + ?
To: cascading-user@googlegroups.com, user@mahout.apache.org,
common-user@hadoop.apache.org



Sorry, cross posting to save time.

I now have a WIP of Cascading 1.2 that includes support for Riffle
annotations.

Riffle is an Apache licensed library that includes Java annotations for
marking lifecycle and dependency methods on a 'process' object.

That is, you can create custom objects with 'start' and 'stop' methods, as
well as with getters for incoming/outgoing resources (input files, and
output files).

With a collection of such objects, each one for a particular task like
running a copy job, or Mahout process, you can have either Riffle or
Cascading chain and execute all the processes in dependency order.

You can see more about Riffle here (which includes a tool to run a
collection of processes):
http://github.com/cwensel/riffle

You can download WIP builds for Cascading 1.2 (1.1 is the current stable
version) here:
http://www.concurrentinc.com/downloads/

Note that Riffle is very early stage (and likely naive), and the Cascading
support is likely to evolve before the 1.2 final release (sometime this
fall).

The long term goal here is to allow Mahout and other projects to apply the
annotations, and then third party tools can be used to run the processes.

For you Cascading users, writing a simple DistCp wrapper (or putting the
annotations directly on hadoop DistCp object, would allow a efficient copy
to run inside of a Cascade process along side your Flow instances.

Or more importantly, you can write iterative processes (e.g. page rank, etc)
that act like a single process even though internally there is a unknown
number of Flows being created on the fly. (I'm running a connected component
algorithm that requires multiple Flows/passes in production now as a Riffle
object)

Please feel free to fork and tweak.

ckw

--
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

--
You received this message because you are subscribed to the Google Groups
"cascading-user" group.
To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to
cascading-user+unsubscribe@googlegroups.com<ca...@googlegroups.com>
.
For more options, visit this group at
http://groups.google.com/group/cascading-user?hl=en.