You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by "Stadin, Benjamin" <Be...@heidelberg-mobil.com> on 2014/12/01 18:33:01 UTC

Crunch, workflow management and user interaction

I have a mixed bag of requirements, ranging from parallel data processing to local file updates (single / same node), and „reactive“ filter interaction. I’m undecided what frameworks I should settle on.

It’s probably best explained by an example usage scenario:

 *   A web site user uploads small files (typically 1-200 files, file size typically 2-10MB per file)
 *   Files should be converted in parallel and on available nodes. The conversion is actually done via native tools, but I consider to use Crunch for dynamic parallelization of the conversion according to the number of uploaded files. The conversion will likely take between several minutes and a few hours.
 *   The converted files are gathered and stored in a single *SQLite* (!) database (containing geometries for rendering). This needs to be done on one node only (file lockings etc). You may say I should not use SQLite, but believe me I really do =).
 *   Once the SQLite db is ready, a web map server is (re-)configured on the very same server as the one where the db job was started, and the user can interact with a web application and make small updates to the data set via a web map editing UI. This is a temporary service. After a few minutes when user interaction is done, the server is "shut down“ (it isn’t really, just the data source is remeoved form it and reconfigured).
 *   When the user is done and hit’s the save button, the workflow triggers another parallelizable job which does some post-processings on the data

The main two things causing me headache:

 *   I’m not sure how to implement „reactivity“ as it’s called in Haskell Arrows with my filters. How should I design a Crunch job as a long-running job which accepts input, and in addition runs only on a single node? In Spark one could call coalesce(1, true), but in either case I’m not sure how to cleanly implement a reactive filter in Crunch or Spark.
 *   Workflow management: In my scenario, there is are n user sessions and each can start different workflows in parallel (above outlines just one of the workflows). What shall I take to chain my pipes into workflows? Oozie? Crunch-Jobs? Could you pint me to an example how to do this?

~Ben


Re: Crunch, workflow management and user interaction

Posted by Josh Wills <jw...@cloudera.com>.
Hey Ben,

Have you had a look at Spark Streaming? It seems like a better choice for
the "reactive" part of the application. In the last release of Crunch, I
added a bunch of "SFunctions" that allow you to re-use logic you write
using Spark's Java APIs with Crunch if it makes sense for your use case:

http://crunch.apache.org/apidocs/0.11.0/org/apache/crunch/fn/SFunctions.html

My suspicion, based on what I read in the above, is that you're more gated
on CPU than IO for most of the steps in your workflow-- is that true? If
so, I'd be inclined to recommend an app architecture that was built on
something like golang over the JVM-based Hadoop/Spark world.

Best,
Josh

On Mon, Dec 1, 2014 at 9:33 AM, Stadin, Benjamin <
Benjamin.Stadin@heidelberg-mobil.com> wrote:

> I have a mixed bag of requirements, ranging from parallel data processing
> to local file updates (single / same node), and „reactive“ filter
> interaction. I’m undecided what frameworks I should settle on.
>
> It’s probably best explained by an example usage scenario:
>
>    - A web site user uploads small files (typically 1-200 files, file
>    size typically 2-10MB per file)
>    - Files should be converted in parallel and on available nodes. The
>    conversion is actually done via native tools, but I consider to use Crunch
>    for dynamic parallelization of the conversion according to the number of
>    uploaded files. The conversion will likely take between several minutes and
>    a few hours.
>    - The converted files are gathered and stored in a single *SQLite* (!)
>    database (containing geometries for rendering). This needs to be done on
>    one node only (file lockings etc). You may say I should not use SQLite, but
>    believe me I really do =).
>    - Once the SQLite db is ready, a web map server is (re-)configured on
>    the very same server as the one where the db job was started, and the user
>    can interact with a web application and make small updates to the data set
>    via a web map editing UI. This is a temporary service. After a few minutes
>    when user interaction is done, the server is "shut down“ (it isn’t really,
>    just the data source is remeoved form it and reconfigured).
>    - When the user is done and hit’s the save button, the workflow
>    triggers another parallelizable job which does some post-processings on the
>    data
>
> The main two things causing me headache:
>
>    - I’m not sure how to implement „reactivity“ as it’s called in Haskell
>    Arrows with my filters. How should I design a Crunch job as a long-running
>    job which accepts input, and in addition runs only on a single node? In
>    Spark one could call coalesce(1, true), but in either case I’m not sure how
>    to cleanly implement a reactive filter in Crunch or Spark.
>    - Workflow management: In my scenario, there is are n user sessions
>    and each can start different workflows in parallel (above outlines just one
>    of the workflows). What shall I take to chain my pipes into workflows?
>    Oozie? Crunch-Jobs? Could you pint me to an example how to do this?
>
> ~Ben
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>