You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by "Stadin, Benjamin" <Be...@heidelberg-mobil.com> on 2014/12/01 18:33:01 UTC
Crunch, workflow management and user interaction
I have a mixed bag of requirements, ranging from parallel data processing to local file updates (single / same node), and „reactive“ filter interaction. I’m undecided what frameworks I should settle on.
It’s probably best explained by an example usage scenario:
* A web site user uploads small files (typically 1-200 files, file size typically 2-10MB per file)
* Files should be converted in parallel and on available nodes. The conversion is actually done via native tools, but I consider to use Crunch for dynamic parallelization of the conversion according to the number of uploaded files. The conversion will likely take between several minutes and a few hours.
* The converted files are gathered and stored in a single *SQLite* (!) database (containing geometries for rendering). This needs to be done on one node only (file lockings etc). You may say I should not use SQLite, but believe me I really do =).
* Once the SQLite db is ready, a web map server is (re-)configured on the very same server as the one where the db job was started, and the user can interact with a web application and make small updates to the data set via a web map editing UI. This is a temporary service. After a few minutes when user interaction is done, the server is "shut down“ (it isn’t really, just the data source is remeoved form it and reconfigured).
* When the user is done and hit’s the save button, the workflow triggers another parallelizable job which does some post-processings on the data
The main two things causing me headache:
* I’m not sure how to implement „reactivity“ as it’s called in Haskell Arrows with my filters. How should I design a Crunch job as a long-running job which accepts input, and in addition runs only on a single node? In Spark one could call coalesce(1, true), but in either case I’m not sure how to cleanly implement a reactive filter in Crunch or Spark.
* Workflow management: In my scenario, there is are n user sessions and each can start different workflows in parallel (above outlines just one of the workflows). What shall I take to chain my pipes into workflows? Oozie? Crunch-Jobs? Could you pint me to an example how to do this?
~Ben
Re: Crunch, workflow management and user interaction
Posted by Josh Wills <jw...@cloudera.com>.
Hey Ben,
Have you had a look at Spark Streaming? It seems like a better choice for
the "reactive" part of the application. In the last release of Crunch, I
added a bunch of "SFunctions" that allow you to re-use logic you write
using Spark's Java APIs with Crunch if it makes sense for your use case:
http://crunch.apache.org/apidocs/0.11.0/org/apache/crunch/fn/SFunctions.html
My suspicion, based on what I read in the above, is that you're more gated
on CPU than IO for most of the steps in your workflow-- is that true? If
so, I'd be inclined to recommend an app architecture that was built on
something like golang over the JVM-based Hadoop/Spark world.
Best,
Josh
On Mon, Dec 1, 2014 at 9:33 AM, Stadin, Benjamin <
Benjamin.Stadin@heidelberg-mobil.com> wrote:
> I have a mixed bag of requirements, ranging from parallel data processing
> to local file updates (single / same node), and „reactive“ filter
> interaction. I’m undecided what frameworks I should settle on.
>
> It’s probably best explained by an example usage scenario:
>
> - A web site user uploads small files (typically 1-200 files, file
> size typically 2-10MB per file)
> - Files should be converted in parallel and on available nodes. The
> conversion is actually done via native tools, but I consider to use Crunch
> for dynamic parallelization of the conversion according to the number of
> uploaded files. The conversion will likely take between several minutes and
> a few hours.
> - The converted files are gathered and stored in a single *SQLite* (!)
> database (containing geometries for rendering). This needs to be done on
> one node only (file lockings etc). You may say I should not use SQLite, but
> believe me I really do =).
> - Once the SQLite db is ready, a web map server is (re-)configured on
> the very same server as the one where the db job was started, and the user
> can interact with a web application and make small updates to the data set
> via a web map editing UI. This is a temporary service. After a few minutes
> when user interaction is done, the server is "shut down“ (it isn’t really,
> just the data source is remeoved form it and reconfigured).
> - When the user is done and hit’s the save button, the workflow
> triggers another parallelizable job which does some post-processings on the
> data
>
> The main two things causing me headache:
>
> - I’m not sure how to implement „reactivity“ as it’s called in Haskell
> Arrows with my filters. How should I design a Crunch job as a long-running
> job which accepts input, and in addition runs only on a single node? In
> Spark one could call coalesce(1, true), but in either case I’m not sure how
> to cleanly implement a reactive filter in Crunch or Spark.
> - Workflow management: In my scenario, there is are n user sessions
> and each can start different workflows in parallel (above outlines just one
> of the workflows). What shall I take to chain my pipes into workflows?
> Oozie? Crunch-Jobs? Could you pint me to an example how to do this?
>
> ~Ben
>
>
--
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>