You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bigtop.apache.org by "Jörn Franke (JIRA)" <ji...@apache.org> on 2014/08/22 19:48:11 UTC

[jira] [Commented] (BIGTOP-1414) Add Apache Spark implementation to BigPetStore

    [ https://issues.apache.org/jira/browse/BIGTOP-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107171#comment-14107171 ] 

Jörn Franke commented on BIGTOP-1414:
-------------------------------------

You can the most benefit from Spark if you do an incremental and/or iterative job which relies on cached data.
Of course, we can do something similar as the Hadoop Map/Reduce job, but if it is only executed once then we do not have much benefit out of it - probably it will have a similar performance.
I think trend analysis could be one example, e.g.
* a batch job which is supposedly executed every week (or other schedules)
* The batch job generates some trend data from big pet store, e.g. dog food with turkey has been bought 5 times more in December week 4 more than in previous week(s)
** This means we need to keep the trend data for each week as a RDD in memory, because we do a comparison between the current weeks and x previous week(s)
The difference to the Hadoop Map/Reduce job will be that we leverage cached results of previous jobs. 

This is just some simple example. We need to think if it really make sense or if we should have a more sophisticated example. I would also like to include shared variables. Finally, i would like to extend it to Spark Streaming, e.g. complex event processing in combination with a Spark batch job.

What do you think?

> Add Apache Spark implementation to BigPetStore
> ----------------------------------------------
>
>                 Key: BIGTOP-1414
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1414
>             Project: Bigtop
>          Issue Type: Improvement
>          Components: blueprints
>    Affects Versions: backlog
>            Reporter: jay vyas
>             Fix For: 0.9.0
>
>
> Currently we only process data with hadoop.  Now its time to add spark to the bigpetstore application.  This will basically demonstrate the difference between a mapreduce based hadoop implementation of a big data app, versus a Spark one.   
> *We will need to*
> - update graphviz arch.dot to diagram spark as a new path.
> - Adding a spark job to the existing code, in a new package., which uses existing scala based generator, however, we will use it inside  a spark job, rather than in a hadoop inputsplit.
> - The job should output to an RDD, which can then be serialized to disk, or else, fed into the next spark job... 
> *So, the next spark job should*
> - group the data and write product summaries to a local file
> - run a product recommender against the input data set.
> We want the jobs to be runnable as modular, or as a single job, to leverage the RDD paradigm.  
> So it will be interesting to see how the code is architected.    Lets start the planning in this JIRA.  I have some stuff ive informally hacked together, maybe i can attach an initial patch just to start a dialog. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)