You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bigtop.apache.org by "jay vyas (JIRA)" <ji...@apache.org> on 2014/01/03 16:38:51 UTC

[jira] [Comment Edited] (BIGTOP-1089) BigPetStore: A polyglot big data processing blueprint inside of bigtop for comparing and learning about the tools in the bigtop packaged hadoop ecosystem.

    [ https://issues.apache.org/jira/browse/BIGTOP-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861600#comment-13861600 ] 

jay vyas edited comment on BIGTOP-1089 at 1/3/14 3:37 PM:
----------------------------------------------------------

Update: 

Its still under development here : https://github.com/jayunit100/bigpetstore .  After its stable ill put in the first patch, and move dev to bigtop if its approved.

- Generates gaussian distributed data in the DFS for pet store transactions (i.e. # of purchases by each individual is smeared on a guassian distro).  
-Currently it does basic aggregations of raw data data using pig, hive, and compares that the two approaches result in  identical outputs. 
-It also does some classificiation of "states" of the US based on similar transaction profiles using  mahout.

I need some help on some things:

- more unit tests.
- adding crunch, datafu, and a any other bigtop packages : more examples means more usefullness as a big data sandbox app.
- documentation
- deployment on a cluster

I'd like to work with bigtop community directly on this, maybe we can open up to the broader hadoop community for some feedback this week. 


was (Author: jayunit100):
Update: 

Its still under development here : https://github.com/jayunit100/bigpetstore .  After its stable ill put in the first patch, and move dev to bigtop if its approved.

-Currently it does basic aggregations of raw data data using pig, hive, and compares that the two approaches result in  identical outputs. 
- Generates gaussian distributed data in the DFS for pet store transactions.
-It also does some classificiation of "states" of the US based on similar transaction profiles using  mahout.
I need some help on some things:
- more unit tests.
- adding crunch, datafu, and a any other bigtop packages : more examples means more usefullness as a big data sandbox app.
- documentation
- deployment on a cluster

I'd like to work with bigtop community directly on this, maybe we can open up to the broader hadoop community for some feedback this week. 

> BigPetStore: A polyglot big data processing blueprint inside of bigtop for comparing and learning about the tools in the bigtop packaged hadoop ecosystem.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: BIGTOP-1089
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1089
>             Project: Bigtop
>          Issue Type: New Feature
>          Components: Blueprints
>            Reporter: jay vyas
>            Assignee: jay vyas
>
> The need for templates for processing big data pipelines is obvious - and also - given the increasing amount of overlap across different big data and nosql projects, it will provide a ground truth in the future for comparing the behaviour and approach of different tools to solve a common, easily comprehended problem. 
> This ticket formalizes the conversation in mailing list archives regarding the BigPetStore proposal. 
> At the moment, (with the exception of word count), there are very few examples of bigdata problems that have been solved by a variety of different technologies.  And, even with wordcount, there arent alot of templates which can be customized for applications. 
> Comparatively: Other application developer communities (i.e.the Rails folks, those using maven archetypes, etc.. ) have a plethora of template applications which can be used to kickstart their applications and use cases.   
> This big pet store JIRA thus aims to do the following: 
> 0) Curate a single, central, standard input data set . (modified: generating a large input data set on the fly).
> 1) Define a big data processing pipeline (using the pet store theme - except morphing it to be analytics rather than transaction oriented), and implement basic aggregations in hive, pig, etc...
> 2) Sink the results of 2 into some kind of NoSQL store or search engine.
>  
> Some implementation details -- open to change these, please comment/review -- .
> - initial data source will be raw text or (better yet) some kind of automatically generated data.
> - the source will initially go in bigtop/blueprints
> - the application sources can be in any modern JVM language (java,scala,groovy,clojure), since bigtop supports scala, java, groovy natively already and clojure is easy to support with the right jars.  
> - each "job" will be named according to the corresponding DAG of the big data pipeline . 
> - all jobs should (not sure if requirement?) be controlled by a global program (maybe oozie?) which runs the tasks in order, and can easily be customized to use different tools at different stages. 
> - for now, all outputs will be to files: so that users don't require servers to run the app. 
> - final data sinks will be into a highly available transaction oriented store (solr/hbase/...)
> This ticket will be completed once a first iteration of BigPetStore is complete using 3 ecosystem components, along with a depiction of the pipeline which can be used for development.
> I've assigned this to myself :) I hope thats okay? Seems like at the moment im the only one working on it. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)