You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bigtop.apache.org by "RJ Nowling (JIRA)" <ji...@apache.org> on 2014/11/07 21:02:34 UTC
[jira] [Commented] (BIGTOP-1366) Updated, Richer Model for Generating Data for BigPetStore

    [ https://issues.apache.org/jira/browse/BIGTOP-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202566#comment-14202566 ] 

RJ Nowling commented on BIGTOP-1366:
------------------------------------

Hi all,

Just an update.  I have an initial Spark driver for the data generator:

https://github.com/rnowling/bigpetstore-data-generator/blob/javaport/src/java/bps-data-generator/spark_driver/src/main/scala/com/github/rnowling/bps/datagenerator/spark/Driver.scala

I'm using the Spark driver to test out the API.  My goal is to have a handful of high level generators classes that need to be called in each parallel step.  These will be supported by high level data readers and data models.  This way, the data generator can easily be used in MapReduce, Spark, or CLI drivers without knowing the details of the methods.  Seems I'm almost there, just need a few more cosmetic changes.

Note that I'm using the javaport branch for my current work -- eventually I'll merge this into the master branch and mark it as a v0.2 release.  I should be able to release and make it available in BigTop once I clean up the Spark driver and API a bit.

> Updated, Richer Model for Generating Data for BigPetStore 
> ----------------------------------------------------------
>
>                 Key: BIGTOP-1366
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1366
>             Project: Bigtop
>          Issue Type: Improvement
>          Components: blueprints
>    Affects Versions: backlog
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
>            Priority: Minor
>   Original Estimate: 8,736h
>  Remaining Estimate: 8,736h
>
> BigPetStore uses synthetic data as the basis for its workflow.  BPS's current model for generating customer data is sufficient for basic testing of the Hadoop ecosystem, **but the model is very basic and lacks sufficient complexity for embedding interesting patterns into the data**.  
> As a result, **more complex, scalable testing such as testing clustering algorithms in Mahout on non-trivial data or multidimensional data with factors influencing it** is not currently possible.
> Efforts are currently underway to incrementally improve the current model (see BIGTOP-1271 and BIGTOP-1272).  
> To create a model that can that incorporate **realistic, non-hierarchichal patterns** and input data to generate rich customer/transaction data with interesting correlations will require a re-imagining of the current model and its framework.
> To support the improvements to the model in BigPetStore, I have been working on an **alternative ab initio model, developed from scratch**. Since the development of a new model involves substantial R&D work with more specialized tools (mathematical and plotting libraries), I'm doing the current work outside of BPS using the iPython Notebook environment.  Due to the long time frame, the model will be developed on a separate timeline to prevent slowing the development of BPS.  
> Once the model has stabilized, I will begin incorporating the model into BPS itself.  One option is to implement the model in using Scala for clean integration with **spark** which is likely to play an increasingly important role in the hadoop ecosystem, and thus will be an important part of bigpetstore as a test/blueprint app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)