You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bigtop.apache.org by "jay vyas (JIRA)" <ji...@apache.org> on 2014/07/07 16:18:34 UTC
[jira] [Comment Edited] (BIGTOP-1366) Updated, Richer Model for Generating Data for BigPetStore

    [ https://issues.apache.org/jira/browse/BIGTOP-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053688#comment-14053688 ] 

jay vyas edited comment on BIGTOP-1366 at 7/7/14 2:18 PM:
----------------------------------------------------------

Thanks RJ.  Tl:DR 

* RJ is working on making the dataset generation much more sophisticated, and plans to port it to scala some day.    This is mostly *theoretical work* at the moment. 

* A requirement is that this new model can be used in any paradigm : So *we will want to decouple the model implementation, if possible, from spark*.

* This new model will (or at least, *can*) take into account *everything*:  product inventories, customer preferences, possibly even temperature of states etc when generating transactions.  Thus it can be used to benchmark machine learning tools in very sparse environments.  

Thanks again for doing this.  In the interim, I think it would be great if you could chime in on the primitive models which we are currently using - although they aren't as advanced as this - if we have feedback we will at least be able to keep placeholders in the code wherever possible to pave the way for things to come.



was (Author: jayunit100):
Thanks RJ.  Tl:DR 

* RJ is working on making the dataset generation much more sophisticated, and plans to port it to scala some day.    This is mostly theoretical work at the moment. 

* A requirement is that this new model can be used in any paradigm : So *we will want to decouple the model implementation, if possible, from spark*.

* This new model will take into account everything:  product inventories, customer preferences, possibly even temperature of states etc when generating transactions.  Thus it can be used to benchmark machine learning tools in very sparse environments.  

Thanks again for doing this.  In the interim, I think it would be great if you could chime in on the primitive models which we are currently using - although they aren't as advanced as this - if we have feedback we will at least be able to keep placeholders in the code wherever possible to pave the way for things to come.


> Updated, Richer Model for Generating Data for BigPetStore 
> ----------------------------------------------------------
>
>                 Key: BIGTOP-1366
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1366
>             Project: Bigtop
>          Issue Type: Improvement
>          Components: Blueprints
>    Affects Versions: backlog
>            Reporter: RJ Nowling
>            Priority: Minor
>   Original Estimate: 8,736h
>  Remaining Estimate: 8,736h
>
> BigPetStore uses synthetic data as the basis for its workflow.  BPS's current model for generating customer data is sufficient for basic testing of the Hadoop ecosystem, but the model is very basic and lacks sufficient complexity for embedding interesting patterns into the data.  As a result, more complex testing such as testing clustering algorithms in Mahout on non-trivial data is not currently possible.
> Efforts are currently underway to incrementally improve the current model (see BIGTOP-1271 and BIGTOP-1272).  However, to create a model that can that incorporate realistic patterns and input data to generate rich customer/transaction data with interesting correlations will require a re-imagining of the current model and its framework.
> To support the improvements to the model in BigPetStore, I have been working on an alternative ab initio model, developed from scratch. Since the development of a new model involves substantial R&D work with more specialized tools (mathematical and plotting libraries), I'm doing the current work outside of BPS using the iPython Notebook environment.  Due to the long time frame, the model will be developed on a separate timeline to prevent slowing the development of BPS.  
> Once the model has stabilized, I will begin incorporating the model into BPS itself.  One option is to implement the model in Spark using Scala as a foundation for Spark support in BPS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)