You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bigtop.apache.org by "bhashit parikh (JIRA)" <ji...@apache.org> on 2014/06/09 13:27:02 UTC

[jira] [Comment Edited] (BIGTOP-1272) BigPetStore: Productionize the Mahout recommender

    [ https://issues.apache.org/jira/browse/BIGTOP-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021928#comment-14021928 ] 

bhashit parikh edited comment on BIGTOP-1272 at 6/9/14 11:26 AM:
-----------------------------------------------------------------

So far, I have finished up to step 2, and I have been working on step 3 and 4. Here are some thoughts on how we can go about that:
# Instead of performing hashing for users and products, we assign unique ids to both. This is more likely to resemble a real life scenario where both types of data are generally stored in relational databases. This also saves us from having to depend on the hashes to encode/decode the users and product information back after mahout is done processing.
# Since we want the to use the output from a pig script to be the input for mahout, we would need to change the current data generation code to include user-ids and product-ids. If we change the data-generation part, we'd also need to make changes to the current pig related code to deal with the changed format.
# As discussed with [~jayunit100], I'm also working on making the association between states and products more modular.



was (Author: bhashit):
So far, I have finished up to step 2, and I have been working on step 3 and 4. Here are some thoughts on how we can go about that:
# Instead of performing hashing for users and products, we assign unique ids to both. This is more likely to resemble a real life scenario where both types of data are generally stored in relational databases. This also saves us from having to depend on the hashes to encode/decode the users and product information back after mahout is done processing.
# Since we want the to use the output from a pig script to be the input for mahout, we would need to change the current data generation code to include user-ids and product-ids. If we change the data-generation part, we'd also need to make changes to the current pig related code to deal with the changed format.
# As discussed with [~jayunit100], I'm also working on making the association between states and products more modular.

About assigning ids to users and products: there

> BigPetStore: Productionize the Mahout recommender
> -------------------------------------------------
>
>                 Key: BIGTOP-1272
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1272
>             Project: Bigtop
>          Issue Type: New Feature
>          Components: Blueprints
>    Affects Versions: backlog
>            Reporter: jay vyas
>         Attachments: arch.jpeg
>
>
> BIGTOP-1271 adds patterns into the data that gaurantee that a meaningfull type of product recommendation can be given for at least *some* customers, since we know that there are going to be many customers who only bought 1 product, and also customers that bought 2 or more products -- even in a dataset size of 10. due to the gaussian distribution of purchases that is also in the dataset generator. 
> The current mahout recommender code is statically valid: It runs to completion in local unit tests if a hadoop 1x tarball is present but otherwise it hasn't been tested at scale.  So, lets get it working.  this JIRA also will comprise:
> - deciding wether to use mahout 2x for unit tests (default on mahout maven repo is the 1x impl) and wether or not bigtop should host a mahout 2x jar?  After all, bigtop builds a mahout 2x jar as part of its packaging process, and BigPetStore might thus need a mahout 2x jar in order to test against the right same of bigtop releases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)