You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Onur Kuru <ku...@gmail.com> on 2012/11/15 16:23:45 UTC

A problem using MongoDBDataModel

Hello!

I have exported Netflix data to a mongo db and then tried to build a MongoDBDataModel but it is taking too long. As I inspected the MongoDBDataModel class I found out that it's making a conversion from string to long because mongo uses strings for user_id and item_id, and mahout uses long for ids.

MongoDBDataModel stores this conversions in another collection and as it iterates over all the documents in the ratings collection, it checks this conversion collection whether it assigned a long id to every string id(user & item). I think checking/creating a new one(if necessary) in this collection becomes a great overhead when the data is too big.

Is there any solution to this which is included in mahout or do I have to write my own optimized code?

Regards,
Onur

Re: A problem using MongoDBDataModel

Posted by Sean Owen <sr...@gmail.com>.
I don't think this implementation is going to be practical for any
significant scale, it's more of a toy implementation that reads into
memory. You're welcome to propose a speedup patch if it doesn't break the
semantics. I would not use Mongo this way nor would I probably use the
netflix data set as-is in a non-distributed setup.


On Thu, Nov 15, 2012 at 3:23 PM, Onur Kuru <ku...@gmail.com> wrote:

> Hello!
>
> I have exported Netflix data to a mongo db and then tried to build a
> MongoDBDataModel but it is taking too long. As I inspected the
> MongoDBDataModel class I found out that it's making a conversion from
> string to long because mongo uses strings for user_id and item_id, and
> mahout uses long for ids.
>
> MongoDBDataModel stores this conversions in another collection and as it
> iterates over all the documents in the ratings collection, it checks this
> conversion collection whether it assigned a long id to every string id(user
> & item). I think checking/creating a new one(if necessary) in this
> collection becomes a great overhead when the data is too big.
>
> Is there any solution to this which is included in mahout or do I have to
> write my own optimized code?
>
> Regards,
> Onur