You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Peng Cheng <pc...@uowmail.edu.au> on 2013/10/21 04:22:19 UTC

Sebastian: On the subject of efficient In-memory DataModel for recommendation engine.

Hi Sebastian,

Sorry I dropped out from the Hangout for a few minutes, when I get back 
its already over :<

Well, lets continue the conversation on the DataModel improvement:

I was looking into your KDDCupFactorizablePreferences and found out that 
it doesn't load any data into the memory, the only data structure in 
that class is the dataFile used to generate a stream of preference from 
hard disk. I think this is why you can load it into 1G memory without a 
heapspace overflow.

However, I think it is only good for memory saving at the expenses of 
lots of things (e.g. random access, random insert, delete and update, 
concurrency). Thus justify the necessity to load things into memory, 
theoretically, a preference array of netflix size will cost at least:

[8bytes (userID : long) + 8bytes (itemID : long) + 4bytes (value : 
float)]* 100,480,507 = 2009610140bytes = 1.87159528956GB = 1916.51357651MB

...plus overhead. But I would rather it to be a bit bigger to trade for 
O(1) random access/update, but not too big, like the current row/column 
sparse matrix-ish implementation that duplicates everything.

That's my concern, I got several ideas on optimizing my in-memory 
dataModel, but never had time to do them :< Please give me a few more 
weeks, when the code is optimized to the teeth and support concurrent 
access, I'll submit it again for revision. Gokhan has also made a lot of 
work on this part, so its good to have many options.

Yours Peg