You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Peng Cheng <pc...@uowmail.edu.au> on 2013/10/21 04:22:19 UTC
Sebastian: On the subject of efficient In-memory DataModel for recommendation
engine.
Hi Sebastian,
Sorry I dropped out from the Hangout for a few minutes, when I get back
its already over :<
Well, lets continue the conversation on the DataModel improvement:
I was looking into your KDDCupFactorizablePreferences and found out that
it doesn't load any data into the memory, the only data structure in
that class is the dataFile used to generate a stream of preference from
hard disk. I think this is why you can load it into 1G memory without a
heapspace overflow.
However, I think it is only good for memory saving at the expenses of
lots of things (e.g. random access, random insert, delete and update,
concurrency). Thus justify the necessity to load things into memory,
theoretically, a preference array of netflix size will cost at least:
[8bytes (userID : long) + 8bytes (itemID : long) + 4bytes (value :
float)]* 100,480,507 = 2009610140bytes = 1.87159528956GB = 1916.51357651MB
...plus overhead. But I would rather it to be a bit bigger to trade for
O(1) random access/update, but not too big, like the current row/column
sparse matrix-ish implementation that duplicates everything.
That's my concern, I got several ideas on optimizing my in-memory
dataModel, but never had time to do them :< Please give me a few more
weeks, when the code is optimized to the teeth and support concurrent
access, I'll submit it again for revision. Gokhan has also made a lot of
work on this part, so its good to have many options.
Yours Peg