You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2008/10/22 18:52:41 UTC

Trimming Taste input (memory consumption)

Hi,

I've finally fed Taste some real data (in terms of volume, users, and item preference distribution) and quickly hit the memory limits of my development laptop. :). Now I'm trying to see what, if anything, I can trim from the input set (the user,item,rating triplets) to lower the memory consumption. N.b. I don't actually have rating information - my ratings are all just "1.0" indicating that the item has been seen/read/consumed.

I ran one of these to see the item popularity distribution:
$ cut -d, -f2 input.txt | sort | uniq -c | sort -rn | less

And quickly saw the expected zipfian distribution. Big head of several very popular items and a loooong tail of items that have been seen/read/consumed only a few times.

So here are my questions:
- Is there a point in keeping and loading very unpopular items (e.g.
the ones read only once)? I think keeping those might help very few
people discover very obscure items, so removing them will hurt this
small subset of people a bit, but this will not affect the majority of
people. Is this thinking correct?

- I'm dealing with items where their freshness counts. I don't want to recommend items older than N days - think news stories. Assume I have the age of each item. I could certainly then remove old items as I don't ever want to recommend them, but if I remove them, won't that hurt the quality of recommendations, simply because I'll lose users' "item consumption history"?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Trimming Taste input (memory consumption)

Posted by Sean Owen <sr...@gmail.com>.

On Wed, Oct 22, 2008 at 5:52 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> So here are my questions:
> - Is there a point in keeping and loading very unpopular items (e.g.
> the ones read only once)?  I think keeping those might help very few
> people discover very obscure items, so removing them will hurt this
> small subset of people a bit, but this will not affect the majority of
> people.  Is this thinking correct?

I agree, it makes sense to trim data in this way. I tried to build in
"levers" of this sort in several places in the code. If you mention
what implementation you are using I can recommend some parameters to
look at.

> - I'm dealing with items where their freshness counts.  I don't want to recommend items older than N days - think news stories.  Assume I have the age of each item.  I could certainly then remove old items as I don't ever want to recommend them, but if I remove them, won't that hurt the quality of recommendations, simply because I'll lose users' "item consumption history"?

Yes they are still valuable data points even if they are not
recommendable items. You can use a Rescorer to exclude items from
recommendations according to any criteria you like. This is easier and
more efficient than filtering after the fact.

Sean