You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Josh Devins <hi...@joshdevins.com> on 2013/03/05 22:15:43 UTC

Top-N recommendations from SVD

Hi all,

I have a conceptually simple problem. A user-item matrix, A, whose
dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS with
20 features reduces this in the usual way to A = UM'. Trying to generate
top-n (where n=100) recommendations for all users in U is quite a long
process though. Essentially, for every user, it's generating a prediction
for all unrated items in M then taking the top-n (all in-memory). I'm using
the standard ALS `RecommenderJob` for this.

Considering that there are ~2.6M users and ~2.8M items, this is a really,
really, time consuming way to find the top-n recommendations for all users
in U. I feel like there could be a tricky way to avoid having to compute
all item predictions of a user though. I can't find any reference in papers
about improving this but at the moment, the estimate (with 10 mappers
running the `RecommenderJob`) is ~80 hours. When I think about this problem
I wonder if applying kNN or local sensitive min-hashing would somehow help
me. Basically find the nearest neighbours directly and calculate
predictions on those items only and not every item in M. On the flip side,
I could start to reduce the item space, since it's quite large, basically
start removing items that have low in-degrees since these probably don't
contribute too much to the final recommendations. I don't like this so much
though as it could remove some of the long-tail recommendations. At least,
that is my intuition :)

Thoughts anyone?

Thanks in advance,

Josh

Re: Top-N recommendations from SVD

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Josh,

The factorization should be quite a bit faster with the current trunk,
as we reworked the QR decomposition used for solving the least squares
problems of ALS.

I think we can also remove a lot of object instantiations in
ParallelALSFactorizationJob.

/s

On 06.03.2013 11:25, Josh Devins wrote:
> So the 80 hour estimate is _only_ for the U*M', top-n calculation and not
> the factorization. Factorization is on the order of 2-hours. For the
> interested, here's the pertinent code from the ALS `RecommenderJob`:
> 
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148
> 
> I'm sure this can be optimised, but by an order of magnitude? Something to
> try out, I'll report back if I find anything concrete.
> 
> 
> 
> On 6 March 2013 11:13, Ted Dunning <te...@gmail.com> wrote:
> 
>> Well, it would definitely not be the for time I counted incorrectly.
>>  Anytime I do arithmetic the result should be considered suspect.  I do
>> think my numbers are correct, but then again, I always do.
>>
>> But the OP did say 20 dimensions which gives me back 5x.
>>
>> Inclusion of learning time is a good suspect.  In the other side of the
>> ledger, if the multiply is doing any column wise access it is a likely
>> performance bug.  The computation is AB'. Perhaps you refer to rows of B
>> which are the columns of B'.
>>
>> Sent from my sleepy thumbs set to typing on my iPhone.
>>
>> On Mar 6, 2013, at 4:16 AM, Sean Owen <sr...@gmail.com> wrote:
>>
>>> If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728 Tflops
>> --
>>> I think you're missing an "M", and the features by an order of magnitude.
>>> That's still 1 day on an 8-core machine by this rule of thumb.
>>>
>>> The 80 hours is the model building time too (right?), not the time to
>>> multiply U*M'. This is dominated by iterations when building from
>> scratch,
>>> and I expect took 75% of that 80 hours. So if the multiply was 20 hours
>> --
>>> on 10 machines -- on Hadoop, then that's still slow but not out of the
>>> question for Hadoop, given it's usually a 3-6x slowdown over a parallel
>>> in-core implementation.
>>>
>>> I'm pretty sure what exists in Mahout here can be optimized further at
>> the
>>> Hadoop level; I don't know that it's doing the multiply badly though. In
>>> fact I'm pretty sure it's caching cols in memory, which is a bit of
>>> 'cheating' to speed up by taking a lot of memory.
>>>
>>>
>>> On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>>
>>>> Hmm... each users recommendations seems to be about 2.8 x 20M Flops =
>> 60M
>>>> Flops.  You should get about a Gflop per core in Java so this should
>> about
>>>> 60 ms.  You can make this faster with more cores or by using ATLAS.
>>>>
>>>> Are you expecting 3 million unique people every 80 hours?  If no, then
>> it
>>>> is probably more efficient to compute the recommendations on the fly.
>>>>
>>>> How many recommendations per second are you expecting?  If you have 1
>>>> million uniques per day (just for grins) and we assume 20,000 s/day to
>>>> allow for peak loading, you have to do 50 queries per second peak.  This
>>>> seems to require 3 cores.  Use 16 to be safe.
>>>>
>>>> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours.
>>  I
>>>> think that your map-reduce is under performing by about a factor of 10.
>>>> This is quite plausible with bad arrangement of the inner loops.  I
>> think
>>>> that you would have highest performance computing the recommendations
>> for a
>>>> few thousand items by a few thousand users at a time.  It might be just
>>>> about as fast to do all items against a few users at a time.  The reason
>>>> for this is that dense matrix multiply requires c n x k + m x k memory
>> ops,
>>>> but n x k x m arithmetic ops.  If you can re-use data many times, you
>> can
>>>> balance memory channel bandwidth against CPU speed.  Typically you need
>> 20
>>>> or more re-uses to really make this fly.
>>>>
>>>>
>>
>

Re: Top-N recommendations from SVD

Posted by Sebastian Schelter <ss...@apache.org>.

Looking forward to your debugging results. Could you have a look at the
GC behavior? Maybe we should remove the per-tuple object instantiation
in line 153:

topKItems.offer(new GenericRecommendedItem(itemID, (float)
predictedRating));

/s

On 06.03.2013 11:54, Josh Devins wrote:
> The factorization at 2-hours is kind of a non-issue (certainly fast
> enough). It was run with (if I recall correctly) 30 reducers across a 35
> node cluster, with 10 iterations.
> 
> I was a bit shocked at how long the recommendation step took and will throw
> some timing debug in to see where the problem lies exactly. There were no
> other jobs running on the cluster during these attempts, but it's certainly
> possible that something is swapping or the like. I'll be looking more
> closely today before I start to consider other options for calculating the
> recommendations.
> 
> 
> 
> On 6 March 2013 11:41, Sean Owen <sr...@gmail.com> wrote:
> 
>> Yeah that's right, he said 20 features, oops. And yes he says he's talking
>> about the recs only too, so that's not right either. That seems way too
>> long relative to factorization. And the factorization seems quite fast; how
>> many machines, and how many iterations?
>>
>> I thought the shape of the computation was to cache B' (yes whose columns
>> are B rows) and multiply against the rows of A. There again probably wrong
>> given the latest timing info.
>>
>>
>> On Wed, Mar 6, 2013 at 10:25 AM, Josh Devins <hi...@joshdevins.com> wrote:
>>
>>> So the 80 hour estimate is _only_ for the U*M', top-n calculation and not
>>> the factorization. Factorization is on the order of 2-hours. For the
>>> interested, here's the pertinent code from the ALS `RecommenderJob`:
>>>
>>>
>>>
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148
>>>
>>> I'm sure this can be optimised, but by an order of magnitude? Something
>> to
>>> try out, I'll report back if I find anything concrete.
>>>
>>>
>>>
>>> On 6 March 2013 11:13, Ted Dunning <te...@gmail.com> wrote:
>>>
>>>> Well, it would definitely not be the for time I counted incorrectly.
>>>>  Anytime I do arithmetic the result should be considered suspect.  I do
>>>> think my numbers are correct, but then again, I always do.
>>>>
>>>> But the OP did say 20 dimensions which gives me back 5x.
>>>>
>>>> Inclusion of learning time is a good suspect.  In the other side of the
>>>> ledger, if the multiply is doing any column wise access it is a likely
>>>> performance bug.  The computation is AB'. Perhaps you refer to rows of
>> B
>>>> which are the columns of B'.
>>>>
>>>> Sent from my sleepy thumbs set to typing on my iPhone.
>>>>
>>>> On Mar 6, 2013, at 4:16 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728
>>> Tflops
>>>> --
>>>>> I think you're missing an "M", and the features by an order of
>>> magnitude.
>>>>> That's still 1 day on an 8-core machine by this rule of thumb.
>>>>>
>>>>> The 80 hours is the model building time too (right?), not the time to
>>>>> multiply U*M'. This is dominated by iterations when building from
>>>> scratch,
>>>>> and I expect took 75% of that 80 hours. So if the multiply was 20
>> hours
>>>> --
>>>>> on 10 machines -- on Hadoop, then that's still slow but not out of
>> the
>>>>> question for Hadoop, given it's usually a 3-6x slowdown over a
>> parallel
>>>>> in-core implementation.
>>>>>
>>>>> I'm pretty sure what exists in Mahout here can be optimized further
>> at
>>>> the
>>>>> Hadoop level; I don't know that it's doing the multiply badly though.
>>> In
>>>>> fact I'm pretty sure it's caching cols in memory, which is a bit of
>>>>> 'cheating' to speed up by taking a lot of memory.
>>>>>
>>>>>
>>>>> On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Hmm... each users recommendations seems to be about 2.8 x 20M Flops
>> =
>>>> 60M
>>>>>> Flops.  You should get about a Gflop per core in Java so this should
>>>> about
>>>>>> 60 ms.  You can make this faster with more cores or by using ATLAS.
>>>>>>
>>>>>> Are you expecting 3 million unique people every 80 hours?  If no,
>> then
>>>> it
>>>>>> is probably more efficient to compute the recommendations on the
>> fly.
>>>>>>
>>>>>> How many recommendations per second are you expecting?  If you have
>> 1
>>>>>> million uniques per day (just for grins) and we assume 20,000 s/day
>> to
>>>>>> allow for peak loading, you have to do 50 queries per second peak.
>>>  This
>>>>>> seems to require 3 cores.  Use 16 to be safe.
>>>>>>
>>>>>> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50
>> hours.
>>>>  I
>>>>>> think that your map-reduce is under performing by about a factor of
>>> 10.
>>>>>> This is quite plausible with bad arrangement of the inner loops.  I
>>>> think
>>>>>> that you would have highest performance computing the
>> recommendations
>>>> for a
>>>>>> few thousand items by a few thousand users at a time.  It might be
>>> just
>>>>>> about as fast to do all items against a few users at a time.  The
>>> reason
>>>>>> for this is that dense matrix multiply requires c n x k + m x k
>> memory
>>>> ops,
>>>>>> but n x k x m arithmetic ops.  If you can re-use data many times,
>> you
>>>> can
>>>>>> balance memory channel bandwidth against CPU speed.  Typically you
>>> need
>>>> 20
>>>>>> or more re-uses to really make this fly.
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Top-N recommendations from SVD

Posted by Josh Devins <hi...@joshdevins.com>.

I'm running a job right now that uses your static `dot` method from your
previous post, ontop of v0.7 (nothing from trunk). This has cut the time
down by about 1/3 but it's still around 500ms per user. I'll give your
latest patch a go hopefully tomorrow and report back.

We're working on another approach too. Will email you off thread if it
proves fruitful, perhaps whip up a patch as well.

Josh



On 7 March 2013 21:37, Sebastian Schelter <ss...@apache.org> wrote:

> Hi Josh,
>
> I made another attempt today. It directly computes the dot products,
> introduces a mutable version of RecommendedItem and uses Lucene's
> PriorityQueue to keep the top k.
>
> I hope this gives you some improvements.
>
> Here's the patch (must be applied against trunk):
>
>
> https://issues.apache.org/jira/secure/attachment/12572605/MAHOUT-1151-2.patch
>
> Best,
> Sebastian
>
> On 07.03.2013 16:00, Josh Devins wrote:
> > I ran from what's in trunk as of this morning. I didn't dig in further to
> > see where that extra time was coming from but can do so when I get some
> > time soon.
> >
> >
> > On 7 March 2013 15:56, Sebastian Schelter <ss...@apache.org> wrote:
> >
> >> Hi Josh,
> >>
> >> Did you run the patch from the jira issue or did you run the trunk? I
> >> made some follow up changes after uploading the patch. I can't imagine
> >> why those small changes would lead to an increase of 50% in the runtime.
> >>
> >> /s
> >>
> >>
> >>
> >> On 07.03.2013 15:02, Josh Devins wrote:
> >>> So the good news is that the patch runs ;)  The bad news is that it's
> >>> slower, going from 1600-1800ms to ~2500ms to calculate a single users'
> >> topK
> >>> recommendations. For kicks, I ran a couple other experiments,
> >> progressively
> >>> removing code to isolate the problem area. Results are detailed here:
> >>> https://gist.github.com/joshdevins/5106930
> >>>
> >>> Conclusions thus far:
> >>>  * the patch is not helpful (for performance) and should be reverted or
> >>> fixed again (sorry Sebastian)
> >>>  * the dot product operation in `Vector` is not efficient enough for
> >> large
> >>> vectors/matrices, when used as it is in the ALS `RecommenderJob`,
> inside
> >> a
> >>> loop over `M`
> >>>
> >>> I've tried a few other experiments with Colt (for example) but there
> was
> >> no
> >>> noticeable gain. Parallelizing inside the map task (manually or with
> >>> Parallel Colt) is possible but obviously is not ideal in an environment
> >>> like Hadoop -- this would save memory since you only need a few map
> tasks
> >>> loading the matrices, but isn't playing very nicely within a shared
> >> cluster
> >>> :)
> >>>
> >>> Next step at this point is to look at either reducing the number of
> items
> >>> to recommend over, LSH or a third secret plan that "the PhD's" are
> >> thinking
> >>> about. Paper forthcoming, no doubt :D
> >>>
> >>> @Sebastian, happy to run any patches on our cluster/dataset before
> making
> >>> more commits.
> >>>
> >>>
> >>>
> >>> On 6 March 2013 20:58, Josh Devins <hi...@joshdevins.com> wrote:
> >>>
> >>>> Got sidetracked today but I'll run Sebastian's version in trunk
> tomorrow
> >>>> and report back.
> >>>>
> >>>>
> >>>> On 6 March 2013 17:07, Sebastian Schelter <ss...@apache.org> wrote:
> >>>>
> >>>>> I already committed a fix in that direction. I modified our
> >>>>> FixedSizePriorityQueue to allow inspection of its head for direct
> >>>>> comparison. This obviates the need to instantiate a Comparable and
> >> offer
> >>>>> it to the queue.
> >>>>>
> >>>>> /s
> >>>>>
> >>>>>
> >>>>> On 06.03.2013 17:01, Ted Dunning wrote:
> >>>>>> I would recommend against a mutable object on maintenance grounds.
> >>>>>>
> >>>>>> Better is to keep the threshold that a new score must meet and only
> >>>>>> construct the object on need.  That cuts the allocation down to
> >>>>> negligible
> >>>>>> levels.
> >>>>>>
> >>>>>> On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sr...@gmail.com> wrote:
> >>>>>>
> >>>>>>> OK, that's reasonable on 35 machines. (You can turn up to 70
> >> reducers,
> >>>>>>> probably, as most machines can handle 2 reducers at once).
> >>>>>>> I think the recommendation step loads one whole matrix into memory.
> >>>>> You're
> >>>>>>> not running out of memory but if you're turning up the heap size to
> >>>>>>> accommodate, you might be hitting swapping, yes. I think (?) the
> >>>>>>> conventional wisdom is to turn off swap for Hadoop.
> >>>>>>>
> >>>>>>> Sebastian yes that is probably a good optimization; I've had good
> >>>>> results
> >>>>>>> reusing a mutable object in this context.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> The factorization at 2-hours is kind of a non-issue (certainly
> fast
> >>>>>>>> enough). It was run with (if I recall correctly) 30 reducers
> across
> >> a
> >>>>> 35
> >>>>>>>> node cluster, with 10 iterations.
> >>>>>>>>
> >>>>>>>> I was a bit shocked at how long the recommendation step took and
> >> will
> >>>>>>> throw
> >>>>>>>> some timing debug in to see where the problem lies exactly. There
> >>>>> were no
> >>>>>>>> other jobs running on the cluster during these attempts, but it's
> >>>>>>> certainly
> >>>>>>>> possible that something is swapping or the like. I'll be looking
> >> more
> >>>>>>>> closely today before I start to consider other options for
> >> calculating
> >>>>>>> the
> >>>>>>>> recommendations.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: Top-N recommendations from SVD

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Josh,

I made another attempt today. It directly computes the dot products,
introduces a mutable version of RecommendedItem and uses Lucene's
PriorityQueue to keep the top k.

I hope this gives you some improvements.

Here's the patch (must be applied against trunk):

https://issues.apache.org/jira/secure/attachment/12572605/MAHOUT-1151-2.patch

Best,
Sebastian

On 07.03.2013 16:00, Josh Devins wrote:
> I ran from what's in trunk as of this morning. I didn't dig in further to
> see where that extra time was coming from but can do so when I get some
> time soon.
> 
> 
> On 7 March 2013 15:56, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> Hi Josh,
>>
>> Did you run the patch from the jira issue or did you run the trunk? I
>> made some follow up changes after uploading the patch. I can't imagine
>> why those small changes would lead to an increase of 50% in the runtime.
>>
>> /s
>>
>>
>>
>> On 07.03.2013 15:02, Josh Devins wrote:
>>> So the good news is that the patch runs ;)  The bad news is that it's
>>> slower, going from 1600-1800ms to ~2500ms to calculate a single users'
>> topK
>>> recommendations. For kicks, I ran a couple other experiments,
>> progressively
>>> removing code to isolate the problem area. Results are detailed here:
>>> https://gist.github.com/joshdevins/5106930
>>>
>>> Conclusions thus far:
>>>  * the patch is not helpful (for performance) and should be reverted or
>>> fixed again (sorry Sebastian)
>>>  * the dot product operation in `Vector` is not efficient enough for
>> large
>>> vectors/matrices, when used as it is in the ALS `RecommenderJob`, inside
>> a
>>> loop over `M`
>>>
>>> I've tried a few other experiments with Colt (for example) but there was
>> no
>>> noticeable gain. Parallelizing inside the map task (manually or with
>>> Parallel Colt) is possible but obviously is not ideal in an environment
>>> like Hadoop -- this would save memory since you only need a few map tasks
>>> loading the matrices, but isn't playing very nicely within a shared
>> cluster
>>> :)
>>>
>>> Next step at this point is to look at either reducing the number of items
>>> to recommend over, LSH or a third secret plan that "the PhD's" are
>> thinking
>>> about. Paper forthcoming, no doubt :D
>>>
>>> @Sebastian, happy to run any patches on our cluster/dataset before making
>>> more commits.
>>>
>>>
>>>
>>> On 6 March 2013 20:58, Josh Devins <hi...@joshdevins.com> wrote:
>>>
>>>> Got sidetracked today but I'll run Sebastian's version in trunk tomorrow
>>>> and report back.
>>>>
>>>>
>>>> On 6 March 2013 17:07, Sebastian Schelter <ss...@apache.org> wrote:
>>>>
>>>>> I already committed a fix in that direction. I modified our
>>>>> FixedSizePriorityQueue to allow inspection of its head for direct
>>>>> comparison. This obviates the need to instantiate a Comparable and
>> offer
>>>>> it to the queue.
>>>>>
>>>>> /s
>>>>>
>>>>>
>>>>> On 06.03.2013 17:01, Ted Dunning wrote:
>>>>>> I would recommend against a mutable object on maintenance grounds.
>>>>>>
>>>>>> Better is to keep the threshold that a new score must meet and only
>>>>>> construct the object on need.  That cuts the allocation down to
>>>>> negligible
>>>>>> levels.
>>>>>>
>>>>>> On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>> OK, that's reasonable on 35 machines. (You can turn up to 70
>> reducers,
>>>>>>> probably, as most machines can handle 2 reducers at once).
>>>>>>> I think the recommendation step loads one whole matrix into memory.
>>>>> You're
>>>>>>> not running out of memory but if you're turning up the heap size to
>>>>>>> accommodate, you might be hitting swapping, yes. I think (?) the
>>>>>>> conventional wisdom is to turn off swap for Hadoop.
>>>>>>>
>>>>>>> Sebastian yes that is probably a good optimization; I've had good
>>>>> results
>>>>>>> reusing a mutable object in this context.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com>
>>>>> wrote:
>>>>>>>
>>>>>>>> The factorization at 2-hours is kind of a non-issue (certainly fast
>>>>>>>> enough). It was run with (if I recall correctly) 30 reducers across
>> a
>>>>> 35
>>>>>>>> node cluster, with 10 iterations.
>>>>>>>>
>>>>>>>> I was a bit shocked at how long the recommendation step took and
>> will
>>>>>>> throw
>>>>>>>> some timing debug in to see where the problem lies exactly. There
>>>>> were no
>>>>>>>> other jobs running on the cluster during these attempts, but it's
>>>>>>> certainly
>>>>>>>> possible that something is swapping or the like. I'll be looking
>> more
>>>>>>>> closely today before I start to consider other options for
>> calculating
>>>>>>> the
>>>>>>>> recommendations.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: Top-N recommendations from SVD

Posted by Sebastian Schelter <ss...@apache.org>.

I've created a small benchmark to play around with the way the dot
product is computed. It tries to mimic Josh's usecase: multiplying 2.5M
dense item vectors of size 20 by a dense user vector of size 20.

https://gist.github.com/sscdotopen/5108988

I compared using the .dot() method from the Vector class to simply
computing the dot product like this:

  static double dot(Vector one, Vector two) {
    double sum = 0;
      for (int n = 0; n < NUM_FEATURES; n++) {
        sum += one.getQuick(n) * two.getQuick(n);
      }
    return sum;
  }

The results indicate that using Vector.dot() incurs a 6x to 7x overhead
compared to the simple method.

On 07.03.2013 16:00, Josh Devins wrote:
> I ran from what's in trunk as of this morning. I didn't dig in further to
> see where that extra time was coming from but can do so when I get some
> time soon.
> 
> 
> On 7 March 2013 15:56, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> Hi Josh,
>>
>> Did you run the patch from the jira issue or did you run the trunk? I
>> made some follow up changes after uploading the patch. I can't imagine
>> why those small changes would lead to an increase of 50% in the runtime.
>>
>> /s
>>
>>
>>
>> On 07.03.2013 15:02, Josh Devins wrote:
>>> So the good news is that the patch runs ;)  The bad news is that it's
>>> slower, going from 1600-1800ms to ~2500ms to calculate a single users'
>> topK
>>> recommendations. For kicks, I ran a couple other experiments,
>> progressively
>>> removing code to isolate the problem area. Results are detailed here:
>>> https://gist.github.com/joshdevins/5106930
>>>
>>> Conclusions thus far:
>>>  * the patch is not helpful (for performance) and should be reverted or
>>> fixed again (sorry Sebastian)
>>>  * the dot product operation in `Vector` is not efficient enough for
>> large
>>> vectors/matrices, when used as it is in the ALS `RecommenderJob`, inside
>> a
>>> loop over `M`
>>>
>>> I've tried a few other experiments with Colt (for example) but there was
>> no
>>> noticeable gain. Parallelizing inside the map task (manually or with
>>> Parallel Colt) is possible but obviously is not ideal in an environment
>>> like Hadoop -- this would save memory since you only need a few map tasks
>>> loading the matrices, but isn't playing very nicely within a shared
>> cluster
>>> :)
>>>
>>> Next step at this point is to look at either reducing the number of items
>>> to recommend over, LSH or a third secret plan that "the PhD's" are
>> thinking
>>> about. Paper forthcoming, no doubt :D
>>>
>>> @Sebastian, happy to run any patches on our cluster/dataset before making
>>> more commits.
>>>
>>>
>>>
>>> On 6 March 2013 20:58, Josh Devins <hi...@joshdevins.com> wrote:
>>>
>>>> Got sidetracked today but I'll run Sebastian's version in trunk tomorrow
>>>> and report back.
>>>>
>>>>
>>>> On 6 March 2013 17:07, Sebastian Schelter <ss...@apache.org> wrote:
>>>>
>>>>> I already committed a fix in that direction. I modified our
>>>>> FixedSizePriorityQueue to allow inspection of its head for direct
>>>>> comparison. This obviates the need to instantiate a Comparable and
>> offer
>>>>> it to the queue.
>>>>>
>>>>> /s
>>>>>
>>>>>
>>>>> On 06.03.2013 17:01, Ted Dunning wrote:
>>>>>> I would recommend against a mutable object on maintenance grounds.
>>>>>>
>>>>>> Better is to keep the threshold that a new score must meet and only
>>>>>> construct the object on need.  That cuts the allocation down to
>>>>> negligible
>>>>>> levels.
>>>>>>
>>>>>> On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>> OK, that's reasonable on 35 machines. (You can turn up to 70
>> reducers,
>>>>>>> probably, as most machines can handle 2 reducers at once).
>>>>>>> I think the recommendation step loads one whole matrix into memory.
>>>>> You're
>>>>>>> not running out of memory but if you're turning up the heap size to
>>>>>>> accommodate, you might be hitting swapping, yes. I think (?) the
>>>>>>> conventional wisdom is to turn off swap for Hadoop.
>>>>>>>
>>>>>>> Sebastian yes that is probably a good optimization; I've had good
>>>>> results
>>>>>>> reusing a mutable object in this context.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com>
>>>>> wrote:
>>>>>>>
>>>>>>>> The factorization at 2-hours is kind of a non-issue (certainly fast
>>>>>>>> enough). It was run with (if I recall correctly) 30 reducers across
>> a
>>>>> 35
>>>>>>>> node cluster, with 10 iterations.
>>>>>>>>
>>>>>>>> I was a bit shocked at how long the recommendation step took and
>> will
>>>>>>> throw
>>>>>>>> some timing debug in to see where the problem lies exactly. There
>>>>> were no
>>>>>>>> other jobs running on the cluster during these attempts, but it's
>>>>>>> certainly
>>>>>>>> possible that something is swapping or the like. I'll be looking
>> more
>>>>>>>> closely today before I start to consider other options for
>> calculating
>>>>>>> the
>>>>>>>> recommendations.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: Top-N recommendations from SVD

Posted by Josh Devins <hi...@joshdevins.com>.

I ran from what's in trunk as of this morning. I didn't dig in further to
see where that extra time was coming from but can do so when I get some
time soon.


On 7 March 2013 15:56, Sebastian Schelter <ss...@apache.org> wrote:

> Hi Josh,
>
> Did you run the patch from the jira issue or did you run the trunk? I
> made some follow up changes after uploading the patch. I can't imagine
> why those small changes would lead to an increase of 50% in the runtime.
>
> /s
>
>
>
> On 07.03.2013 15:02, Josh Devins wrote:
> > So the good news is that the patch runs ;)  The bad news is that it's
> > slower, going from 1600-1800ms to ~2500ms to calculate a single users'
> topK
> > recommendations. For kicks, I ran a couple other experiments,
> progressively
> > removing code to isolate the problem area. Results are detailed here:
> > https://gist.github.com/joshdevins/5106930
> >
> > Conclusions thus far:
> >  * the patch is not helpful (for performance) and should be reverted or
> > fixed again (sorry Sebastian)
> >  * the dot product operation in `Vector` is not efficient enough for
> large
> > vectors/matrices, when used as it is in the ALS `RecommenderJob`, inside
> a
> > loop over `M`
> >
> > I've tried a few other experiments with Colt (for example) but there was
> no
> > noticeable gain. Parallelizing inside the map task (manually or with
> > Parallel Colt) is possible but obviously is not ideal in an environment
> > like Hadoop -- this would save memory since you only need a few map tasks
> > loading the matrices, but isn't playing very nicely within a shared
> cluster
> > :)
> >
> > Next step at this point is to look at either reducing the number of items
> > to recommend over, LSH or a third secret plan that "the PhD's" are
> thinking
> > about. Paper forthcoming, no doubt :D
> >
> > @Sebastian, happy to run any patches on our cluster/dataset before making
> > more commits.
> >
> >
> >
> > On 6 March 2013 20:58, Josh Devins <hi...@joshdevins.com> wrote:
> >
> >> Got sidetracked today but I'll run Sebastian's version in trunk tomorrow
> >> and report back.
> >>
> >>
> >> On 6 March 2013 17:07, Sebastian Schelter <ss...@apache.org> wrote:
> >>
> >>> I already committed a fix in that direction. I modified our
> >>> FixedSizePriorityQueue to allow inspection of its head for direct
> >>> comparison. This obviates the need to instantiate a Comparable and
> offer
> >>> it to the queue.
> >>>
> >>> /s
> >>>
> >>>
> >>> On 06.03.2013 17:01, Ted Dunning wrote:
> >>>> I would recommend against a mutable object on maintenance grounds.
> >>>>
> >>>> Better is to keep the threshold that a new score must meet and only
> >>>> construct the object on need.  That cuts the allocation down to
> >>> negligible
> >>>> levels.
> >>>>
> >>>> On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sr...@gmail.com> wrote:
> >>>>
> >>>>> OK, that's reasonable on 35 machines. (You can turn up to 70
> reducers,
> >>>>> probably, as most machines can handle 2 reducers at once).
> >>>>> I think the recommendation step loads one whole matrix into memory.
> >>> You're
> >>>>> not running out of memory but if you're turning up the heap size to
> >>>>> accommodate, you might be hitting swapping, yes. I think (?) the
> >>>>> conventional wisdom is to turn off swap for Hadoop.
> >>>>>
> >>>>> Sebastian yes that is probably a good optimization; I've had good
> >>> results
> >>>>> reusing a mutable object in this context.
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com>
> >>> wrote:
> >>>>>
> >>>>>> The factorization at 2-hours is kind of a non-issue (certainly fast
> >>>>>> enough). It was run with (if I recall correctly) 30 reducers across
> a
> >>> 35
> >>>>>> node cluster, with 10 iterations.
> >>>>>>
> >>>>>> I was a bit shocked at how long the recommendation step took and
> will
> >>>>> throw
> >>>>>> some timing debug in to see where the problem lies exactly. There
> >>> were no
> >>>>>> other jobs running on the cluster during these attempts, but it's
> >>>>> certainly
> >>>>>> possible that something is swapping or the like. I'll be looking
> more
> >>>>>> closely today before I start to consider other options for
> calculating
> >>>>> the
> >>>>>> recommendations.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
>
>

Re: Top-N recommendations from SVD

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Josh,

Did you run the patch from the jira issue or did you run the trunk? I
made some follow up changes after uploading the patch. I can't imagine
why those small changes would lead to an increase of 50% in the runtime.

/s



On 07.03.2013 15:02, Josh Devins wrote:
> So the good news is that the patch runs ;)  The bad news is that it's
> slower, going from 1600-1800ms to ~2500ms to calculate a single users' topK
> recommendations. For kicks, I ran a couple other experiments, progressively
> removing code to isolate the problem area. Results are detailed here:
> https://gist.github.com/joshdevins/5106930
> 
> Conclusions thus far:
>  * the patch is not helpful (for performance) and should be reverted or
> fixed again (sorry Sebastian)
>  * the dot product operation in `Vector` is not efficient enough for large
> vectors/matrices, when used as it is in the ALS `RecommenderJob`, inside a
> loop over `M`
> 
> I've tried a few other experiments with Colt (for example) but there was no
> noticeable gain. Parallelizing inside the map task (manually or with
> Parallel Colt) is possible but obviously is not ideal in an environment
> like Hadoop -- this would save memory since you only need a few map tasks
> loading the matrices, but isn't playing very nicely within a shared cluster
> :)
> 
> Next step at this point is to look at either reducing the number of items
> to recommend over, LSH or a third secret plan that "the PhD's" are thinking
> about. Paper forthcoming, no doubt :D
> 
> @Sebastian, happy to run any patches on our cluster/dataset before making
> more commits.
> 
> 
> 
> On 6 March 2013 20:58, Josh Devins <hi...@joshdevins.com> wrote:
> 
>> Got sidetracked today but I'll run Sebastian's version in trunk tomorrow
>> and report back.
>>
>>
>> On 6 March 2013 17:07, Sebastian Schelter <ss...@apache.org> wrote:
>>
>>> I already committed a fix in that direction. I modified our
>>> FixedSizePriorityQueue to allow inspection of its head for direct
>>> comparison. This obviates the need to instantiate a Comparable and offer
>>> it to the queue.
>>>
>>> /s
>>>
>>>
>>> On 06.03.2013 17:01, Ted Dunning wrote:
>>>> I would recommend against a mutable object on maintenance grounds.
>>>>
>>>> Better is to keep the threshold that a new score must meet and only
>>>> construct the object on need.  That cuts the allocation down to
>>> negligible
>>>> levels.
>>>>
>>>> On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
>>>>> probably, as most machines can handle 2 reducers at once).
>>>>> I think the recommendation step loads one whole matrix into memory.
>>> You're
>>>>> not running out of memory but if you're turning up the heap size to
>>>>> accommodate, you might be hitting swapping, yes. I think (?) the
>>>>> conventional wisdom is to turn off swap for Hadoop.
>>>>>
>>>>> Sebastian yes that is probably a good optimization; I've had good
>>> results
>>>>> reusing a mutable object in this context.
>>>>>
>>>>>
>>>>> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com>
>>> wrote:
>>>>>
>>>>>> The factorization at 2-hours is kind of a non-issue (certainly fast
>>>>>> enough). It was run with (if I recall correctly) 30 reducers across a
>>> 35
>>>>>> node cluster, with 10 iterations.
>>>>>>
>>>>>> I was a bit shocked at how long the recommendation step took and will
>>>>> throw
>>>>>> some timing debug in to see where the problem lies exactly. There
>>> were no
>>>>>> other jobs running on the cluster during these attempts, but it's
>>>>> certainly
>>>>>> possible that something is swapping or the like. I'll be looking more
>>>>>> closely today before I start to consider other options for calculating
>>>>> the
>>>>>> recommendations.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: Top-N recommendations from SVD

Posted by Josh Devins <hi...@joshdevins.com>.

So the good news is that the patch runs ;)  The bad news is that it's
slower, going from 1600-1800ms to ~2500ms to calculate a single users' topK
recommendations. For kicks, I ran a couple other experiments, progressively
removing code to isolate the problem area. Results are detailed here:
https://gist.github.com/joshdevins/5106930

Conclusions thus far:
 * the patch is not helpful (for performance) and should be reverted or
fixed again (sorry Sebastian)
 * the dot product operation in `Vector` is not efficient enough for large
vectors/matrices, when used as it is in the ALS `RecommenderJob`, inside a
loop over `M`

I've tried a few other experiments with Colt (for example) but there was no
noticeable gain. Parallelizing inside the map task (manually or with
Parallel Colt) is possible but obviously is not ideal in an environment
like Hadoop -- this would save memory since you only need a few map tasks
loading the matrices, but isn't playing very nicely within a shared cluster
:)

Next step at this point is to look at either reducing the number of items
to recommend over, LSH or a third secret plan that "the PhD's" are thinking
about. Paper forthcoming, no doubt :D

@Sebastian, happy to run any patches on our cluster/dataset before making
more commits.



On 6 March 2013 20:58, Josh Devins <hi...@joshdevins.com> wrote:

> Got sidetracked today but I'll run Sebastian's version in trunk tomorrow
> and report back.
>
>
> On 6 March 2013 17:07, Sebastian Schelter <ss...@apache.org> wrote:
>
>> I already committed a fix in that direction. I modified our
>> FixedSizePriorityQueue to allow inspection of its head for direct
>> comparison. This obviates the need to instantiate a Comparable and offer
>> it to the queue.
>>
>> /s
>>
>>
>> On 06.03.2013 17:01, Ted Dunning wrote:
>> > I would recommend against a mutable object on maintenance grounds.
>> >
>> > Better is to keep the threshold that a new score must meet and only
>> > construct the object on need.  That cuts the allocation down to
>> negligible
>> > levels.
>> >
>> > On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sr...@gmail.com> wrote:
>> >
>> >> OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
>> >> probably, as most machines can handle 2 reducers at once).
>> >> I think the recommendation step loads one whole matrix into memory.
>> You're
>> >> not running out of memory but if you're turning up the heap size to
>> >> accommodate, you might be hitting swapping, yes. I think (?) the
>> >> conventional wisdom is to turn off swap for Hadoop.
>> >>
>> >> Sebastian yes that is probably a good optimization; I've had good
>> results
>> >> reusing a mutable object in this context.
>> >>
>> >>
>> >> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com>
>> wrote:
>> >>
>> >>> The factorization at 2-hours is kind of a non-issue (certainly fast
>> >>> enough). It was run with (if I recall correctly) 30 reducers across a
>> 35
>> >>> node cluster, with 10 iterations.
>> >>>
>> >>> I was a bit shocked at how long the recommendation step took and will
>> >> throw
>> >>> some timing debug in to see where the problem lies exactly. There
>> were no
>> >>> other jobs running on the cluster during these attempts, but it's
>> >> certainly
>> >>> possible that something is swapping or the like. I'll be looking more
>> >>> closely today before I start to consider other options for calculating
>> >> the
>> >>> recommendations.
>> >>>
>> >>>
>> >>
>> >
>>
>>
>

Re: Top-N recommendations from SVD

Posted by Josh Devins <hi...@joshdevins.com>.

Got sidetracked today but I'll run Sebastian's version in trunk tomorrow
and report back.


On 6 March 2013 17:07, Sebastian Schelter <ss...@apache.org> wrote:

> I already committed a fix in that direction. I modified our
> FixedSizePriorityQueue to allow inspection of its head for direct
> comparison. This obviates the need to instantiate a Comparable and offer
> it to the queue.
>
> /s
>
>
> On 06.03.2013 17:01, Ted Dunning wrote:
> > I would recommend against a mutable object on maintenance grounds.
> >
> > Better is to keep the threshold that a new score must meet and only
> > construct the object on need.  That cuts the allocation down to
> negligible
> > levels.
> >
> > On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
> >> probably, as most machines can handle 2 reducers at once).
> >> I think the recommendation step loads one whole matrix into memory.
> You're
> >> not running out of memory but if you're turning up the heap size to
> >> accommodate, you might be hitting swapping, yes. I think (?) the
> >> conventional wisdom is to turn off swap for Hadoop.
> >>
> >> Sebastian yes that is probably a good optimization; I've had good
> results
> >> reusing a mutable object in this context.
> >>
> >>
> >> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com> wrote:
> >>
> >>> The factorization at 2-hours is kind of a non-issue (certainly fast
> >>> enough). It was run with (if I recall correctly) 30 reducers across a
> 35
> >>> node cluster, with 10 iterations.
> >>>
> >>> I was a bit shocked at how long the recommendation step took and will
> >> throw
> >>> some timing debug in to see where the problem lies exactly. There were
> no
> >>> other jobs running on the cluster during these attempts, but it's
> >> certainly
> >>> possible that something is swapping or the like. I'll be looking more
> >>> closely today before I start to consider other options for calculating
> >> the
> >>> recommendations.
> >>>
> >>>
> >>
> >
>
>

Re: Top-N recommendations from SVD

Posted by Sebastian Schelter <ss...@apache.org>.

I already committed a fix in that direction. I modified our
FixedSizePriorityQueue to allow inspection of its head for direct
comparison. This obviates the need to instantiate a Comparable and offer
it to the queue.

/s


On 06.03.2013 17:01, Ted Dunning wrote:
> I would recommend against a mutable object on maintenance grounds.
> 
> Better is to keep the threshold that a new score must meet and only
> construct the object on need.  That cuts the allocation down to negligible
> levels.
> 
> On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sr...@gmail.com> wrote:
> 
>> OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
>> probably, as most machines can handle 2 reducers at once).
>> I think the recommendation step loads one whole matrix into memory. You're
>> not running out of memory but if you're turning up the heap size to
>> accommodate, you might be hitting swapping, yes. I think (?) the
>> conventional wisdom is to turn off swap for Hadoop.
>>
>> Sebastian yes that is probably a good optimization; I've had good results
>> reusing a mutable object in this context.
>>
>>
>> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com> wrote:
>>
>>> The factorization at 2-hours is kind of a non-issue (certainly fast
>>> enough). It was run with (if I recall correctly) 30 reducers across a 35
>>> node cluster, with 10 iterations.
>>>
>>> I was a bit shocked at how long the recommendation step took and will
>> throw
>>> some timing debug in to see where the problem lies exactly. There were no
>>> other jobs running on the cluster during these attempts, but it's
>> certainly
>>> possible that something is swapping or the like. I'll be looking more
>>> closely today before I start to consider other options for calculating
>> the
>>> recommendations.
>>>
>>>
>>
>

Re: Top-N recommendations from SVD

Posted by Sean Owen <sr...@gmail.com>.

That too, even better. Isn't that already done? Could be in one place but
not another. IIRC there were also cases where it was a lot easier to pass
around an object internally and mutability solved the performance issue,
without much risk since it was only internal. You can (nay, must) always
copy the objects before being returned.



On Wed, Mar 6, 2013 at 4:01 PM, Ted Dunning <te...@gmail.com> wrote:

> I would recommend against a mutable object on maintenance grounds.
>
> Better is to keep the threshold that a new score must meet and only
> construct the object on need.  That cuts the allocation down to negligible
> levels.
>
> On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
> > probably, as most machines can handle 2 reducers at once).
> > I think the recommendation step loads one whole matrix into memory.
> You're
> > not running out of memory but if you're turning up the heap size to
> > accommodate, you might be hitting swapping, yes. I think (?) the
> > conventional wisdom is to turn off swap for Hadoop.
> >
> > Sebastian yes that is probably a good optimization; I've had good results
> > reusing a mutable object in this context.
> >
> >
> > On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com> wrote:
> >
> > > The factorization at 2-hours is kind of a non-issue (certainly fast
> > > enough). It was run with (if I recall correctly) 30 reducers across a
> 35
> > > node cluster, with 10 iterations.
> > >
> > > I was a bit shocked at how long the recommendation step took and will
> > throw
> > > some timing debug in to see where the problem lies exactly. There were
> no
> > > other jobs running on the cluster during these attempts, but it's
> > certainly
> > > possible that something is swapping or the like. I'll be looking more
> > > closely today before I start to consider other options for calculating
> > the
> > > recommendations.
> > >
> > >
> >
>

Re: Top-N recommendations from SVD

Posted by Ted Dunning <te...@gmail.com>.

I would recommend against a mutable object on maintenance grounds.

Better is to keep the threshold that a new score must meet and only
construct the object on need.  That cuts the allocation down to negligible
levels.

On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sr...@gmail.com> wrote:

> OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
> probably, as most machines can handle 2 reducers at once).
> I think the recommendation step loads one whole matrix into memory. You're
> not running out of memory but if you're turning up the heap size to
> accommodate, you might be hitting swapping, yes. I think (?) the
> conventional wisdom is to turn off swap for Hadoop.
>
> Sebastian yes that is probably a good optimization; I've had good results
> reusing a mutable object in this context.
>
>
> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com> wrote:
>
> > The factorization at 2-hours is kind of a non-issue (certainly fast
> > enough). It was run with (if I recall correctly) 30 reducers across a 35
> > node cluster, with 10 iterations.
> >
> > I was a bit shocked at how long the recommendation step took and will
> throw
> > some timing debug in to see where the problem lies exactly. There were no
> > other jobs running on the cluster during these attempts, but it's
> certainly
> > possible that something is swapping or the like. I'll be looking more
> > closely today before I start to consider other options for calculating
> the
> > recommendations.
> >
> >
>

Re: Top-N recommendations from SVD

Posted by Sean Owen <sr...@gmail.com>.

OK and he mentioned that 10 mappers were running, when it ought to be able
to use several per machine. The # of mappers is a function of the input
size really, so probably needs to turn down the max file split size to
induce more mappers?

On Wed, Mar 6, 2013 at 11:16 AM, Sebastian Schelter <ssc.open@googlemail.com
> wrote:

> Btw, all important jobs in ALS are map-only, so its the number of map
> slotes that counts.
>
>

Re: Top-N recommendations from SVD

Posted by Sebastian Schelter <ss...@apache.org>.

I tried to rework and optimize the code a little in trunk today,
https://issues.apache.org/jira/browse/MAHOUT-1151

You could use this as a basis for further optimization.

Best,
Sebastian


On 06.03.2013 12:44, Josh Devins wrote:
> First bit of feedback. The `M.forEachPair` loop is about 1600-1800 millis
> per user (recall the size is ~2.6M users x ~2.8M items). There doesn't
> appear to be any out of the ordinary GC going on (yet). Going to look at
> optimising this loop a bit and see where I can get. Definitely time-boxing
> this though ;)
> 
> 
> On 6 March 2013 12:16, Sebastian Schelter <ss...@googlemail.com> wrote:
> 
>> Btw, all important jobs in ALS are map-only, so its the number of map
>> slotes that counts.
>>
>> On 06.03.2013 12:11, Sean Owen wrote:
>>> OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
>>> probably, as most machines can handle 2 reducers at once).
>>> I think the recommendation step loads one whole matrix into memory.
>> You're
>>> not running out of memory but if you're turning up the heap size to
>>> accommodate, you might be hitting swapping, yes. I think (?) the
>>> conventional wisdom is to turn off swap for Hadoop.
>>>
>>> Sebastian yes that is probably a good optimization; I've had good results
>>> reusing a mutable object in this context.
>>>
>>>
>>> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com> wrote:
>>>
>>>> The factorization at 2-hours is kind of a non-issue (certainly fast
>>>> enough). It was run with (if I recall correctly) 30 reducers across a 35
>>>> node cluster, with 10 iterations.
>>>>
>>>> I was a bit shocked at how long the recommendation step took and will
>> throw
>>>> some timing debug in to see where the problem lies exactly. There were
>> no
>>>> other jobs running on the cluster during these attempts, but it's
>> certainly
>>>> possible that something is swapping or the like. I'll be looking more
>>>> closely today before I start to consider other options for calculating
>> the
>>>> recommendations.
>>>>
>>>>
>>>
>>
>>
>

Re: Top-N recommendations from SVD

Posted by Ted Dunning <te...@gmail.com>.

That sounds way too long.  How is the U matrix stored?  What type?

On Wed, Mar 6, 2013 at 6:44 AM, Josh Devins <hi...@joshdevins.com> wrote:

> First bit of feedback. The `M.forEachPair` loop is about 1600-1800 millis
> per user (recall the size is ~2.6M users x ~2.8M items). There doesn't
> appear to be any out of the ordinary GC going on (yet). Going to look at
> optimising this loop a bit and see where I can get. Definitely time-boxing
> this though ;)
>
>
> On 6 March 2013 12:16, Sebastian Schelter <ss...@googlemail.com> wrote:
>
> > Btw, all important jobs in ALS are map-only, so its the number of map
> > slotes that counts.
> >
> > On 06.03.2013 12:11, Sean Owen wrote:
> > > OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
> > > probably, as most machines can handle 2 reducers at once).
> > > I think the recommendation step loads one whole matrix into memory.
> > You're
> > > not running out of memory but if you're turning up the heap size to
> > > accommodate, you might be hitting swapping, yes. I think (?) the
> > > conventional wisdom is to turn off swap for Hadoop.
> > >
> > > Sebastian yes that is probably a good optimization; I've had good
> results
> > > reusing a mutable object in this context.
> > >
> > >
> > > On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com>
> wrote:
> > >
> > >> The factorization at 2-hours is kind of a non-issue (certainly fast
> > >> enough). It was run with (if I recall correctly) 30 reducers across a
> 35
> > >> node cluster, with 10 iterations.
> > >>
> > >> I was a bit shocked at how long the recommendation step took and will
> > throw
> > >> some timing debug in to see where the problem lies exactly. There were
> > no
> > >> other jobs running on the cluster during these attempts, but it's
> > certainly
> > >> possible that something is swapping or the like. I'll be looking more
> > >> closely today before I start to consider other options for calculating
> > the
> > >> recommendations.
> > >>
> > >>
> > >
> >
> >
>

Re: Top-N recommendations from SVD

Posted by Josh Devins <hi...@joshdevins.com>.

First bit of feedback. The `M.forEachPair` loop is about 1600-1800 millis
per user (recall the size is ~2.6M users x ~2.8M items). There doesn't
appear to be any out of the ordinary GC going on (yet). Going to look at
optimising this loop a bit and see where I can get. Definitely time-boxing
this though ;)


On 6 March 2013 12:16, Sebastian Schelter <ss...@googlemail.com> wrote:

> Btw, all important jobs in ALS are map-only, so its the number of map
> slotes that counts.
>
> On 06.03.2013 12:11, Sean Owen wrote:
> > OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
> > probably, as most machines can handle 2 reducers at once).
> > I think the recommendation step loads one whole matrix into memory.
> You're
> > not running out of memory but if you're turning up the heap size to
> > accommodate, you might be hitting swapping, yes. I think (?) the
> > conventional wisdom is to turn off swap for Hadoop.
> >
> > Sebastian yes that is probably a good optimization; I've had good results
> > reusing a mutable object in this context.
> >
> >
> > On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com> wrote:
> >
> >> The factorization at 2-hours is kind of a non-issue (certainly fast
> >> enough). It was run with (if I recall correctly) 30 reducers across a 35
> >> node cluster, with 10 iterations.
> >>
> >> I was a bit shocked at how long the recommendation step took and will
> throw
> >> some timing debug in to see where the problem lies exactly. There were
> no
> >> other jobs running on the cluster during these attempts, but it's
> certainly
> >> possible that something is swapping or the like. I'll be looking more
> >> closely today before I start to consider other options for calculating
> the
> >> recommendations.
> >>
> >>
> >
>
>

Re: Top-N recommendations from SVD

Posted by Sebastian Schelter <ss...@googlemail.com>.

Btw, all important jobs in ALS are map-only, so its the number of map
slotes that counts.

On 06.03.2013 12:11, Sean Owen wrote:
> OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
> probably, as most machines can handle 2 reducers at once).
> I think the recommendation step loads one whole matrix into memory. You're
> not running out of memory but if you're turning up the heap size to
> accommodate, you might be hitting swapping, yes. I think (?) the
> conventional wisdom is to turn off swap for Hadoop.
> 
> Sebastian yes that is probably a good optimization; I've had good results
> reusing a mutable object in this context.
> 
> 
> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com> wrote:
> 
>> The factorization at 2-hours is kind of a non-issue (certainly fast
>> enough). It was run with (if I recall correctly) 30 reducers across a 35
>> node cluster, with 10 iterations.
>>
>> I was a bit shocked at how long the recommendation step took and will throw
>> some timing debug in to see where the problem lies exactly. There were no
>> other jobs running on the cluster during these attempts, but it's certainly
>> possible that something is swapping or the like. I'll be looking more
>> closely today before I start to consider other options for calculating the
>> recommendations.
>>
>>
>

Re: Top-N recommendations from SVD

Posted by Sean Owen <sr...@gmail.com>.

OK, that's reasonable on 35 machines. (You can turn up to 70 reducers,
probably, as most machines can handle 2 reducers at once).
I think the recommendation step loads one whole matrix into memory. You're
not running out of memory but if you're turning up the heap size to
accommodate, you might be hitting swapping, yes. I think (?) the
conventional wisdom is to turn off swap for Hadoop.

Sebastian yes that is probably a good optimization; I've had good results
reusing a mutable object in this context.

On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <hi...@joshdevins.com> wrote:

> The factorization at 2-hours is kind of a non-issue (certainly fast
> enough). It was run with (if I recall correctly) 30 reducers across a 35
> node cluster, with 10 iterations.
>
> I was a bit shocked at how long the recommendation step took and will throw
> some timing debug in to see where the problem lies exactly. There were no
> other jobs running on the cluster during these attempts, but it's certainly
> possible that something is swapping or the like. I'll be looking more
> closely today before I start to consider other options for calculating the
> recommendations.
>
>

Re: Top-N recommendations from SVD

Posted by Josh Devins <hi...@joshdevins.com>.

The factorization at 2-hours is kind of a non-issue (certainly fast
enough). It was run with (if I recall correctly) 30 reducers across a 35
node cluster, with 10 iterations.

I was a bit shocked at how long the recommendation step took and will throw
some timing debug in to see where the problem lies exactly. There were no
other jobs running on the cluster during these attempts, but it's certainly
possible that something is swapping or the like. I'll be looking more
closely today before I start to consider other options for calculating the
recommendations.



On 6 March 2013 11:41, Sean Owen <sr...@gmail.com> wrote:

> Yeah that's right, he said 20 features, oops. And yes he says he's talking
> about the recs only too, so that's not right either. That seems way too
> long relative to factorization. And the factorization seems quite fast; how
> many machines, and how many iterations?
>
> I thought the shape of the computation was to cache B' (yes whose columns
> are B rows) and multiply against the rows of A. There again probably wrong
> given the latest timing info.
>
>
> On Wed, Mar 6, 2013 at 10:25 AM, Josh Devins <hi...@joshdevins.com> wrote:
>
> > So the 80 hour estimate is _only_ for the U*M', top-n calculation and not
> > the factorization. Factorization is on the order of 2-hours. For the
> > interested, here's the pertinent code from the ALS `RecommenderJob`:
> >
> >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148
> >
> > I'm sure this can be optimised, but by an order of magnitude? Something
> to
> > try out, I'll report back if I find anything concrete.
> >
> >
> >
> > On 6 March 2013 11:13, Ted Dunning <te...@gmail.com> wrote:
> >
> > > Well, it would definitely not be the for time I counted incorrectly.
> > >  Anytime I do arithmetic the result should be considered suspect.  I do
> > > think my numbers are correct, but then again, I always do.
> > >
> > > But the OP did say 20 dimensions which gives me back 5x.
> > >
> > > Inclusion of learning time is a good suspect.  In the other side of the
> > > ledger, if the multiply is doing any column wise access it is a likely
> > > performance bug.  The computation is AB'. Perhaps you refer to rows of
> B
> > > which are the columns of B'.
> > >
> > > Sent from my sleepy thumbs set to typing on my iPhone.
> > >
> > > On Mar 6, 2013, at 4:16 AM, Sean Owen <sr...@gmail.com> wrote:
> > >
> > > > If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728
> > Tflops
> > > --
> > > > I think you're missing an "M", and the features by an order of
> > magnitude.
> > > > That's still 1 day on an 8-core machine by this rule of thumb.
> > > >
> > > > The 80 hours is the model building time too (right?), not the time to
> > > > multiply U*M'. This is dominated by iterations when building from
> > > scratch,
> > > > and I expect took 75% of that 80 hours. So if the multiply was 20
> hours
> > > --
> > > > on 10 machines -- on Hadoop, then that's still slow but not out of
> the
> > > > question for Hadoop, given it's usually a 3-6x slowdown over a
> parallel
> > > > in-core implementation.
> > > >
> > > > I'm pretty sure what exists in Mahout here can be optimized further
> at
> > > the
> > > > Hadoop level; I don't know that it's doing the multiply badly though.
> > In
> > > > fact I'm pretty sure it's caching cols in memory, which is a bit of
> > > > 'cheating' to speed up by taking a lot of memory.
> > > >
> > > >
> > > > On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >
> > > >> Hmm... each users recommendations seems to be about 2.8 x 20M Flops
> =
> > > 60M
> > > >> Flops.  You should get about a Gflop per core in Java so this should
> > > about
> > > >> 60 ms.  You can make this faster with more cores or by using ATLAS.
> > > >>
> > > >> Are you expecting 3 million unique people every 80 hours?  If no,
> then
> > > it
> > > >> is probably more efficient to compute the recommendations on the
> fly.
> > > >>
> > > >> How many recommendations per second are you expecting?  If you have
> 1
> > > >> million uniques per day (just for grins) and we assume 20,000 s/day
> to
> > > >> allow for peak loading, you have to do 50 queries per second peak.
> >  This
> > > >> seems to require 3 cores.  Use 16 to be safe.
> > > >>
> > > >> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50
> hours.
> > >  I
> > > >> think that your map-reduce is under performing by about a factor of
> > 10.
> > > >> This is quite plausible with bad arrangement of the inner loops.  I
> > > think
> > > >> that you would have highest performance computing the
> recommendations
> > > for a
> > > >> few thousand items by a few thousand users at a time.  It might be
> > just
> > > >> about as fast to do all items against a few users at a time.  The
> > reason
> > > >> for this is that dense matrix multiply requires c n x k + m x k
> memory
> > > ops,
> > > >> but n x k x m arithmetic ops.  If you can re-use data many times,
> you
> > > can
> > > >> balance memory channel bandwidth against CPU speed.  Typically you
> > need
> > > 20
> > > >> or more re-uses to really make this fly.
> > > >>
> > > >>
> > >
> >
>

Re: Top-N recommendations from SVD

Posted by Sean Owen <sr...@gmail.com>.

Yeah that's right, he said 20 features, oops. And yes he says he's talking
about the recs only too, so that's not right either. That seems way too
long relative to factorization. And the factorization seems quite fast; how
many machines, and how many iterations?

I thought the shape of the computation was to cache B' (yes whose columns
are B rows) and multiply against the rows of A. There again probably wrong
given the latest timing info.


On Wed, Mar 6, 2013 at 10:25 AM, Josh Devins <hi...@joshdevins.com> wrote:

> So the 80 hour estimate is _only_ for the U*M', top-n calculation and not
> the factorization. Factorization is on the order of 2-hours. For the
> interested, here's the pertinent code from the ALS `RecommenderJob`:
>
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148
>
> I'm sure this can be optimised, but by an order of magnitude? Something to
> try out, I'll report back if I find anything concrete.
>
>
>
> On 6 March 2013 11:13, Ted Dunning <te...@gmail.com> wrote:
>
> > Well, it would definitely not be the for time I counted incorrectly.
> >  Anytime I do arithmetic the result should be considered suspect.  I do
> > think my numbers are correct, but then again, I always do.
> >
> > But the OP did say 20 dimensions which gives me back 5x.
> >
> > Inclusion of learning time is a good suspect.  In the other side of the
> > ledger, if the multiply is doing any column wise access it is a likely
> > performance bug.  The computation is AB'. Perhaps you refer to rows of B
> > which are the columns of B'.
> >
> > Sent from my sleepy thumbs set to typing on my iPhone.
> >
> > On Mar 6, 2013, at 4:16 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728
> Tflops
> > --
> > > I think you're missing an "M", and the features by an order of
> magnitude.
> > > That's still 1 day on an 8-core machine by this rule of thumb.
> > >
> > > The 80 hours is the model building time too (right?), not the time to
> > > multiply U*M'. This is dominated by iterations when building from
> > scratch,
> > > and I expect took 75% of that 80 hours. So if the multiply was 20 hours
> > --
> > > on 10 machines -- on Hadoop, then that's still slow but not out of the
> > > question for Hadoop, given it's usually a 3-6x slowdown over a parallel
> > > in-core implementation.
> > >
> > > I'm pretty sure what exists in Mahout here can be optimized further at
> > the
> > > Hadoop level; I don't know that it's doing the multiply badly though.
> In
> > > fact I'm pretty sure it's caching cols in memory, which is a bit of
> > > 'cheating' to speed up by taking a lot of memory.
> > >
> > >
> > > On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > >> Hmm... each users recommendations seems to be about 2.8 x 20M Flops =
> > 60M
> > >> Flops.  You should get about a Gflop per core in Java so this should
> > about
> > >> 60 ms.  You can make this faster with more cores or by using ATLAS.
> > >>
> > >> Are you expecting 3 million unique people every 80 hours?  If no, then
> > it
> > >> is probably more efficient to compute the recommendations on the fly.
> > >>
> > >> How many recommendations per second are you expecting?  If you have 1
> > >> million uniques per day (just for grins) and we assume 20,000 s/day to
> > >> allow for peak loading, you have to do 50 queries per second peak.
>  This
> > >> seems to require 3 cores.  Use 16 to be safe.
> > >>
> > >> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours.
> >  I
> > >> think that your map-reduce is under performing by about a factor of
> 10.
> > >> This is quite plausible with bad arrangement of the inner loops.  I
> > think
> > >> that you would have highest performance computing the recommendations
> > for a
> > >> few thousand items by a few thousand users at a time.  It might be
> just
> > >> about as fast to do all items against a few users at a time.  The
> reason
> > >> for this is that dense matrix multiply requires c n x k + m x k memory
> > ops,
> > >> but n x k x m arithmetic ops.  If you can re-use data many times, you
> > can
> > >> balance memory channel bandwidth against CPU speed.  Typically you
> need
> > 20
> > >> or more re-uses to really make this fly.
> > >>
> > >>
> >
>

Re: Top-N recommendations from SVD

Posted by Josh Devins <hi...@joshdevins.com>.

So the 80 hour estimate is _only_ for the U*M', top-n calculation and not
the factorization. Factorization is on the order of 2-hours. For the
interested, here's the pertinent code from the ALS `RecommenderJob`:

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148

I'm sure this can be optimised, but by an order of magnitude? Something to
try out, I'll report back if I find anything concrete.



On 6 March 2013 11:13, Ted Dunning <te...@gmail.com> wrote:

> Well, it would definitely not be the for time I counted incorrectly.
>  Anytime I do arithmetic the result should be considered suspect.  I do
> think my numbers are correct, but then again, I always do.
>
> But the OP did say 20 dimensions which gives me back 5x.
>
> Inclusion of learning time is a good suspect.  In the other side of the
> ledger, if the multiply is doing any column wise access it is a likely
> performance bug.  The computation is AB'. Perhaps you refer to rows of B
> which are the columns of B'.
>
> Sent from my sleepy thumbs set to typing on my iPhone.
>
> On Mar 6, 2013, at 4:16 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728 Tflops
> --
> > I think you're missing an "M", and the features by an order of magnitude.
> > That's still 1 day on an 8-core machine by this rule of thumb.
> >
> > The 80 hours is the model building time too (right?), not the time to
> > multiply U*M'. This is dominated by iterations when building from
> scratch,
> > and I expect took 75% of that 80 hours. So if the multiply was 20 hours
> --
> > on 10 machines -- on Hadoop, then that's still slow but not out of the
> > question for Hadoop, given it's usually a 3-6x slowdown over a parallel
> > in-core implementation.
> >
> > I'm pretty sure what exists in Mahout here can be optimized further at
> the
> > Hadoop level; I don't know that it's doing the multiply badly though. In
> > fact I'm pretty sure it's caching cols in memory, which is a bit of
> > 'cheating' to speed up by taking a lot of memory.
> >
> >
> > On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> Hmm... each users recommendations seems to be about 2.8 x 20M Flops =
> 60M
> >> Flops.  You should get about a Gflop per core in Java so this should
> about
> >> 60 ms.  You can make this faster with more cores or by using ATLAS.
> >>
> >> Are you expecting 3 million unique people every 80 hours?  If no, then
> it
> >> is probably more efficient to compute the recommendations on the fly.
> >>
> >> How many recommendations per second are you expecting?  If you have 1
> >> million uniques per day (just for grins) and we assume 20,000 s/day to
> >> allow for peak loading, you have to do 50 queries per second peak.  This
> >> seems to require 3 cores.  Use 16 to be safe.
> >>
> >> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours.
>  I
> >> think that your map-reduce is under performing by about a factor of 10.
> >> This is quite plausible with bad arrangement of the inner loops.  I
> think
> >> that you would have highest performance computing the recommendations
> for a
> >> few thousand items by a few thousand users at a time.  It might be just
> >> about as fast to do all items against a few users at a time.  The reason
> >> for this is that dense matrix multiply requires c n x k + m x k memory
> ops,
> >> but n x k x m arithmetic ops.  If you can re-use data many times, you
> can
> >> balance memory channel bandwidth against CPU speed.  Typically you need
> 20
> >> or more re-uses to really make this fly.
> >>
> >>
>

Re: Top-N recommendations from SVD

Posted by Ted Dunning <te...@gmail.com>.

Well, it would definitely not be the for time I counted incorrectly.  Anytime I do arithmetic the result should be considered suspect.  I do think my numbers are correct, but then again, I always do.  

But the OP did say 20 dimensions which gives me back 5x. 

Inclusion of learning time is a good suspect.  In the other side of the ledger, if the multiply is doing any column wise access it is a likely performance bug.  The computation is AB'. Perhaps you refer to rows of B which are the columns of B'. 

Sent from my sleepy thumbs set to typing on my iPhone.  

On Mar 6, 2013, at 4:16 AM, Sean Owen <sr...@gmail.com> wrote:

> If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728 Tflops --
> I think you're missing an "M", and the features by an order of magnitude.
> That's still 1 day on an 8-core machine by this rule of thumb.
> 
> The 80 hours is the model building time too (right?), not the time to
> multiply U*M'. This is dominated by iterations when building from scratch,
> and I expect took 75% of that 80 hours. So if the multiply was 20 hours --
> on 10 machines -- on Hadoop, then that's still slow but not out of the
> question for Hadoop, given it's usually a 3-6x slowdown over a parallel
> in-core implementation.
> 
> I'm pretty sure what exists in Mahout here can be optimized further at the
> Hadoop level; I don't know that it's doing the multiply badly though. In
> fact I'm pretty sure it's caching cols in memory, which is a bit of
> 'cheating' to speed up by taking a lot of memory.
> 
> 
> On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <te...@gmail.com> wrote:
> 
>> Hmm... each users recommendations seems to be about 2.8 x 20M Flops = 60M
>> Flops.  You should get about a Gflop per core in Java so this should about
>> 60 ms.  You can make this faster with more cores or by using ATLAS.
>> 
>> Are you expecting 3 million unique people every 80 hours?  If no, then it
>> is probably more efficient to compute the recommendations on the fly.
>> 
>> How many recommendations per second are you expecting?  If you have 1
>> million uniques per day (just for grins) and we assume 20,000 s/day to
>> allow for peak loading, you have to do 50 queries per second peak.  This
>> seems to require 3 cores.  Use 16 to be safe.
>> 
>> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours.  I
>> think that your map-reduce is under performing by about a factor of 10.
>> This is quite plausible with bad arrangement of the inner loops.  I think
>> that you would have highest performance computing the recommendations for a
>> few thousand items by a few thousand users at a time.  It might be just
>> about as fast to do all items against a few users at a time.  The reason
>> for this is that dense matrix multiply requires c n x k + m x k memory ops,
>> but n x k x m arithmetic ops.  If you can re-use data many times, you can
>> balance memory channel bandwidth against CPU speed.  Typically you need 20
>> or more re-uses to really make this fly.
>> 
>>

Re: Top-N recommendations from SVD

Posted by Sean Owen <sr...@gmail.com>.

If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728 Tflops --
I think you're missing an "M", and the features by an order of magnitude.
That's still 1 day on an 8-core machine by this rule of thumb.

The 80 hours is the model building time too (right?), not the time to
multiply U*M'. This is dominated by iterations when building from scratch,
and I expect took 75% of that 80 hours. So if the multiply was 20 hours --
on 10 machines -- on Hadoop, then that's still slow but not out of the
question for Hadoop, given it's usually a 3-6x slowdown over a parallel
in-core implementation.

I'm pretty sure what exists in Mahout here can be optimized further at the
Hadoop level; I don't know that it's doing the multiply badly though. In
fact I'm pretty sure it's caching cols in memory, which is a bit of
'cheating' to speed up by taking a lot of memory.

On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <te...@gmail.com> wrote:

> Hmm... each users recommendations seems to be about 2.8 x 20M Flops = 60M
> Flops.  You should get about a Gflop per core in Java so this should about
> 60 ms.  You can make this faster with more cores or by using ATLAS.
>
> Are you expecting 3 million unique people every 80 hours?  If no, then it
> is probably more efficient to compute the recommendations on the fly.
>
> How many recommendations per second are you expecting?  If you have 1
> million uniques per day (just for grins) and we assume 20,000 s/day to
> allow for peak loading, you have to do 50 queries per second peak.  This
> seems to require 3 cores.  Use 16 to be safe.
>
> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours.  I
> think that your map-reduce is under performing by about a factor of 10.
>  This is quite plausible with bad arrangement of the inner loops.  I think
> that you would have highest performance computing the recommendations for a
> few thousand items by a few thousand users at a time.  It might be just
> about as fast to do all items against a few users at a time.  The reason
> for this is that dense matrix multiply requires c n x k + m x k memory ops,
> but n x k x m arithmetic ops.  If you can re-use data many times, you can
> balance memory channel bandwidth against CPU speed.  Typically you need 20
> or more re-uses to really make this fly.
>
>

Re: Top-N recommendations from SVD

Posted by Ted Dunning <te...@gmail.com>.

Hmm... each users recommendations seems to be about 2.8 x 20M Flops = 60M
Flops.  You should get about a Gflop per core in Java so this should about
60 ms.  You can make this faster with more cores or by using ATLAS.

Are you expecting 3 million unique people every 80 hours?  If no, then it
is probably more efficient to compute the recommendations on the fly.

How many recommendations per second are you expecting?  If you have 1
million uniques per day (just for grins) and we assume 20,000 s/day to
allow for peak loading, you have to do 50 queries per second peak.  This
seems to require 3 cores.  Use 16 to be safe.

Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours.  I
think that your map-reduce is under performing by about a factor of 10.
 This is quite plausible with bad arrangement of the inner loops.  I think
that you would have highest performance computing the recommendations for a
few thousand items by a few thousand users at a time.  It might be just
about as fast to do all items against a few users at a time.  The reason
for this is that dense matrix multiply requires c n x k + m x k memory ops,
but n x k x m arithmetic ops.  If you can re-use data many times, you can
balance memory channel bandwidth against CPU speed.  Typically you need 20
or more re-uses to really make this fly.

On Tue, Mar 5, 2013 at 4:15 PM, Josh Devins <hi...@joshdevins.com> wrote:

> Hi all,
>
> I have a conceptually simple problem. A user-item matrix, A, whose
> dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS with
> 20 features reduces this in the usual way to A = UM'. Trying to generate
> top-n (where n=100) recommendations for all users in U is quite a long
> process though. Essentially, for every user, it's generating a prediction
> for all unrated items in M then taking the top-n (all in-memory). I'm using
> the standard ALS `RecommenderJob` for this.
>
> Considering that there are ~2.6M users and ~2.8M items, this is a really,
> really, time consuming way to find the top-n recommendations for all users
> in U. I feel like there could be a tricky way to avoid having to compute
> all item predictions of a user though. I can't find any reference in papers
> about improving this but at the moment, the estimate (with 10 mappers
> running the `RecommenderJob`) is ~80 hours. When I think about this problem
> I wonder if applying kNN or local sensitive min-hashing would somehow help
> me. Basically find the nearest neighbours directly and calculate
> predictions on those items only and not every item in M. On the flip side,
> I could start to reduce the item space, since it's quite large, basically
> start removing items that have low in-degrees since these probably don't
> contribute too much to the final recommendations. I don't like this so much
> though as it could remove some of the long-tail recommendations. At least,
> that is my intuition :)
>
> Thoughts anyone?
>
> Thanks in advance,
>
> Josh
>

Re: Top-N recommendations from SVD

Posted by Sean Owen <sr...@gmail.com>.

Ah OK, so this is quite a big problem. Still, it is quite useful to be able
to make recommendations in real-time, or near-real-time. It saves the
relatively quite large cost of precomputing, and lets you respond
immediately to new data. If the site has a lot of occasional or new users,
that can make a huge difference -- if I visit once, or once a month,
precomputing recommendations every day from tomorrow doesn't help much.

Of course, that can be difficult to reconcile with <100ms response times,
but with some tricks like LSH and some reasonable hardware I think you'd
find it possible at this scale. It does take a lot of engineering.



On Tue, Mar 5, 2013 at 9:43 PM, Josh Devins <hi...@joshdevins.com> wrote:

> Thanks Sean, at least I know I'm mostly on the right track ;)
>
> So in our case (a large, social, consumer website), this is already a small
> subset of all users (and items for that matter) and is really only the
> active users. In fact, in future iterations, the number of users will
> likely grow by around 3x (or at least, that's my optimistic target). So
> it's not very likely to be able to calculate recommendations for fewer
> users, but I like the idea of leaving all items in the matrix but not
> computing preference predictions for all of them. I will think on this and
> see if it fits for our domain (probably will work), and maybe a pull
> request to Mahout if I can make this generic in some way! LSH was my
> instinctual approach also but wasn't totally sure if this was sane! I'll
> have a look into this as well if needed.
>
> Thanks for the advice!
>
> Josh
>
>
>
> On 5 March 2013 22:23, Sean Owen <sr...@gmail.com> wrote:
>
> > Without any tricks, yes you have to do this much work to really know
> which
> > are the largest values in UM' for every row. There's not an obvious twist
> > that speeds it up.
> >
> > (Do you really want to compute all user recommendations? how many of the
> > 2.6M are likely to be active soon, or, ever?)
> >
> > First, usually it's only a subset of all items that are recommendable
> > anyway. You don't want them out of the model but don't need to consider
> > them. This is domain specific of course, but, if 90% of the items are
> "out
> > of stock" or something, of course you can not bother to score them in the
> > first place
> >
> > Yes, LSH is exactly what I do as well. You hash the item feature vectors
> > into buckets and then only iterate over nearby buckets to find
> candidates.
> > You can avoid looking at 90+% of candidates this way without much if any
> > impact on top N.
> >
> > Pruning is indeed third on the list but usually you get the problem to a
> > pretty good size from the points above.
> >
> >
> >
> > On Tue, Mar 5, 2013 at 9:15 PM, Josh Devins <hi...@joshdevins.com> wrote:
> >
> > > Hi all,
> > >
> > > I have a conceptually simple problem. A user-item matrix, A, whose
> > > dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS
> with
> > > 20 features reduces this in the usual way to A = UM'. Trying to
> generate
> > > top-n (where n=100) recommendations for all users in U is quite a long
> > > process though. Essentially, for every user, it's generating a
> prediction
> > > for all unrated items in M then taking the top-n (all in-memory). I'm
> > using
> > > the standard ALS `RecommenderJob` for this.
> > >
> > > Considering that there are ~2.6M users and ~2.8M items, this is a
> really,
> > > really, time consuming way to find the top-n recommendations for all
> > users
> > > in U. I feel like there could be a tricky way to avoid having to
> compute
> > > all item predictions of a user though. I can't find any reference in
> > papers
> > > about improving this but at the moment, the estimate (with 10 mappers
> > > running the `RecommenderJob`) is ~80 hours. When I think about this
> > problem
> > > I wonder if applying kNN or local sensitive min-hashing would somehow
> > help
> > > me. Basically find the nearest neighbours directly and calculate
> > > predictions on those items only and not every item in M. On the flip
> > side,
> > > I could start to reduce the item space, since it's quite large,
> basically
> > > start removing items that have low in-degrees since these probably
> don't
> > > contribute too much to the final recommendations. I don't like this so
> > much
> > > though as it could remove some of the long-tail recommendations. At
> > least,
> > > that is my intuition :)
> > >
> > > Thoughts anyone?
> > >
> > > Thanks in advance,
> > >
> > > Josh
> > >
> >
>

Re: Top-N recommendations from SVD

Posted by Josh Devins <hi...@joshdevins.com>.

Thanks Sean, at least I know I'm mostly on the right track ;)

So in our case (a large, social, consumer website), this is already a small
subset of all users (and items for that matter) and is really only the
active users. In fact, in future iterations, the number of users will
likely grow by around 3x (or at least, that's my optimistic target). So
it's not very likely to be able to calculate recommendations for fewer
users, but I like the idea of leaving all items in the matrix but not
computing preference predictions for all of them. I will think on this and
see if it fits for our domain (probably will work), and maybe a pull
request to Mahout if I can make this generic in some way! LSH was my
instinctual approach also but wasn't totally sure if this was sane! I'll
have a look into this as well if needed.

Thanks for the advice!

Josh



On 5 March 2013 22:23, Sean Owen <sr...@gmail.com> wrote:

> Without any tricks, yes you have to do this much work to really know which
> are the largest values in UM' for every row. There's not an obvious twist
> that speeds it up.
>
> (Do you really want to compute all user recommendations? how many of the
> 2.6M are likely to be active soon, or, ever?)
>
> First, usually it's only a subset of all items that are recommendable
> anyway. You don't want them out of the model but don't need to consider
> them. This is domain specific of course, but, if 90% of the items are "out
> of stock" or something, of course you can not bother to score them in the
> first place
>
> Yes, LSH is exactly what I do as well. You hash the item feature vectors
> into buckets and then only iterate over nearby buckets to find candidates.
> You can avoid looking at 90+% of candidates this way without much if any
> impact on top N.
>
> Pruning is indeed third on the list but usually you get the problem to a
> pretty good size from the points above.
>
>
>
> On Tue, Mar 5, 2013 at 9:15 PM, Josh Devins <hi...@joshdevins.com> wrote:
>
> > Hi all,
> >
> > I have a conceptually simple problem. A user-item matrix, A, whose
> > dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS with
> > 20 features reduces this in the usual way to A = UM'. Trying to generate
> > top-n (where n=100) recommendations for all users in U is quite a long
> > process though. Essentially, for every user, it's generating a prediction
> > for all unrated items in M then taking the top-n (all in-memory). I'm
> using
> > the standard ALS `RecommenderJob` for this.
> >
> > Considering that there are ~2.6M users and ~2.8M items, this is a really,
> > really, time consuming way to find the top-n recommendations for all
> users
> > in U. I feel like there could be a tricky way to avoid having to compute
> > all item predictions of a user though. I can't find any reference in
> papers
> > about improving this but at the moment, the estimate (with 10 mappers
> > running the `RecommenderJob`) is ~80 hours. When I think about this
> problem
> > I wonder if applying kNN or local sensitive min-hashing would somehow
> help
> > me. Basically find the nearest neighbours directly and calculate
> > predictions on those items only and not every item in M. On the flip
> side,
> > I could start to reduce the item space, since it's quite large, basically
> > start removing items that have low in-degrees since these probably don't
> > contribute too much to the final recommendations. I don't like this so
> much
> > though as it could remove some of the long-tail recommendations. At
> least,
> > that is my intuition :)
> >
> > Thoughts anyone?
> >
> > Thanks in advance,
> >
> > Josh
> >
>

Re: Top-N recommendations from SVD

Posted by Sean Owen <sr...@gmail.com>.

Without any tricks, yes you have to do this much work to really know which
are the largest values in UM' for every row. There's not an obvious twist
that speeds it up.

(Do you really want to compute all user recommendations? how many of the
2.6M are likely to be active soon, or, ever?)

First, usually it's only a subset of all items that are recommendable
anyway. You don't want them out of the model but don't need to consider
them. This is domain specific of course, but, if 90% of the items are "out
of stock" or something, of course you can not bother to score them in the
first place

Yes, LSH is exactly what I do as well. You hash the item feature vectors
into buckets and then only iterate over nearby buckets to find candidates.
You can avoid looking at 90+% of candidates this way without much if any
impact on top N.

Pruning is indeed third on the list but usually you get the problem to a
pretty good size from the points above.

On Tue, Mar 5, 2013 at 9:15 PM, Josh Devins <hi...@joshdevins.com> wrote:

> Hi all,
>
> I have a conceptually simple problem. A user-item matrix, A, whose
> dimensions are ~2.6M rows x ~2.8M cols (~65M non-zeros). Running ALS with
> 20 features reduces this in the usual way to A = UM'. Trying to generate
> top-n (where n=100) recommendations for all users in U is quite a long
> process though. Essentially, for every user, it's generating a prediction
> for all unrated items in M then taking the top-n (all in-memory). I'm using
> the standard ALS `RecommenderJob` for this.
>
> Considering that there are ~2.6M users and ~2.8M items, this is a really,
> really, time consuming way to find the top-n recommendations for all users
> in U. I feel like there could be a tricky way to avoid having to compute
> all item predictions of a user though. I can't find any reference in papers
> about improving this but at the moment, the estimate (with 10 mappers
> running the `RecommenderJob`) is ~80 hours. When I think about this problem
> I wonder if applying kNN or local sensitive min-hashing would somehow help
> me. Basically find the nearest neighbours directly and calculate
> predictions on those items only and not every item in M. On the flip side,
> I could start to reduce the item space, since it's quite large, basically
> start removing items that have low in-degrees since these probably don't
> contribute too much to the final recommendations. I don't like this so much
> though as it could remove some of the long-tail recommendations. At least,
> that is my intuition :)
>
> Thoughts anyone?
>
> Thanks in advance,
>
> Josh
>