You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2010/05/26 17:09:22 UTC

Crude distributed recommender performance / cost stats

Hi all, though the list might be interested in some recent numbers I
collected on distributed recommenders, in reality, on Hadoop. I just
finished running a set of recommendations based on the Wikipedia link
graph, for book purposes (yeah, it's unconventional). I ran on my
laptop, but it ought to be crudely representative of how it runs in a
real cluster.

The input is 1058MB as a text file, and contains, 130M article-article
associations, from 5.7M articles to 3.8M distinct articles ("users"
and "items", respectively). I estimate cost based on Amazon's North
American small Linux-based instance pricing of $0.085/hour. I ran on a
dual-core laptop with plenty of RAM, allowing 1GB per worker, so this
is valid.

In this run, I run recommendations for all 5.7M "users". You can
certainly run for any subset of all users of course.

Phase 1 (Item ID to item index mapping)
29 minutes CPU time
$0.05
60MB output

Phase 2 (Create user vectors)
88 minutes CPU time
$0.13
Output: 1159MB

Phase 3 (Count co-occurrence)
77 hours minutes CPU time
$6.54
Output: 23.6GB

Phase 4 (Partial multiply prep)
636 minutes
$0.90
Output: 24.6GB

Phase 5 (Aggregate and recommend)
about 600 hours
about $51.00
about 10GB
(I estimated these rather than let it run at home for days!)


Note that phases 1 and 3 may be run less frequently, and need not be
run every time.
But the cost is dominated by the last step, which is most of the work.
I've ignored storage costs since

This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000
user recommendations. That's not bad if, say, you want to update recs
for you site's 100,000 daily active users for a dollar.

There are several levers one could pull internally to sacrifice
accuracy for speed, but it's currently set to pretty normal values. So
this is just one possibility.

Now that's not terrible, but it is about 8x more computing than would
be needed by a non-distributed implementation *if* you could fit the
whole data set into a very large instance's memory, which is still
possible at this scale but needs a pretty big instance. That's a very
apples-to-oranges comparison of course; different algorithms, entirely
different environments. This is about the amount of overhead I'd
expect from distributing -- interesting to note how non-trivial it is.


Still to-do is to actually run this on EMR at some point or a real
cluster to see how well this estimate holds up.
And still to-do is to make this faster.

Re: Crude distributed recommender performance / cost stats

Posted by Sean Owen <sr...@gmail.com>.

Could do... it's trivial enough to convert the input or modify the
mapper that it's almost not worth committing and maintaining. It's all
just the stock algorithm otherwise.

I will dump this into the wiki with some other thoughts.

On Thu, May 27, 2010 at 3:57 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Great stuff, Sean!  Code Sounds like it could go into examples with some of the other Wikipedia stuff?
>
> Also, how about c-n-p to https://cwiki.apache.org/confluence/display/MAHOUT/MahoutBenchmarks?
>
> -Grant
>
> On May 26, 2010, at 1:32 PM, Sean Owen wrote:
>
>> The only customization needed was in the first mapper/reducer to parse
>> the particular format of the input:
>>
>> http://users.on.net/~henry/home/wikipedia.htm
>>
>> I can post the code somewhere... it's in the book too. Oh why not do
>> it here, it's pasted later.
>>
>> The rest is just the stock code from HEAD in SVN. The command line is
>> something like:
>>
>> hadoop jar mahout-core-0.4-SNAPSHOT.job
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
>> -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output
>> --booleanData true
>>
>> Your mileage may vary depending on the machine you run it on of course.
>>
>>
>>
>> public final class WikipediaItemIDIndexMapper extends MapReduceBase implements
>>    Mapper<LongWritable,Text,IntWritable, VLongWritable> {
>>
>>  private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
>>
>>  @Override
>>  public void map(LongWritable key,
>>                  Text value,
>>                  OutputCollector<IntWritable,VLongWritable> output,
>>                  Reporter reporter) throws IOException {
>>    String line = value.toString();
>>    Matcher m = NUMBERS.matcher(line);
>>    m.find();
>>    IntWritable index = new IntWritable();
>>    VLongWritable itemID = new VLongWritable();
>>    while (m.find()) {
>>      long item = Long.parseLong(m.group());
>>      itemID.set(item);
>>      index.set(idToIndex(item));
>>      output.collect(index, itemID);
>>    }
>>  }
>>
>>  static int idToIndex(long itemID) {
>>    return 0x7FFFFFFF & ((int) itemID ^ (int) (itemID >>> 32));
>>  }
>>
>> }
>>
>>
>> public final class WikipediaToItemPrefsMapper extends MapReduceBase implements
>>    Mapper<LongWritable,Text,VLongWritable,VLongWritable> {
>>
>>  private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
>>
>>  @Override
>>  public void map(LongWritable key,
>>                  Text value,
>>                  OutputCollector<VLongWritable,VLongWritable> output,
>>                  Reporter reporter) throws IOException {
>>    String line = value.toString();
>>    Matcher m = NUMBERS.matcher(line);
>>    m.find();
>>    VLongWritable userID = new VLongWritable(Long.parseLong(m.group()));
>>    VLongWritable itemID = new VLongWritable();
>>    while (m.find()) {
>>      itemID.set(Long.parseLong(m.group()));
>>      output.collect(userID, itemID);
>>    }
>>  }
>>
>> }
>>
>>
>> On Wed, May 26, 2010 at 4:45 PM, Jake Mannix <ja...@gmail.com> wrote:
>>> Hey Sean,
>>>
>>>  Very cool!  Is there any custom code you used to import the link data /
>>> instructions on how to reproduce this?
>>>
>>>  -jake
>>>
>>> On May 26, 2010 8:09 AM, "Sean Owen" <sr...@gmail.com> wrote:
>>>
>>> Hi all, though the list might be interested in some recent numbers I
>>> collected on distributed recommenders, in reality, on Hadoop. I just
>>> finished running a set of recommendations based on the Wikipedia link
>>> graph, for book purposes (yeah, it's unconventional). I ran on my
>>> laptop, but it ought to be crudely representative of how it runs in a
>>> real cluster.
>>>
>>> The input is 1058MB as a text file, and contains, 130M article-article
>>> associations, from 5.7M articles to 3.8M distinct articles ("users"
>>> and "items", respectively). I estimate cost based on Amazon's North
>>> American small Linux-based instance pricing of $0.085/hour. I ran on a
>>> dual-core laptop with plenty of RAM, allowing 1GB per worker, so this
>>> is valid.
>>>
>>> In this run, I run recommendations for all 5.7M "users". You can
>>> certainly run for any subset of all users of course.
>>>
>>> Phase 1 (Item ID to item index mapping)
>>> 29 minutes CPU time
>>> $0.05
>>> 60MB output
>>>
>>> Phase 2 (Create user vectors)
>>> 88 minutes CPU time
>>> $0.13
>>> Output: 1159MB
>>>
>>> Phase 3 (Count co-occurrence)
>>> 77 hours minutes CPU time
>>> $6.54
>>> Output: 23.6GB
>>>
>>> Phase 4 (Partial multiply prep)
>>> 636 minutes
>>> $0.90
>>> Output: 24.6GB
>>>
>>> Phase 5 (Aggregate and recommend)
>>> about 600 hours
>>> about $51.00
>>> about 10GB
>>> (I estimated these rather than let it run at home for days!)
>>>
>>>
>>> Note that phases 1 and 3 may be run less frequently, and need not be
>>> run every time.
>>> But the cost is dominated by the last step, which is most of the work.
>>> I've ignored storage costs since
>>>
>>> This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000
>>> user recommendations. That's not bad if, say, you want to update recs
>>> for you site's 100,000 daily active users for a dollar.
>>>
>>> There are several levers one could pull internally to sacrifice
>>> accuracy for speed, but it's currently set to pretty normal values. So
>>> this is just one possibility.
>>>
>>> Now that's not terrible, but it is about 8x more computing than would
>>> be needed by a non-distributed implementation *if* you could fit the
>>> whole data set into a very large instance's memory, which is still
>>> possible at this scale but needs a pretty big instance. That's a very
>>> apples-to-oranges comparison of course; different algorithms, entirely
>>> different environments. This is about the amount of overhead I'd
>>> expect from distributing -- interesting to note how non-trivial it is.
>>>
>>>
>>> Still to-do is to actually run this on EMR at some point or a real
>>> cluster to see how well this estimate holds up.
>>> And still to-do is to make this faster.
>>>
>
>
>

Re: Crude distributed recommender performance / cost stats

Posted by Grant Ingersoll <gs...@apache.org>.

Great stuff, Sean!  Code Sounds like it could go into examples with some of the other Wikipedia stuff?

Also, how about c-n-p to https://cwiki.apache.org/confluence/display/MAHOUT/MahoutBenchmarks?

-Grant 

On May 26, 2010, at 1:32 PM, Sean Owen wrote:

> The only customization needed was in the first mapper/reducer to parse
> the particular format of the input:
> 
> http://users.on.net/~henry/home/wikipedia.htm
> 
> I can post the code somewhere... it's in the book too. Oh why not do
> it here, it's pasted later.
> 
> The rest is just the stock code from HEAD in SVN. The command line is
> something like:
> 
> hadoop jar mahout-core-0.4-SNAPSHOT.job
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output
> --booleanData true
> 
> Your mileage may vary depending on the machine you run it on of course.
> 
> 
> 
> public final class WikipediaItemIDIndexMapper extends MapReduceBase implements
>    Mapper<LongWritable,Text,IntWritable, VLongWritable> {
> 
>  private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
> 
>  @Override
>  public void map(LongWritable key,
>                  Text value,
>                  OutputCollector<IntWritable,VLongWritable> output,
>                  Reporter reporter) throws IOException {
>    String line = value.toString();
>    Matcher m = NUMBERS.matcher(line);
>    m.find();
>    IntWritable index = new IntWritable();
>    VLongWritable itemID = new VLongWritable();
>    while (m.find()) {
>      long item = Long.parseLong(m.group());
>      itemID.set(item);
>      index.set(idToIndex(item));
>      output.collect(index, itemID);
>    }
>  }
> 
>  static int idToIndex(long itemID) {
>    return 0x7FFFFFFF & ((int) itemID ^ (int) (itemID >>> 32));
>  }
> 
> }
> 
> 
> public final class WikipediaToItemPrefsMapper extends MapReduceBase implements
>    Mapper<LongWritable,Text,VLongWritable,VLongWritable> {
> 
>  private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
> 
>  @Override
>  public void map(LongWritable key,
>                  Text value,
>                  OutputCollector<VLongWritable,VLongWritable> output,
>                  Reporter reporter) throws IOException {
>    String line = value.toString();
>    Matcher m = NUMBERS.matcher(line);
>    m.find();
>    VLongWritable userID = new VLongWritable(Long.parseLong(m.group()));
>    VLongWritable itemID = new VLongWritable();
>    while (m.find()) {
>      itemID.set(Long.parseLong(m.group()));
>      output.collect(userID, itemID);
>    }
>  }
> 
> }
> 
> 
> On Wed, May 26, 2010 at 4:45 PM, Jake Mannix <ja...@gmail.com> wrote:
>> Hey Sean,
>> 
>>  Very cool!  Is there any custom code you used to import the link data /
>> instructions on how to reproduce this?
>> 
>>  -jake
>> 
>> On May 26, 2010 8:09 AM, "Sean Owen" <sr...@gmail.com> wrote:
>> 
>> Hi all, though the list might be interested in some recent numbers I
>> collected on distributed recommenders, in reality, on Hadoop. I just
>> finished running a set of recommendations based on the Wikipedia link
>> graph, for book purposes (yeah, it's unconventional). I ran on my
>> laptop, but it ought to be crudely representative of how it runs in a
>> real cluster.
>> 
>> The input is 1058MB as a text file, and contains, 130M article-article
>> associations, from 5.7M articles to 3.8M distinct articles ("users"
>> and "items", respectively). I estimate cost based on Amazon's North
>> American small Linux-based instance pricing of $0.085/hour. I ran on a
>> dual-core laptop with plenty of RAM, allowing 1GB per worker, so this
>> is valid.
>> 
>> In this run, I run recommendations for all 5.7M "users". You can
>> certainly run for any subset of all users of course.
>> 
>> Phase 1 (Item ID to item index mapping)
>> 29 minutes CPU time
>> $0.05
>> 60MB output
>> 
>> Phase 2 (Create user vectors)
>> 88 minutes CPU time
>> $0.13
>> Output: 1159MB
>> 
>> Phase 3 (Count co-occurrence)
>> 77 hours minutes CPU time
>> $6.54
>> Output: 23.6GB
>> 
>> Phase 4 (Partial multiply prep)
>> 636 minutes
>> $0.90
>> Output: 24.6GB
>> 
>> Phase 5 (Aggregate and recommend)
>> about 600 hours
>> about $51.00
>> about 10GB
>> (I estimated these rather than let it run at home for days!)
>> 
>> 
>> Note that phases 1 and 3 may be run less frequently, and need not be
>> run every time.
>> But the cost is dominated by the last step, which is most of the work.
>> I've ignored storage costs since
>> 
>> This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000
>> user recommendations. That's not bad if, say, you want to update recs
>> for you site's 100,000 daily active users for a dollar.
>> 
>> There are several levers one could pull internally to sacrifice
>> accuracy for speed, but it's currently set to pretty normal values. So
>> this is just one possibility.
>> 
>> Now that's not terrible, but it is about 8x more computing than would
>> be needed by a non-distributed implementation *if* you could fit the
>> whole data set into a very large instance's memory, which is still
>> possible at this scale but needs a pretty big instance. That's a very
>> apples-to-oranges comparison of course; different algorithms, entirely
>> different environments. This is about the amount of overhead I'd
>> expect from distributing -- interesting to note how non-trivial it is.
>> 
>> 
>> Still to-do is to actually run this on EMR at some point or a real
>> cluster to see how well this estimate holds up.
>> And still to-do is to make this faster.
>>

Re: Crude distributed recommender performance / cost stats

Posted by Sean Owen <sr...@gmail.com>.

The only customization needed was in the first mapper/reducer to parse
the particular format of the input:

http://users.on.net/~henry/home/wikipedia.htm

I can post the code somewhere... it's in the book too. Oh why not do
it here, it's pasted later.

The rest is just the stock code from HEAD in SVN. The command line is
something like:

hadoop jar mahout-core-0.4-SNAPSHOT.job
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output
--booleanData true

Your mileage may vary depending on the machine you run it on of course.



public final class WikipediaItemIDIndexMapper extends MapReduceBase implements
    Mapper<LongWritable,Text,IntWritable, VLongWritable> {

  private static final Pattern NUMBERS = Pattern.compile("(\\d+)");

  @Override
  public void map(LongWritable key,
                  Text value,
                  OutputCollector<IntWritable,VLongWritable> output,
                  Reporter reporter) throws IOException {
    String line = value.toString();
    Matcher m = NUMBERS.matcher(line);
    m.find();
    IntWritable index = new IntWritable();
    VLongWritable itemID = new VLongWritable();
    while (m.find()) {
      long item = Long.parseLong(m.group());
      itemID.set(item);
      index.set(idToIndex(item));
      output.collect(index, itemID);
    }
  }

  static int idToIndex(long itemID) {
    return 0x7FFFFFFF & ((int) itemID ^ (int) (itemID >>> 32));
  }

}


public final class WikipediaToItemPrefsMapper extends MapReduceBase implements
    Mapper<LongWritable,Text,VLongWritable,VLongWritable> {

  private static final Pattern NUMBERS = Pattern.compile("(\\d+)");

  @Override
  public void map(LongWritable key,
                  Text value,
                  OutputCollector<VLongWritable,VLongWritable> output,
                  Reporter reporter) throws IOException {
    String line = value.toString();
    Matcher m = NUMBERS.matcher(line);
    m.find();
    VLongWritable userID = new VLongWritable(Long.parseLong(m.group()));
    VLongWritable itemID = new VLongWritable();
    while (m.find()) {
      itemID.set(Long.parseLong(m.group()));
      output.collect(userID, itemID);
    }
  }

}


On Wed, May 26, 2010 at 4:45 PM, Jake Mannix <ja...@gmail.com> wrote:
> Hey Sean,
>
>  Very cool!  Is there any custom code you used to import the link data /
> instructions on how to reproduce this?
>
>  -jake
>
> On May 26, 2010 8:09 AM, "Sean Owen" <sr...@gmail.com> wrote:
>
> Hi all, though the list might be interested in some recent numbers I
> collected on distributed recommenders, in reality, on Hadoop. I just
> finished running a set of recommendations based on the Wikipedia link
> graph, for book purposes (yeah, it's unconventional). I ran on my
> laptop, but it ought to be crudely representative of how it runs in a
> real cluster.
>
> The input is 1058MB as a text file, and contains, 130M article-article
> associations, from 5.7M articles to 3.8M distinct articles ("users"
> and "items", respectively). I estimate cost based on Amazon's North
> American small Linux-based instance pricing of $0.085/hour. I ran on a
> dual-core laptop with plenty of RAM, allowing 1GB per worker, so this
> is valid.
>
> In this run, I run recommendations for all 5.7M "users". You can
> certainly run for any subset of all users of course.
>
> Phase 1 (Item ID to item index mapping)
> 29 minutes CPU time
> $0.05
> 60MB output
>
> Phase 2 (Create user vectors)
> 88 minutes CPU time
> $0.13
> Output: 1159MB
>
> Phase 3 (Count co-occurrence)
> 77 hours minutes CPU time
> $6.54
> Output: 23.6GB
>
> Phase 4 (Partial multiply prep)
> 636 minutes
> $0.90
> Output: 24.6GB
>
> Phase 5 (Aggregate and recommend)
> about 600 hours
> about $51.00
> about 10GB
> (I estimated these rather than let it run at home for days!)
>
>
> Note that phases 1 and 3 may be run less frequently, and need not be
> run every time.
> But the cost is dominated by the last step, which is most of the work.
> I've ignored storage costs since
>
> This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000
> user recommendations. That's not bad if, say, you want to update recs
> for you site's 100,000 daily active users for a dollar.
>
> There are several levers one could pull internally to sacrifice
> accuracy for speed, but it's currently set to pretty normal values. So
> this is just one possibility.
>
> Now that's not terrible, but it is about 8x more computing than would
> be needed by a non-distributed implementation *if* you could fit the
> whole data set into a very large instance's memory, which is still
> possible at this scale but needs a pretty big instance. That's a very
> apples-to-oranges comparison of course; different algorithms, entirely
> different environments. This is about the amount of overhead I'd
> expect from distributing -- interesting to note how non-trivial it is.
>
>
> Still to-do is to actually run this on EMR at some point or a real
> cluster to see how well this estimate holds up.
> And still to-do is to make this faster.
>

Re: Crude distributed recommender performance / cost stats

Posted by Jake Mannix <ja...@gmail.com>.

Hey Sean,

  Very cool!  Is there any custom code you used to import the link data /
instructions on how to reproduce this?

  -jake

On May 26, 2010 8:09 AM, "Sean Owen" <sr...@gmail.com> wrote:

Hi all, though the list might be interested in some recent numbers I
collected on distributed recommenders, in reality, on Hadoop. I just
finished running a set of recommendations based on the Wikipedia link
graph, for book purposes (yeah, it's unconventional). I ran on my
laptop, but it ought to be crudely representative of how it runs in a
real cluster.

The input is 1058MB as a text file, and contains, 130M article-article
associations, from 5.7M articles to 3.8M distinct articles ("users"
and "items", respectively). I estimate cost based on Amazon's North
American small Linux-based instance pricing of $0.085/hour. I ran on a
dual-core laptop with plenty of RAM, allowing 1GB per worker, so this
is valid.

In this run, I run recommendations for all 5.7M "users". You can
certainly run for any subset of all users of course.

Phase 1 (Item ID to item index mapping)
29 minutes CPU time
$0.05
60MB output

Phase 2 (Create user vectors)
88 minutes CPU time
$0.13
Output: 1159MB

Phase 3 (Count co-occurrence)
77 hours minutes CPU time
$6.54
Output: 23.6GB

Phase 4 (Partial multiply prep)
636 minutes
$0.90
Output: 24.6GB

Phase 5 (Aggregate and recommend)
about 600 hours
about $51.00
about 10GB
(I estimated these rather than let it run at home for days!)


Note that phases 1 and 3 may be run less frequently, and need not be
run every time.
But the cost is dominated by the last step, which is most of the work.
I've ignored storage costs since

This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000
user recommendations. That's not bad if, say, you want to update recs
for you site's 100,000 daily active users for a dollar.

There are several levers one could pull internally to sacrifice
accuracy for speed, but it's currently set to pretty normal values. So
this is just one possibility.

Now that's not terrible, but it is about 8x more computing than would
be needed by a non-distributed implementation *if* you could fit the
whole data set into a very large instance's memory, which is still
possible at this scale but needs a pretty big instance. That's a very
apples-to-oranges comparison of course; different algorithms, entirely
different environments. This is about the amount of overhead I'd
expect from distributing -- interesting to note how non-trivial it is.


Still to-do is to actually run this on EMR at some point or a real
cluster to see how well this estimate holds up.
And still to-do is to make this faster.