You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by DB Tsai <db...@dbtsai.com> on 2013/06/12 02:20:42 UTC

In-Mapper combiner design pattern

Hi,

Recently we started to use the in-mapper combiner design patterns in
our hadoop based algorithms at Alpine Data Labs; those algorithms
include variable selection using info gain, decision tree, naive bayes
model and SVM, and we found that we can have 20~40% performance
speedup without doing too much work.

The whole idea is really simple, just use a in-mapper LRU cache to
combine the result first instead of using combiner directly. If the
cache is full, just emit the result to combiner or reducer. The detail
is discussed in Data-Intensive Text Processing with MapReduce
(http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf)
by Jimmy Lin and Chris Dyer at University of Maryland, College Park.

We would like to contribute the api to mahout, and work closer with
open source community. I'm now working on random forest using
information gain, and we have the plan to contribute to mahout
community. We also have a scalable kernel SVM implementation which
intends to contribute to mahout as well. We just presented a talk
about our SVM in SF machine learning meetup with great feedback, see

http://www.meetup.com/sfmachinelearning/events/116497192/?_af_eid=116497192&a=uc1_te&_af=event

The api is pretty simple, just change context.write to combiner.write,
and remember to flush the cache in the clean up method.

This is the example of implementing hadoop classical word count using
in-mapper combiner,
https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerExampleTest.java

, and all we need to do is just change from context.write to
combiner.write. The test code for this example is in
https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java

This is the actually implementation of in-mapper combiner using LRU cache,
https://github.com/dbtsai/mahout/blob/trunk/core/src/main/java/org/apache/mahout/common/mapreduce/InMapperCombiner.java

and this implementation is well tested.
https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java

I'm wondering what is the best candidate in mahout to use this kind of
in-mapper combiner now to demonstrate this idea works, and I'll focus
on that particular use case, and do benchmark.

Thanks.

Sincerely,

DB Tsai
-----------------------------------
Web: http://www.dbtsai.com
Phone : +1-650-383-8392

Re: In-Mapper combiner design pattern

Posted by DB Tsai <db...@dbtsai.com>.

I'm more than willing to start porting the SVM to mahout once I finish
the customer POC request.

At the same time, Michael Yang, who is my colleague at Alpine Data
Labs will start to work on "single-pass algorithm for penalized linear
regression with cross validation" first.

We are looking forward to working with community closely.

Thanks.

Sincerely,

DB Tsai
-----------------------------------
Web: http://www.dbtsai.com
Phone : +1-650-383-8392


On Sun, Jun 30, 2013 at 10:48 AM, Ted Dunning <te...@gmail.com> wrote:
> +1 to what Grant says.
>
> If you have code and tests, porting to Mahout is not that terribly hard.
>
>
>
> On Sun, Jun 30, 2013 at 4:32 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> Just  coming back to this...
>>
>> On Jun 12, 2013, at 5:38 PM, DB Tsai <db...@dbtsai.com> wrote:
>>
>> > Hi,
>> >
>> > For scalable SVM, since our codebase is quite different from mahout,
>> > it may take some time to refactorize it to work in mahout.
>>
>> Note, the community may be able to help, here, if you put up a patch, then
>> others likely will jump on and help.   Your call, of course.
>>
>> Food for thought,
>> Grant

Re: In-Mapper combiner design pattern

Posted by Ted Dunning <te...@gmail.com>.

+1 to what Grant says.

If you have code and tests, porting to Mahout is not that terribly hard.



On Sun, Jun 30, 2013 at 4:32 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Just  coming back to this...
>
> On Jun 12, 2013, at 5:38 PM, DB Tsai <db...@dbtsai.com> wrote:
>
> > Hi,
> >
> > For scalable SVM, since our codebase is quite different from mahout,
> > it may take some time to refactorize it to work in mahout.
>
> Note, the community may be able to help, here, if you put up a patch, then
> others likely will jump on and help.   Your call, of course.
>
> Food for thought,
> Grant

Re: In-Mapper combiner design pattern

Posted by Grant Ingersoll <gs...@apache.org>.

Just  coming back to this...

On Jun 12, 2013, at 5:38 PM, DB Tsai <db...@dbtsai.com> wrote:

> Hi,
> 
> For scalable SVM, since our codebase is quite different from mahout,
> it may take some time to refactorize it to work in mahout.

Note, the community may be able to help, here, if you put up a patch, then others likely will jump on and help.   Your call, of course.

Food for thought,
Grant

Re: In-Mapper combiner design pattern

Posted by DB Tsai <db...@dbtsai.com>.

Hi,

For scalable SVM, since our codebase is quite different from mahout,
it may take some time to refactorize it to work in mahout. However, we
are trying to integrate mahout PCA now, so as we're getting more
familiar with mahout codebase, it maybe easier for us to port our code
to mahout. This slide is our technical implementation of kernel SVM,
http://www.slideshare.net/SaraAsher/svm-map-reduceslides

In our company, we still leave the traditional combiner there since it
doesn't hurt. I'm going to create a ticket in JIRA issue tracker about
this tonight. I'm also very curious to see the benchmark result.

Which algorithm do you think I can start first? Originally, I want
start from naive bayes, since we have a great performance improvement
there. However, I don't the whole logic in mahout's code yet. Does
IndexInstancesMapper.java the right place I've to take a look?
https://github.com/dbtsai/mahout/blob/15c30350635ef26593f26c13be19736531778bed/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java

As Jake said, CachingCV0Driver is somehow using this pattern now, but
it only flushes them out in the cleanup phase which may cause memory
issue when the sizes of key-value pairs in memory are higher than the
allowed usage of memory in mapper.

The whole idea is that having a fixed size of LRU cache in the mapper,
and instead of emitting the result to reducer, the key-value pairs
will be store in the LRU cache. If a new key-value pair is added, it
will try to combine with existing key-value pair using user defined
combining function.

If it's the new key, and the LRU cache is full, it'll emit the eldest
data to the reducer, and have a room for new key-value pair.

The following is the short example, users can customize the size of
cache, and combining function. All users have to do is just replace
context.write to combiner.write, and remember to flush out the data
which is still in cache in the cleanup phase.

    public static class WordCountMapperWithInMapperCombiner extends
Mapper<LongWritable, Text, Text, LongWritable> {
        private final static LongWritable one = new LongWritable(1);
        private final Text word = new Text();
        private final InMapperCombiner combiner = new
InMapperCombiner<Text, LongWritable>(
                2048, // cacheCapacity, and default is 65536
                new CombiningFunction<LongWritable>() {
                    @Override
                    public LongWritable combine(LongWritable value1,
LongWritable value2) {
                        value1.set(value1.get() + value2.get());
                        return value1;
                    }
                }

        @Override
        @SuppressWarnings("unchecked")
        public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                combiner.write(word, one, context);
            }
        }

        @Override
        protected void cleanup(Mapper.Context context) throws
IOException, InterruptedException {
            combiner.flush(context);
        }
    }

Sincerely,

DB Tsai
-----------------------------------
Web: http://www.dbtsai.com
Phone : +1-650-383-8392


On Wed, Jun 12, 2013 at 9:20 AM, Andy Schlaikjer
<an...@gmail.com> wrote:
> This is perhaps tangential, but pig 0.10+ does this automatically with
> option pig.exec.mapPartAgg = true:
>
> http://pig.apache.org/docs/r0.10.0/perf.html, section "Hash-based
> Aggregation in Map Task"
> https://issues.apache.org/jira/browse/PIG-2228
> https://cwiki.apache.org/PIG/pig-performance-optimization.html
> http://wiki.apache.org/pig/PigHashBasedAggInMap
>
>
>
>
> On Wed, Jun 12, 2013 at 8:59 AM, Jake Mannix <ja...@gmail.com> wrote:
>
>> In fact, I think we're doing exactly this "design pattern" in a few places
>> already.  In particular, the CachingCV0Driver is effectively an in-memory
>> mapside cache of topic/term counts, and it only flushes them all out in the
>> cleanup phase of the mapper execution.
>>
>> I'd certainly like to see what sort of API this would look like, a
>> relatively general form of this could be quite useful, especially if the
>> LRU cache can be tuned and controlled (sometimes you might want to control
>> it's flushing, as there may be business/algorithm logic which needs to be
>> executed at flush time).
>>
>>
>> On Wed, Jun 12, 2013 at 8:45 AM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>> > Regarding the in-memory combiner: It would be good if you showcase the
>> > benefits on one specific implementation in Mahout, by replacing its
>> > normal combiner with the in-memory one and benchmarking it.
>> >
>> > I'm curious to see the results.
>> >
>> > Best,
>> > Sebastian
>> >
>> >
>> > On 12.06.2013 17:06, Grant Ingersoll wrote:
>> > > Hi DB,
>> > >
>> > > This all sounds rather interesting.  I see a number of places where we
>> > use combiners, so perhaps focus on those first?
>> > >
>> > > Also, any thoughts on when the scalable SVM would be ready?  We are
>> > trying to get 1.0 out in the next few months and I personally think it
>> > would be good to have SVM in.
>> > >
>> > > -Grant
>> > >
>> > > On Jun 11, 2013, at 8:20 PM, DB Tsai <db...@dbtsai.com> wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> Recently we started to use the in-mapper combiner design patterns in
>> > >> our hadoop based algorithms at Alpine Data Labs; those algorithms
>> > >> include variable selection using info gain, decision tree, naive bayes
>> > >> model and SVM, and we found that we can have 20~40% performance
>> > >> speedup without doing too much work.
>> > >>
>> > >> The whole idea is really simple, just use a in-mapper LRU cache to
>> > >> combine the result first instead of using combiner directly. If the
>> > >> cache is full, just emit the result to combiner or reducer. The detail
>> > >> is discussed in Data-Intensive Text Processing with MapReduce
>> > >> (
>> http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf)
>> > >> by Jimmy Lin and Chris Dyer at University of Maryland, College Park.
>> > >>
>> > >> We would like to contribute the api to mahout, and work closer with
>> > >> open source community. I'm now working on random forest using
>> > >> information gain, and we have the plan to contribute to mahout
>> > >> community. We also have a scalable kernel SVM implementation which
>> > >> intends to contribute to mahout as well. We just presented a talk
>> > >> about our SVM in SF machine learning meetup with great feedback, see
>> > >>
>> > >>
>> >
>> http://www.meetup.com/sfmachinelearning/events/116497192/?_af_eid=116497192&a=uc1_te&_af=event
>> > >>
>> > >> The api is pretty simple, just change context.write to combiner.write,
>> > >> and remember to flush the cache in the clean up method.
>> > >>
>> > >> This is the example of implementing hadoop classical word count using
>> > >> in-mapper combiner,
>> > >>
>> >
>> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerExampleTest.java
>> > >>
>> > >> , and all we need to do is just change from context.write to
>> > >> combiner.write. The test code for this example is in
>> > >>
>> >
>> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
>> > >>
>> > >> This is the actually implementation of in-mapper combiner using LRU
>> > cache,
>> > >>
>> >
>> https://github.com/dbtsai/mahout/blob/trunk/core/src/main/java/org/apache/mahout/common/mapreduce/InMapperCombiner.java
>> > >>
>> > >> and this implementation is well tested.
>> > >>
>> >
>> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
>> > >>
>> > >> I'm wondering what is the best candidate in mahout to use this kind of
>> > >> in-mapper combiner now to demonstrate this idea works, and I'll focus
>> > >> on that particular use case, and do benchmark.
>> > >>
>> > >> Thanks.
>> > >>
>> > >> Sincerely,
>> > >>
>> > >> DB Tsai
>> > >> -----------------------------------
>> > >> Web: http://www.dbtsai.com
>> > >> Phone : +1-650-383-8392
>> > >
>> > > --------------------------------------------
>> > > Grant Ingersoll | @gsingers
>> > > http://www.lucidworks.com
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> >
>> >
>>
>>
>> --
>>
>>   -jake
>>

Re: In-Mapper combiner design pattern

Posted by Andy Schlaikjer <an...@gmail.com>.

This is perhaps tangential, but pig 0.10+ does this automatically with
option pig.exec.mapPartAgg = true:

http://pig.apache.org/docs/r0.10.0/perf.html, section "Hash-based
Aggregation in Map Task"
https://issues.apache.org/jira/browse/PIG-2228
https://cwiki.apache.org/PIG/pig-performance-optimization.html
http://wiki.apache.org/pig/PigHashBasedAggInMap




On Wed, Jun 12, 2013 at 8:59 AM, Jake Mannix <ja...@gmail.com> wrote:

> In fact, I think we're doing exactly this "design pattern" in a few places
> already.  In particular, the CachingCV0Driver is effectively an in-memory
> mapside cache of topic/term counts, and it only flushes them all out in the
> cleanup phase of the mapper execution.
>
> I'd certainly like to see what sort of API this would look like, a
> relatively general form of this could be quite useful, especially if the
> LRU cache can be tuned and controlled (sometimes you might want to control
> it's flushing, as there may be business/algorithm logic which needs to be
> executed at flush time).
>
>
> On Wed, Jun 12, 2013 at 8:45 AM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > Regarding the in-memory combiner: It would be good if you showcase the
> > benefits on one specific implementation in Mahout, by replacing its
> > normal combiner with the in-memory one and benchmarking it.
> >
> > I'm curious to see the results.
> >
> > Best,
> > Sebastian
> >
> >
> > On 12.06.2013 17:06, Grant Ingersoll wrote:
> > > Hi DB,
> > >
> > > This all sounds rather interesting.  I see a number of places where we
> > use combiners, so perhaps focus on those first?
> > >
> > > Also, any thoughts on when the scalable SVM would be ready?  We are
> > trying to get 1.0 out in the next few months and I personally think it
> > would be good to have SVM in.
> > >
> > > -Grant
> > >
> > > On Jun 11, 2013, at 8:20 PM, DB Tsai <db...@dbtsai.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> Recently we started to use the in-mapper combiner design patterns in
> > >> our hadoop based algorithms at Alpine Data Labs; those algorithms
> > >> include variable selection using info gain, decision tree, naive bayes
> > >> model and SVM, and we found that we can have 20~40% performance
> > >> speedup without doing too much work.
> > >>
> > >> The whole idea is really simple, just use a in-mapper LRU cache to
> > >> combine the result first instead of using combiner directly. If the
> > >> cache is full, just emit the result to combiner or reducer. The detail
> > >> is discussed in Data-Intensive Text Processing with MapReduce
> > >> (
> http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf)
> > >> by Jimmy Lin and Chris Dyer at University of Maryland, College Park.
> > >>
> > >> We would like to contribute the api to mahout, and work closer with
> > >> open source community. I'm now working on random forest using
> > >> information gain, and we have the plan to contribute to mahout
> > >> community. We also have a scalable kernel SVM implementation which
> > >> intends to contribute to mahout as well. We just presented a talk
> > >> about our SVM in SF machine learning meetup with great feedback, see
> > >>
> > >>
> >
> http://www.meetup.com/sfmachinelearning/events/116497192/?_af_eid=116497192&a=uc1_te&_af=event
> > >>
> > >> The api is pretty simple, just change context.write to combiner.write,
> > >> and remember to flush the cache in the clean up method.
> > >>
> > >> This is the example of implementing hadoop classical word count using
> > >> in-mapper combiner,
> > >>
> >
> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerExampleTest.java
> > >>
> > >> , and all we need to do is just change from context.write to
> > >> combiner.write. The test code for this example is in
> > >>
> >
> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
> > >>
> > >> This is the actually implementation of in-mapper combiner using LRU
> > cache,
> > >>
> >
> https://github.com/dbtsai/mahout/blob/trunk/core/src/main/java/org/apache/mahout/common/mapreduce/InMapperCombiner.java
> > >>
> > >> and this implementation is well tested.
> > >>
> >
> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
> > >>
> > >> I'm wondering what is the best candidate in mahout to use this kind of
> > >> in-mapper combiner now to demonstrate this idea works, and I'll focus
> > >> on that particular use case, and do benchmark.
> > >>
> > >> Thanks.
> > >>
> > >> Sincerely,
> > >>
> > >> DB Tsai
> > >> -----------------------------------
> > >> Web: http://www.dbtsai.com
> > >> Phone : +1-650-383-8392
> > >
> > > --------------------------------------------
> > > Grant Ingersoll | @gsingers
> > > http://www.lucidworks.com
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>
> --
>
>   -jake
>

Re: In-Mapper combiner design pattern

Posted by Jake Mannix <ja...@gmail.com>.

In fact, I think we're doing exactly this "design pattern" in a few places
already.  In particular, the CachingCV0Driver is effectively an in-memory
mapside cache of topic/term counts, and it only flushes them all out in the
cleanup phase of the mapper execution.

I'd certainly like to see what sort of API this would look like, a
relatively general form of this could be quite useful, especially if the
LRU cache can be tuned and controlled (sometimes you might want to control
it's flushing, as there may be business/algorithm logic which needs to be
executed at flush time).


On Wed, Jun 12, 2013 at 8:45 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Regarding the in-memory combiner: It would be good if you showcase the
> benefits on one specific implementation in Mahout, by replacing its
> normal combiner with the in-memory one and benchmarking it.
>
> I'm curious to see the results.
>
> Best,
> Sebastian
>
>
> On 12.06.2013 17:06, Grant Ingersoll wrote:
> > Hi DB,
> >
> > This all sounds rather interesting.  I see a number of places where we
> use combiners, so perhaps focus on those first?
> >
> > Also, any thoughts on when the scalable SVM would be ready?  We are
> trying to get 1.0 out in the next few months and I personally think it
> would be good to have SVM in.
> >
> > -Grant
> >
> > On Jun 11, 2013, at 8:20 PM, DB Tsai <db...@dbtsai.com> wrote:
> >
> >> Hi,
> >>
> >> Recently we started to use the in-mapper combiner design patterns in
> >> our hadoop based algorithms at Alpine Data Labs; those algorithms
> >> include variable selection using info gain, decision tree, naive bayes
> >> model and SVM, and we found that we can have 20~40% performance
> >> speedup without doing too much work.
> >>
> >> The whole idea is really simple, just use a in-mapper LRU cache to
> >> combine the result first instead of using combiner directly. If the
> >> cache is full, just emit the result to combiner or reducer. The detail
> >> is discussed in Data-Intensive Text Processing with MapReduce
> >> (http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf)
> >> by Jimmy Lin and Chris Dyer at University of Maryland, College Park.
> >>
> >> We would like to contribute the api to mahout, and work closer with
> >> open source community. I'm now working on random forest using
> >> information gain, and we have the plan to contribute to mahout
> >> community. We also have a scalable kernel SVM implementation which
> >> intends to contribute to mahout as well. We just presented a talk
> >> about our SVM in SF machine learning meetup with great feedback, see
> >>
> >>
> http://www.meetup.com/sfmachinelearning/events/116497192/?_af_eid=116497192&a=uc1_te&_af=event
> >>
> >> The api is pretty simple, just change context.write to combiner.write,
> >> and remember to flush the cache in the clean up method.
> >>
> >> This is the example of implementing hadoop classical word count using
> >> in-mapper combiner,
> >>
> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerExampleTest.java
> >>
> >> , and all we need to do is just change from context.write to
> >> combiner.write. The test code for this example is in
> >>
> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
> >>
> >> This is the actually implementation of in-mapper combiner using LRU
> cache,
> >>
> https://github.com/dbtsai/mahout/blob/trunk/core/src/main/java/org/apache/mahout/common/mapreduce/InMapperCombiner.java
> >>
> >> and this implementation is well tested.
> >>
> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
> >>
> >> I'm wondering what is the best candidate in mahout to use this kind of
> >> in-mapper combiner now to demonstrate this idea works, and I'll focus
> >> on that particular use case, and do benchmark.
> >>
> >> Thanks.
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> -----------------------------------
> >> Web: http://www.dbtsai.com
> >> Phone : +1-650-383-8392
> >
> > --------------------------------------------
> > Grant Ingersoll | @gsingers
> > http://www.lucidworks.com
> >
> >
> >
> >
> >
> >
>
>


-- 

  -jake

Re: In-Mapper combiner design pattern

Posted by Sebastian Schelter <ss...@apache.org>.

Regarding the in-memory combiner: It would be good if you showcase the
benefits on one specific implementation in Mahout, by replacing its
normal combiner with the in-memory one and benchmarking it.

I'm curious to see the results.

Best,
Sebastian


On 12.06.2013 17:06, Grant Ingersoll wrote:
> Hi DB,
> 
> This all sounds rather interesting.  I see a number of places where we use combiners, so perhaps focus on those first?
> 
> Also, any thoughts on when the scalable SVM would be ready?  We are trying to get 1.0 out in the next few months and I personally think it would be good to have SVM in.
> 
> -Grant
> 
> On Jun 11, 2013, at 8:20 PM, DB Tsai <db...@dbtsai.com> wrote:
> 
>> Hi,
>>
>> Recently we started to use the in-mapper combiner design patterns in
>> our hadoop based algorithms at Alpine Data Labs; those algorithms
>> include variable selection using info gain, decision tree, naive bayes
>> model and SVM, and we found that we can have 20~40% performance
>> speedup without doing too much work.
>>
>> The whole idea is really simple, just use a in-mapper LRU cache to
>> combine the result first instead of using combiner directly. If the
>> cache is full, just emit the result to combiner or reducer. The detail
>> is discussed in Data-Intensive Text Processing with MapReduce
>> (http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf)
>> by Jimmy Lin and Chris Dyer at University of Maryland, College Park.
>>
>> We would like to contribute the api to mahout, and work closer with
>> open source community. I'm now working on random forest using
>> information gain, and we have the plan to contribute to mahout
>> community. We also have a scalable kernel SVM implementation which
>> intends to contribute to mahout as well. We just presented a talk
>> about our SVM in SF machine learning meetup with great feedback, see
>>
>> http://www.meetup.com/sfmachinelearning/events/116497192/?_af_eid=116497192&a=uc1_te&_af=event
>>
>> The api is pretty simple, just change context.write to combiner.write,
>> and remember to flush the cache in the clean up method.
>>
>> This is the example of implementing hadoop classical word count using
>> in-mapper combiner,
>> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerExampleTest.java
>>
>> , and all we need to do is just change from context.write to
>> combiner.write. The test code for this example is in
>> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
>>
>> This is the actually implementation of in-mapper combiner using LRU cache,
>> https://github.com/dbtsai/mahout/blob/trunk/core/src/main/java/org/apache/mahout/common/mapreduce/InMapperCombiner.java
>>
>> and this implementation is well tested.
>> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
>>
>> I'm wondering what is the best candidate in mahout to use this kind of
>> in-mapper combiner now to demonstrate this idea works, and I'll focus
>> on that particular use case, and do benchmark.
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> -----------------------------------
>> Web: http://www.dbtsai.com
>> Phone : +1-650-383-8392
> 
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
> 
> 
> 
> 
> 
>

Re: In-Mapper combiner design pattern

Posted by Grant Ingersoll <gs...@apache.org>.

Hi DB,

This all sounds rather interesting.  I see a number of places where we use combiners, so perhaps focus on those first?

Also, any thoughts on when the scalable SVM would be ready?  We are trying to get 1.0 out in the next few months and I personally think it would be good to have SVM in.

-Grant

On Jun 11, 2013, at 8:20 PM, DB Tsai <db...@dbtsai.com> wrote:

> Hi,
> 
> Recently we started to use the in-mapper combiner design patterns in
> our hadoop based algorithms at Alpine Data Labs; those algorithms
> include variable selection using info gain, decision tree, naive bayes
> model and SVM, and we found that we can have 20~40% performance
> speedup without doing too much work.
> 
> The whole idea is really simple, just use a in-mapper LRU cache to
> combine the result first instead of using combiner directly. If the
> cache is full, just emit the result to combiner or reducer. The detail
> is discussed in Data-Intensive Text Processing with MapReduce
> (http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf)
> by Jimmy Lin and Chris Dyer at University of Maryland, College Park.
> 
> We would like to contribute the api to mahout, and work closer with
> open source community. I'm now working on random forest using
> information gain, and we have the plan to contribute to mahout
> community. We also have a scalable kernel SVM implementation which
> intends to contribute to mahout as well. We just presented a talk
> about our SVM in SF machine learning meetup with great feedback, see
> 
> http://www.meetup.com/sfmachinelearning/events/116497192/?_af_eid=116497192&a=uc1_te&_af=event
> 
> The api is pretty simple, just change context.write to combiner.write,
> and remember to flush the cache in the clean up method.
> 
> This is the example of implementing hadoop classical word count using
> in-mapper combiner,
> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerExampleTest.java
> 
> , and all we need to do is just change from context.write to
> combiner.write. The test code for this example is in
> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
> 
> This is the actually implementation of in-mapper combiner using LRU cache,
> https://github.com/dbtsai/mahout/blob/trunk/core/src/main/java/org/apache/mahout/common/mapreduce/InMapperCombiner.java
> 
> and this implementation is well tested.
> https://github.com/dbtsai/mahout/blob/trunk/core/src/test/java/org/apache/mahout/common/mapreduce/InMapperCombinerTest.java
> 
> I'm wondering what is the best candidate in mahout to use this kind of
> in-mapper combiner now to demonstrate this idea works, and I'll focus
> on that particular use case, and do benchmark.
> 
> Thanks.
> 
> Sincerely,
> 
> DB Tsai
> -----------------------------------
> Web: http://www.dbtsai.com
> Phone : +1-650-383-8392

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com