You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dhruv Kumar (JIRA)" <ji...@apache.org> on 2011/03/22 03:50:05 UTC

[jira] [Commented] (MAHOUT-627) Parallelization of Baum-Welch Algorithm for HMM Training

    [ https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009526#comment-13009526 ] 

Dhruv Kumar commented on MAHOUT-627:
------------------------------------

How are the GSoC proposals discussed in Mahout? 

I was wondering if I should continue with the proposal here by editing the JIRA or email it to the dev list.


> Parallelization of Baum-Welch Algorithm for HMM Training 
> ---------------------------------------------------------
>
>                 Key: MAHOUT-627
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-627
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Dhruv Kumar
>            Priority: Minor
>
> Among the unsupervised learning methods available in the current sequential implementation of HMM training (MAHOUT-396), the Baum-Welch (BW) algorithm is an attractive candidate for a parallel, MapReduce implementation. Although slower than the Viterbi training algorithm, the BW is more numerically stable and provides guaranteed discovery of Maximum Likelihood Estimator (MLE). In Viterbi training, the MLE is approximated in order to reduce computation time.
> A parallel, MapReduce implementation of BW will allow scalable model learning over large data sets. The resulting model can be used for prediction using the current sequential implementation of the Viterbi decoding algorithm. 
> BW is a general case of the Expectation Maximization (EM) algorithm and it is shown that all EM algorithms are candidates for a MapReduce implementation: http://www-cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf. Also, the k-means clustering is an EM algorithm and has an existing MapReduce implementation in Mahout which hints at the possibility of code reuse to aid parallel BW development. The other possibility for a parallel implementation would be to use the notion of matrix probability as shown by Turin: http://tagh.de/tom/wp-content/uploads/turin-1998.pdf. 
> Although non trivial, a parallel BW implementation would greatly add to the existing set of Mahout's HMM functionality. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (MAHOUT-627) Parallelization of Baum-Welch Algorithm for HMM Training

Posted by Ted Dunning <te...@gmail.com>.
I would invert the order of 1 and 2 and estimate the times as 30, 30, 40.
 You can view this from a few points of view:

a) writing good tests is hard

b) writing good tests makes writing code much easier

c) writing any kind of decent documentation is VASTLY harder than most
people think.  It is also very important for user adoption, mentor
satisfaction and personal marketing (for instance a resume or portfolio)

On Thu, Mar 24, 2011 at 4:41 PM, Dhruv Kumar <dk...@ecs.umass.edu> wrote:

> Thanks Ted, I'll start working on a proposal having the following sub tasks
> (I have given a rudimentary percent time estimate, please feel free to
> suggest alterations):
>
> 1. Implementing the BW on Map Reduce following the line of k-means. Focus
> on
> re-using as much of the existing k-means code as possible. (60%)
>
> 2. Unit testing the Mapper, Combiner, Reducer and testing the integration,
> in local and pseudo-distributed modes. I may be able to get access to a
> small cluster at UMass for unit testing in the real-distributed mode. (35%)
>
> 3. Writing clear documentation directing clients how to use the implemented
> library code for their needs. (5%)
>
>
>
> On Thu, Mar 24, 2011 at 6:45 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Thu, Mar 24, 2011 at 3:34 PM, Dhruv Kumar <dk...@ecs.umass.edu>
> wrote:
> >
> > > 2. Another very interesting possibility is to express the BW as a
> > recursive
> > > join.  There's a very interesting offshoot of Hadoop, called Haloop (
> > > http://code.google.com/p/haloop/) which supports loop control, and
> > caching
> > > of the intermediate results on the mapper inputs,  reducer inputs and
> > > reducer outputs to improve performance. The paper [1] describes this in
> > > more
> > > detail. They have implemented k-means as a recursive join.
> > >
> >
> > Until there is flexibility around execution model such as the recent
> > map-reduce 2.0 announcement
> > from Yahoo and until that flexibility is pretty much standard, it is hard
> > to
> > justify this.
> >
> > The exception is where such extended capabilities fit into standard
> hadoop
> > 0.20 environments.
> >
>
> >
> > > In either case, I want to clearly define the scope and task list. BW
> will
> > > be
> > > the core of the project but:
> > >
> > > 1. Does it make sense for implementing the "counting method" for model
> > > discovery as well? It is clearly inferior but will it be a good
> reference
> > > for comparison to the BW. Any added benefit?
> > >
> >
> > No opinion here except that increased scope decreases probability of even
> > partial success.
> >
> >
> > > 2. What has been the standard in the past GSoC Mahout projects
> regarding
> > > unit testing and documentation?
> > >
> >
> > Do it.
> >
> > Seriously.
> >
> > We use junit 4+ and very much prefer strong unit tests.  Nothing in what
> > you
> > are proposing should
> > require anything interesting in this regard.  Testing the mapper,
> combiner
> > and reducer in isolation is
> > good.  Testing the integrated program in local mode or pseudo distributed
> > mode should suffice beyond
> > that.  It is best if you can separate command line argument parsing from
> > execution path to that you
> > can test them separately.
> >
> > >
> > > In the meantime, I've been understanding more about Mahout, Map Reduce
> > and
> > > Hadoop's internals. One of my course projects this semester is to
> > implement
> > > the Bellman Iteration algorithm on Map Reduce and so far it has been
> > coming
> > > along well.
> > >
> > > Any feedback is much appreciated.
> > >
> > > Dhruv
> > >
> >
>

Re: [jira] [Commented] (MAHOUT-627) Parallelization of Baum-Welch Algorithm for HMM Training

Posted by Dhruv Kumar <dk...@ecs.umass.edu>.
Thanks Ted, I'll start working on a proposal having the following sub tasks
(I have given a rudimentary percent time estimate, please feel free to
suggest alterations):

1. Implementing the BW on Map Reduce following the line of k-means. Focus on
re-using as much of the existing k-means code as possible. (60%)

2. Unit testing the Mapper, Combiner, Reducer and testing the integration,
in local and pseudo-distributed modes. I may be able to get access to a
small cluster at UMass for unit testing in the real-distributed mode. (35%)

3. Writing clear documentation directing clients how to use the implemented
library code for their needs. (5%)



On Thu, Mar 24, 2011 at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:

> On Thu, Mar 24, 2011 at 3:34 PM, Dhruv Kumar <dk...@ecs.umass.edu> wrote:
>
> > 2. Another very interesting possibility is to express the BW as a
> recursive
> > join.  There's a very interesting offshoot of Hadoop, called Haloop (
> > http://code.google.com/p/haloop/) which supports loop control, and
> caching
> > of the intermediate results on the mapper inputs,  reducer inputs and
> > reducer outputs to improve performance. The paper [1] describes this in
> > more
> > detail. They have implemented k-means as a recursive join.
> >
>
> Until there is flexibility around execution model such as the recent
> map-reduce 2.0 announcement
> from Yahoo and until that flexibility is pretty much standard, it is hard
> to
> justify this.
>
> The exception is where such extended capabilities fit into standard hadoop
> 0.20 environments.
>

>
> > In either case, I want to clearly define the scope and task list. BW will
> > be
> > the core of the project but:
> >
> > 1. Does it make sense for implementing the "counting method" for model
> > discovery as well? It is clearly inferior but will it be a good reference
> > for comparison to the BW. Any added benefit?
> >
>
> No opinion here except that increased scope decreases probability of even
> partial success.
>
>
> > 2. What has been the standard in the past GSoC Mahout projects regarding
> > unit testing and documentation?
> >
>
> Do it.
>
> Seriously.
>
> We use junit 4+ and very much prefer strong unit tests.  Nothing in what
> you
> are proposing should
> require anything interesting in this regard.  Testing the mapper, combiner
> and reducer in isolation is
> good.  Testing the integrated program in local mode or pseudo distributed
> mode should suffice beyond
> that.  It is best if you can separate command line argument parsing from
> execution path to that you
> can test them separately.
>
> >
> > In the meantime, I've been understanding more about Mahout, Map Reduce
> and
> > Hadoop's internals. One of my course projects this semester is to
> implement
> > the Bellman Iteration algorithm on Map Reduce and so far it has been
> coming
> > along well.
> >
> > Any feedback is much appreciated.
> >
> > Dhruv
> >
>

Re: [jira] [Commented] (MAHOUT-627) Parallelization of Baum-Welch Algorithm for HMM Training

Posted by Ted Dunning <te...@gmail.com>.
On Thu, Mar 24, 2011 at 3:34 PM, Dhruv Kumar <dk...@ecs.umass.edu> wrote:

> 2. Another very interesting possibility is to express the BW as a recursive
> join.  There's a very interesting offshoot of Hadoop, called Haloop (
> http://code.google.com/p/haloop/) which supports loop control, and caching
> of the intermediate results on the mapper inputs,  reducer inputs and
> reducer outputs to improve performance. The paper [1] describes this in
> more
> detail. They have implemented k-means as a recursive join.
>

Until there is flexibility around execution model such as the recent
map-reduce 2.0 announcement
from Yahoo and until that flexibility is pretty much standard, it is hard to
justify this.

The exception is where such extended capabilities fit into standard hadoop
0.20 environments.


> In either case, I want to clearly define the scope and task list. BW will
> be
> the core of the project but:
>
> 1. Does it make sense for implementing the "counting method" for model
> discovery as well? It is clearly inferior but will it be a good reference
> for comparison to the BW. Any added benefit?
>

No opinion here except that increased scope decreases probability of even
partial success.


> 2. What has been the standard in the past GSoC Mahout projects regarding
> unit testing and documentation?
>

Do it.

Seriously.

We use junit 4+ and very much prefer strong unit tests.  Nothing in what you
are proposing should
require anything interesting in this regard.  Testing the mapper, combiner
and reducer in isolation is
good.  Testing the integrated program in local mode or pseudo distributed
mode should suffice beyond
that.  It is best if you can separate command line argument parsing from
execution path to that you
can test them separately.

>
> In the meantime, I've been understanding more about Mahout, Map Reduce and
> Hadoop's internals. One of my course projects this semester is to implement
> the Bellman Iteration algorithm on Map Reduce and so far it has been coming
> along well.
>
> Any feedback is much appreciated.
>
> Dhruv
>

Re: [jira] [Commented] (MAHOUT-627) Parallelization of Baum-Welch Algorithm for HMM Training

Posted by Dhruv Kumar <dk...@ecs.umass.edu>.
On Tue, Mar 22, 2011 at 1:15 AM, Robin Anil <ro...@gmail.com> wrote:

> You can discuss on the group and update the JIRA with concrete plans
>

As far as making the Baum Welch (BW) parallel goes, I have two broad
approaches which I'm interested in:

1. As Ted had suggested, a lot of code in the existing K-Means
implementation could be re-used to implement BW. This is because they're
both Expectation Maximization algorithms.

2. Another very interesting possibility is to express the BW as a recursive
join.  There's a very interesting offshoot of Hadoop, called Haloop (
http://code.google.com/p/haloop/) which supports loop control, and caching
of the intermediate results on the mapper inputs,  reducer inputs and
reducer outputs to improve performance. The paper [1] describes this in more
detail. They have implemented k-means as a recursive join.

In either case, I want to clearly define the scope and task list. BW will be
the core of the project but:

1. Does it make sense for implementing the "counting method" for model
discovery as well? It is clearly inferior but will it be a good reference
for comparison to the BW. Any added benefit?

2. What has been the standard in the past GSoC Mahout projects regarding
unit testing and documentation?

In the meantime, I've been understanding more about Mahout, Map Reduce and
Hadoop's internals. One of my course projects this semester is to implement
the Bellman Iteration algorithm on Map Reduce and so far it has been coming
along well.

Any feedback is much appreciated.

Dhruv

Re: [jira] [Commented] (MAHOUT-627) Parallelization of Baum-Welch Algorithm for HMM Training

Posted by Robin Anil <ro...@gmail.com>.
You can discuss on the group and update the JIRA with concrete plans