You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Keary Cavin <kc...@renci.org> on 2012/01/25 21:47:20 UTC

status of hadoop hidden markov model in mahout

Hello,

We are investigating Mahout as a scalable solution for a hidden markov model genetic imputation problem we'd like to run on our Hadoop cluster.

Does the infrastructure exist to run the Mahout HMM code through Hadoop?

Thanks very much,

Keary

Re: status of hadoop hidden markov model in mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 31, 2012, at 2:14 PM, Keary Cavin wrote:

> 
> Dhruv, I downloaded the MAHOUT-627 patch and applied the files to the current mahout release.  I'll let you know when I have questions.

Note, the plan is to put this patch into 0.7 once the remaining test issue is fixed.

-Grant

RE: status of hadoop hidden markov model in mahout

Posted by Keary Cavin <kc...@renci.org>.

Hi Dhruv and Manuel,

Thank you for your responses.  I apologize for my late reply.

One of the immediate goals of our project is to perform imputation over several hundred genomes.  In our next round of imputations, we anticipate incoming data for several thousand genomes.

I don't have a ready answer for the question about the number of observed and hidden states in the HMM.  We do know the best imputation window size our current code supports is between 1 and 5 million base pairs.

We have a meeting scheduled with the authors of the imputation code we are using and we want to get the details on the parameters and implementation details of its Hidden Markov Model.

When I have more information about the algorithm, I will send it to you.

Dhruv, I downloaded the MAHOUT-627 patch and applied the files to the current mahout release.  I'll let you know when I have questions.

Thank you very much,

Keary

-----Original Message-----
From: dhruv21@gmail.com [mailto:dhruv21@gmail.com] On Behalf Of Dhruv Kumar
Sent: Wednesday, January 25, 2012 5:15 PM
To: user@mahout.apache.org
Subject: Re: status of hadoop hidden markov model in mahout

MAHOUT-627 provides a patch for training the HMM using the Baum Welch Algorithm on MapReduce. If you would like to give it a spin, I will be glad to help you with any questions you may have about its use.

Can you provide more details about your project? I am particularly interested in the number of observed and hidden states you have in your HMM model and the total training set size for understanding your scalability requirements.

On Wed, Jan 25, 2012 at 1:08 PM, Manuel Blechschmidt < Manuel.Blechschmidt@gmx.de> wrote:

> Hi Keary,
>
> On 25.01.2012, at 21:47, Keary Cavin wrote:
>
> > Hello,
> >
> > We are investigating Mahout as a scalable solution for a hidden 
> > markov
> model genetic imputation problem we'd like to run on our Hadoop cluster.
> >
> > Does the infrastructure exist to run the Mahout HMM code through Hadoop?
>
> Actually in the core Mahout SVN there is not parallel implementation 
> of any aspects of the HMM approach yet. Nevertheless Dhruv Kumar 
> implemented during  a google summar of code a paralyzed Baum Welch algorithm.
>
> There was a discussion 2 days ago about HMM:
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201201.mbox/%3C5D
> 1FC0E56861D84086FE034AFD1B223D3CD101E9@mbx1.hosted.exchange-login.net%
> 3E
>
> Here is the patch:
> https://issues.apache.org/jira/browse/MAHOUT-627
>
> Keep in mind that executing machine learning in parallel is on going 
> research.
>
> >
> > Thanks very much,
> >
> > Keary
>
> /Manuel
>
> --
> Manuel Blechschmidt
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
>
>

Re: status of hadoop hidden markov model in mahout

Posted by Dhruv Kumar <dk...@ecs.umass.edu>.

MAHOUT-627 provides a patch for training the HMM using the Baum Welch
Algorithm on MapReduce. If you would like to give it a spin, I will be glad
to help you with any questions you may have about its use.

Can you provide more details about your project? I am particularly
interested in the number of observed and hidden states you have in your HMM
model and the total training set size for understanding your scalability
requirements.

On Wed, Jan 25, 2012 at 1:08 PM, Manuel Blechschmidt <
Manuel.Blechschmidt@gmx.de> wrote:

> Hi Keary,
>
> On 25.01.2012, at 21:47, Keary Cavin wrote:
>
> > Hello,
> >
> > We are investigating Mahout as a scalable solution for a hidden markov
> model genetic imputation problem we'd like to run on our Hadoop cluster.
> >
> > Does the infrastructure exist to run the Mahout HMM code through Hadoop?
>
> Actually in the core Mahout SVN there is not parallel implementation of
> any aspects of the HMM approach yet. Nevertheless Dhruv Kumar implemented
> during  a google summar of code a paralyzed Baum Welch algorithm.
>
> There was a discussion 2 days ago about HMM:
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201201.mbox/%3C5D1FC0E56861D84086FE034AFD1B223D3CD101E9@mbx1.hosted.exchange-login.net%3E
>
> Here is the patch:
> https://issues.apache.org/jira/browse/MAHOUT-627
>
> Keep in mind that executing machine learning in parallel is on going
> research.
>
> >
> > Thanks very much,
> >
> > Keary
>
> /Manuel
>
> --
> Manuel Blechschmidt
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
>
>

Re: status of hadoop hidden markov model in mahout

Posted by Manuel Blechschmidt <Ma...@gmx.de>.

Hi Keary,

On 25.01.2012, at 21:47, Keary Cavin wrote:

> Hello,
> 
> We are investigating Mahout as a scalable solution for a hidden markov model genetic imputation problem we'd like to run on our Hadoop cluster.
> 
> Does the infrastructure exist to run the Mahout HMM code through Hadoop?

Actually in the core Mahout SVN there is not parallel implementation of any aspects of the HMM approach yet. Nevertheless Dhruv Kumar implemented during  a google summar of code a paralyzed Baum Welch algorithm.

There was a discussion 2 days ago about HMM: 
http://mail-archives.apache.org/mod_mbox/mahout-user/201201.mbox/%3C5D1FC0E56861D84086FE034AFD1B223D3CD101E9@mbx1.hosted.exchange-login.net%3E

Here is the patch:
https://issues.apache.org/jira/browse/MAHOUT-627

Keep in mind that executing machine learning in parallel is on going research.

> 
> Thanks very much,
> 
> Keary

/Manuel

-- 
Manuel Blechschmidt
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B