You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Isabel Drost (JIRA)" <ji...@apache.org> on 2010/08/13 17:46:15 UTC

[jira] Commented: (MAHOUT-396) Proposal for Implementing Hidden Markov Model

    [ https://issues.apache.org/jira/browse/MAHOUT-396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898284#action_12898284 ] 

Isabel Drost commented on MAHOUT-396:
-------------------------------------

Patch applies cleanly with "-p1" (was generated from a git clone of Mahout), builds, all tests green after applying it.

Some comments after taking an initial look at the code:

* Please don't use "System.out.println" anywhere in the code - use Loggers instead.
* I still see quite a few int Arrays in the example code. Could you please provide some explanation of why you did not use the o.a.m.math.list.IntArrayList as proposed by Robin?

Rest of the code looks to me otherwise - though I am to be considered highly subjective in that matter. Would be glad if a second Mahout developer had time to have a look.

Just for the record: The current implementation is sequential-only. During the past few weeks I have come accross a few publications that might be interesting for follow-up work: "Scaling the iHMM: Parallelization versus Hadoop". In addition there seem the have been works in the direction of spectral learning algorithms for HMMs that might be interesting: "Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola)" and  "A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang)"

> Proposal for Implementing Hidden Markov Model
> ---------------------------------------------
>
>                 Key: MAHOUT-396
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-396
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Max Heimel
>            Assignee: Max Heimel
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: MAHOUT-396.diff, MAHOUT-396.diff, MAHOUT-396.diff, MAHOUT-396.diff, MAHOUT-396.diff
>
>
> h4. Overview
> This is a project proposal for a summer-term university project to write a (sequential) HMM implementation for Mahout. Five students will work on this project as part of a course mentored by Isabel Drost.
> h4. Abstract:
> Hidden Markov Models are used in multiple areas of Machine Learning, such as speech recognition, handwritten letter recognition or natural language processing. A Hidden Markov Model (HMM) is a statistical model of a process consisting of two (in our case discrete) random variables O and Y, which change their state sequentially. The variable Y with states {y_1, ... , y_n} is called the "hidden variable", since its state is not directly observable. The state of Y changes sequentially with a so called - in our case first-order - Markov Property. This means, that the state change probability of Y only depends on its current state and does not change in time. Formally we write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)). The variable O with states {o_1, ... , o_m} is called the "observable variable", since its state can be directly observed. O does not have a Markov Property, but its state propability depends statically on the current state of Y. 
> Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of hidden states, m is the number of observable states, P is an n-dimensional vector containing initial hidden state probabilities, A is the nxn-dimensional "transition matrix" containing the transition probabilities such that A[i,j]=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional "observation matrix" containing the observation probabilties such that B[i,j]= P(O=o_i|Y=y_j).
> Rabiner [[1|My Page#reference1]] defined three main problems for HMM models:
> # Evaluation: Given a sequence O of observations and a model M, what is the probability P(O|M)  that sequence O was generated by model M. The Evaluation problem can be efficiently solved using the Forward algorithm
> # Decoding: Given a sequence O of observations and a model M, what is the most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to generate this sequence. The Decoding problem can be efficiently sovled using the Viterbi algorithm.
> # Learning: Given a sequence O of observations, what is the most likely model M*=argmax(M)P(O|M) to generate this sequence.  The Learning problem can be efficiently solved using the Baum-Welch algorithm.
> The target of each milestone is defined as the implementation for the given goals, the respective documentation and unit tests for the implementation.
> h4.Timeline
> Mid of May 2010 - Mid of July 2010
> h4.Milestones
> I) Define an HMM class based on Apache Mahout Math package offering interfaces to set model parameters, perform consistency checks, perform output prediction.
> 1 week from May 18th till May 25th.
> II) Write sequential implementations of forward (cf. problem 1 [[1|My Page#reference1]]) and backward algorithm.
> 2 weeks from May 25th till June 8th.
> III) Write a sequential implementation of Viterbi algorithm (cf. problem 2 [[1|My Page#reference1]]), based on existing forward algorithm implementation.
> 2 weeks from June 8th till June 22nd
> IV) Have a running sequential implementation of Baum-Welch algorithm for model parameter learning (application II [ref]), based on existing forward/backward algorithm implementation.
> 2 weeks from June 8th till June 22nd
> V) Provide a usage example of HMM implementation, demonstrating all three problems.
> 2 weeks from June 22nd till July 6th
> VI) Finalize documentation and implemenation, clean up open ends.
> 1 week from July 6th till July 13th
> h4.References:
> {anchor:reference1}[[1|http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf]]    Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and selected applications in speech recognition". Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.
> Potential test data sets:
> [http://www.cnts.ua.ac.be/conll2000/chunking/]
> [http://archive.ics.uci.edu/ml/datasets/Character+Trajectories]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.