You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:01:12 UTC

[jira] [Updated] (SPARK-17716) Hidden Markov Model (HMM)

     [ https://issues.apache.org/jira/browse/SPARK-17716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-17716:
---------------------------------
    Labels: bulk-closed  (was: )

> Hidden Markov Model (HMM)
> -------------------------
>
>                 Key: SPARK-17716
>                 URL: https://issues.apache.org/jira/browse/SPARK-17716
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Runxin Li
>            Priority: Major
>              Labels: bulk-closed
>
> Had an offline chat with [~Lil'Rex], who implemented HMM on Spark at https://github.com/apache/spark/compare/master...lilrex:sequence. I asked him to list popular HMM applications, describe public API (params, input/output schemas), compare its API with existing HMM implementations.
> h1. Hidden Markov Model (HMM) Design Doc
> h2. Overview
> h3. Introduction to HMM
> Hidden Markov Model is a type of statistical Machine Learning model that assumes a sequence of observations is generated by a Markov process with hidden states. There are 3 (or 2, depending on the implementation) main components of the model:
> * *Transition Probability*: describes the probability distribution of transitions from each state to other states (including self) in the Markov process
> * *Emission Probability*: describes the probability distribution for an observation associated with hidden states
> * *Initial/Start Probability* (optional): represents the prior probability of each state at the beginning of the observation sequence
> _Note: some implementations merge the Initial Probability into Transition Probability by adding an arbitrary Start state before the first observation point._
> h3. HMM Models and Algorithms
> Given a limited number of states, most HMM models have the same form of Transition Probability: a matrix, where each element _(i, j)_ represents the probability of transition from state _i_ to state _j_. The Initial Probability usually take the simple form of a probabilistic vector.
> The Emission Probability, on the other hand, can be represented in many different ways, depending on different nature of observations, i.e. continuous vs. discrete, or different model assumptions, e.g. single Gaussian vs. Gaussian Mixtures.
> There are three main problems associated with HMM models, and their canonical algorithms:
> # *Evaluation*: What’s the probability of a given observation sequence, based on the model? It’s usually done by either *Forward* or *Backward* algorithms
> # *Decoding*: What’s the most likely state sequence, given the observation sequence and the model? It’s usually done by *Viterbi* decoding
> # *Learning*: How to train the parameters of the model based on the observation sequences? *Baum-Welch* (Forward-Backward) is usually used as part of the *EM* algorithm in unsupervised training
> h3. Popular Applications of HMM
> * Speech Recognition
> * Part-of-speech Tagging
> * Named Entity Recognition
> * Machine Translation
> * Gene Prediction
> h2. Alternate Libraries
> [Mahout|https://mahout.apache.org/users/classification/hidden-markov-models.html]
> * Assume emission probability to be an m-by-n matrix
> * Use Baum-Welch algorithm for training and Viterbi algorithm for prediction
> * API (command line)
> ** training
> {{$ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 2 2 3 2 1 1 0" > hmm-input}}
> {{$ export MAHOUT_LOCAL=true}}
> {{$ $MAHOUT_HOME/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 -e .0001 -m 1000}}
> ** prediction
> {{$ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10}}
> {{$ cat hmm-predictions}}
> [Mallet|http://mallet.cs.umass.edu/api/cc/mallet/fst/HMM.html]
> * Treat HMM as a Finite State Transducer (FST)
> * Theoretically can go beyond first-order Markov assumption by setting an arbitrary order
> * Limited to text data, i.e. discrete observation sequence with Multinomial emission model assumption
> * Supervised training only
> * API:
> ** Training:
> {{HMM hmm = new HMM(pipe, null);}}
> {{hmm.addStatesForLabelsConnectedAsIn(trainingInstances);}}
> {{HMMTrainerByLikelihood trainer = new HMMTrainerByLikelihood(hmm);}}
> {{trainer.train(trainingInstances, 10);}}
> ** Testing:
> {{evaluator.evaluate(trainer);}}
> [HMMLearn|https://github.com/hmmlearn/hmmlearn]
> * Previously part of scikit-learn
> * Algo:
> ** Standard HMM unsupervised training algorithm
> ** Three types of emission models: GMM, Gaussian and Multinomial
> * API:
> ** Training: 
> {{model = hmm.GaussianHMM(n_components=3, covariance_type="full")}}
> {{model.fit(X)}}
> ** Testing: 
> {{hidden_states = model.predict(X)}}
> h2. API
> Design Goals
> * Build the foundation for general Sequential Tagging models (HMM, CRF, etc.)
> * Support multiple Emission Probability models such as “Multinomial” and “Gaussian Mixture”
> * Keep both supervised and unsupervised learning for HMM in mind
> h3. Proposed API
> _Note: This is written for the spark.ml API._
> Decoder API
> {code:title=Decoder.scala}
> trait DecoderParams extends Params {
>   def featureCol: DataType // column of sequential features, e.g. MatrixUDT
>   def predictionCol: DoubleType // column of prediction
>   def labelCol: DataType // column of sequential labels, e.g. VectorUDT
> }
> abstract class Decoder extends Estimator with DecoderParams {
>   def extractLabeledSequences(dataset: Dataset[_]): RDD[LabeledSequence]
> }
> abstract class DecodingModel extends Model with DecoderParams {
>   def numFeatures: Int
>   def decode(features: FeatureType): Vector
> }
> {code}
> Tagger API
> {code:title=Tagger.scala}
> trait TaggerParams extends DecoderParams with HasRawPredictionCol {
>   def rawPredictionCol: MatrixUDT // column for all predicted label sequences
> }
> abstract class Tagger extends Decoder with TaggerParams
> abstract class TaggingModel extends DecodingModel with TaggerParams {
>   def numClasses: Int
>   def decodeRaw(features: FeaturesType): Array[(Double, Vector)]
>   def raw2prediction(rawPrediction: Array[(Double, Vector)]): Vector
> }
> {code}
> MarginalTagger (Probabilistic Tagger) API
> {code:title=MarginalTagger.scala}
> trait MarginalTaggerParams extends TaggerParams with HasProbabilityCol with HasThreshold {
>   def probabilityCol: DoubleType // column for probability
> }
> abstract class MarginalTagger extends Tagger with MarginalTaggerParams
> abstract class MarginalTaggingModel extends TaggingModel with MarginalTaggerParams {
>   def getMargin(featuers: FeaturesType): Double
> }
> {code}
> HiddenMarkovModel API
> {code:title=HiddenMarkovModel.scala}
> trait HiddenMarkovModelParams extends MarginalTaggerParams with HasMaxIter with HasTol with HasStandardization with HasThreshold {
>   def smoothing: DoubleParam
>   def emissionType: Param[String] //can be either Multinomial or Gaussian
> }
> class HiddenMarkovModel extends MarginalTagger with HiddenMarkovModelParams {
>   def initialModel: Option[HMMModel] //initial model before training
>   def def trainSupervised(dataset: Dataset[_]): Option[HMMModel]
>   def trainUnsupervised(dataset: Dataset[_]): HMMModel
> }
> abstract class HMMModel extends MarginalTaggingModel with HiddenMarkovModelParams {
>   def initialProb: Vector // initial probability for states
>   def transitionProb: DenseMatrix // transition probability between states
>   //compute feature scores for all states in all frames in a sequence, different for different emission models, e.g. Multinomial vs. GMM
>   def precomputeEmission(features: Matrix): List[Array[Double]] 
>   // accumulate sufficient statistics for emission model  
>   def getEmissionStats(features: Matrix, gammas: List[Array[Double]]): DenseMatrix
>   // forward algorithm
>   def forward(emissions: Traversable[Array[Double]]): List[Array[Double]] 
>   // backword algorithm
>   def backward(emissions: Traversable[Array[Double]]):List[Array[Double]] 
> }
> class MultinomialHMMModel extends HMMModel {
>   def emissionProb: Matrix // emission probability from states to observations
> }
> object MultinomialHMMModel extends MLReadable[MultinomialHMMModel]
> class HMMModelReader extends MLReader[MultinomialHMMModel]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org