You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by "Nicolas Hernandez (JIRA)" <de...@uima.apache.org> on 2011/03/31 20:18:05 UTC

[jira] [Created] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process

Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process
------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: UIMA-2106
                 URL: https://issues.apache.org/jira/browse/UIMA-2106
             Project: UIMA
          Issue Type: Bug
          Components: Sandbox-Tagger
    Affects Versions: 2.3
         Environment: OS
Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011

JVM
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)

            Reporter: Nicolas Hernandez
            Priority: Minor
             Fix For: 2.3


The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java implementation to determine the pos tag list of a given sentence.
In practice this implementation is partially dependant on the part of speech tagging (likewise the remaining HMMTagger classes actually).
For exemple it makes strong assumptions on the kind of tokens it can take as input. It assumes no restriction about the token covertext values.
It results in using some covertext probabilities for initialization or default value when the tagger processes an unknown token...

As a consequence if the coveredText used for setting the default value is not present in the training model an error occurs. Roughly speaking, the process looks first for probability associated to the current token coverText, if the coverText is not present in the model, it looks in the model for the probability of its longest suffix, and finally if it does not found a match, the process assigns to the unknown coverText the probability of the arbitrary coverText : "("  
The problem is that if the probability of this coverText is not available in the model, the probability of the unknown token is not defined and a null pointer exception occurs latter when the variable is called.

Why the probability of the "(" text would not be available in the model ? In a large training corpus if we consider all the tokens, there is little chance not to find at least one occurrence of "(". 
Nevertheless if we use the HMM training  AE to build a model for predicting noun gender and number, or verb tense and person, or "being a part of" named entity... these tokens won t have the "(" coverText... and consequently an error will occurs when the tagging will be performed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process

Posted by "Nicolas Hernandez (JIRA)" <de...@uima.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Hernandez updated UIMA-2106:
------------------------------------

    Attachment: TaggerHandlingTokensNotPresentInTheLanguageModel.patch

As a default probability value for unkwnon token, the algorithm used the probability of an assumed known token which is "(".
Unfortunately the latter can be absent from the language model too.
We propose to keep this default value when its token exists in the model and to set it to set it to Double.MIN_VALUE if not. Actually, it is not the value which is set but the couple token and its value. A question arises anyway: would it be better for the algorithm to take as default value an absent token from the training data but present in the testing data or an unprobable token both in the training and testing data ?
The current solution aims at fitting the most the previous results.

> Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: UIMA-2106
>                 URL: https://issues.apache.org/jira/browse/UIMA-2106
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-Tagger
>    Affects Versions: 2.3
>         Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>            Reporter: Nicolas Hernandez
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: TaggerHandlingTokensNotPresentInTheLanguageModel.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java implementation to determine the pos tag list of a given sentence.
> In practice this implementation is partially dependant on the part of speech tagging (likewise the remaining HMMTagger classes actually).
> For exemple it makes strong assumptions on the kind of tokens it can take as input. It assumes no restriction about the token covertext values.
> It results in using some covertext probabilities for initialization or default value when the tagger processes an unknown token...
> As a consequence if the coveredText used for setting the default value is not present in the training model an error occurs. Roughly speaking, the process looks first for probability associated to the current token coverText, if the coverText is not present in the model, it looks in the model for the probability of its longest suffix, and finally if it does not found a match, the process assigns to the unknown coverText the probability of the arbitrary coverText : "("  
> The problem is that if the probability of this coverText is not available in the model, the probability of the unknown token is not defined and a null pointer exception occurs latter when the variable is called.
> Why the probability of the "(" text would not be available in the model ? In a large training corpus if we consider all the tokens, there is little chance not to find at least one occurrence of "(". 
> Nevertheless if we use the HMM training  AE to build a model for predicting noun gender and number, or verb tense and person, or "being a part of" named entity... these tokens won t have the "(" coverText... and consequently an error will occurs when the tagging will be performed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process

Posted by "Nicolas Hernandez (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014482#comment-13014482 ] 

Nicolas Hernandez commented on UIMA-2106:
-----------------------------------------

Thanks 

I do that.

> Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: UIMA-2106
>                 URL: https://issues.apache.org/jira/browse/UIMA-2106
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-Tagger
>    Affects Versions: 2.3
>         Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>            Reporter: Nicolas Hernandez
>            Priority: Minor
>             Fix For: 2.3
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java implementation to determine the pos tag list of a given sentence.
> In practice this implementation is partially dependant on the part of speech tagging (likewise the remaining HMMTagger classes actually).
> For exemple it makes strong assumptions on the kind of tokens it can take as input. It assumes no restriction about the token covertext values.
> It results in using some covertext probabilities for initialization or default value when the tagger processes an unknown token...
> As a consequence if the coveredText used for setting the default value is not present in the training model an error occurs. Roughly speaking, the process looks first for probability associated to the current token coverText, if the coverText is not present in the model, it looks in the model for the probability of its longest suffix, and finally if it does not found a match, the process assigns to the unknown coverText the probability of the arbitrary coverText : "("  
> The problem is that if the probability of this coverText is not available in the model, the probability of the unknown token is not defined and a null pointer exception occurs latter when the variable is called.
> Why the probability of the "(" text would not be available in the model ? In a large training corpus if we consider all the tokens, there is little chance not to find at least one occurrence of "(". 
> Nevertheless if we use the HMM training  AE to build a model for predicting noun gender and number, or verb tense and person, or "being a part of" named entity... these tokens won t have the "(" coverText... and consequently an error will occurs when the tagging will be performed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process

Posted by "Nicolas Hernandez (JIRA)" <de...@uima.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Hernandez updated UIMA-2106:
------------------------------------

    Description: 
The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java implementation to determine the pos tag list of a given sentence.
In practice this implementation is partially dependant on the part of speech tagging (likewise the remaining HMMTagger classes actually).
For exemple it makes strong assumptions on the kind of tokens it can take as input. It assumes no restriction about the token covertext values.
It results in using some covertext probabilities for initialization or default value when the tagger processes an unknown token...

As a consequence if the coveredText used for setting the default value is not present in the training model an error occurs. Roughly speaking, the process looks first for probability associated to the current token coverText, if the coverText is not present in the model, it looks in the model for the probability of its longest suffix, and finally if it does not found a match, the process assigns to the unknown coverText the probability of the arbitrary coverText : "("  
The problem is that if the probability of this coverText is not available in the model, the probability of the unknown token is not defined and a null pointer exception occurs latter when the variable is called.

Why the probability of the "(" text would not be available in the model ? In a large training corpus if we consider all the tokens, there is little chance not to find at least one occurrence of "(". 
Nevertheless if we use the HMM training  AE to build a model for predicting noun gender and number, or verb tense and person, or "being a part of" named entity... these tokens won t have the "(" coverText... and consequently an error will occurs when the tagging will be performed.

A patch has been proposed.


  was:
The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java implementation to determine the pos tag list of a given sentence.
In practice this implementation is partially dependant on the part of speech tagging (likewise the remaining HMMTagger classes actually).
For exemple it makes strong assumptions on the kind of tokens it can take as input. It assumes no restriction about the token covertext values.
It results in using some covertext probabilities for initialization or default value when the tagger processes an unknown token...

As a consequence if the coveredText used for setting the default value is not present in the training model an error occurs. Roughly speaking, the process looks first for probability associated to the current token coverText, if the coverText is not present in the model, it looks in the model for the probability of its longest suffix, and finally if it does not found a match, the process assigns to the unknown coverText the probability of the arbitrary coverText : "("  
The problem is that if the probability of this coverText is not available in the model, the probability of the unknown token is not defined and a null pointer exception occurs latter when the variable is called.

Why the probability of the "(" text would not be available in the model ? In a large training corpus if we consider all the tokens, there is little chance not to find at least one occurrence of "(". 
Nevertheless if we use the HMM training  AE to build a model for predicting noun gender and number, or verb tense and person, or "being a part of" named entity... these tokens won t have the "(" coverText... and consequently an error will occurs when the tagging will be performed.


A patch has been proposed.

> Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: UIMA-2106
>                 URL: https://issues.apache.org/jira/browse/UIMA-2106
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-Tagger
>    Affects Versions: 2.3
>         Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>            Reporter: Nicolas Hernandez
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: TaggerHandlingTokensNotPresentInTheLanguageModel.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java implementation to determine the pos tag list of a given sentence.
> In practice this implementation is partially dependant on the part of speech tagging (likewise the remaining HMMTagger classes actually).
> For exemple it makes strong assumptions on the kind of tokens it can take as input. It assumes no restriction about the token covertext values.
> It results in using some covertext probabilities for initialization or default value when the tagger processes an unknown token...
> As a consequence if the coveredText used for setting the default value is not present in the training model an error occurs. Roughly speaking, the process looks first for probability associated to the current token coverText, if the coverText is not present in the model, it looks in the model for the probability of its longest suffix, and finally if it does not found a match, the process assigns to the unknown coverText the probability of the arbitrary coverText : "("  
> The problem is that if the probability of this coverText is not available in the model, the probability of the unknown token is not defined and a null pointer exception occurs latter when the variable is called.
> Why the probability of the "(" text would not be available in the model ? In a large training corpus if we consider all the tokens, there is little chance not to find at least one occurrence of "(". 
> Nevertheless if we use the HMM training  AE to build a model for predicting noun gender and number, or verb tense and person, or "being a part of" named entity... these tokens won t have the "(" coverText... and consequently an error will occurs when the tagging will be performed.
> A patch has been proposed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process

Posted by "Marshall Schor (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098226#comment-13098226 ] 

Marshall Schor commented on UIMA-2106:
--------------------------------------

Should this issue be closed?

> Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: UIMA-2106
>                 URL: https://issues.apache.org/jira/browse/UIMA-2106
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-Tagger
>    Affects Versions: 2.3
>         Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>            Reporter: Nicolas Hernandez
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: TaggerHandlingTokensNotPresentInTheLanguageModel.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java implementation to determine the pos tag list of a given sentence.
> In practice this implementation is partially dependant on the part of speech tagging (likewise the remaining HMMTagger classes actually).
> For exemple it makes strong assumptions on the kind of tokens it can take as input. It assumes no restriction about the token covertext values.
> It results in using some covertext probabilities for initialization or default value when the tagger processes an unknown token...
> As a consequence if the coveredText used for setting the default value is not present in the training model an error occurs. Roughly speaking, the process looks first for probability associated to the current token coverText, if the coverText is not present in the model, it looks in the model for the probability of its longest suffix, and finally if it does not found a match, the process assigns to the unknown coverText the probability of the arbitrary coverText : "("  
> The problem is that if the probability of this coverText is not available in the model, the probability of the unknown token is not defined and a null pointer exception occurs latter when the variable is called.
> Why the probability of the "(" text would not be available in the model ? In a large training corpus if we consider all the tokens, there is little chance not to find at least one occurrence of "(". 
> Nevertheless if we use the HMM training  AE to build a model for predicting noun gender and number, or verb tense and person, or "being a part of" named entity... these tokens won t have the "(" coverText... and consequently an error will occurs when the tagging will be performed.
> A patch has been proposed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process

Posted by "Jerry Cwiklik (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015412#comment-13015412 ] 

Jerry Cwiklik commented on UIMA-2106:
-------------------------------------

Nicolas, I just committed your patch. 

> Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: UIMA-2106
>                 URL: https://issues.apache.org/jira/browse/UIMA-2106
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-Tagger
>    Affects Versions: 2.3
>         Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>            Reporter: Nicolas Hernandez
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: TaggerHandlingTokensNotPresentInTheLanguageModel.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java implementation to determine the pos tag list of a given sentence.
> In practice this implementation is partially dependant on the part of speech tagging (likewise the remaining HMMTagger classes actually).
> For exemple it makes strong assumptions on the kind of tokens it can take as input. It assumes no restriction about the token covertext values.
> It results in using some covertext probabilities for initialization or default value when the tagger processes an unknown token...
> As a consequence if the coveredText used for setting the default value is not present in the training model an error occurs. Roughly speaking, the process looks first for probability associated to the current token coverText, if the coverText is not present in the model, it looks in the model for the probability of its longest suffix, and finally if it does not found a match, the process assigns to the unknown coverText the probability of the arbitrary coverText : "("  
> The problem is that if the probability of this coverText is not available in the model, the probability of the unknown token is not defined and a null pointer exception occurs latter when the variable is called.
> Why the probability of the "(" text would not be available in the model ? In a large training corpus if we consider all the tokens, there is little chance not to find at least one occurrence of "(". 
> Nevertheless if we use the HMM training  AE to build a model for predicting noun gender and number, or verb tense and person, or "being a part of" named entity... these tokens won t have the "(" coverText... and consequently an error will occurs when the tagging will be performed.
> A patch has been proposed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process

Posted by "Nicolas Hernandez (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014118#comment-13014118 ] 

Nicolas Hernandez commented on UIMA-2106:
-----------------------------------------

As soon as I found how to assign me the task I can submit a patch. There are two lines to change in org.apache.uima.examples.tagger.Viterbi.java
available_pos = word_probs.get("("); 
->
available_pos.put("null", Double.MIN_VALUE);

possible_pos_next =  word_probs.get("(");
->
possible_pos_next.put("null", Double.MIN_VALUE);

> Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: UIMA-2106
>                 URL: https://issues.apache.org/jira/browse/UIMA-2106
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-Tagger
>    Affects Versions: 2.3
>         Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>            Reporter: Nicolas Hernandez
>            Priority: Minor
>             Fix For: 2.3
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java implementation to determine the pos tag list of a given sentence.
> In practice this implementation is partially dependant on the part of speech tagging (likewise the remaining HMMTagger classes actually).
> For exemple it makes strong assumptions on the kind of tokens it can take as input. It assumes no restriction about the token covertext values.
> It results in using some covertext probabilities for initialization or default value when the tagger processes an unknown token...
> As a consequence if the coveredText used for setting the default value is not present in the training model an error occurs. Roughly speaking, the process looks first for probability associated to the current token coverText, if the coverText is not present in the model, it looks in the model for the probability of its longest suffix, and finally if it does not found a match, the process assigns to the unknown coverText the probability of the arbitrary coverText : "("  
> The problem is that if the probability of this coverText is not available in the model, the probability of the unknown token is not defined and a null pointer exception occurs latter when the variable is called.
> Why the probability of the "(" text would not be available in the model ? In a large training corpus if we consider all the tokens, there is little chance not to find at least one occurrence of "(". 
> Nevertheless if we use the HMM training  AE to build a model for predicting noun gender and number, or verb tense and person, or "being a part of" named entity... these tokens won t have the "(" coverText... and consequently an error will occurs when the tagging will be performed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process

Posted by "Richard Eckart de Castilho (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014124#comment-13014124 ] 

Richard Eckart de Castilho commented on UIMA-2106:
--------------------------------------------------

I believe only users with the role "developer" can assign issues. But you can already attach a patch.

> Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: UIMA-2106
>                 URL: https://issues.apache.org/jira/browse/UIMA-2106
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-Tagger
>    Affects Versions: 2.3
>         Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>            Reporter: Nicolas Hernandez
>            Priority: Minor
>             Fix For: 2.3
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java implementation to determine the pos tag list of a given sentence.
> In practice this implementation is partially dependant on the part of speech tagging (likewise the remaining HMMTagger classes actually).
> For exemple it makes strong assumptions on the kind of tokens it can take as input. It assumes no restriction about the token covertext values.
> It results in using some covertext probabilities for initialization or default value when the tagger processes an unknown token...
> As a consequence if the coveredText used for setting the default value is not present in the training model an error occurs. Roughly speaking, the process looks first for probability associated to the current token coverText, if the coverText is not present in the model, it looks in the model for the probability of its longest suffix, and finally if it does not found a match, the process assigns to the unknown coverText the probability of the arbitrary coverText : "("  
> The problem is that if the probability of this coverText is not available in the model, the probability of the unknown token is not defined and a null pointer exception occurs latter when the variable is called.
> Why the probability of the "(" text would not be available in the model ? In a large training corpus if we consider all the tokens, there is little chance not to find at least one occurrence of "(". 
> Nevertheless if we use the HMM training  AE to build a model for predicting noun gender and number, or verb tense and person, or "being a part of" named entity... these tokens won t have the "(" coverText... and consequently an error will occurs when the tagging will be performed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira