You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "William Colen (JIRA)" <ji...@apache.org> on 2011/07/26 05:13:14 UTC

[jira] [Created] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
----------------------------------------------------------------------------------------------------

                 Key: OPENNLP-238
                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
             Project: OpenNLP
          Issue Type: Bug
          Components: POS Tagger
    Affects Versions: tools-1.5.2-incubating
            Reporter: William Colen
            Assignee: William Colen
             Fix For: tools-1.5.2-incubating


I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072293#comment-13072293 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

What is the reason for it to return no sequence? Is there no possible valid sequence? If so the tagdict might just be incorrect.

Or is there a possible sequence, but beam search doesn't find it?

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072902#comment-13072902 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

I still suspect that there might be something wrong in the beam search implementation, when the tagdict restricts the possible sequences it should still find a valid sequence if one exists.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072815#comment-13072815 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

I observed a similar issue once, there the tagdict restricted the outcome to outcomes which are not in the model. Therefore the sequence validator returned false on all possible sequences, and the beam searchs bestSequence then returned null.

I am not sure if we have now a check in the POS Model, to fail if the tagdict contains such invalid outcomes, but if I remember correctly there is at least a jira for that.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072280#comment-13072280 ] 

William Colen commented on OPENNLP-238:
---------------------------------------

Yes, you are right. That would be bad for tools like name finder. I'll remove it latter (I can't now because I'm away from my computer)
I'm having issues with the POS Tagger. Often the BeamSearch can't find a valid sequence if I'm using a dictionary. Dictionaries help a lot, but it shouldn't strict so much the tagger. Exceptions happen here for example while tagging a noun that in the context should be tagged as adjective, but the dictionary doesn't include this noun as adjective and the POS Tagger didn't add noun to the outcomes.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073008#comment-13073008 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

Ups, we should fix the typo in this error message.

That explains your issue, the tag dictionary denies all sequences beam search could advance, therefore it can only return null which is causing the NPE you see.

Why isn't cross validation not checking the dictionary, does it not support it? 

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072998#comment-13072998 ] 

William Colen commented on OPENNLP-238:
---------------------------------------

Cross validation is not checking the dictionary.  That is why I don't have the error.

I tried to train the model using the Train tool and it failed to load:

Loading POS Tagger model ... failed
Model has invalid format: Tag dictioinary contains tags which are unkown by the model!


> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072854#comment-13072854 ] 

William Colen commented on OPENNLP-238:
---------------------------------------

I verified and the tagset of the corpus and of the dictionary are the same. But I could find some issues:

- the training data is small (4k sentences);
- the tagset is big: over than 200 tags
- the corpus annotation can combine different tags and it would be difficult to add that to the dictionary, only if I create the dictionary from the corpus, but don't know if it is a good idea.

examples of combinations:
- when there is a noun (n) used as adjective (adj) the annotation is "n-adj" and I don't that in the dictionary
- sometimes the corpus is not clear if something is singular (S) or plural (P) according to the context, and the person/computer who annotated the corpus added the tag S/P - I also don't have it in the dictionary.
- the same we have for person: we have 0/1/3 when it couldn't decide according to the corpus or the word morphology the person of a verb.

What I'm trying to do is to define my own sequence validator that can handle these cases.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073014#comment-13073014 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

I see, the POS Model does not validates the dictionary when it is instantiated with the constructor the cross validator, or training code uses.
That is an issue which we have across our models, the validation is only performed when it is loaded from an Input Stream.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072294#comment-13072294 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

What is the reason for it to return no sequence? Is there no possible valid sequence? If so the tagdict might just be incorrect.

Or is there a possible sequence, but beam search doesn't find it?

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072278#comment-13072278 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

What do you mean with advancing invalid sequences? That might be an issue for the name finder, because there we really do not want to have a sequence like this "other cont cont other".

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072806#comment-13072806 ] 

William Colen commented on OPENNLP-238:
---------------------------------------

I reverted my changes. I will investigate if the issue is caused by some specificity of the Portuguese data I have.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072917#comment-13072917 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

I had a short look at the beam search code and believe the following happens:

- Your beam search size is smaller than your number of outcomes
- Beam search is only advancing, n best sequences
- If the n best sequences are all invalid it is not advancing anything (even so it could advance the best valid sequences)

I believe it should be changed, and should always advance the n best possible sequences.

Anyway, before we start changing things here we should demonstrate these issues in a solid unit test.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072276#comment-13072276 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

Which exception do you get?

We have similar issues here and there, usually bestSequence returns something but in certain cases it might return null to indicate that no sequence could be found, that results in NPEs in various places in our code base.


> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072292#comment-13072292 ] 

William Colen commented on OPENNLP-238:
---------------------------------------

Maybe we could add a 'weak' flag to the sequence validator. If the implementer of the validador sets the flag true, the BeamSearch would know that the validator can be used to filter valid sequences, but if none was found it would let all pass. I don't know if we can add this flag now, I think we can't because the validators implements an interface and we would break API compatibility changing it.
Another option should be to override the bestSequence method only for the pos tagger, that don't have strong sequence constraints.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jörn Kottmann closed OPENNLP-238.
---------------------------------

       Resolution: Not A Problem
    Fix Version/s:     (was: tools-1.5.2-incubating)

It turned out that the described problem is not caused by beam search, but instead is a problem with the provided tag dictionary. 

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072966#comment-13072966 ] 

William Colen commented on OPENNLP-238:
---------------------------------------

I was using the standard sequence validator. But now I am using one with some hacks, like to handle "n-adj' and tags with "/".

Here is an example I found while running cross validator using the best corpus I have (the Bosque is a newspaper based 4k sentences human reviewed corpus).

At some point we have the following sentence:

(...) rios, lagos, cachoeiras, montanhas (acompanhas da altitude), parques nacionais, reservas (...)
that can be translated to "rivers, lakes, waterfalls, mountains (altitude track), national parks, reserves".

The word "acompanhas" makes no sense here, although it is not misspelled, "acompanhas" is the verb "to follow" in the present, second person singular (v-fin=PR=2S=IND), but I think the right word here should be "acompanhadas", that is the same verb but in the past participle (v-pcp=M=P).

The annotated sentence from corpus is:
(...) rios_n=M=P ,_, lagos_n=M=P ,_, cachoeiras_n=F=P ,_, montanhas_n=F=P (_( acompanhas_v-pcp=M=P de_prp a_art=F=S altitude_n=F=S )_) ,_, parques_n=M=P nacionais_adj=M=P ,_, reservas_n=F=P (...)

So the software that originally created the corpus or the person who reviewed it used the POS tag according to the context, not restricting it to the morphology of "acompanhas".

While running OpenNLP in this phrase it evaluate a huge list of outcomes, but none of them is "v-fin=PR=2S=IND" (I will include all outcomes bellow). It makes sense because we shouldn't have it in this context. Since the default sequence validator performs a dictionary search and the correct tag of "acompanhas" is in the outcome list, it will not validate any outcome and the list will end empty, causing an exception latter.

--- outcomes
prop=F=S, n=F=S, v-pcp=F=S, adv, v-fin=PR=3S=IND, art=M=S, n=M=S, adj=M=S, :, v-ger, art=F=S, adj=F=S, ,, (, num=M=P, n=M=P, ), prp, art=M=P, prop=M=S, ., conj-s, pron-pers=M=3P=NOM, pron-pers=M=3P=ACC, v-fin=PR=3P=IND, conj-c, v-fin=PS/MQP=3P=IND, pron-indp=M=S, v-inf, «, », v-fin=PS=3S=IND, v-fin=FUT=3S=IND, n=F=P, adj=F=P, v-pcp=M=P, v-pcp=M=S, v-inf=3S, pron-det=M=S, v-fin=IMPF=3S=IND, ec, adj=M=P, pron-det=F=P, pron-indp=F=P, v-fin=IMPF=3P=IND, v-pcp=F=P, num=M=S, pron-indp=M/F=S, pron-pers=M=3S=NOM, --, pron-det=M=P, n-adj=M=P, v-fin=COND=3P, art=F=P, num=F=P, pron-indp=F=S, v-fin=PR=1S=IND, pron-pers=M/F=3S/P=ACC, v-fin=COND=3S, n-adj=M=S, n-adj=F=P, prop=M=P, pron-det=F=S, v-fin=PR=3S=SUBJ, pron-pers=M=3S=ACC, v-fin=IMPF=3S=SUBJ, num=F=S, conj-c=<co-postnom, pron-indp=M=P, v-fin=IMPF=3P=SUBJ, adj, pron-pers=M=3S/P=ACC, v-fin=PR=3P=SUBJ, v-fin=PS=1/3S=IND, pron-pers=F=3S=ACC, pron-pers=M=3S=NOM/PIV, pron-pers=M/F=1S=DAT, v-fin=PS=1S=IND, pron-pers=M=3S=DAT, v-pcp, v-fin=FUT=3P=IND, v-inf=3P, pron-pers=F=3S=NOM/PIV, ;, ', prop=F=P, v-fin=PS=1P=IND, art=N=S, ?, v-fin=PR=1P=IND, !, pron-pers=F=3S=NOM, pron-pers=M/F=3S=ACC, prp=N<ARG, v-fin=FUT=3S=SUBJ, pron-pers=M=1P=NOM, pron-pers=M/F=1P=NOM/PIV, v-fin=MQP=3S=IND, v-fin=PS=2S=IND, pron-pers=M=3P=NOM/PIV, P.vp, pron-pers=M=1S=DAT, pron-pers=M=1S=ACC, pron-pers=F=1S=ACC, adj=M/F=S, pron-pers=F=3P=ACC, v-fin=IMP=2S, intj, n=M/F=S, pron-pers=M/F=3S=NOM, v-fin=PR=1P=SUBJ, pron-pers=F=3P=NOM/PIV, v-fin=FUT=1P=IND, pron-pers=M/F=1P=ACC, prop=M/F=S, pron-pers=M/F=3S=NOM/PIV, v-fin=PR=1/3S=SUBJ, pron-pers=M/F=1S=NOM, v-fin=IMPF=1S=SUBJ, v-fin=IMPF=1S=IND, pron-pers=F=3P=NOM, ..., pron-pers=M=1S=NOM, pron-pers=F=3S=DAT, v-fin=FUT=1/3S=SUBJ, num=M/F=P, n-adj=F=S, n=M=R, conj-c=<co-prparg, pron-pers=M/F=1P=NOM, v-inf=M=S, v-inf=1P, v-fin=IMPF=1P=IND, -, pron-pers=M=3P=DAT, pron-pers=M/F=1S=ACC, pron-indp=M/F=S/P, v-fin=MQP=3P=IND, pron-pers=F=1S=DAT, pron-pers=F=1S=PIV, v-fin=PR=1S=SUBJ, /, v-fin=PR=2P=IND, pron-pers=M/F=2P=NOM, v-fin=COND=1S, pron-pers=F=1S=NOM, v-fin=FUT=3P=SUBJ, pron-indp=M=S/P, n=M/F=P, pron-pers=M=3S=PIV, v-fin=FUT=1S=IND, v-inf=1S, pron-pers=M/F=3S=DAT, v-fin=FUT=1P=SUBJ, pron-pers=M=1P=DAT, v-fin=MQP=1S=IND, v-ger=F=S, n=N=M/F=S, v-fin=IMP=3P, intj=PS=3S=IND, pron-indp=F=F, pron-pers=F=1P=NOM/PIV, pron-pers=M/F=1P=DAT, vp=V=PCP=F=S, n=S=S, v-fin=PR=3S, pron-pers=M=1S=PIV, pron-pers=M/F=3S/P=DAT, v-fin=PS=3P=IND, v-fin=PR=3S=IND=VFIN

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071742#comment-13071742 ] 

William Colen commented on OPENNLP-238:
---------------------------------------

Any comment on this? I tried it here and it looks ok, but I am not a specialist on the BeamSearch class.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073010#comment-13073010 ] 

William Colen commented on OPENNLP-238:
---------------------------------------

I don't know, I added a breakpoint to the method that validates the model and it never stopped there while running the cross validator. I'll investigate that and open a new Jira.

Also I will open a Jira for a new tool to create POS Tag dictionaries that optionally checks if the tagset is valid, maybe looking at the training corpus to extract the tagset using a cutoff.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072986#comment-13072986 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

The model has an outcome list, which contains all observed outcomes in your training data. On each prediction it calculates the probability of each outcome. The tad dictionary now reduces the allowed outcomes to a smaller set. This makes things more accurate and speeds the whole process up.

When the pos tagger comes to "acompanhas" it should advance the existing sequences with one of the best predicted outcomes, or if that fails just advance all valid sequences. For some reason the later fails and it does not advance anything, right? But that is strange and indicates that we have a bug somewhere.

I believe it should not be a problem with the tagdict itself, because it is validated when the POS Model is loaded. I am not sure what exactly is going wrong here.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "William Colen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072973#comment-13072973 ] 

William Colen commented on OPENNLP-238:
---------------------------------------

Ops, I was confused with the meaning of the outcome list. In fact it is all possible outcomes? Or it is restricted to the context?
We don't have "v-fin=PR=2S=IND" in the context, does it mean that it never appeared in the corpus? It makes sense since it was extracted from newspaper.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072818#comment-13072818 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

The jira I ment is this one: OPENNLP-127

It is fixed, so it should not be possible to load an invalid tagdict with the current version.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072937#comment-13072937 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

Ups, should have read the method to the end, if it cannot advance at least one of the n best model outcomes, it simply advances all valid sequences. Not sure why it is not working in your case.

Do you use the standard sequence validator for the pos tagger?

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (OPENNLP-238) BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence

Posted by "Jörn Kottmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072923#comment-13072923 ] 

Jörn Kottmann commented on OPENNLP-238:
---------------------------------------

So it seems like that this is always an issue for the POS Tagger when the beam size is smaller than the number of outcomes.

> BestSequence method in BeamSearch can cause NullPointerException if it can not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-238
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-238
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: POS Tagger
>    Affects Versions: tools-1.5.2-incubating
>            Reporter: William Colen
>            Assignee: William Colen
>             Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a TagDictionary. Sometimes there are no outcome that matches with the tags in the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty after advancing all valid sequences (line 159) we should let it advance invalid sequences. It would make the POS Tagger more robust.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira