You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2014/12/17 13:29:13 UTC

[jira] [Commented] (OPENNLP-738) AbstractDataIndexer#sortAndMerge sets up callers for a NullPointerException

    [ https://issues.apache.org/jira/browse/OPENNLP-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249781#comment-14249781 ] 

Joern Kottmann commented on OPENNLP-738:
----------------------------------------

Thanks for tracking down that bug! Many people eventually run into it by not training with enough data. I am not sure which solution we should choose.

If a user only trains with 1 or 2 events the produced result is probably meaningless and therefore he might be better of receiving an error, on the other hand if he does it with 3 events, it is still meaningless and the error would go away.
The indexer can only detect some training data problems, e.g. only one outcome, or not enough data. The amount of data needed to train a model might also vary on the task the user tries to perform. It would definitely be useful to have a system in place which could warn and inform a user about the training data he is using on a per task level.

+1 to let the user train with 1 or 2 events and later add a system to warn and inform about the training data he is using.

We are currently building RCs for the 1.6.0 release. As part of that the poms get set to the next version after 1.6.0. We use subversion and there you can always state the revision number to point to a specific version.

Are there any objections to pull this into 1.6.0?


> AbstractDataIndexer#sortAndMerge sets up callers for a NullPointerException
> ---------------------------------------------------------------------------
>
>                 Key: OPENNLP-738
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-738
>             Project: OpenNLP
>          Issue Type: Bug
>            Reporter: Chris Lewis
>         Attachments: AbstractDataIndexer.java-NPE.patch
>
>
> In its constructor, the {{OnePassDataIndexer}} calls {{sortAndMerge}} of its parent class, {{AbstractDataIndexer}} (source file {{opennlp-tools/src/main/java/opennlp/tools/ml/model/AbstractDataIndexer.java}}). A quick read through the source of these two classes shows that the member variable {{contexts}} is only initialized by this method, otherwise it remains {{null}}. Note that in the case of {{sort}} being {{true}} (which it is as called) and there being fewer than two events, the method returns early thus leaving {{contexts}} unilitialized. Note also that {{getContexts}} exposes this variable, and that {{GIS.trainModel}} delegates to the {{trainModel}} method of {{GISTrainer}}. Line 263 attempts to dereference {{contexts.length}}, which will be {{null}} in the case of fewer than two events in the stream, and thus result in a {{NullPointerException}}.
> I'm not an expert in the algorithms relying on this code, but [some|http://comments.gmane.org/gmane.comp.apache.opennlp.user/564] [googling|http://blog.gmane.org/gmane.comp.apache.opennlp.user/month=20140501] shows a few incidents that lead back to this behavior, including at least the tickets OPENNLP-316 and OPENNLP-488. It may be the case that all uses of this code cannot possibly function correctly without >= 2 events, but I don't know that. As such, being the non-expert on the natural constraints of the inputs to {{sortAndMerge}}, I'd like to suggest 2 possible improvements: 1) default the {{contexts}} and other private arrays that are set in the >= 2 path of this code to non-null defaults or 2) throw an explicit {{IllegalArgumentException}} that states >= 2 events are required for the calculation.
> The latter is not as desirable as the former (for which I've attached a patch), but at least it provides a targeted, unambiguous reason for why an exception is being thrown.
> Also I apologize for not specifying the version or component, as I'm not clear on how the project source is organized with respect to the published artifacts. This issue is present in trunk whose parent pom claims a version of {{1.6.1-SNAPSHOT}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)