You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Gibney (JIRA)" <ji...@apache.org> on 2018/12/15 06:51:00 UTC
[jira] [Commented] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

    [ https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722043#comment-16722043 ] 

Michael Gibney commented on LUCENE-8610:
----------------------------------------

Changed to "minor wish"; this patch still might be a good idea, but I encountered it in practice because I was using {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} incorrectly. {{PreAnalyzedTokenizer}} (and its token Attributes) is not designed to be reused _at all_ at index time. Caching concerns only apply to TokenStream reuse, so now that I've corrected my use of {{PreAnalyzedTokenizer}}, this patch could be viewed as a solution in search of a problem.

If this patch still has any merit, it would be because:
 1. there might be TokenStreams that lazily instantiate token Attributes and _are_ reused, or
 2. this change would be a prerequisite for potentially modifying {{PreAnalyzedTokenizer}} to enable reuse, thus avoiding creation of a {{PreAnalyzedTokenizer}} (and all associated token Attributes) for every field value.

I'm fine with just closing this issue; but again it's a pretty minor change that won't hurt anything, and could in some cases make indexing more robust. Or at least clarify whether it's acceptable for index-time TokenStreams to lazily instantiate token Attributes ...

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes
> -----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8610
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8610
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: core/index
>    Affects Versions: 7.4, master (8.0)
>            Reporter: Michael Gibney
>            Priority: Minor
>         Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to {{incrementToken()}} that returns {{true}}), this can result in caching a {{null}} value in {{invertState.termAttribute}} for a given {{stream}} instance. 
> Subsequent calls that reuse the same {{stream}} instance (reusing {{TokenStreamComponents}}) for field values with at least 1 token will call {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE is thrown when {{termsHashPerField.add()}} reasonably but incorrectly assumes a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org