You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Gibney (JIRA)" <ji...@apache.org> on 2018/12/15 06:51:00 UTC
[jira] [Commented] (LUCENE-8610) NPE in TermsHashPerField.add() for
TokenStreams with lazily instantiated token Attributes
[ https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722043#comment-16722043 ]
Michael Gibney commented on LUCENE-8610:
----------------------------------------
Changed to "minor wish"; this patch still might be a good idea, but I encountered it in practice because I was using {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} incorrectly. {{PreAnalyzedTokenizer}} (and its token Attributes) is not designed to be reused _at all_ at index time. Caching concerns only apply to TokenStream reuse, so now that I've corrected my use of {{PreAnalyzedTokenizer}}, this patch could be viewed as a solution in search of a problem.
If this patch still has any merit, it would be because:
1. there might be TokenStreams that lazily instantiate token Attributes and _are_ reused, or
2. this change would be a prerequisite for potentially modifying {{PreAnalyzedTokenizer}} to enable reuse, thus avoiding creation of a {{PreAnalyzedTokenizer}} (and all associated token Attributes) for every field value.
I'm fine with just closing this issue; but again it's a pretty minor change that won't hurt anything, and could in some cases make indexing more robust. Or at least clarify whether it's acceptable for index-time TokenStreams to lazily instantiate token Attributes ...
> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes
> -----------------------------------------------------------------------------------------
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
> Issue Type: Wish
> Components: core/index
> Affects Versions: 7.4, master (8.0)
> Reporter: Michael Gibney
> Priority: Minor
> Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} callsĀ {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to {{incrementToken()}} that returns {{true}}), this can result in caching a {{null}} value in {{invertState.termAttribute}} for a given {{stream}} instance.
> Subsequent calls that reuse the same {{stream}} instance (reusing {{TokenStreamComponents}}) for field values with at least 1 token will call {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE is thrown when {{termsHashPerField.add()}} reasonably but incorrectly assumes a non-null value for {{termAtt}}.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org