You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael McCandless (Jira)" <ji...@apache.org> on 2021/01/16 12:15:00 UTC
[jira] [Resolved] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

     [ https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-8947.
----------------------------------------
    Resolution: Won't Fix

It turns out we cannot find a safe way to fix this, so, users must not try to write too many, too large custom term frequencies such that their sum overflows a Java `int` in a single field / document.

> Indexing fails with "too many tokens for field" when using custom term frequencies
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-8947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8947
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 7.5
>            Reporter: Michael McCandless
>            Priority: Major
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> We are using custom term frequencies (LUCENE-7854) to index per-token scoring signals, however for one document that had many tokens and those tokens had fairly large (~998,000) scoring signals, we hit this exception:
> {noformat}
> 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: java.lang.IllegalArgumentException: too many tokens for field "foobar"
> at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
> at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
> at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
> at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
> at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
> at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
> at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
> {noformat}
> This is happening in this code in {{DefaultIndexingChain.java}}:
> {noformat}
>   try {
>     invertState.length = Math.addExact(invertState.length, invertState.termFreqAttribute.getTermFrequency());
>   } catch (ArithmeticException ae) {
>     throw new IllegalArgumentException("too many tokens for field \"" + field.name() + "\"");
>   }{noformat}
> Where Lucene is accumulating the total length (number of tokens) for the field.  But total length doesn't really make sense if you are using custom term frequencies to hold arbitrary scoring signals?  Or, maybe it does make sense, if user is using this as simple boosting, but maybe we should allow this length to be a {{long}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org