You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Ilan Ginzburg (Jira)" <ji...@apache.org> on 2019/11/08 22:48:00 UTC

[jira] [Comment Edited] (LUCENE-9037) ArrayIndexOutOfBoundsException due to repeated IOException during indexing

    [ https://issues.apache.org/jira/browse/LUCENE-9037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970622#comment-16970622 ] 

Ilan Ginzburg edited comment on LUCENE-9037 at 11/8/19 10:47 PM:
-----------------------------------------------------------------

Thanks [~mikemccand].

What about moving up the call to {{DocumentsWriterFlushControl.doAfterDocument()}} into the {{finally}} of the bloc calling {{DocumentsWriterPerThread.updateDocument/s()}} in {{DocumentsWriter.updateDocument/s()}}?
 Basically consider {{DocumentsWriterFlushControl.doAfterDocument()}} as a "do after _successful or failed_ document".

Exploring that path see if I can make it work (and existing tests pass).

Your suggestion of throwing a meaningful exception upon reaching the limit would not help my use case if there's no flush happening as a consequence.


was (Author: murblanc):
Thanks [~mikemccand].

What about moving up the call to {{DocumentsWriterFlushControl.doAfterDocument()}} into the {{finally}} of the bloc calling {{DocumentsWriterPerThread.updateDocument/s()}} in {{DocumentsWriter.updateDocument/s()}}?
Basically consider {{DocumentsWriterFlushControl.doAfterDocument()}} as a "do after _successful or failed_ document".

Exploring that path see if I can make it work (and existing tests pass).

> ArrayIndexOutOfBoundsException due to repeated IOException during indexing
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-9037
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9037
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 7.1
>            Reporter: Ilan Ginzburg
>            Priority: Minor
>         Attachments: TestIndexWriterTermsHashOverflow.java
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a limit to the number of tokens that can be held in memory by Lucene when docs are indexed using DocumentsWriter, then bad things happen. The limit can be reached by submitting a really large document, by submitting a large number of documents without doing a commit (see LUCENE-8118) or by repeatedly submitting documents that fail to get indexed in some specific ways, leading to Lucene not cleaning up the in memory data structures that eventually overflow.
> The overflow is due to a 32 bit (signed) integer wrapping around to negative territory, then causing an ArrayIndexOutOfBoundsException. 
> The failure path that we are reliably hitting is due to an IOException during doc tokenization. A tokenizer implementing TokenStream throws an exception from incrementToken() which causes indexing of that doc to fail. 
> The IOException bubbles back up to DocumentsWriter.updateDocument() (or DocumentsWriter.updateDocuments() in some other cases) where it is not treated as an AbortingException therefore it is not causing a reset of the DocumentsWriterPerThread. On repeated failures (without any successful indexing in between) if the upper layer (client via Solr) resubmits the doc that fails again, DocumentsWriterPerThread will eventually cause TermsHashPerField data structures to grow and overflow, leading to an exception stack similar to the one in LUCENE-8118 (below stack trace copied from a test run repro on 7.1):
> java.lang.ArrayIndexOutOfBoundsException: -65536java.lang.ArrayIndexOutOfBoundsException: -65536
>  at __randomizedtesting.SeedInfo.seed([394FAB2B91B1D90A:C86FB3F3CE001AA8]:0) at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:221) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80) at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:171) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:792) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:481) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1717) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1462)
> Using tokens composed only of lowercase letters, it takes less than 130,000,000 different tokens (the shortest ones) to overflow TermsHashPerField.
> Using a single document (composed of the 20,000 shortest lowercase tokens) submitted repeatedly for indexing requires 6352 submissions all failing with an IOException on incrementToken() to trigger the ArrayIndexOutOfBoundsException.
> A proposed fix is to treat in DocumentsWriter.updateDocument() and DocumentsWriter.updateDocuments() an IOException in the same way we treat an AbortingException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org