You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Trejkaz <tr...@trypticon.org> on 2014/10/14 07:29:32 UTC

Re: ArrayIndexOutOfBoundsException: -65536

Bit of thread necromancy here, but I figured it was relevant because
we get exactly the same error.

On Thu, Jan 19, 2012 at 12:47 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Hmm, are you certain your RAM buffer is 3 MB?
>
> Is it possible you are indexing an absurdly enormous document...?

We're seeing a case here where the document certainly could qualify as
"absurdly enormous". The doc itself is 2GB in size and the
tokenisation is per-character, not per-word, so the number of
generated terms must be enormous. Probably enough to fill 2GB...

So I'm wondering if there is more info somewhere on why this is (or
was? We're still using 3.6.x) a limit and whether it can be detected
up-front. Some large amount of indexing time (~30 minutes) could be
avoided if we can detect that it would have failed ahead of time.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ArrayIndexOutOfBoundsException: -65536

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Oct 14, 2014 at 1:29 AM, Trejkaz <tr...@trypticon.org> wrote:

> Bit of thread necromancy here, but I figured it was relevant because
> we get exactly the same error.

Wow, blast from the past ...

>> Is it possible you are indexing an absurdly enormous document...?
>
> We're seeing a case here where the document certainly could qualify as
> "absurdly enormous". The doc itself is 2GB in size and the
> tokenisation is per-character, not per-word, so the number of
> generated terms must be enormous. Probably enough to fill 2GB...
>
> So I'm wondering if there is more info somewhere on why this is (or
> was? We're still using 3.6.x) a limit and whether it can be detected
> up-front. Some large amount of indexing time (~30 minutes) could be
> avoided if we can detect that it would have failed ahead of time.

The limit is still there; it's because Lucene uses an int internally
to address its memory buffer.

It's probably easiest to set a limit on the max sized doc you will
index?  Or, use LimitTokenCountFilter (available in newer releases) to
only index the first N tokens...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org