You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Cecere <jo...@oracle.com> on 2014/02/14 19:36:49 UTC

IndexWriter croaks on large file

I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file > 2GB in size, it dies with the following exception:

java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, 
startOffset=-2147483648,endOffset=-2147483647

Essentially, I'm doing this:

Directory directory = new MMapDirectory(indexPath);
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer);
IndexWriter iw = new IndexWriter(directory, iwc);

InputStream is = <my input stream>;
InputStreamReader reader = new InputStreamReader(is);

Document doc = new Document();
doc.add(new StoredField("fileid", fileid));
doc.add(new StoredField("pathname", pathname));
doc.add(new TextField("content", reader));

iw.addDocument(doc);

It's the IndexWriter addDocument method that throws the exception. In looking at the Lucene source code, it appears that the offsets 
being used internally are int, which makes it somewhat obvious why this is happening.

This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly capable of handling a file over 2GB in this manner. What has 
changed and how do I get around this ? Is Lucene no longer capable of handling files this large, or is there some other way I should 
be doing this ?

Here's the full stack trace sans my code:

java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, 
startOffset=-2147483648,endOffset=-2147483647
	at org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
	at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
	at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
	at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
	at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
	at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)

Thanks,
John

-- 
John Cecere
Principal Engineer - Oracle Corporation
732-987-4317 / john.cecere@oracle.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter croaks on large file

Posted by Glen Newton <gl...@gmail.com>.

You should consider making each _line_ of the log file a (Lucene)
document (assuming it is a log-per-line log file)

-Glen

On Fri, Feb 14, 2014 at 4:12 PM, John Cecere <jo...@oracle.com> wrote:
> I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At
> any rate, I don't have control over the size of the documents that go into
> my database. Sometimes my customer's log files end up really big. I'm
> willing to have huge indexes for these things.
>
> Wouldn't just changing from int to long for the offsets solve the problem ?
> I'm sure it would probably have to be changed in a lot of places, but why
> impose such a limitation ? Especially since it's using an InputStream and
> only dealing with a block of data at a time.
>
> I'll take a look at your suggestion.
>
> Thanks,
> John
>
>
> On 2/14/14 3:20 PM, Michael McCandless wrote:
>>
>> Hmm, why are you indexing such immense documents?
>>
>> In 3.x Lucene never sanity checked the offsets, so we would silently
>> index negative (int overflow'd) offsets into e.g. term vectors.
>>
>> But in 4.x, we now detect this and throw the exception you're seeing,
>> because it can lead to index corruption when you index the offsets
>> into the postings.
>>
>> If you really must index such enormous documents, maybe you could
>> create a custom tokenizer  (derived from StandardTokenizer) that
>> "fixes" the offset before setting them?  Or maybe just doesn't even
>> set them.
>>
>> Note that position can also overflow, if your documents get too large.
>>
>>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Feb 14, 2014 at 1:36 PM, John Cecere <jo...@oracle.com>
>> wrote:
>>>
>>> I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a
>>> file >
>>> 2GB in size, it dies with the following exception:
>>>
>>> java.lang.IllegalArgumentException: startOffset must be non-negative, and
>>> endOffset must be >= startOffset,
>>> startOffset=-2147483648,endOffset=-2147483647
>>>
>>> Essentially, I'm doing this:
>>>
>>> Directory directory = new MMapDirectory(indexPath);
>>> Analyzer analyzer = new StandardAnalyzer();
>>> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,
>>> analyzer);
>>> IndexWriter iw = new IndexWriter(directory, iwc);
>>>
>>> InputStream is = <my input stream>;
>>> InputStreamReader reader = new InputStreamReader(is);
>>>
>>> Document doc = new Document();
>>> doc.add(new StoredField("fileid", fileid));
>>> doc.add(new StoredField("pathname", pathname));
>>> doc.add(new TextField("content", reader));
>>>
>>> iw.addDocument(doc);
>>>
>>> It's the IndexWriter addDocument method that throws the exception. In
>>> looking at the Lucene source code, it appears that the offsets being used
>>> internally are int, which makes it somewhat obvious why this is
>>> happening.
>>>
>>> This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly
>>> capable of handling a file over 2GB in this manner. What has changed and
>>> how
>>> do I get around this ? Is Lucene no longer capable of handling files this
>>> large, or is there some other way I should be doing this ?
>>>
>>> Here's the full stack trace sans my code:
>>>
>>> java.lang.IllegalArgumentException: startOffset must be non-negative, and
>>> endOffset must be >= startOffset,
>>> startOffset=-2147483648,endOffset=-2147483647
>>>          at
>>>
>>> org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
>>>          at
>>>
>>> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
>>>          at
>>>
>>> org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
>>>          at
>>>
>>> org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
>>>          at
>>>
>>> org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
>>>          at
>>>
>>> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
>>>          at
>>>
>>> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
>>>          at
>>>
>>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
>>>          at
>>>
>>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
>>>          at
>>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
>>>          at
>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
>>>          at
>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)
>>>
>>> Thanks,
>>> John
>>>
>>> --
>>> John Cecere
>>> Principal Engineer - Oracle Corporation
>>> 732-987-4317 / john.cecere@oracle.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --
> John Cecere
> Principal Engineer - Oracle Corporation
> 732-987-4317 / john.cecere@oracle.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter croaks on large file

Posted by John Cecere <jo...@oracle.com>.

I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At any rate, I don't have control over the size of the 
documents that go into my database. Sometimes my customer's log files end up really big. I'm willing to have huge indexes for these 
things.

Wouldn't just changing from int to long for the offsets solve the problem ? I'm sure it would probably have to be changed in a lot 
of places, but why impose such a limitation ? Especially since it's using an InputStream and only dealing with a block of data at a 
time.

I'll take a look at your suggestion.

Thanks,
John


On 2/14/14 3:20 PM, Michael McCandless wrote:
> Hmm, why are you indexing such immense documents?
>
> In 3.x Lucene never sanity checked the offsets, so we would silently
> index negative (int overflow'd) offsets into e.g. term vectors.
>
> But in 4.x, we now detect this and throw the exception you're seeing,
> because it can lead to index corruption when you index the offsets
> into the postings.
>
> If you really must index such enormous documents, maybe you could
> create a custom tokenizer  (derived from StandardTokenizer) that
> "fixes" the offset before setting them?  Or maybe just doesn't even
> set them.
>
> Note that position can also overflow, if your documents get too large.
>
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Feb 14, 2014 at 1:36 PM, John Cecere <jo...@oracle.com> wrote:
>> I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file >
>> 2GB in size, it dies with the following exception:
>>
>> java.lang.IllegalArgumentException: startOffset must be non-negative, and
>> endOffset must be >= startOffset,
>> startOffset=-2147483648,endOffset=-2147483647
>>
>> Essentially, I'm doing this:
>>
>> Directory directory = new MMapDirectory(indexPath);
>> Analyzer analyzer = new StandardAnalyzer();
>> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer);
>> IndexWriter iw = new IndexWriter(directory, iwc);
>>
>> InputStream is = <my input stream>;
>> InputStreamReader reader = new InputStreamReader(is);
>>
>> Document doc = new Document();
>> doc.add(new StoredField("fileid", fileid));
>> doc.add(new StoredField("pathname", pathname));
>> doc.add(new TextField("content", reader));
>>
>> iw.addDocument(doc);
>>
>> It's the IndexWriter addDocument method that throws the exception. In
>> looking at the Lucene source code, it appears that the offsets being used
>> internally are int, which makes it somewhat obvious why this is happening.
>>
>> This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly
>> capable of handling a file over 2GB in this manner. What has changed and how
>> do I get around this ? Is Lucene no longer capable of handling files this
>> large, or is there some other way I should be doing this ?
>>
>> Here's the full stack trace sans my code:
>>
>> java.lang.IllegalArgumentException: startOffset must be non-negative, and
>> endOffset must be >= startOffset,
>> startOffset=-2147483648,endOffset=-2147483647
>>          at
>> org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
>>          at
>> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
>>          at
>> org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
>>          at
>> org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
>>          at
>> org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
>>          at
>> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
>>          at
>> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
>>          at
>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
>>          at
>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
>>          at
>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
>>          at
>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
>>          at
>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)
>>
>> Thanks,
>> John
>>
>> --
>> John Cecere
>> Principal Engineer - Oracle Corporation
>> 732-987-4317 / john.cecere@oracle.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

-- 
John Cecere
Principal Engineer - Oracle Corporation
732-987-4317 / john.cecere@oracle.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: IndexWriter croaks on large file

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hmm, why are you indexing such immense documents?

In 3.x Lucene never sanity checked the offsets, so we would silently
index negative (int overflow'd) offsets into e.g. term vectors.

But in 4.x, we now detect this and throw the exception you're seeing,
because it can lead to index corruption when you index the offsets
into the postings.

If you really must index such enormous documents, maybe you could
create a custom tokenizer  (derived from StandardTokenizer) that
"fixes" the offset before setting them?  Or maybe just doesn't even
set them.

Note that position can also overflow, if your documents get too large.



Mike McCandless

http://blog.mikemccandless.com


On Fri, Feb 14, 2014 at 1:36 PM, John Cecere <jo...@oracle.com> wrote:
> I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file >
> 2GB in size, it dies with the following exception:
>
> java.lang.IllegalArgumentException: startOffset must be non-negative, and
> endOffset must be >= startOffset,
> startOffset=-2147483648,endOffset=-2147483647
>
> Essentially, I'm doing this:
>
> Directory directory = new MMapDirectory(indexPath);
> Analyzer analyzer = new StandardAnalyzer();
> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer);
> IndexWriter iw = new IndexWriter(directory, iwc);
>
> InputStream is = <my input stream>;
> InputStreamReader reader = new InputStreamReader(is);
>
> Document doc = new Document();
> doc.add(new StoredField("fileid", fileid));
> doc.add(new StoredField("pathname", pathname));
> doc.add(new TextField("content", reader));
>
> iw.addDocument(doc);
>
> It's the IndexWriter addDocument method that throws the exception. In
> looking at the Lucene source code, it appears that the offsets being used
> internally are int, which makes it somewhat obvious why this is happening.
>
> This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly
> capable of handling a file over 2GB in this manner. What has changed and how
> do I get around this ? Is Lucene no longer capable of handling files this
> large, or is there some other way I should be doing this ?
>
> Here's the full stack trace sans my code:
>
> java.lang.IllegalArgumentException: startOffset must be non-negative, and
> endOffset must be >= startOffset,
> startOffset=-2147483648,endOffset=-2147483647
>         at
> org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
>         at
> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
>         at
> org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
>         at
> org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
>         at
> org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
>         at
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
>         at
> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
>         at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
>         at
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
>         at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
>         at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
>         at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)
>
> Thanks,
> John
>
> --
> John Cecere
> Principal Engineer - Oracle Corporation
> 732-987-4317 / john.cecere@oracle.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org