You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by George Washington <gw...@hotmail.com> on 2006/01/20 05:52:20 UTC

Storing large text or binary source documents in the index and memory usage

I would like to store large source documents (>10MB) in the index in their 
original form, i.e. as text for text documents or as byte[] for binary 
documents.
I have no difficulty adding the source document as a field to the Lucene 
index document, but when I write the index document to the index I 
consistently get out-of-memory errors for documents larger than approx 9MB.
Is there a formula that can help calculate the max size of a document which 
can be added to the index?
Is there an alternative way to store such large documents that you can 
suggest?
I have 512MB memory under WinXP. Increasing the VM heap size does not help.
Many thanks

_________________________________________________________________
Make your dream car a reality 
http://a.ninemsn.com.au/b.aspx?URL=http%3A%2F%2Fcarpoint%2Eninemsn%2Ecom%2Eau&_t=12345&_r=emailtagline&_m=EXT


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Storing large text or binary source documents in the index and memory usage

Posted by George Washington <gw...@hotmail.com>.
thank you Daniel, but the best I get from MaxBufferedDocs(1) is an OOM error 
after trying 5 iterations of 10MB each in the JUnit test provided by Chris, 
running inside Eclipse 3.1.
I had already tried with MaxBufferdDocs(2) with no success before I posted 
the original post.
I also tried:
  writer.setMaxBufferedDocs(1);
  writer.setMergeFactor(2);
but with the same result.

My heap size is set in the eclipse command line with:  
C:\eclipse\eclipse.exe -vmargs -Xmx350M
Increasing the size does not help.
If you actually tried Chris's test program and got it to run successfully 
there must be something  wrong in my Eclipse config. I cannot think of any 
other possible cause for the difference in your results from mine.
But I am open to all suggestions.
thanks

_________________________________________________________________
ASUS M5 Ultra-slim lightweight is Now $1999 (was $2,999)  
http://a.ninemsn.com.au/b.aspx?URL=http%3A%2F%2Fwww%2Easus%2Ecom%2Eau%2F&_t=752129232&_r=Hotmail_tagline_23Nov05&_m=EXT


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Storing large text or binary source documents in the index and memory usage

Posted by George Washington <gw...@hotmail.com>.
Thank you Chris for replying and for the trouble you took to test the 
problem. I am looking forward to a reply form the Lucene project.
Cheers


>From: Chris Hostetter <ho...@fucit.org>
>Reply-To: java-user@lucene.apache.org
>To: java-user@lucene.apache.org
>Subject: RE: Storing large text or binary source documents in the index and 
>memory usage
>Date: Fri, 20 Jan 2006 18:35:41 -0800 (PST)
>
>
>: otherwise I would have done so already. My real question is question 
>number
>: one, which did not receive a reply, is there a formula that can tell me 
>if
>: what is happening is reasonable and to be expected, or am I doing 
>something
>
>I've never played with the binary fields much, nor have i ever tried to
>add more then a few KB of data to any document, but my reading of the docs
>doesn't turn up any reason why you should be encountering this problem.
>
>typically, when people report problems like this, it turns out the problem
>had nothing to do with lucene -- they were forgetting to close files they
>were reading while indexing or something like that.  so i tried writing a
>unit test that allocated arbitrary in memory byte arrays to see if i could
>reproduce the problem, sure enough i can.
>
>No matter what size heap i use, i can't add more then 9 documents
>containing fields of 5mb of data..  It seems i can as many 4mb documents
>as my heart desires, but once i go up to 5 all hell breaks loose.
>
>I didn't try playing with the various IndexWriter options to see what
>affect they had on the breaking point.
>
>i've open a jira bug about this...
>
>https://issues.apache.org/jira/browse/LUCENE-488
>
>
>-Hoss
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>

_________________________________________________________________
Make your dream car a reality 
http://a.ninemsn.com.au/b.aspx?URL=http%3A%2F%2Fcarpoint%2Eninemsn%2Ecom%2Eau&_t=12345&_r=emailtagline&_m=EXT


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Storing large text or binary source documents in the index and memory usage

Posted by Chris Hostetter <ho...@fucit.org>.
: otherwise I would have done so already. My real question is question number
: one, which did not receive a reply, is there a formula that can tell me if
: what is happening is reasonable and to be expected, or am I doing something

I've never played with the binary fields much, nor have i ever tried to
add more then a few KB of data to any document, but my reading of the docs
doesn't turn up any reason why you should be encountering this problem.

typically, when people report problems like this, it turns out the problem
had nothing to do with lucene -- they were forgetting to close files they
were reading while indexing or something like that.  so i tried writing a
unit test that allocated arbitrary in memory byte arrays to see if i could
reproduce the problem, sure enough i can.

No matter what size heap i use, i can't add more then 9 documents
containing fields of 5mb of data..  It seems i can as many 4mb documents
as my heart desires, but once i go up to 5 all hell breaks loose.

I didn't try playing with the various IndexWriter options to see what
affect they had on the breaking point.

i've open a jira bug about this...

https://issues.apache.org/jira/browse/LUCENE-488


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Storing large text or binary source documents in the index and memory usage

Posted by George Washington <gw...@hotmail.com>.
Thank you ipowers for your reply. Perhaps I did not make myself clear 
enough. As I explained in my original posting I want to store large 
documents  in the Lucene index. Storing them elsewhere is not an option, 
otherwise I would have done so already. My real question is question number 
one, which did not receive a reply, is there a formula that can tell me if 
what is happening is reasonable and to be expected, or am I doing something 
wrong or something which can be done better. I realise I can breakdown a 
large file into smaller chunks, but I don't want to do that unless the 
answer to question one is that there is no other alternative.
Cheers





>From: "George Washington" <gw...@hotmail.com>
>Reply-To: java-user@lucene.apache.org
>To: java-user@lucene.apache.org
>Subject: Storing large text or binary source documents in the index and 
>memory usage
>Date: Fri, 20 Jan 2006 04:52:20 +0000
>
>I would like to store large source documents (>10MB) in the index in their 
>original form, i.e. as text for text documents or as byte[] for binary 
>documents.
>I have no difficulty adding the source document as a field to the Lucene 
>index document, but when I write the index document to the index I 
>consistently get out-of-memory errors for documents larger than approx 9MB.
>Is there a formula that can help calculate the max size of a document which 
>can be added to the index?
>Is there an alternative way to store such large documents that you can 
>suggest?
>I have 512MB memory under WinXP. Increasing the VM heap size does not help.
>Many thanks
>
>_________________________________________________________________
>Make your dream car a reality 
>http://a.ninemsn.com.au/b.aspx?URL=http%3A%2F%2Fcarpoint%2Eninemsn%2Ecom%2Eau&_t=12345&_r=emailtagline&_m=EXT
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>

_________________________________________________________________
Make your dream car a reality 
http://a.ninemsn.com.au/b.aspx?URL=http%3A%2F%2Fcarpoint%2Eninemsn%2Ecom%2Eau&_t=12345&_r=emailtagline&_m=EXT


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org