You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Suba Suresh <su...@wolfram.com> on 2006/07/13 15:54:44 UTC

Out of memory error

I am indexing different document formats with lucene 1.9. One of the pdf 
file I am indexing is 300MG. Whenever the index writer hits that file it 
stops the indexing with "Out of Memory" exception. I am using the pdf 
box library to index. I have set the following merge factors in my code.

writer.setMergeFactor(1000);
writer.setMaxMergeDocs(9999999);
writer.setMaxBufferedDocs(1000);
writer.setMaxFieldLength(Integer.MAX_VALUE);

I would like any help and suggestions.

thanks,
suba suresh.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Out of memory error

Posted by Suba Suresh <su...@wolfram.com>.

Definitely. Thanks for both the suggestions. Yes it is 300MB.(typo)

suba suresh.

Rob Staveley (Tom) wrote:
> Let us know how you get on. There are a lot of people fighting very similar
> battles on this list. 
> 
> -----Original Message-----
> From: Suba Suresh [mailto:subas@wolfram.com] 
> Sent: 13 July 2006 15:30
> To: java-user@lucene.apache.org
> Subject: Re: Out of memory error
> 
> Thanks.
> 
> I am using the getText(PDDocument) method of the PDFTextStripper. I will try
> the other suggestion.
> 
> suba suresh.
> 
> Rob Staveley (Tom) wrote:
> 
>>If you are using
>>http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
>>Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large 
>>String and may need a 1G heap.
>>
>>If, however, you are using
>>http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
>>teText
>>(org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a 
>>temporary file, you will not need so much RAM, but you need to use 
>>http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
>>d.html
>>#Field(java.lang.String,%20java.io.Reader) to construct your Lucene 
>>field (rather than 
>>http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
>>d.html 
>>#Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum
>>ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).
>>
>>-----Original Message-----
>>From: Suba Suresh [mailto:subas@wolfram.com]
>>Sent: 13 July 2006 14:55
>>To: java-user@lucene.apache.org
>>Subject: Out of memory error
>>
>>I am indexing different document formats with lucene 1.9. One of the 
>>pdf file I am indexing is 300MG. Whenever the index writer hits that 
>>file it stops the indexing with "Out of Memory" exception. I am using 
>>the pdf box library to index. I have set the following merge factors in my
> 
> code.
> 
>>writer.setMergeFactor(1000);
>>writer.setMaxMergeDocs(9999999);
>>writer.setMaxBufferedDocs(1000);
>>writer.setMaxFieldLength(Integer.MAX_VALUE);
>>
>>I would like any help and suggestions.
>>
>>thanks,
>>suba suresh.
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Out of memory error

Posted by Suba Suresh <su...@wolfram.com>.

Sorry for my late response. It took us some time to run it again. We 
increased the memory heap to 1G as you suggested and it works. The 
indexer is not crashing. (We are running into some other problem with a 
powerpoint file .That is for another email).

The code change with 
PDFTextStripper.writeText((org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) 
did not work for us.


Thanks for all the help.

suba suresh.

Rob Staveley (Tom) wrote:
> Let us know how you get on. There are a lot of people fighting very similar
> battles on this list. 
> 
> -----Original Message-----
> From: Suba Suresh [mailto:subas@wolfram.com] 
> Sent: 13 July 2006 15:30
> To: java-user@lucene.apache.org
> Subject: Re: Out of memory error
> 
> Thanks.
> 
> I am using the getText(PDDocument) method of the PDFTextStripper. I will try
> the other suggestion.
> 
> suba suresh.
> 
> Rob Staveley (Tom) wrote:
> 
>>If you are using
>>http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
>>Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large 
>>String and may need a 1G heap.
>>
>>If, however, you are using
>>http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
>>teText
>>(org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a 
>>temporary file, you will not need so much RAM, but you need to use 
>>http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
>>d.html
>>#Field(java.lang.String,%20java.io.Reader) to construct your Lucene 
>>field (rather than 
>>http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
>>d.html 
>>#Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum
>>ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).
>>
>>-----Original Message-----
>>From: Suba Suresh [mailto:subas@wolfram.com]
>>Sent: 13 July 2006 14:55
>>To: java-user@lucene.apache.org
>>Subject: Out of memory error
>>
>>I am indexing different document formats with lucene 1.9. One of the 
>>pdf file I am indexing is 300MG. Whenever the index writer hits that 
>>file it stops the indexing with "Out of Memory" exception. I am using 
>>the pdf box library to index. I have set the following merge factors in my
> 
> code.
> 
>>writer.setMergeFactor(1000);
>>writer.setMaxMergeDocs(9999999);
>>writer.setMaxBufferedDocs(1000);
>>writer.setMaxFieldLength(Integer.MAX_VALUE);
>>
>>I would like any help and suggestions.
>>
>>thanks,
>>suba suresh.
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Out of memory error

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.

Let us know how you get on. There are a lot of people fighting very similar
battles on this list. 

-----Original Message-----
From: Suba Suresh [mailto:subas@wolfram.com] 
Sent: 13 July 2006 15:30
To: java-user@lucene.apache.org
Subject: Re: Out of memory error

Thanks.

I am using the getText(PDDocument) method of the PDFTextStripper. I will try
the other suggestion.

suba suresh.

Rob Staveley (Tom) wrote:
> If you are using
> http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
> Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large 
> String and may need a 1G heap.
> 
> If, however, you are using
> http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
> teText
> (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a 
> temporary file, you will not need so much RAM, but you need to use 
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
> d.html
> #Field(java.lang.String,%20java.io.Reader) to construct your Lucene 
> field (rather than 
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
> d.html 
> #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum
> ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).
> 
> -----Original Message-----
> From: Suba Suresh [mailto:subas@wolfram.com]
> Sent: 13 July 2006 14:55
> To: java-user@lucene.apache.org
> Subject: Out of memory error
> 
> I am indexing different document formats with lucene 1.9. One of the 
> pdf file I am indexing is 300MG. Whenever the index writer hits that 
> file it stops the indexing with "Out of Memory" exception. I am using 
> the pdf box library to index. I have set the following merge factors in my
code.
> 
> writer.setMergeFactor(1000);
> writer.setMaxMergeDocs(9999999);
> writer.setMaxBufferedDocs(1000);
> writer.setMaxFieldLength(Integer.MAX_VALUE);
> 
> I would like any help and suggestions.
> 
> thanks,
> suba suresh.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Out of memory error

Posted by Suba Suresh <su...@wolfram.com>.

Thanks.

I am using the getText(PDDocument) method of the PDFTextStripper. I will 
try the other suggestion.

suba suresh.

Rob Staveley (Tom) wrote:
> If you are using
> http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
> rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may
> need a 1G heap. 
> 
> If, however, you are using
> http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText
> (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary
> file, you will not need so much RAM, but you need to use
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
> #Field(java.lang.String,%20java.io.Reader) to construct your Lucene field
> (rather than
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
> #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi
> eld.Store,%20org.apache.lucene.document.Field.Index)).
> 
> -----Original Message-----
> From: Suba Suresh [mailto:subas@wolfram.com] 
> Sent: 13 July 2006 14:55
> To: java-user@lucene.apache.org
> Subject: Out of memory error
> 
> I am indexing different document formats with lucene 1.9. One of the pdf
> file I am indexing is 300MG. Whenever the index writer hits that file it
> stops the indexing with "Out of Memory" exception. I am using the pdf box
> library to index. I have set the following merge factors in my code.
> 
> writer.setMergeFactor(1000);
> writer.setMaxMergeDocs(9999999);
> writer.setMaxBufferedDocs(1000);
> writer.setMaxFieldLength(Integer.MAX_VALUE);
> 
> I would like any help and suggestions.
> 
> thanks,
> suba suresh.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Out of memory error

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.

If you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may
need a 1G heap. 

If, however, you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText
(org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary
file, you will not need so much RAM, but you need to use
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
#Field(java.lang.String,%20java.io.Reader) to construct your Lucene field
(rather than
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
#Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi
eld.Store,%20org.apache.lucene.document.Field.Index)).

-----Original Message-----
From: Suba Suresh [mailto:subas@wolfram.com] 
Sent: 13 July 2006 14:55
To: java-user@lucene.apache.org
Subject: Out of memory error

I am indexing different document formats with lucene 1.9. One of the pdf
file I am indexing is 300MG. Whenever the index writer hits that file it
stops the indexing with "Out of Memory" exception. I am using the pdf box
library to index. I have set the following merge factors in my code.

writer.setMergeFactor(1000);
writer.setMaxMergeDocs(9999999);
writer.setMaxBufferedDocs(1000);
writer.setMaxFieldLength(Integer.MAX_VALUE);

I would like any help and suggestions.

thanks,
suba suresh.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org