You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Suba Suresh <su...@wolfram.com> on 2006/07/13 15:54:44 UTC
Out of memory error
I am indexing different document formats with lucene 1.9. One of the pdf
file I am indexing is 300MG. Whenever the index writer hits that file it
stops the indexing with "Out of Memory" exception. I am using the pdf
box library to index. I have set the following merge factors in my code.
writer.setMergeFactor(1000);
writer.setMaxMergeDocs(9999999);
writer.setMaxBufferedDocs(1000);
writer.setMaxFieldLength(Integer.MAX_VALUE);
I would like any help and suggestions.
thanks,
suba suresh.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Out of memory error
Posted by Suba Suresh <su...@wolfram.com>.
Definitely. Thanks for both the suggestions. Yes it is 300MB.(typo)
suba suresh.
Rob Staveley (Tom) wrote:
> Let us know how you get on. There are a lot of people fighting very similar
> battles on this list.
>
> -----Original Message-----
> From: Suba Suresh [mailto:subas@wolfram.com]
> Sent: 13 July 2006 15:30
> To: java-user@lucene.apache.org
> Subject: Re: Out of memory error
>
> Thanks.
>
> I am using the getText(PDDocument) method of the PDFTextStripper. I will try
> the other suggestion.
>
> suba suresh.
>
> Rob Staveley (Tom) wrote:
>
>>If you are using
>>http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
>>Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large
>>String and may need a 1G heap.
>>
>>If, however, you are using
>>http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
>>teText
>>(org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a
>>temporary file, you will not need so much RAM, but you need to use
>>http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
>>d.html
>>#Field(java.lang.String,%20java.io.Reader) to construct your Lucene
>>field (rather than
>>http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
>>d.html
>>#Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum
>>ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).
>>
>>-----Original Message-----
>>From: Suba Suresh [mailto:subas@wolfram.com]
>>Sent: 13 July 2006 14:55
>>To: java-user@lucene.apache.org
>>Subject: Out of memory error
>>
>>I am indexing different document formats with lucene 1.9. One of the
>>pdf file I am indexing is 300MG. Whenever the index writer hits that
>>file it stops the indexing with "Out of Memory" exception. I am using
>>the pdf box library to index. I have set the following merge factors in my
>
> code.
>
>>writer.setMergeFactor(1000);
>>writer.setMaxMergeDocs(9999999);
>>writer.setMaxBufferedDocs(1000);
>>writer.setMaxFieldLength(Integer.MAX_VALUE);
>>
>>I would like any help and suggestions.
>>
>>thanks,
>>suba suresh.
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Out of memory error
Posted by Suba Suresh <su...@wolfram.com>.
Sorry for my late response. It took us some time to run it again. We
increased the memory heap to 1G as you suggested and it works. The
indexer is not crashing. (We are running into some other problem with a
powerpoint file .That is for another email).
The code change with
PDFTextStripper.writeText((org.pdfbox.pdmodel.PDDocument,%20java.io.Writer)
did not work for us.
Thanks for all the help.
suba suresh.
Rob Staveley (Tom) wrote:
> Let us know how you get on. There are a lot of people fighting very similar
> battles on this list.
>
> -----Original Message-----
> From: Suba Suresh [mailto:subas@wolfram.com]
> Sent: 13 July 2006 15:30
> To: java-user@lucene.apache.org
> Subject: Re: Out of memory error
>
> Thanks.
>
> I am using the getText(PDDocument) method of the PDFTextStripper. I will try
> the other suggestion.
>
> suba suresh.
>
> Rob Staveley (Tom) wrote:
>
>>If you are using
>>http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
>>Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large
>>String and may need a 1G heap.
>>
>>If, however, you are using
>>http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
>>teText
>>(org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a
>>temporary file, you will not need so much RAM, but you need to use
>>http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
>>d.html
>>#Field(java.lang.String,%20java.io.Reader) to construct your Lucene
>>field (rather than
>>http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
>>d.html
>>#Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum
>>ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).
>>
>>-----Original Message-----
>>From: Suba Suresh [mailto:subas@wolfram.com]
>>Sent: 13 July 2006 14:55
>>To: java-user@lucene.apache.org
>>Subject: Out of memory error
>>
>>I am indexing different document formats with lucene 1.9. One of the
>>pdf file I am indexing is 300MG. Whenever the index writer hits that
>>file it stops the indexing with "Out of Memory" exception. I am using
>>the pdf box library to index. I have set the following merge factors in my
>
> code.
>
>>writer.setMergeFactor(1000);
>>writer.setMaxMergeDocs(9999999);
>>writer.setMaxBufferedDocs(1000);
>>writer.setMaxFieldLength(Integer.MAX_VALUE);
>>
>>I would like any help and suggestions.
>>
>>thanks,
>>suba suresh.
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Out of memory error
Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.
Let us know how you get on. There are a lot of people fighting very similar
battles on this list.
-----Original Message-----
From: Suba Suresh [mailto:subas@wolfram.com]
Sent: 13 July 2006 15:30
To: java-user@lucene.apache.org
Subject: Re: Out of memory error
Thanks.
I am using the getText(PDDocument) method of the PDFTextStripper. I will try
the other suggestion.
suba suresh.
Rob Staveley (Tom) wrote:
> If you are using
> http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
> Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large
> String and may need a 1G heap.
>
> If, however, you are using
> http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
> teText
> (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a
> temporary file, you will not need so much RAM, but you need to use
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
> d.html
> #Field(java.lang.String,%20java.io.Reader) to construct your Lucene
> field (rather than
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
> d.html
> #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum
> ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).
>
> -----Original Message-----
> From: Suba Suresh [mailto:subas@wolfram.com]
> Sent: 13 July 2006 14:55
> To: java-user@lucene.apache.org
> Subject: Out of memory error
>
> I am indexing different document formats with lucene 1.9. One of the
> pdf file I am indexing is 300MG. Whenever the index writer hits that
> file it stops the indexing with "Out of Memory" exception. I am using
> the pdf box library to index. I have set the following merge factors in my
code.
>
> writer.setMergeFactor(1000);
> writer.setMaxMergeDocs(9999999);
> writer.setMaxBufferedDocs(1000);
> writer.setMaxFieldLength(Integer.MAX_VALUE);
>
> I would like any help and suggestions.
>
> thanks,
> suba suresh.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Out of memory error
Posted by Suba Suresh <su...@wolfram.com>.
Thanks.
I am using the getText(PDDocument) method of the PDFTextStripper. I will
try the other suggestion.
suba suresh.
Rob Staveley (Tom) wrote:
> If you are using
> http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
> rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may
> need a 1G heap.
>
> If, however, you are using
> http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText
> (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary
> file, you will not need so much RAM, but you need to use
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
> #Field(java.lang.String,%20java.io.Reader) to construct your Lucene field
> (rather than
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
> #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi
> eld.Store,%20org.apache.lucene.document.Field.Index)).
>
> -----Original Message-----
> From: Suba Suresh [mailto:subas@wolfram.com]
> Sent: 13 July 2006 14:55
> To: java-user@lucene.apache.org
> Subject: Out of memory error
>
> I am indexing different document formats with lucene 1.9. One of the pdf
> file I am indexing is 300MG. Whenever the index writer hits that file it
> stops the indexing with "Out of Memory" exception. I am using the pdf box
> library to index. I have set the following merge factors in my code.
>
> writer.setMergeFactor(1000);
> writer.setMaxMergeDocs(9999999);
> writer.setMaxBufferedDocs(1000);
> writer.setMaxFieldLength(Integer.MAX_VALUE);
>
> I would like any help and suggestions.
>
> thanks,
> suba suresh.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Out of memory error
Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.
If you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may
need a 1G heap.
If, however, you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText
(org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary
file, you will not need so much RAM, but you need to use
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
#Field(java.lang.String,%20java.io.Reader) to construct your Lucene field
(rather than
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
#Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi
eld.Store,%20org.apache.lucene.document.Field.Index)).
-----Original Message-----
From: Suba Suresh [mailto:subas@wolfram.com]
Sent: 13 July 2006 14:55
To: java-user@lucene.apache.org
Subject: Out of memory error
I am indexing different document formats with lucene 1.9. One of the pdf
file I am indexing is 300MG. Whenever the index writer hits that file it
stops the indexing with "Out of Memory" exception. I am using the pdf box
library to index. I have set the following merge factors in my code.
writer.setMergeFactor(1000);
writer.setMaxMergeDocs(9999999);
writer.setMaxBufferedDocs(1000);
writer.setMaxFieldLength(Integer.MAX_VALUE);
I would like any help and suggestions.
thanks,
suba suresh.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org