You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by P Williams <wi...@gmail.com> on 2011/11/25 18:30:55 UTC

PDFBOX-948

Hi All,

It appears that I am the victim of the caviate stated in
PDFBOX-948<https://issues.apache.org/jira/browse/PDFBOX-948>
:

*"For normal sized PDFs files, the in-memory implementation
RandomAccessBuffer should not increase the memory usage too much, while
providing faster IO as all access operations are only memory copies.

Therefore, please consider switching the default to in-memory scratch
buffers. Users with very large files can still pass a temporary directory."*
*
*
I'm using apache-solr-4.0-2011-10-14_08-56-59 snapshot, which uses Tika
0.10 which uses PDFBox 1.6.0 and getting the following out of memory error:

Caused by: java.lang.OutOfMemoryError: Java heap space
        at
org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
        at
org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:131)
        at
org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108)
        at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
        at java.io.BufferedOutputStream.write(Unknown Source)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:294)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:391)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:363)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:337)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:177)
        at
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:257)
        at
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1325)
        at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:796)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:84)
        at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
        at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:671)

At which stage should I be able to pass a temporary directory?  Would it
need to be a Tika configuration?  Or does there need to be something
changed in PDFBox to even enable this option?

I have been eager to get Tika 0.10 with Solr because it solves a full text
garbling issue solved by TIKA-611, unfortunately it has introduced new
issues with my indexing  workflow that appear to originate in pdfbox 1.6.0.

Thanks,
Tricia

Re: PDFBOX-948

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 25.11.2011 18:30, schrieb P Williams:
> Hi All,
>
> It appears that I am the victim of the caviate stated in
> PDFBOX-948<https://issues.apache.org/jira/browse/PDFBOX-948>
> :
>
> *"For normal sized PDFs files, the in-memory implementation
> RandomAccessBuffer should not increase the memory usage too much, while
> providing faster IO as all access operations are only memory copies.
>
> Therefore, please consider switching the default to in-memory scratch
> buffers. Users with very large files can still pass a temporary directory."*
> *
> *
> I'm using apache-solr-4.0-2011-10-14_08-56-59 snapshot, which uses Tika
> 0.10 which uses PDFBox 1.6.0 and getting the following out of memory error:
>
> Caused by: java.lang.OutOfMemoryError: Java heap space
>          at
> org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
>          at
> org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:131)
>          at
> org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108)
>          at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>          at java.io.BufferedOutputStream.write(Unknown Source)
>          at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:294)
>          at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:391)
>          at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:363)
>          at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:337)
>          at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:177)
>          at
> org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:257)
>          at
> org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1325)
>          at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:796)
>          at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:84)
>          at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>          at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>          at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
>          at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:671)
>
> At which stage should I be able to pass a temporary directory?  Would it
> need to be a Tika configuration?  Or does there need to be something
> changed in PDFBox to even enable this option?
There are 2 possible ways to provide a temp directory/scratch file:

- pass an instance of RandomAccessFile to the PDDocument.load method you are 
using (I'm not familiar with the TIKA details, but I guess that one is used)
- if the PDFParser is used directly you should either pass an instance of 
RandomAccessFile to the constructor or set the temp-dir using setTempDirectory

> I have been eager to get Tika 0.10 with Solr because it solves a full text
> garbling issue solved by TIKA-611, unfortunately it has introduced new
> issues with my indexing  workflow that appear to originate in pdfbox 1.6.0.
>
> Thanks,
> Tricia


BR
Andreas Lehmkühler