You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Christian Ortolf <Ch...@gmx.ch> on 2009/10/19 09:21:49 UTC

Possible to read text without loading PDF to ram?

Hello,

is there any possibility to read the text in a  pdf without loading
the whole document to RAM?

I have the problem that some documents cause OutOfMemory errors. And
increasing the heapsize is not an option...

So would it somehow be possible to read in the text of a pdf either
sequentially.. or may be load the PDF without images so size would be
restricted.

regards
Christian

Re: Possible to read text without loading PDF to ram?

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi,

Christian Ortolf schrieb:
> Hello,
> 
> is there any possibility to read the text in a  pdf without loading
> the whole document to RAM?
There is no special option to do so, but perhaps there is a workaround.
Just try to extract one page after the other, so that for every step the
use of resources should be reduced.

> I have the problem that some documents cause OutOfMemory errors. And
> increasing the heapsize is not an option...
Hmmm, on the other hand there could be an issue with pdfbox. Is it
possible to provide us with a sample document, which crashes with a
OutOfMemory. If so, please create an issue on jira [1] and attach the
pdf to it.

> So would it somehow be possible to read in the text of a pdf either
> sequentially.. or may be load the PDF without images so size would be
> restricted.
During textextraction all operators, which aren't needed for the
extraction itself, should be skipped. See [2] for details.

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX
[2]
http://svn.apache.org/viewvc/incubator/pdfbox/trunk/src/main/resources/Resources/PDFTextStripper.properties?view=log