You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Paul Pearcy <pa...@markit.com> on 2011/12/06 02:17:04 UTC

Processing large amounts of PDFs in parallel without running out of memory

Hello!
  First off, thanks to all who have contributed to this great library. It has made my life a lot easier :)

I am processing a large number of PDFs for search indexing and starting with tika version 0.9, I started hitting out of memory errors while processing PDFs. The heap dumps I get indicate that most of the memory is used up by pdfbox RandomAccessBuffers.

It appears that under the hood pdfbox can work with either a RandomAccessFile (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessFile.html ) or a RandomAccessBuffer (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessBuffer.html) and that tika uses RandomAccessBuffers for better performance. I'd like to sacrifice this performance for less RAM usage.

Is this possible?

Previously, I was passing in tika a byte array and switched to a File in hopes that it would use RandomAccessFile, but that didn't appear to make a difference.

I have a hunch that using TikaInputStreams may be able to address, but am not sure.

Thanks and Best Regards,
Paul



________________________________
This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by Markit is prohibited. This email is subject to all waivers and other terms at the following link: http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide.

RE: Processing large amounts of PDFs in parallel without running out of memory

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 12 Dec 2011, Paul Pearcy wrote:
> PDDocument pdfDocument =
>            PDDocument.load(new CloseShieldInputStream(stream), true);
>
> Anybody have thoughts on whether it makes sense to do this based on the 
> type of the underlying stream the parse method receives? Not sure if 
> there is a better option for controlling this behavior.

If the stream is a TikaInputStream, then we can check if we have a file, 
and get that if available. OfficeParser has similar code to what we'd need

Can anyone spot a snag with doing this?

(If no-one pipes up in a few days, I'd suggest you open a new enhancement 
JIRA for the change, and I'll happily make the change if no-one else beats 
me to it!)

Nick

RE: Processing large amounts of PDFs in parallel without running out of memory

Posted by Paul Pearcy <pa...@markit.com>.
Thanks Nick.

I believe I found the relevant code:
https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java

PDDocument pdfDocument =
            PDDocument.load(new CloseShieldInputStream(stream), true);

It seems that passing in a RandomAccessFile vs RandomAccessBuffer to the load method will control how pdfbox uses temporary memory.

Anybody have thoughts on whether it makes sense to do this based on the type of the underlying stream the parse method receives? Not sure if there is a better option for controlling this behavior.

Best Regards,
Paul

-----Original Message-----
From: Nick Burch [mailto:nick.burch@alfresco.com]
Sent: Monday, December 05, 2011 6:31 PM
To: user@tika.apache.org
Subject: Re: Processing large amounts of PDFs in parallel without running out of memory

On Mon, 5 Dec 2011, Paul Pearcy wrote:
> It appears that under the hood pdfbox can work with either a
> RandomAccessFile
> (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessFile.html
> ) or a RandomAccessBuffer
> (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessBuffer.html)
> and that tika uses RandomAccessBuffers for better performance. I'd like
> to sacrifice this performance for less RAM usage.
>
> Is this possible?

I think it should be a fairly simple change, to test if we have a
TikaInputStream, and if so one with a File, and if so use the File
constructor to PDFBox rather than the stream one.

I don't know the PDFBox related code well though, so I'll wait for others
to comment on the sanity of this... :)

Nick

This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by Markit is prohibited. This email is subject to all waivers and other terms at the following link: http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide.

Re: Processing large amounts of PDFs in parallel without running out of memory

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 5 Dec 2011, Paul Pearcy wrote:
> It appears that under the hood pdfbox can work with either a 
> RandomAccessFile 
> (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessFile.html 
> ) or a RandomAccessBuffer 
> (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/io/RandomAccessBuffer.html) 
> and that tika uses RandomAccessBuffers for better performance. I'd like 
> to sacrifice this performance for less RAM usage.
>
> Is this possible?

I think it should be a fairly simple change, to test if we have a 
TikaInputStream, and if so one with a File, and if so use the File 
constructor to PDFBox rather than the stream one.

I don't know the PDFBox related code well though, so I'll wait for others 
to comment on the sanity of this... :)

Nick