You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Chaitali Patel <ch...@net4nuts.com> on 2009/08/13 07:32:42 UTC

PDF content extraction takes lot of time

Hi,
 
I am using PDFParser to extract PDF content. But it is taking a lot of
time for extracting. 250KB file in my case took around 5 minutes to get
extracted.
a 600KB file took around 10 minutes. 
 
Following is the code snipnet which i am using.
 
InputStream input = new FileInputStream(new File(file));

ContentHandler textHandler = new BodyContentHandler();

Metadata metadata = new Metadata();

PDFParser parser = new PDFParser();

parser.parse(input, textHandler, metadata);

input.close();

 

Is there any kind of optimization required or any other solution to
this.? 

 

Please reply asap.

<http://www.meongo.com/> 
 
 
 
 

Re: PDF content extraction takes lot of time

Posted by Daniel Knapp <da...@mni.fh-giessen.de>.
Hello Jukka,

sorry for the late reply.

It seems the problem no longer appear with the released 0.5 version.

Thanks again for your help!

Regards,
Daniel

Am 16.11.2009 um 14:48 schrieb Jukka Zitting:

> Hi,
> 
> On Mon, Nov 16, 2009 at 2:43 PM, Daniel Knapp
> <da...@mni.fh-giessen.de> wrote:
>> Actually, i use Tika 0.5 from SVN. Most of my PDF Files are parsed in a second,
>> but at a few files it took very long. What a pity! :(
> 
> Can you file a Tika improvement issue about PDF parsing speed and
> attach an example file that illustrates the problem (if you don't want
> to share the files in public, you can also send one to me in private)?
> I can profile the parsing process and report the issue back to the
> PDFBox project.
> 
> BR,
> 
> Jukka Zitting


Re: PDF content extraction takes lot of time

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Nov 16, 2009 at 2:43 PM, Daniel Knapp
<da...@mni.fh-giessen.de> wrote:
> Actually, i use Tika 0.5 from SVN. Most of my PDF Files are parsed in a second,
> but at a few files it took very long. What a pity! :(

Can you file a Tika improvement issue about PDF parsing speed and
attach an example file that illustrates the problem (if you don't want
to share the files in public, you can also send one to me in private)?
I can profile the parsing process and report the issue back to the
PDFBox project.

BR,

Jukka Zitting

Re: PDF content extraction takes lot of time

Posted by Daniel Knapp <da...@mni.fh-giessen.de>.
Am 16.11.2009 um 14:27 schrieb Jukka Zitting:

> Hi,
> 
> On Mon, Nov 16, 2009 at 2:20 PM, Daniel Knapp
> <da...@mni.fh-giessen.de> wrote:
>> i have the same problem. Some PDF Files take about 10 or more Minutes to getting parsed by Tika.
>> 
>> Are there any news about a new PDFBox Version that will fix this problem in the near future?
> 
> Among other improvements the recent PDFBox 0.8.0 release used by Tika
> 0.5 (see tika-dev@ for a release candidate) contains some pretty nice
> speed improvements for PDFs with complex color profiles.

Thanks for your answer Jukka.

Actually, i use Tika 0.5 from SVN. Most of my PDF Files are parsed in a second, but at a few files it took very long.
What a pity! :(

> See
> http://markmail.org/message/amyffqyrmg762vid for some background. I'm
> not sure if or how much this will affect text extraction speed.
> 
> BR,
> 
> Jukka Zitting


Re: PDF content extraction takes lot of time

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Nov 16, 2009 at 2:20 PM, Daniel Knapp
<da...@mni.fh-giessen.de> wrote:
> i have the same problem. Some PDF Files take about 10 or more Minutes to getting parsed by Tika.
>
> Are there any news about a new PDFBox Version that will fix this problem in the near future?

Among other improvements the recent PDFBox 0.8.0 release used by Tika
0.5 (see tika-dev@ for a release candidate) contains some pretty nice
speed improvements for PDFs with complex color profiles. See
http://markmail.org/message/amyffqyrmg762vid for some background. I'm
not sure if or how much this will affect text extraction speed.

BR,

Jukka Zitting

Re: PDF content extraction takes lot of time

Posted by Daniel Knapp <da...@mni.fh-giessen.de>.
Hello,

i have the same problem. Some PDF Files take about 10 or more Minutes to getting parsed by Tika.

Are there any news about a new PDFBox Version that will fix this problem in the near future?

The Workaround with pdf2html doesn't work fine for me.

BR,
Daniel

Am 13.08.2009 um 16:00 schrieb Steen Manniche:

> Den Thu, Aug 13, 2009 at 03:30:22PM +0200 skrev Jukka Zitting:
>> Hi,
>> 
>> On Thu, Aug 13, 2009 at 7:32 AM, Chaitali Patel<ch...@net4nuts.com> 
>> wrote:
>>> I am using PDFParser to extract PDF content. But it is taking a lot of 
>> time
>>> for extracting. 250KB file in my case took around 5 minutes to get
>>> extracted. a 600KB file took around 10 minutes.
>>> [...]
>>> Is there any kind of optimization required or any other solution to this.?
>> 
>> Unfortunately no. The main cause of the slow parsing time is in the
>> PDFBox library.
>> 
>> I've tried profiling it and it looks like it parsing time could be
>> significantly reduced especially by optimizing or even avoiding some
>> of the font size calculations that PDFBox keeps doing even if you're
>> just extracting the text content.
> 
> You could also try preprocessing the PDF files with pdf2html
> (http://sourceforge.net/projects/pdftohtml/), and parse the output
> with the HTMLParser. You could additionally process the html and
> remove unwanted/unneeded content before doing a tika parsing.
> 
> Best Regards,
> Steen Manniche
> 
>> 
>> However, that's something to be discussed over at the Apache PDFBox
>> mailing lists. There isn't much we can do about it in Tika until a new
>> PDFBox release is available or someone contributes an alternative PDF
>> parser implementation.
>> 
>> BR,
>> 
>> Jukka Zitting
>> 
>> 
> 


Re: PDF content extraction takes lot of time

Posted by Steen Manniche <st...@dbc.dk>.
Den Thu, Aug 13, 2009 at 03:30:22PM +0200 skrev Jukka Zitting:
> Hi,
> 
> On Thu, Aug 13, 2009 at 7:32 AM, Chaitali Patel<ch...@net4nuts.com> 
> wrote:
> > I am using PDFParser to extract PDF content. But it is taking a lot of 
> time
> > for extracting. 250KB file in my case took around 5 minutes to get
> > extracted. a 600KB file took around 10 minutes.
> > [...]
> > Is there any kind of optimization required or any other solution to this.?
> 
> Unfortunately no. The main cause of the slow parsing time is in the
> PDFBox library.
> 
> I've tried profiling it and it looks like it parsing time could be
> significantly reduced especially by optimizing or even avoiding some
> of the font size calculations that PDFBox keeps doing even if you're
> just extracting the text content.

You could also try preprocessing the PDF files with pdf2html
(http://sourceforge.net/projects/pdftohtml/), and parse the output
with the HTMLParser. You could additionally process the html and
remove unwanted/unneeded content before doing a tika parsing.

Best Regards,
Steen Manniche

> 
> However, that's something to be discussed over at the Apache PDFBox
> mailing lists. There isn't much we can do about it in Tika until a new
> PDFBox release is available or someone contributes an alternative PDF
> parser implementation.
> 
> BR,
> 
> Jukka Zitting
> 
> 


Re: PDF content extraction takes lot of time

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Aug 13, 2009 at 7:32 AM, Chaitali Patel<ch...@net4nuts.com> wrote:
> I am using PDFParser to extract PDF content. But it is taking a lot of time
> for extracting. 250KB file in my case took around 5 minutes to get
> extracted. a 600KB file took around 10 minutes.
> [...]
> Is there any kind of optimization required or any other solution to this.?

Unfortunately no. The main cause of the slow parsing time is in the
PDFBox library.

I've tried profiling it and it looks like it parsing time could be
significantly reduced especially by optimizing or even avoiding some
of the font size calculations that PDFBox keeps doing even if you're
just extracting the text content.

However, that's something to be discussed over at the Apache PDFBox
mailing lists. There isn't much we can do about it in Tika until a new
PDFBox release is available or someone contributes an alternative PDF
parser implementation.

BR,

Jukka Zitting