You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Martinez, Mel - 1004 - MITLL" <m....@ll.mit.edu> on 2010/01/14 22:53:47 UTC

RE: [jira] performance issue flood

'Sorry about all the Jira issues flooding your in-boxes!  I'm done for now.

What are the chances of getting some or all of these performance tweaks
committed to the codebase?  My project really really really needs PDFBox to
be faster and yet I'm also constrained to only use the 'release' versions of
it.  If we could get these performance tweaks incorporated into the 1.0
release, that would really be helpful.

In particular, the BaseParser.readUntilEndStream() method improvement is
desperately needed (in https://issues.apache.org/jira/browse/PDFBOX-591 ).

That one should benefit practically all users of PDFBox.

Cheers,

Dr. Mel Martinez
m.martinez@ll.mit.edu

RE: [jira] performance issue flood

Posted by "Martinez, Mel - 1004 - MITLL" <m....@ll.mit.edu>.

Thanks, Jukka, for committing all those!

Yes, I'll use diff patches for future submissions.  I've done that with
other open source projects. Not all accept them that way so I went with just
submitting full-up mods of the files. 

Also - thanks for the _additional_ performance tweaks that you put in!

Phillip asked about expected performance boost.  That depends on what
exactly you are doing.  My submissions were very closely targeted around
text extraction.  I had two phases:  1) load the document and 2) extract the
text.

In my test cases for text extraction to *memory*, from a 'typical' file that
my project expects to see (basically a presentation slide show with text and
graphics) I get a performance improvement of about 17.5 / 8.4 == ~2.1 or so.
However, within the context of other things, the real world improvement will
be less than that because these improvements don't speed up things like the
file output.  Also, the sheer ratio of text objects to other objects inside
the file and how the text objects are stored inside the stream objects ...
there are a lot of variables.

Nevertheless, most any operation that involves parsing a PDF will get SOME
boost from the readUntilEndStream() method fix and many should also get some
boost from some of the various other tweaks.

At this point, most of the load() side of my tests is spent in the stream
read() method - so there is not a lot of improvements left to get from
there.  There is definitely some more room in the 'extract text' side.

Mel

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Friday, January 15, 2010 7:17 PM
To: dev@pdfbox.apache.org
Subject: Re: [jira] performance issue flood

Hi,

On Fri, Jan 15, 2010 at 8:30 AM, Philipp Koch <ph...@day.com> wrote:
> thanks a lot for your performance optimization contributions.

+1 Good stuff!

> what factor of (overall) speedup is to be expected?

I ran some simple tests and it looks like PDF loading (PDDocument.load
on a File) is now about 20% faster and text extraction
(PDFTextStripper.writeText on an already loaded PDDocument and a dummy
writer) about 30% faster than before Mel's patches and my additional
improvements.

PDFBox is still quite a bit slower than I'd hope, but this is already
a pretty good improvement.

PS. Mel, if you come up with other improvements, it'll be easier for
us to review and apply the changes if you submit them as patches
instead of full copies of the modified files. To create a patch, use
"svn diff" in your checkout.

BR,

Jukka Zitting

Re: [jira] performance issue flood

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Fri, Jan 15, 2010 at 8:30 AM, Philipp Koch <ph...@day.com> wrote:
> thanks a lot for your performance optimization contributions.

+1 Good stuff!

> what factor of (overall) speedup is to be expected?

I ran some simple tests and it looks like PDF loading (PDDocument.load
on a File) is now about 20% faster and text extraction
(PDFTextStripper.writeText on an already loaded PDDocument and a dummy
writer) about 30% faster than before Mel's patches and my additional
improvements.

PDFBox is still quite a bit slower than I'd hope, but this is already
a pretty good improvement.

PS. Mel, if you come up with other improvements, it'll be easier for
us to review and apply the changes if you submit them as patches
instead of full copies of the modified files. To create a patch, use
"svn diff" in your checkout.

BR,

Jukka Zitting

Re: [jira] performance issue flood

Posted by Philipp Koch <ph...@day.com>.

mel,
thanks a lot for your performance optimization contributions. what
factor of (overall) speedup is to be expected?

regards,
philipp

On Thu, Jan 14, 2010 at 10:53 PM, Martinez, Mel - 1004 - MITLL
<m....@ll.mit.edu> wrote:
>
> 'Sorry about all the Jira issues flooding your in-boxes!  I'm done for now.
>
> What are the chances of getting some or all of these performance tweaks
> committed to the codebase?  My project really really really needs PDFBox to
> be faster and yet I'm also constrained to only use the 'release' versions of
> it.  If we could get these performance tweaks incorporated into the 1.0
> release, that would really be helpful.
>
> In particular, the BaseParser.readUntilEndStream() method improvement is
> desperately needed (in https://issues.apache.org/jira/browse/PDFBOX-591 ).
>
> That one should benefit practically all users of PDFBox.
>
> Cheers,
>
> Dr. Mel Martinez
> m.martinez@ll.mit.edu
>
>
>