You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Zhang, Lisheng" <Li...@BroadVision.com> on 2011/11/04 17:23:49 UTC
getText() performance in PDFBox 1.5 release
Hi,
I have been usiing PDFBox to extract text from PDF files for full text search for a few years,
and found it is a great product. Recently I downloaded PDFBox 1.5 and found that it can
extract text from many PDF files which cannot be processed previously, thanks!!
The problem I have is that it took long time for PDFTextStripper.getText(..) to finish, for example
our client has a 27MB PDF file which contains some graphics, it took getText(..) 50m to finish
even though it only extract 100K text eventually.
I tried to change input parameters and results are same essentially, I would like to know if this
speed is expected and the possibility to improve?
Thanks very much for helps, Lisheng
RE: getText() performance in PDFBox 1.5 release
Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.
Hi,
By using PDFBox 1.6 my problem is solved, the time is reduced to 33s
(with 1.5 is 50m). The parameter suppressDuplicateOverlappingText
did not make much difference, I guess that's because my PDF does not
have big overlap (the resulting TXT is slightly different, but not
very much).
Thanks very much for helps!!!
Lisheng
-----Original Message-----
From: Zhang, Lisheng [mailto:Lisheng.Zhang@broadvision.com]
Sent: Friday, November 04, 2011 4:02 PM
To: users@pdfbox.apache.org
Subject: RE: getText() performance in PDFBox 1.5 release
Thanks very much for pointing that out!!!
I downloaded Tika 0.10 a few days ago and CHANGES.txt attached did
not mention PDFBox 1.6, based on that CHANGES.txt I thought Tika
used 1.4.
I will download PDFBox 1.6 and retest.
Best regards, Lisheng
-----Original Message-----
From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
Sent: Friday, November 04, 2011 3:40 PM
To: users@pdfbox.apache.org
Subject: Re: getText() performance in PDFBox 1.5 release
Hi,
Am 04.11.2011 20:34, schrieb Zhang, Lisheng:
> Hi Mike,
>
> Thanks very much, I tested and result is the same, from source code
> it seems that suppressDuplicateOverlappingText parameter does not
> have effect if I call PDFTextStripper.getText(..) directly. I will
> check more to see if I can use method processEncodedText(..).
>
> Which version of PDFBox did you use (Tika has not used PDFBox 1.5 yet)?
According to [1] Tika 0.10 uses PDFBox 1.6. which includes some improvements
related to performance.
> Best regards, Lisheng
> <SNIP>
BR
Andreas Lehmkühler
[1] http://www.apache.org/dist/tika/CHANGES-0.10.txt
RE: getText() performance in PDFBox 1.5 release
Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.
Thanks very much for pointing that out!!!
I downloaded Tika 0.10 a few days ago and CHANGES.txt attached did
not mention PDFBox 1.6, based on that CHANGES.txt I thought Tika
used 1.4.
I will download PDFBox 1.6 and retest.
Best regards, Lisheng
-----Original Message-----
From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
Sent: Friday, November 04, 2011 3:40 PM
To: users@pdfbox.apache.org
Subject: Re: getText() performance in PDFBox 1.5 release
Hi,
Am 04.11.2011 20:34, schrieb Zhang, Lisheng:
> Hi Mike,
>
> Thanks very much, I tested and result is the same, from source code
> it seems that suppressDuplicateOverlappingText parameter does not
> have effect if I call PDFTextStripper.getText(..) directly. I will
> check more to see if I can use method processEncodedText(..).
>
> Which version of PDFBox did you use (Tika has not used PDFBox 1.5 yet)?
According to [1] Tika 0.10 uses PDFBox 1.6. which includes some improvements
related to performance.
> Best regards, Lisheng
> <SNIP>
BR
Andreas Lehmkühler
[1] http://www.apache.org/dist/tika/CHANGES-0.10.txt
Re: getText() performance in PDFBox 1.5 release
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,
Am 04.11.2011 20:34, schrieb Zhang, Lisheng:
> Hi Mike,
>
> Thanks very much, I tested and result is the same, from source code
> it seems that suppressDuplicateOverlappingText parameter does not
> have effect if I call PDFTextStripper.getText(..) directly. I will
> check more to see if I can use method processEncodedText(..).
>
> Which version of PDFBox did you use (Tika has not used PDFBox 1.5 yet)?
According to [1] Tika 0.10 uses PDFBox 1.6. which includes some improvements
related to performance.
> Best regards, Lisheng
> <SNIP>
BR
Andreas Lehmkühler
[1] http://www.apache.org/dist/tika/CHANGES-0.10.txt
RE: getText() performance in PDFBox 1.5 release
Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.
Hi Mike,
Thanks very much, I tested and result is the same, from source code
it seems that suppressDuplicateOverlappingText parameter does not
have effect if I call PDFTextStripper.getText(..) directly. I will
check more to see if I can use method processEncodedText(..).
Which version of PDFBox did you use (Tika has not used PDFBox 1.5 yet)?
Best regards, Lisheng
-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: Friday, November 04, 2011 10:39 AM
To: users@pdfbox.apache.org
Subject: Re: getText() performance in PDFBox 1.5 release
Is it possible you're hitting this issue?
https://issues.apache.org/jira/browse/PDFBOX-956
Try setting suppressDuplicateOverlappingText to false and see if it
changes the extraction time?
Mike McCandless
http://blog.mikemccandless.com
On Fri, Nov 4, 2011 at 12:23 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
> Hi,
>
> I have been usiing PDFBox to extract text from PDF files for full text search for a few years,
> and found it is a great product. Recently I downloaded PDFBox 1.5 and found that it can
> extract text from many PDF files which cannot be processed previously, thanks!!
>
> The problem I have is that it took long time for PDFTextStripper.getText(..) to finish, for example
> our client has a 27MB PDF file which contains some graphics, it took getText(..) 50m to finish
> even though it only extract 100K text eventually.
>
> I tried to change input parameters and results are same essentially, I would like to know if this
> speed is expected and the possibility to improve?
>
> Thanks very much for helps, Lisheng
>
Re: getText() performance in PDFBox 1.5 release
Posted by Michael McCandless <lu...@mikemccandless.com>.
Is it possible you're hitting this issue?
https://issues.apache.org/jira/browse/PDFBOX-956
Try setting suppressDuplicateOverlappingText to false and see if it
changes the extraction time?
Mike McCandless
http://blog.mikemccandless.com
On Fri, Nov 4, 2011 at 12:23 PM, Zhang, Lisheng
<Li...@broadvision.com> wrote:
> Hi,
>
> I have been usiing PDFBox to extract text from PDF files for full text search for a few years,
> and found it is a great product. Recently I downloaded PDFBox 1.5 and found that it can
> extract text from many PDF files which cannot be processed previously, thanks!!
>
> The problem I have is that it took long time for PDFTextStripper.getText(..) to finish, for example
> our client has a 27MB PDF file which contains some graphics, it took getText(..) 50m to finish
> even though it only extract 100K text eventually.
>
> I tried to change input parameters and results are same essentially, I would like to know if this
> speed is expected and the possibility to improve?
>
> Thanks very much for helps, Lisheng
>