You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/07/15 13:52:04 UTC
per page processing?
All,
Raymond Wu recently opened TIKA-1679 and recommended that we switch to per-page processing so that if there's an exception on one page, we'll still be able to extract contents from other pages.
The proposed fix is along these lines:
int nop = document.getNumberOfPages();
for(int i=1;i<=nop;i++) {
PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
extractAnnotationText, enableAutoSpace,
suppressDuplicateOverlappingText, sortByPosition);
try {
pdf2XHTML.setStartPage(i);
pdf2XHTML.setEndPage(i);
pdf2XHTML.writeText(document, dummyWriter);
} catch(Exception e) {
// TODO ...
}
Does this seem reasonable? Any gut reaction/estimates on the performance hit? Perhaps we should make this mode configurable?
Thank you.
Best,
Tim
RE: per page processing?
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Onward. Thank you!
-----Original Message-----
From: John Hewson [mailto:john@jahewson.com]
Sent: Wednesday, July 15, 2015 5:09 PM
To: users@pdfbox.apache.org
Subject: Re: per page processing?
> On 15 Jul 2015, at 04:52, Allison, Timothy B. <ta...@mitre.org> wrote:
>
> All,
> Raymond Wu recently opened TIKA-1679 and recommended that we switch to per-page processing so that if there's an exception on one page, we'll still be able to extract contents from other pages.
>
> The proposed fix is along these lines:
>
> int nop = document.getNumberOfPages();
> for(int i=1;i<=nop;i++) {
> PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
> extractAnnotationText, enableAutoSpace,
> suppressDuplicateOverlappingText, sortByPosition);
> try {
> pdf2XHTML.setStartPage(i);
> pdf2XHTML.setEndPage(i);
> pdf2XHTML.writeText(document, dummyWriter);
> } catch(Exception e) {
> // TODO ...
> }
>
> Does this seem reasonable? Any gut reaction/estimates on the performance hit? Perhaps we should make this mode configurable?
>
Looks fine to me, as quick look at the source of PDFTextStripper doesn’t indicate any performance issues.
— John
> Thank you.
>
> Best,
>
> Tim
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: per page processing?
Posted by John Hewson <jo...@jahewson.com>.
> On 15 Jul 2015, at 04:52, Allison, Timothy B. <ta...@mitre.org> wrote:
>
> All,
> Raymond Wu recently opened TIKA-1679 and recommended that we switch to per-page processing so that if there's an exception on one page, we'll still be able to extract contents from other pages.
>
> The proposed fix is along these lines:
>
> int nop = document.getNumberOfPages();
> for(int i=1;i<=nop;i++) {
> PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
> extractAnnotationText, enableAutoSpace,
> suppressDuplicateOverlappingText, sortByPosition);
> try {
> pdf2XHTML.setStartPage(i);
> pdf2XHTML.setEndPage(i);
> pdf2XHTML.writeText(document, dummyWriter);
> } catch(Exception e) {
> // TODO ...
> }
>
> Does this seem reasonable? Any gut reaction/estimates on the performance hit? Perhaps we should make this mode configurable?
>
Looks fine to me, as quick look at the source of PDFTextStripper doesn’t indicate any performance issues.
— John
> Thank you.
>
> Best,
>
> Tim
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org