You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/07/15 13:52:04 UTC

per page processing?

All,
  Raymond Wu recently opened TIKA-1679 and recommended that we switch to per-page processing so that if there's an exception on one page, we'll still be able to extract contents from other pages.

  The proposed fix is along these lines:

             int nop = document.getNumberOfPages();
            for(int i=1;i<=nop;i++) {
                PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
                extractAnnotationText, enableAutoSpace,
                suppressDuplicateOverlappingText, sortByPosition);
                try {
                    pdf2XHTML.setStartPage(i);
                    pdf2XHTML.setEndPage(i);
                    pdf2XHTML.writeText(document, dummyWriter);
                } catch(Exception e) {
                    // TODO ...
                }

  Does this seem reasonable?  Any gut reaction/estimates on the performance hit?  Perhaps we should make this mode configurable?

Thank you.

             Best,

                        Tim

RE: per page processing?

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Onward.  Thank you!

-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: Wednesday, July 15, 2015 5:09 PM
To: users@pdfbox.apache.org
Subject: Re: per page processing?


> On 15 Jul 2015, at 04:52, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> All,
>  Raymond Wu recently opened TIKA-1679 and recommended that we switch to per-page processing so that if there's an exception on one page, we'll still be able to extract contents from other pages.
> 
>  The proposed fix is along these lines:
> 
>             int nop = document.getNumberOfPages();
>            for(int i=1;i<=nop;i++) {
>                PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
>                extractAnnotationText, enableAutoSpace,
>                suppressDuplicateOverlappingText, sortByPosition);
>                try {
>                    pdf2XHTML.setStartPage(i);
>                    pdf2XHTML.setEndPage(i);
>                    pdf2XHTML.writeText(document, dummyWriter);
>                } catch(Exception e) {
>                    // TODO ...
>                }
> 
>  Does this seem reasonable?  Any gut reaction/estimates on the performance hit?  Perhaps we should make this mode configurable?
> 

Looks fine to me, as quick look at the source of PDFTextStripper doesn’t indicate any performance issues.

— John

> Thank you.
> 
>             Best,
> 
>                        Tim


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: per page processing?

Posted by John Hewson <jo...@jahewson.com>.

> On 15 Jul 2015, at 04:52, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> All,
>  Raymond Wu recently opened TIKA-1679 and recommended that we switch to per-page processing so that if there's an exception on one page, we'll still be able to extract contents from other pages.
> 
>  The proposed fix is along these lines:
> 
>             int nop = document.getNumberOfPages();
>            for(int i=1;i<=nop;i++) {
>                PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
>                extractAnnotationText, enableAutoSpace,
>                suppressDuplicateOverlappingText, sortByPosition);
>                try {
>                    pdf2XHTML.setStartPage(i);
>                    pdf2XHTML.setEndPage(i);
>                    pdf2XHTML.writeText(document, dummyWriter);
>                } catch(Exception e) {
>                    // TODO ...
>                }
> 
>  Does this seem reasonable?  Any gut reaction/estimates on the performance hit?  Perhaps we should make this mode configurable?
> 

Looks fine to me, as quick look at the source of PDFTextStripper doesn’t indicate any performance issues.

— John

> Thank you.
> 
>             Best,
> 
>                        Tim


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org