You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org> on 2011/10/04 12:32:33 UTC

[jira] [Updated] (TIKA-742) PDF2XHTML fails to insert

nor space around page marker

     [ https://issues.apache.org/jira/browse/TIKA-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-742:
------------------------------------

    Attachment: 000086.pdf

PDF doc showing the issue (unfortunately not committable).
                
> PDF2XHTML fails to insert <p> nor space around page marker
> ----------------------------------------------------------
>
>                 Key: TIKA-742
>                 URL: https://issues.apache.org/jira/browse/TIKA-742
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: 000086.pdf
>
>
> I have a test document (unfortunately not committable) whose page
> numbers are rendered with no separator (<p> nor space) before the next
> word.  So I have words like:
>   * 1Massachusetts
>   * 2Course
>   * 3also
>   * 4The
> But then when I ran the ExtractText -html command-line from PDFBox, I
> can see that <p> is inserted after these page numbers (spookily, not
> closing the previous <p>; I opened PDFBOX-1130 for that).
> So I made a simple change to Tika's PDF2XHTML, to have it override the
> writeStart/EndParagraph, and call handler.start/EndElement("p"), ie to
> preserve the paragraph structure that PDFBOX detects out to the
> resulting XHTML handler, and this fixes the issue (I now see the page
> number as a separate paragraph, rendered w/ newline in "text" mode
> from TikaCLI).
> Note that this test document is the same document from PDFBOX-1129
> (there are some quote characters that are not extracted correctly).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira