You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Gregory Kanevsky (JIRA)" <ji...@apache.org> on 2010/08/26 17:30:54 UTC

[jira] Commented: (TIKA-100) Structured PDF parsing

    [ https://issues.apache.org/jira/browse/TIKA-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902897#action_12902897 ] 

Gregory Kanevsky commented on TIKA-100:
---------------------------------------

This issue seems to be partially fixed. PDF2XHTML generates <div><p> and </p></div> to start and end each page. 

Another issue that is part of this is ordering of pdf content. PDF2XHTML extends PDFBox PDFTextStripper to extract text. By default (for performance reasons) 'sortByPosition' mode is turned off for PDFTextStripper. 

I propose to introduce metadata property (input) that would turn it on if desired. I am not sure about conventions on how such metadata properties are defined (if any) though. The setting of the mode would take place in the PDF2XHTML constructor:

private PDF2XHTML(ContentHandler handler, Metadata metadata)
            throws IOException {
        
        if (metadata.get("setSortByPosition").equalsIgnoreCase("true")) {
                setSortByPosition(true);
        }

        ....

> Structured PDF parsing
> ----------------------
>
>                 Key: TIKA-100
>                 URL: https://issues.apache.org/jira/browse/TIKA-100
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> The PDF parser currently extracts and outputs document content as a single string. PDFBox could be used to support structuring at least down to page and paragraph (not sure how accurate) level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.