You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Maruan Sahyoun (JIRA)" <ji...@apache.org> on 2013/12/03 18:10:36 UTC

[jira] [Commented] (PDFBOX-1792) Metadata not completely extracted with NonSequentialPDFParser on some documents

    [ https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837907#comment-13837907 ] 

Maruan Sahyoun commented on PDFBOX-1792:
----------------------------------------

Could you share with us - or point us to the source - how you do the extraction? Using the ExtractText command line tool both options produce the same result, which is that the text within the annotation is not extracted. 

In addition the following code

        PDDocument document = PDDocument.loadNonSeq(new File("testAnnotations.pdf"), null);
        PDDocumentInformation docInfo = document.getDocumentInformation();
        PDDocumentCatalog catalog = document.getDocumentCatalog();
        List<PDAnnotation> la = ((PDPage)catalog.getAllPages().get(0)).getAnnotations();
        String annotationText = la.get(0).getContents();

Gives you the same content using the NonSequentalPDFParser and the ‚classic‘ parser i.e. 'Here is a comment‘.

All testes done using pdfbox-1.8.3.

BR
Maruan 

> Metadata not completely extracted with NonSequentialPDFParser on some documents
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1792
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1792
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.3
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: PDFBOX-1792.tar.gz
>
>
> The traditional parser is able to extract metadata from the Annotation test document from TIKA-738.  The NonSequentialPDFParser is not able to extract metadata.



--
This message was sent by Atlassian JIRA
(v6.1#6144)