You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2010/06/30 13:50:50 UTC

[jira] Created: (TIKA-452) Extract custom pdf metadata

Extract custom pdf metadata
---------------------------

                 Key: TIKA-452
                 URL: https://issues.apache.org/jira/browse/TIKA-452
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.7
            Reporter: Nick Burch
            Assignee: Nick Burch
            Priority: Minor
             Fix For: 0.8


While PDF files can contain custom metadata, we currently don't extract this

Given that other parsers currently do for their formats' custom metadata, and PDFBox makes the custom metadata available (in a not too nasty way), the pdf parser should do too

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-452) Extract custom pdf metadata

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883900#action_12883900 ] 

Nick Burch commented on TIKA-452:
---------------------------------

Jeremias - thanks for the PDF. Unfortunately, I'm not seeing the custom metadata come through :( It only seems to have the normal metadata entries:
 COSName{Author} Author
 COSName{Creator} Creator
 COSName{CreationDate} CreationDate
 COSName{ModDate} ModDate
 COSName{Producer} Producer
 COSName{Title} Title


> Extract custom pdf metadata
> ---------------------------
>
>                 Key: TIKA-452
>                 URL: https://issues.apache.org/jira/browse/TIKA-452
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: xmp-example.pdf
>
>
> While PDF files can contain custom metadata, we currently don't extract this
> Given that other parsers currently do for their formats' custom metadata, and PDFBox makes the custom metadata available (in a not too nasty way), the pdf parser should do too

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-452) Extract custom pdf metadata

Posted by "Jeremias Maerki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883894#action_12883894 ] 

Jeremias Maerki commented on TIKA-452:
--------------------------------------

I can generate a small PDF for you with Apache FOP, if you want.

> Extract custom pdf metadata
> ---------------------------
>
>                 Key: TIKA-452
>                 URL: https://issues.apache.org/jira/browse/TIKA-452
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>             Fix For: 0.8
>
>
> While PDF files can contain custom metadata, we currently don't extract this
> Given that other parsers currently do for their formats' custom metadata, and PDFBox makes the custom metadata available (in a not too nasty way), the pdf parser should do too

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-452) Extract custom pdf metadata

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-452.
-----------------------------

    Resolution: Fixed

Feature added in r959275.

However, no unit test exists for this, as the only file I have with custom metadata in it is much too large. It would be good if we can find a nice small pdf with some custom metadata in it for testing

> Extract custom pdf metadata
> ---------------------------
>
>                 Key: TIKA-452
>                 URL: https://issues.apache.org/jira/browse/TIKA-452
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>             Fix For: 0.8
>
>
> While PDF files can contain custom metadata, we currently don't extract this
> Given that other parsers currently do for their formats' custom metadata, and PDFBox makes the custom metadata available (in a not too nasty way), the pdf parser should do too

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-452) Extract custom pdf metadata

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883917#action_12883917 ] 

Nick Burch commented on TIKA-452:
---------------------------------

I used pdfbox to add in some custom metadata to your sample file, then used that as a basis for a test. Committed in r959305. We now also have that file in SVN for testing future XMP related metadata extraction enhancements

> Extract custom pdf metadata
> ---------------------------
>
>                 Key: TIKA-452
>                 URL: https://issues.apache.org/jira/browse/TIKA-452
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: xmp-example.pdf
>
>
> While PDF files can contain custom metadata, we currently don't extract this
> Given that other parsers currently do for their formats' custom metadata, and PDFBox makes the custom metadata available (in a not too nasty way), the pdf parser should do too

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-452) Extract custom pdf metadata

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883915#action_12883915 ] 

Nick Burch commented on TIKA-452:
---------------------------------

Sounds like there's another kind of metadata we should be extracting too then!

> Extract custom pdf metadata
> ---------------------------
>
>                 Key: TIKA-452
>                 URL: https://issues.apache.org/jira/browse/TIKA-452
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: xmp-example.pdf
>
>
> While PDF files can contain custom metadata, we currently don't extract this
> Given that other parsers currently do for their formats' custom metadata, and PDFBox makes the custom metadata available (in a not too nasty way), the pdf parser should do too

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-452) Extract custom pdf metadata

Posted by "Jeremias Maerki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremias Maerki updated TIKA-452:
---------------------------------

    Attachment: xmp-example.pdf

Here's a small example (8KB) with custom XMP metadata made with FOP.

> Extract custom pdf metadata
> ---------------------------
>
>                 Key: TIKA-452
>                 URL: https://issues.apache.org/jira/browse/TIKA-452
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: xmp-example.pdf
>
>
> While PDF files can contain custom metadata, we currently don't extract this
> Given that other parsers currently do for their formats' custom metadata, and PDFBox makes the custom metadata available (in a not too nasty way), the pdf parser should do too

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-452) Extract custom pdf metadata

Posted by "Jeremias Maerki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883905#action_12883905 ] 

Jeremias Maerki commented on TIKA-452:
--------------------------------------

Ah, I see, you were talking about the old Info dictionary. I thought this was about XMP metadata. In that case, I can't really help because I can't add custom metadata to the Info dictionary with FOP.

> Extract custom pdf metadata
> ---------------------------
>
>                 Key: TIKA-452
>                 URL: https://issues.apache.org/jira/browse/TIKA-452
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: xmp-example.pdf
>
>
> While PDF files can contain custom metadata, we currently don't extract this
> Given that other parsers currently do for their formats' custom metadata, and PDFBox makes the custom metadata available (in a not too nasty way), the pdf parser should do too

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.