You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/04/10 19:16:41 UTC

[jira] Created: (TIKA-402) Support for Keynote and Pages documents

Support for Keynote and Pages documents
---------------------------------------

                 Key: TIKA-402
                 URL: https://issues.apache.org/jira/browse/TIKA-402
             Project: Tika
          Issue Type: New Feature
          Components: parser
            Reporter: Jukka Zitting


It would be nice to have support for documents created by Apple's Keynote and Pages applications. Both file formats are described in http://developer.apple.com/mac/library/documentation/AppleApplications/Conceptual/iWork2-0_XML/Chapter01/Introduction.html. I'm not sure if there already are open source parser libraries for these formats or if we'd need to directly process the XML content.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (TIKA-402) Support for Keynote and Pages documents

Posted by "Martijn van Groningen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated TIKA-402:
---------------------------------------

    Attachment: iwork.patch

Updated the patch. Refactored the patch a bit. I Introduced extractors for each format. I saw the same for the ms office parser. Currently only Keynote has a working extractor. Pages and Numbers format support will follow shortly.

> Support for Keynote and Pages documents
> ---------------------------------------
>
>                 Key: TIKA-402
>                 URL: https://issues.apache.org/jira/browse/TIKA-402
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: iwork.patch, iwork.patch, testKeynote.key
>
>
> It would be nice to have support for documents created by Apple's Keynote and Pages applications. Both file formats are described in http://developer.apple.com/mac/library/documentation/AppleApplications/Conceptual/iWork2-0_XML/Chapter01/Introduction.html. I'm not sure if there already are open source parser libraries for these formats or if we'd need to directly process the XML content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-402) Support for Keynote and Pages documents

Posted by "Martijn van Groningen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated TIKA-402:
---------------------------------------

    Attachment: iwork.patch
                testKeynote.key

I couldn't find a java library that parses a keynote presentation, so I have made an initial patch that parses a keynote presentation. It is work in-progress and I was hoping to get some feedback. The attached presentation is a keynote version 5 presentation (but has keynote format version 2.x). 

The patch is working. If have tested this via the Tika CLI. Also 2 tests are included in the patch, one testing the parsing and one the auto detecting.

I have added the test file separately, because binary files can't be included in a patch. The keynote file should be placed the test-documents package in the parsers module's resource directory.

Older keynote format versions (1.x) are not supported yet, because the format is different. Also if I remember correctly that keynote file is a directory and not a compressed file. Support for Pages is not yet included.

> Support for Keynote and Pages documents
> ---------------------------------------
>
>                 Key: TIKA-402
>                 URL: https://issues.apache.org/jira/browse/TIKA-402
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: iwork.patch, testKeynote.key
>
>
> It would be nice to have support for documents created by Apple's Keynote and Pages applications. Both file formats are described in http://developer.apple.com/mac/library/documentation/AppleApplications/Conceptual/iWork2-0_XML/Chapter01/Introduction.html. I'm not sure if there already are open source parser libraries for these formats or if we'd need to directly process the XML content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-402) Support for Keynote and Pages documents

Posted by "Martijn van Groningen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated TIKA-402:
---------------------------------------

    Attachment: iwork.patch
                testPages.pages

I've updated the patch to include Pages support. Parser will extract all metadata and content from a pages document. Every page had a div and each paragraph is added to a p element. Table data is extracted as well. I've put this data inside a xhtml table (<table><tr><td>...)

The code needs some cleanup and some more documentation. Numbers support will follow soon.

> Support for Keynote and Pages documents
> ---------------------------------------
>
>                 Key: TIKA-402
>                 URL: https://issues.apache.org/jira/browse/TIKA-402
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>         Attachments: iwork.patch, iwork.patch, iwork.patch, testKeynote.key, testPages.pages
>
>
> It would be nice to have support for documents created by Apple's Keynote and Pages applications. Both file formats are described in http://developer.apple.com/mac/library/documentation/AppleApplications/Conceptual/iWork2-0_XML/Chapter01/Introduction.html. I'm not sure if there already are open source parser libraries for these formats or if we'd need to directly process the XML content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.