You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/09/08 20:11:32 UTC

[jira] Updated: (TIKA-509) Container contents extraction

     [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-509:
-------------------------------

    Attachment: 0001-TIKA-509-Container-contents-extraction.patch

I'm not too excited about the idea of introducing a completely new mechanism in parallel with the Parser API we already have. AFAIUI the Parser API already supports all the functionality you're looking for.

See the attached patch that copies the embedded document handling code from the POIFSContainerExtractor class to our existing OfficeParser implementation, and adds a generic ParserContainerExtractor class that implements the ContainerExtractor interface based on our existing Parser and Detector APIs.

This solution passes all the current test cases (see the modifications I made to POIFSContainerExtractorTest), implements the embedded document support asked for in TIKA-489, and as a bonus gives you ContainerExtractor support for all the package formats (zip, tar, cpio, etc.) that we already have parsers for.

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.