You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2010/09/08 19:11:32 UTC

[jira] Commented: (TIKA-509) Container contents extraction

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907318#action_12907318 ] 

Nick Burch commented on TIKA-509:
---------------------------------

Initial work committed in r995157.

This commit shows a rough cut of the interfaces and key classes, so everyone can review them now there's some concrete code to look at. One change required though is to not pass an open inputstream in for every firing of the handler. Instead, the handler should be given a chance to decline the embeded resource first, before the extractor potentially does lots of work to generate/unpack the resource

Also in the commit is a basic POIFS extractor. It only does nested embeded office files, no images, and only for word+excel. However, it should give an idea of how things may work

The config for the auto-detector isn't wired in yet, that'll be done once we have further extractors.

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.