You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2010/09/08 15:50:33 UTC

[jira] Created: (TIKA-509) Container contents extraction

Container contents extraction
-----------------------------

                 Key: TIKA-509
                 URL: https://issues.apache.org/jira/browse/TIKA-509
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 0.7
            Reporter: Nick Burch
            Assignee: Nick Burch
            Priority: Minor


As discussed on the mailing list:
http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E

This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.

It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-509) Container contents extraction

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907318#action_12907318 ] 

Nick Burch commented on TIKA-509:
---------------------------------

Initial work committed in r995157.

This commit shows a rough cut of the interfaces and key classes, so everyone can review them now there's some concrete code to look at. One change required though is to not pass an open inputstream in for every firing of the handler. Instead, the handler should be given a chance to decline the embeded resource first, before the extractor potentially does lots of work to generate/unpack the resource

Also in the commit is a basic POIFS extractor. It only does nested embeded office files, no images, and only for word+excel. However, it should give an idea of how things may work

The config for the auto-detector isn't wired in yet, that'll be done once we have further extractors.

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-509) Container contents extraction

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908155#action_12908155 ] 

Jukka Zitting commented on TIKA-509:
------------------------------------

In revision 995946 I started drafting a possible solution for the selective extraction feature.

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-509) Container contents extraction

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907633#action_12907633 ] 

Nick Burch commented on TIKA-509:
---------------------------------

Interface rename makes sense to me, I've done that

Not sure about the best way to do the strategy object, but hopefully someone else will do! :)

Recursion was broken in ParserContainerExtractor - I've re-enabled this, but I'm not sure if it's the best way to handle it or not? See r995438 for how I've made it work for now.

Next up, images in word!

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-509) Container contents extraction

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907599#action_12907599 ] 

Nick Burch commented on TIKA-509:
---------------------------------

Jukka - your patch looks good, just thought I'd check a few things

Are you thinking that the ContainerExtractor interface and ContainerEmbededResourceHandler will remain? And that this would be the way for people who's main interest is getting at the embeded documents to work? 

In terms of the parser / ParserContainerExtractor, are you thinking that we should try to make the container related Parsers call the nested parser from the ParseContext? The only issue with that I can see is that it isn't then possible for for users to say "I don't want that file, don't bother doing lots of work to extract it". Admittedly the ContainerExtractor doesn't support that either, but I had an idea for how to do that, and I can't easily see how that'd fit in with parsers

I'll apply your patch shortly, then carry on with my work on making the office format embeded resources available, but using your new pattern

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-509) Container contents extraction

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908103#action_12908103 ] 

Nick Burch commented on TIKA-509:
---------------------------------

Support is now in place for .doc, .docx, .xls and .xlsx

There's a couple more office formats to support, and unit tests are needed for the general container formats which are supported via the PackageExtractor

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-509) Container contents extraction

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907319#action_12907319 ] 

Nick Burch commented on TIKA-509:
---------------------------------

Also on the config side of things, currently there's just a straight list of ContainerExtractors, and each one does its own mini-detection to decide if it's suitable. We may wish to change this later to work with MediaTypes as the Parsers do, however there are issues around how much extra work might be involved in the detection step - we don't want to duplicate the container parsing by forcing people to run it through a ContainerAwareDetector and then process the container again in a ContainerExtractor. Probably one to hold off deciding on until we have several extractors in place, when we'll have a better idea of how they might work and fit together?

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-509) Container contents extraction

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-509:
-------------------------------

    Attachment: 0001-TIKA-509-Container-contents-extraction.patch

I'm not too excited about the idea of introducing a completely new mechanism in parallel with the Parser API we already have. AFAIUI the Parser API already supports all the functionality you're looking for.

See the attached patch that copies the embedded document handling code from the POIFSContainerExtractor class to our existing OfficeParser implementation, and adds a generic ParserContainerExtractor class that implements the ContainerExtractor interface based on our existing Parser and Detector APIs.

This solution passes all the current test cases (see the modifications I made to POIFSContainerExtractorTest), implements the embedded document support asked for in TIKA-489, and as a bonus gives you ContainerExtractor support for all the package formats (zip, tar, cpio, etc.) that we already have parsers for.

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-509) Container contents extraction

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907419#action_12907419 ] 

Chris A. Mattmann commented on TIKA-509:
----------------------------------------

Wow, Jukka! +1!

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-509) Container contents extraction

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908785#action_12908785 ] 

Nick Burch commented on TIKA-509:
---------------------------------

I think we're almost there. All office formats except .ppt are now supported, and we now have unit tests which show that the PackageParser based extraction was already worked as required (yey!)

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-509) Container contents extraction

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907607#action_12907607 ] 

Jukka Zitting commented on TIKA-509:
------------------------------------

Yes, I think the ContainerExtractor and ContainerEmbeddedResourceHandler (rename to EmbeddedResourceHandler?) interfaces should remain as they provide a much more convenient way to achieve this use case.

> [...] we should try to make the container related Parsers call the nested parser from the ParseContext?

Yes, that's the way the PackageParser was designed and how I'd like to see also other container formats handled.

> "I don't want that file, don't bother doing lots of work to extract it"

We could implement that with an optional strategy object that gets passed through the parse context along with the component parser.

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.