You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nick Burch <ni...@alfresco.com> on 2010/09/01 11:54:20 UTC

Container Extractor?

Hi All

I've been thinking about extracting files from container formats (eg 
images in a .docx, pdfs in a zip file etc). Given the recent number of 
queries about embeded files and Tika lately, I was wondering if people 
thought this might be something worth adding as another part of Tika?

My idea is that you'd pass to this "service" a container file. You'd also 
say if you wanted recursion, and which mime types interest you. The result 
would be say an iterator of input stream, which would probably also let 
you get the filenames and mime types where supported by the container.

Example uses would be:
* .doc file, non recursive, request image/png and image/jpg
   gives you all the images in the word document
* .ppt file, recursive, request excel
   gives you excel files embeded in the powerpoint, and excel files embeded
   in the word documents embeded in the powerpoint
* .docx file, non recursive, request image/png
   treated as a ooxml file, not a plain zip file, and all png images
   from the magic embeded directory are returned.
* .zip file, recursive, request pdf
   gives you all pdf files anywhere in the zip
* .ogg file, non-recursive, request audio
   gives you the 3 different audio streams in your video file

You could pass the resultant input streams into the regular tika parser if 
you wanted to process them, or even just save them into a directory
if all you wanted was an extractor.

What do people think? Is this useful? Is this appropriate for Tika? If yes 
to these two, does the rough method signature sound sane?

Nick

PS I'm willing to do most of the coding on this if it's deemed suitable
    for Tika, but not for a few weeks probably, until Alfresco 3.4 is done

Re: Container Extractor?

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Sep 2010, Andrzej Bialecki wrote:
>> I was thinking recursive could mean different things. For zip files, tar
>> files etc, it would probably just mean root directory vs descend into
>> all directories.
>
> There are no directories in these formats - it's just a flat namespace 
> that just happens to use the filesystem conventions. Java APIs for these 
> containers also provide only simple iterators. So I'm not sure if 
> there's any benefit to this distinction here... maybe provide a 
> FilenameFilter to control what path names to process?

OK, looks like a directory descent on/off isn't a great fit.

I guess we'll want to provide two ways to filter, one by filename (which 
is normally available), and one by mime type (which is sometimes 
available). Or I guess a callback of "do you want this one?" where we pass 
in all the information we have to hand. Any thoughts?

> On the other hand I see a benefit in having an option to automatically 
> descend into embedded archives.

So we'd have some sort of filtering, and the descend yes/no option? For a 
zip, the former exposes all files from all "directories", and the latter 
will cause it to descend into both embeded zips, and embeded other 
containers like .doc? For a .docx, the former exposes all embeded files 
(but none of the ooxml file format stuff), and the latter controls if 
embeded other office documents are processed?

>> For OLE2, it would mean checking embeded documents of
>> embeded documents (normally but not always by means of descending into
>> child directories). Maybe there's a clearer name for this sort of thing?
>
> OLE2 is nothing special, it's the same with other archive types, you can 
> always have embedded archives within archives.

The OLE2 files aren't always so nice. Some store embeded files as 
directory entries, some stash them away in records...

Nick

Re: Container Extractor?

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-01 12:25, Nick Burch wrote:
> On Wed, 1 Sep 2010, Andrzej Bialecki wrote:
>> This would be very useful. We contemplated implementing something like
>> this in Nutch, to handle archives (jar/tar/zip/...), but having it in
>> Tika would be much better.
>
> I'd forgotten about tar, that's another one to handle... :)
>
>> Does recursive here mean that it would look into embedded zip files
>> too? Or that it would process all paths (since there is really no
>> hierarchy in zip files)?
>
> I was thinking recursive could mean different things. For zip files, tar
> files etc, it would probably just mean root directory vs descend into
> all directories.

There are no directories in these formats - it's just a flat namespace 
that just happens to use the filesystem conventions. Java APIs for these 
containers also provide only simple iterators. So I'm not sure if 
there's any benefit to this distinction here... maybe provide a 
FilenameFilter to control what path names to process?

On the other hand I see a benefit in having an option to automatically 
descend into embedded archives.

> For OLE2, it would mean checking embeded documents of
> embeded documents (normally but not always by means of descending into
> child directories). Maybe there's a clearer name for this sort of thing?

OLE2 is nothing special, it's the same with other archive types, you can 
always have embedded archives within archives. I think the following 
could be helpful:

* a FilenameFilter to decide what paths to process
* a boolean "recursive" to specify that we want to descend into embedded 
archives, maybe with a list of interesting archive types?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Container Extractor?

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Sep 2010, Andrzej Bialecki wrote:
> This would be very useful. We contemplated implementing something like 
> this in Nutch, to handle archives (jar/tar/zip/...), but having it in 
> Tika would be much better.

I'd forgotten about tar, that's another one to handle... :)

> Does recursive here mean that it would look into embedded zip files too? 
> Or that it would process all paths (since there is really no hierarchy 
> in zip files)?

I was thinking recursive could mean different things. For zip files, tar 
files etc, it would probably just mean root directory vs descend into all 
directories. For OLE2, it would mean checking embeded documents of embeded 
documents (normally but not always by means of descending into child 
directories). Maybe there's a clearer name for this sort of thing?

Nick

Re: Container Extractor?

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-01 11:54, Nick Burch wrote:
> Hi All
>
> I've been thinking about extracting files from container formats (eg
> images in a .docx, pdfs in a zip file etc). Given the recent number of
> queries about embeded files and Tika lately, I was wondering if people
> thought this might be something worth adding as another part of Tika?

This would be very useful. We contemplated implementing something like 
this in Nutch, to handle archives (jar/tar/zip/...), but having it in 
Tika would be much better.

> Example uses would be:
> * .doc file, non recursive, request image/png and image/jpg
> gives you all the images in the word document
> * .ppt file, recursive, request excel
> gives you excel files embeded in the powerpoint, and excel files embeded
> in the word documents embeded in the powerpoint
> * .docx file, non recursive, request image/png
> treated as a ooxml file, not a plain zip file, and all png images
> from the magic embeded directory are returned.
> * .zip file, recursive, request pdf
> gives you all pdf files anywhere in the zip

Does recursive here mean that it would look into embedded zip files too? 
Or that it would process all paths (since there is really no hierarchy 
in zip files)?

> * .ogg file, non-recursive, request audio
> gives you the 3 different audio streams in your video file
>
> You could pass the resultant input streams into the regular tika parser
> if you wanted to process them, or even just save them into a directory
> if all you wanted was an extractor.
>
> What do people think? Is this useful? Is this appropriate for Tika? If
> yes to these two, does the rough method signature sound sane?


+1.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Container Extractor?

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 2 Sep 2010, Maxim Valyanskiy wrote:
> I think it is a good idea. I already have POI-based part extractor for 
> office file formats. I can contribute some code when API will be done.

Excellent news :)

Now we just need to hammer out an API design that'll work for all the 
different use cases!

Nick

Re: Container Extractor?

Posted by Maxim Valyanskiy <ma...@jet.msk.su>.
  Hello!

01.09.2010 13:54, Nick Burch пишет:
> I've been thinking about extracting files from container formats (eg images in a 
> .docx, pdfs in a zip file etc). Given the recent number of queries about embeded 
> files and Tika lately, I was wondering if people thought this might be something 
> worth adding as another part of Tika?
>
> My idea is that you'd pass to this "service" a container file. You'd also say if 
> you wanted recursion, and which mime types interest you. The result would be say 
> an iterator of input stream, which would probably also let you get the filenames 
> and mime types where supported by the container.

I think it is a good idea. I already have POI-based part extractor for office file 
formats. I can contribute some code when API will be done.

best wishes, Max

Re: Container Extractor?

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Sep 2010, Ken Krugler wrote:
> http://lucene.472066.n3.nabble.com/Multiple-documents-per-input-stream-td647159.html#a647159
>
> I ran into a similar issue when trying to figure out how best to handle mbox 
> files.

Yeah, I guess we could optionally treat mbox (and .pst + similar) 
mailboxes as containers too.

My current thinking is that Tika should do roughly the right thing if 
people just throw it a document, but should allow finer grained access to 
embeded resources for people who need control. Just FYI, at the moment my 
use case is to extract the embeded images out of uploaded .doc and .docx 
files, but I can see future requirements for parsing the metadata and text 
out of embeded documents via Tika too, so I want it to work for both :)

Nick

Re: Container Extractor?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Nick,

Potentially interesting thread from about a year ago:

http://lucene.472066.n3.nabble.com/Multiple-documents-per-input-stream-td647159.html#a647159

I ran into a similar issue when trying to figure out how best to  
handle mbox files.

-- Ken

On Sep 1, 2010, at 2:54am, Nick Burch wrote:

> Hi All
>
> I've been thinking about extracting files from container formats (eg  
> images in a .docx, pdfs in a zip file etc). Given the recent number  
> of queries about embeded files and Tika lately, I was wondering if  
> people thought this might be something worth adding as another part  
> of Tika?
>
> My idea is that you'd pass to this "service" a container file. You'd  
> also say if you wanted recursion, and which mime types interest you.  
> The result would be say an iterator of input stream, which would  
> probably also let you get the filenames and mime types where  
> supported by the container.
>
> Example uses would be:
> * .doc file, non recursive, request image/png and image/jpg
>  gives you all the images in the word document
> * .ppt file, recursive, request excel
>  gives you excel files embeded in the powerpoint, and excel files  
> embeded
>  in the word documents embeded in the powerpoint
> * .docx file, non recursive, request image/png
>  treated as a ooxml file, not a plain zip file, and all png images
>  from the magic embeded directory are returned.
> * .zip file, recursive, request pdf
>  gives you all pdf files anywhere in the zip
> * .ogg file, non-recursive, request audio
>  gives you the 3 different audio streams in your video file
>
> You could pass the resultant input streams into the regular tika  
> parser if you wanted to process them, or even just save them into a  
> directory
> if all you wanted was an extractor.
>
> What do people think? Is this useful? Is this appropriate for Tika?  
> If yes to these two, does the rough method signature sound sane?
>
> Nick
>
> PS I'm willing to do most of the coding on this if it's deemed  
> suitable
>   for Tika, but not for a few weeks probably, until Alfresco 3.4 is  
> done

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Container Extractor?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Sep 7, 2010 at 12:39 PM, Nick Burch <ni...@alfresco.com> wrote:
> I'd see this as meaning that you pass in a TikaInputStream to the service,
> and a callback handler. If supported for the container, it will stream
> through the file, firing the callback handler as it goes. For most cases,
> the file will be buffered (to disk or memory as appropriate), the
> appropriate bits identified, and then the callback handler fired for each
> part.

Sounds good to me.

BR,

Jukka Zitting

Re: Container Extractor?

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Guys,

I've been following this discussion and one thing I'd like to add is that scientific data formats exhibit most of the properties that the container formats do as well. For instance, NetCDF does not support RandomAccess, and existing Java APIs to deal with those files require the full file to be available on disk in order to be loaded into the class methods for extracting information from the file. HDF is similar. So I'm going to follow this discussion a bit more closely now as I see it coming closer to a concrete idea! ;) I've been watching the TikaInputStream stuff that Jukka has been working on and I think that's a good starting point for addressing some of these issues.

Cheers,
Chris


On 9/7/10 3:39 AM, "Nick Burch" <ni...@alfresco.com> wrote:

On Tue, 7 Sep 2010, Jukka Zitting wrote:
> On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <ni...@alfresco.com> wrote:
>> Finally, pull vs push for the consumer.
>> [...]
>> I think the former would be a little bit more work for us, but is likely to
>> lead to cleaner and simpler code for consumers. What do people think?
>
> I'd start with a push mechanism as that supports streaming and is
> better in line with the current design of Tika.

OK, that seems sensible to me, we'll go for a push option where you
specify a callback helper that'll be triggered for each embeded file. It'd
then be up to you to decide if you wanted the contents or not, based on
the filename and/or mime type.

In terms of fully streaming approach though, I'm not sure how easy it'll
be. Reviewing the different container formats, the extent that they'll be
streamable vs need buffering is:
* Tar (+compressed) - can be streamed
* Ogg / Avi / etc - different parts of the file are interlaced. If we
    support streaming, the callbacks would need to handle being run
    in parallel, which might add too much complexity for users?
* OLE2 - can't be streamed, we're going to have to buffer the whole file,
    load it into POIFS, and only then start returning things
* Zip - we'll need to do (at least) two passes. The first pass we'll look
    at what files it contains, and use that to figure out if it's
    .docx, keynote, open office etc, or just plain zip. If it's a plain
    zip, 2nd pass will return each file in turn. If it's a zip-based
    document format, filetype specific code will identify the embeded
    media for that format, and return each in turn.

I'd see this as meaning that you pass in a TikaInputStream to the service,
and a callback handler. If supported for the container, it will stream
through the file, firing the callback handler as it goes. For most cases,
the file will be buffered (to disk or memory as appropriate), the
appropriate bits identified, and then the callback handler fired for each
part.

Nick



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
Phone: +1 (818) 354-8810
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Container Extractor?

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 7 Sep 2010, Jukka Zitting wrote:
> On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <ni...@alfresco.com> wrote:
>> Finally, pull vs push for the consumer.
>> [...]
>> I think the former would be a little bit more work for us, but is likely to
>> lead to cleaner and simpler code for consumers. What do people think?
>
> I'd start with a push mechanism as that supports streaming and is
> better in line with the current design of Tika.

OK, that seems sensible to me, we'll go for a push option where you 
specify a callback helper that'll be triggered for each embeded file. It'd 
then be up to you to decide if you wanted the contents or not, based on 
the filename and/or mime type.

In terms of fully streaming approach though, I'm not sure how easy it'll 
be. Reviewing the different container formats, the extent that they'll be 
streamable vs need buffering is:
* Tar (+compressed) - can be streamed
* Ogg / Avi / etc - different parts of the file are interlaced. If we
    support streaming, the callbacks would need to handle being run
    in parallel, which might add too much complexity for users?
* OLE2 - can't be streamed, we're going to have to buffer the whole file,
    load it into POIFS, and only then start returning things
* Zip - we'll need to do (at least) two passes. The first pass we'll look
    at what files it contains, and use that to figure out if it's
    .docx, keynote, open office etc, or just plain zip. If it's a plain
    zip, 2nd pass will return each file in turn. If it's a zip-based
    document format, filetype specific code will identify the embeded
    media for that format, and return each in turn.

I'd see this as meaning that you pass in a TikaInputStream to the service, 
and a callback handler. If supported for the container, it will stream
through the file, firing the callback handler as it goes. For most cases, 
the file will be buffered (to disk or memory as appropriate), the
appropriate bits identified, and then the callback handler fired for each 
part.

Nick

Re: Container Extractor?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <ni...@alfresco.com> wrote:
> Finally, pull vs push for the consumer.
> [...]
> I think the former would be a little bit more work for us, but is likely to
> lead to cleaner and simpler code for consumers. What do people think?

I'd start with a push mechanism as that supports streaming and is
better in line with the current design of Tika.

We can then add a pull layer on top of that either by using a
background thread like done by the ParsingReader class or by spooling
component data to temporary files or in-memory buffers when a
random-access backend is not available.

BR,

Jukka Zitting

Re: Container Extractor?

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Sep 2010, Nick Burch wrote:
> I've been thinking about extracting files from container formats (eg 
> images in a .docx, pdfs in a zip file etc).

I've been pondering the various feedback over the weekend, and hopefully 
now have a more detailed idea.

Firstly, the new service needs to work for both people who have the 
container file locally, and those streaming it remotely. Some container 
parsers may work better with input streams, some with files, so making the 
input contract be a TikaInputStream would seem to be the right way around 
this?

Next, how to control which child elements are returned. The container will 
usually know the embeded file name, but not always, and will often know 
the path details of it (eg /foo/bar.txt in a zip file). It may sometimes 
know the mime type. This seems to me too difficult to easily represent as 
a wish-list filter. So, I now think that probably the only way to work it 
is to offer all the details of every file to the consumer, and let them 
decide if they're interested or not. Ideally, the amount of work done by 
the container parser until the consumer decides they want it + asks for 
the contents will be minimised. (A filter wrapper can always be put around 
it as required)

Nested embeded files - do we have a boolean flag for descend / don't 
descend, or do we pass that choice back to the consumer on a per-embeded 
basis similar to above? I worry that the latter would make things too 
complicated and heavy-weight, so I'm leaning towards the simple boolean 
flag.

Finally, pull vs push for the consumer. The two forms would probably look 
something like:
====
Iterator<Embeded> embeded = containerExtractor.extract(inp, false);
for(Embeded details : embeded) {
   if("application/pdf".equals(details.getMimeType()) ||
      "pdf".equals(details.getSuffix()) {
        handlePDF(details.getInputStream());
   }
   if("/README.txt".equals(details.getFilename()) {
        handleREADME(details.getInputStream());
   }
}
====
containerExtractor.extract(inp, false, new EmbededHandler() {
    public void handle(String filename, String mimetype, InputStreamSource
                           futureInputStream) {
        if("application/pdf".equals(mimetype) ||
               (filename != null && filename.endsWith("pdf"))) {
            handlePDF(futureInputStream.getInputStream());
        }
    }
});
====

I think the former would be a little bit more work for us, but is likely 
to lead to cleaner and simpler code for consumers. What do people think?

Nick

Re: Container Extractor?

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Sep 2010, Jukka Zitting wrote:
> The main complexity I see here is what the return values of such a
> service would look like, especially if you need to support cases where
> the container document is only available as an InputStream (i.e. no
> random access). Then you'd either need to use temporary files (or
> in-memory buffers) or a callback interface like this one:
>
>    public interface ComponentDocumentHandler {
>        void handleComponentDocument(
>            InputStream stream, Metadata metadata)
>            throws IOException, TikaException;
>    }

The issue is that for some file formats, we'll have to process the whole 
container anyway to do something useful. Even zip is problematic - we'll 
want to know if it's a plain .zip file, or a .docx file, or a Keynote 
file. That would potentially mean looking at the whole of the zip file's 
entries before we'll know if we should expose every entry in the zip, or 
only ones in certain special places. For the .docx case, we'll also need 
to look at the content type and rels entries to figure out the mime types, 
and potentially the real file names.

So, I think that if someone wants to use this service, they'll need to 
either have the file locally, or put up with buffering the whole thing in 
memory. Alas I don't see this is being a light-weight call.


In terms of linking it up with the tika parser, I'm happy to go with 
whatever you suggest :)

Nick

Re: Container Extractor?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Sep 1, 2010 at 11:54 AM, Nick Burch <ni...@alfresco.com> wrote:
> My idea is that you'd pass to this "service" a container file. You'd also
> say if you wanted recursion, and which mime types interest you. The result
> would be say an iterator of input stream, which would probably also let you
> get the filenames and mime types where supported by the container.

The main complexity I see here is what the return values of such a
service would look like, especially if you need to support cases where
the container document is only available as an InputStream (i.e. no
random access). Then you'd either need to use temporary files (or
in-memory buffers) or a callback interface like this one:

    public interface ComponentDocumentHandler {
        void handleComponentDocument(
            InputStream stream, Metadata metadata)
            throws IOException, TikaException;
    }

Such callbacks could be trivially produced by passing a custom Parser
instance through the ParseContext to the package parser. The custom
Parser class should have a parse() method like this:

    public void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, ParseContext context)
            throws IOException, SAXException, ParseException {
        componentDocumentHandler.handleComponentDocument(stream, metadata);
    }

> What do people think? Is this useful? Is this appropriate for Tika? If yes
> to these two, does the rough method signature sound sane?

+1 to having something like this in Tika, as long as we can come up
with a clean API.

BR,

Jukka Zitting

Re: Container Extractor?

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
+1, Nick, this sounds great...

Cheers,
Chris



On 9/1/10 2:54 AM, "Nick Burch" <ni...@alfresco.com> wrote:

Hi All

I've been thinking about extracting files from container formats (eg
images in a .docx, pdfs in a zip file etc). Given the recent number of
queries about embeded files and Tika lately, I was wondering if people
thought this might be something worth adding as another part of Tika?

My idea is that you'd pass to this "service" a container file. You'd also
say if you wanted recursion, and which mime types interest you. The result
would be say an iterator of input stream, which would probably also let
you get the filenames and mime types where supported by the container.

Example uses would be:
* .doc file, non recursive, request image/png and image/jpg
   gives you all the images in the word document
* .ppt file, recursive, request excel
   gives you excel files embeded in the powerpoint, and excel files embeded
   in the word documents embeded in the powerpoint
* .docx file, non recursive, request image/png
   treated as a ooxml file, not a plain zip file, and all png images
   from the magic embeded directory are returned.
* .zip file, recursive, request pdf
   gives you all pdf files anywhere in the zip
* .ogg file, non-recursive, request audio
   gives you the 3 different audio streams in your video file

You could pass the resultant input streams into the regular tika parser if
you wanted to process them, or even just save them into a directory
if all you wanted was an extractor.

What do people think? Is this useful? Is this appropriate for Tika? If yes
to these two, does the rough method signature sound sane?

Nick

PS I'm willing to do most of the coding on this if it's deemed suitable
    for Tika, but not for a few weeks probably, until Alfresco 3.4 is done



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++