You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nick Burch <ni...@alfresco.com> on 2010/09/01 11:54:20 UTC
Container Extractor?
Hi All
I've been thinking about extracting files from container formats (eg
images in a .docx, pdfs in a zip file etc). Given the recent number of
queries about embeded files and Tika lately, I was wondering if people
thought this might be something worth adding as another part of Tika?
My idea is that you'd pass to this "service" a container file. You'd also
say if you wanted recursion, and which mime types interest you. The result
would be say an iterator of input stream, which would probably also let
you get the filenames and mime types where supported by the container.
Example uses would be:
* .doc file, non recursive, request image/png and image/jpg
gives you all the images in the word document
* .ppt file, recursive, request excel
gives you excel files embeded in the powerpoint, and excel files embeded
in the word documents embeded in the powerpoint
* .docx file, non recursive, request image/png
treated as a ooxml file, not a plain zip file, and all png images
from the magic embeded directory are returned.
* .zip file, recursive, request pdf
gives you all pdf files anywhere in the zip
* .ogg file, non-recursive, request audio
gives you the 3 different audio streams in your video file
You could pass the resultant input streams into the regular tika parser if
you wanted to process them, or even just save them into a directory
if all you wanted was an extractor.
What do people think? Is this useful? Is this appropriate for Tika? If yes
to these two, does the rough method signature sound sane?
Nick
PS I'm willing to do most of the coding on this if it's deemed suitable
for Tika, but not for a few weeks probably, until Alfresco 3.4 is done
Re: Container Extractor?
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Sep 2010, Andrzej Bialecki wrote:
>> I was thinking recursive could mean different things. For zip files, tar
>> files etc, it would probably just mean root directory vs descend into
>> all directories.
>
> There are no directories in these formats - it's just a flat namespace
> that just happens to use the filesystem conventions. Java APIs for these
> containers also provide only simple iterators. So I'm not sure if
> there's any benefit to this distinction here... maybe provide a
> FilenameFilter to control what path names to process?
OK, looks like a directory descent on/off isn't a great fit.
I guess we'll want to provide two ways to filter, one by filename (which
is normally available), and one by mime type (which is sometimes
available). Or I guess a callback of "do you want this one?" where we pass
in all the information we have to hand. Any thoughts?
> On the other hand I see a benefit in having an option to automatically
> descend into embedded archives.
So we'd have some sort of filtering, and the descend yes/no option? For a
zip, the former exposes all files from all "directories", and the latter
will cause it to descend into both embeded zips, and embeded other
containers like .doc? For a .docx, the former exposes all embeded files
(but none of the ooxml file format stuff), and the latter controls if
embeded other office documents are processed?
>> For OLE2, it would mean checking embeded documents of
>> embeded documents (normally but not always by means of descending into
>> child directories). Maybe there's a clearer name for this sort of thing?
>
> OLE2 is nothing special, it's the same with other archive types, you can
> always have embedded archives within archives.
The OLE2 files aren't always so nice. Some store embeded files as
directory entries, some stash them away in records...
Nick
Re: Container Extractor?
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-01 12:25, Nick Burch wrote:
> On Wed, 1 Sep 2010, Andrzej Bialecki wrote:
>> This would be very useful. We contemplated implementing something like
>> this in Nutch, to handle archives (jar/tar/zip/...), but having it in
>> Tika would be much better.
>
> I'd forgotten about tar, that's another one to handle... :)
>
>> Does recursive here mean that it would look into embedded zip files
>> too? Or that it would process all paths (since there is really no
>> hierarchy in zip files)?
>
> I was thinking recursive could mean different things. For zip files, tar
> files etc, it would probably just mean root directory vs descend into
> all directories.
There are no directories in these formats - it's just a flat namespace
that just happens to use the filesystem conventions. Java APIs for these
containers also provide only simple iterators. So I'm not sure if
there's any benefit to this distinction here... maybe provide a
FilenameFilter to control what path names to process?
On the other hand I see a benefit in having an option to automatically
descend into embedded archives.
> For OLE2, it would mean checking embeded documents of
> embeded documents (normally but not always by means of descending into
> child directories). Maybe there's a clearer name for this sort of thing?
OLE2 is nothing special, it's the same with other archive types, you can
always have embedded archives within archives. I think the following
could be helpful:
* a FilenameFilter to decide what paths to process
* a boolean "recursive" to specify that we want to descend into embedded
archives, maybe with a list of interesting archive types?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Container Extractor?
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Sep 2010, Andrzej Bialecki wrote:
> This would be very useful. We contemplated implementing something like
> this in Nutch, to handle archives (jar/tar/zip/...), but having it in
> Tika would be much better.
I'd forgotten about tar, that's another one to handle... :)
> Does recursive here mean that it would look into embedded zip files too?
> Or that it would process all paths (since there is really no hierarchy
> in zip files)?
I was thinking recursive could mean different things. For zip files, tar
files etc, it would probably just mean root directory vs descend into all
directories. For OLE2, it would mean checking embeded documents of embeded
documents (normally but not always by means of descending into child
directories). Maybe there's a clearer name for this sort of thing?
Nick
Re: Container Extractor?
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-01 11:54, Nick Burch wrote:
> Hi All
>
> I've been thinking about extracting files from container formats (eg
> images in a .docx, pdfs in a zip file etc). Given the recent number of
> queries about embeded files and Tika lately, I was wondering if people
> thought this might be something worth adding as another part of Tika?
This would be very useful. We contemplated implementing something like
this in Nutch, to handle archives (jar/tar/zip/...), but having it in
Tika would be much better.
> Example uses would be:
> * .doc file, non recursive, request image/png and image/jpg
> gives you all the images in the word document
> * .ppt file, recursive, request excel
> gives you excel files embeded in the powerpoint, and excel files embeded
> in the word documents embeded in the powerpoint
> * .docx file, non recursive, request image/png
> treated as a ooxml file, not a plain zip file, and all png images
> from the magic embeded directory are returned.
> * .zip file, recursive, request pdf
> gives you all pdf files anywhere in the zip
Does recursive here mean that it would look into embedded zip files too?
Or that it would process all paths (since there is really no hierarchy
in zip files)?
> * .ogg file, non-recursive, request audio
> gives you the 3 different audio streams in your video file
>
> You could pass the resultant input streams into the regular tika parser
> if you wanted to process them, or even just save them into a directory
> if all you wanted was an extractor.
>
> What do people think? Is this useful? Is this appropriate for Tika? If
> yes to these two, does the rough method signature sound sane?
+1.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Container Extractor?
Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 2 Sep 2010, Maxim Valyanskiy wrote:
> I think it is a good idea. I already have POI-based part extractor for
> office file formats. I can contribute some code when API will be done.
Excellent news :)
Now we just need to hammer out an API design that'll work for all the
different use cases!
Nick
Re: Container Extractor?
Posted by Maxim Valyanskiy <ma...@jet.msk.su>.
Hello!
01.09.2010 13:54, Nick Burch пишет:
> I've been thinking about extracting files from container formats (eg images in a
> .docx, pdfs in a zip file etc). Given the recent number of queries about embeded
> files and Tika lately, I was wondering if people thought this might be something
> worth adding as another part of Tika?
>
> My idea is that you'd pass to this "service" a container file. You'd also say if
> you wanted recursion, and which mime types interest you. The result would be say
> an iterator of input stream, which would probably also let you get the filenames
> and mime types where supported by the container.
I think it is a good idea. I already have POI-based part extractor for office file
formats. I can contribute some code when API will be done.
best wishes, Max
Re: Container Extractor?
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Sep 2010, Ken Krugler wrote:
> http://lucene.472066.n3.nabble.com/Multiple-documents-per-input-stream-td647159.html#a647159
>
> I ran into a similar issue when trying to figure out how best to handle mbox
> files.
Yeah, I guess we could optionally treat mbox (and .pst + similar)
mailboxes as containers too.
My current thinking is that Tika should do roughly the right thing if
people just throw it a document, but should allow finer grained access to
embeded resources for people who need control. Just FYI, at the moment my
use case is to extract the embeded images out of uploaded .doc and .docx
files, but I can see future requirements for parsing the metadata and text
out of embeded documents via Tika too, so I want it to work for both :)
Nick
Re: Container Extractor?
Posted by Ken Krugler <kk...@transpac.com>.
Hi Nick,
Potentially interesting thread from about a year ago:
http://lucene.472066.n3.nabble.com/Multiple-documents-per-input-stream-td647159.html#a647159
I ran into a similar issue when trying to figure out how best to
handle mbox files.
-- Ken
On Sep 1, 2010, at 2:54am, Nick Burch wrote:
> Hi All
>
> I've been thinking about extracting files from container formats (eg
> images in a .docx, pdfs in a zip file etc). Given the recent number
> of queries about embeded files and Tika lately, I was wondering if
> people thought this might be something worth adding as another part
> of Tika?
>
> My idea is that you'd pass to this "service" a container file. You'd
> also say if you wanted recursion, and which mime types interest you.
> The result would be say an iterator of input stream, which would
> probably also let you get the filenames and mime types where
> supported by the container.
>
> Example uses would be:
> * .doc file, non recursive, request image/png and image/jpg
> gives you all the images in the word document
> * .ppt file, recursive, request excel
> gives you excel files embeded in the powerpoint, and excel files
> embeded
> in the word documents embeded in the powerpoint
> * .docx file, non recursive, request image/png
> treated as a ooxml file, not a plain zip file, and all png images
> from the magic embeded directory are returned.
> * .zip file, recursive, request pdf
> gives you all pdf files anywhere in the zip
> * .ogg file, non-recursive, request audio
> gives you the 3 different audio streams in your video file
>
> You could pass the resultant input streams into the regular tika
> parser if you wanted to process them, or even just save them into a
> directory
> if all you wanted was an extractor.
>
> What do people think? Is this useful? Is this appropriate for Tika?
> If yes to these two, does the rough method signature sound sane?
>
> Nick
>
> PS I'm willing to do most of the coding on this if it's deemed
> suitable
> for Tika, but not for a few weeks probably, until Alfresco 3.4 is
> done
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
Re: Container Extractor?
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Tue, Sep 7, 2010 at 12:39 PM, Nick Burch <ni...@alfresco.com> wrote:
> I'd see this as meaning that you pass in a TikaInputStream to the service,
> and a callback handler. If supported for the container, it will stream
> through the file, firing the callback handler as it goes. For most cases,
> the file will be buffered (to disk or memory as appropriate), the
> appropriate bits identified, and then the callback handler fired for each
> part.
Sounds good to me.
BR,
Jukka Zitting
Re: Container Extractor?
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Guys,
I've been following this discussion and one thing I'd like to add is that scientific data formats exhibit most of the properties that the container formats do as well. For instance, NetCDF does not support RandomAccess, and existing Java APIs to deal with those files require the full file to be available on disk in order to be loaded into the class methods for extracting information from the file. HDF is similar. So I'm going to follow this discussion a bit more closely now as I see it coming closer to a concrete idea! ;) I've been watching the TikaInputStream stuff that Jukka has been working on and I think that's a good starting point for addressing some of these issues.
Cheers,
Chris
On 9/7/10 3:39 AM, "Nick Burch" <ni...@alfresco.com> wrote:
On Tue, 7 Sep 2010, Jukka Zitting wrote:
> On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <ni...@alfresco.com> wrote:
>> Finally, pull vs push for the consumer.
>> [...]
>> I think the former would be a little bit more work for us, but is likely to
>> lead to cleaner and simpler code for consumers. What do people think?
>
> I'd start with a push mechanism as that supports streaming and is
> better in line with the current design of Tika.
OK, that seems sensible to me, we'll go for a push option where you
specify a callback helper that'll be triggered for each embeded file. It'd
then be up to you to decide if you wanted the contents or not, based on
the filename and/or mime type.
In terms of fully streaming approach though, I'm not sure how easy it'll
be. Reviewing the different container formats, the extent that they'll be
streamable vs need buffering is:
* Tar (+compressed) - can be streamed
* Ogg / Avi / etc - different parts of the file are interlaced. If we
support streaming, the callbacks would need to handle being run
in parallel, which might add too much complexity for users?
* OLE2 - can't be streamed, we're going to have to buffer the whole file,
load it into POIFS, and only then start returning things
* Zip - we'll need to do (at least) two passes. The first pass we'll look
at what files it contains, and use that to figure out if it's
.docx, keynote, open office etc, or just plain zip. If it's a plain
zip, 2nd pass will return each file in turn. If it's a zip-based
document format, filetype specific code will identify the embeded
media for that format, and return each in turn.
I'd see this as meaning that you pass in a TikaInputStream to the service,
and a callback handler. If supported for the container, it will stream
through the file, firing the callback handler as it goes. For most cases,
the file will be buffered (to disk or memory as appropriate), the
appropriate bits identified, and then the callback handler fired for each
part.
Nick
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
Phone: +1 (818) 354-8810
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Re: Container Extractor?
Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 7 Sep 2010, Jukka Zitting wrote:
> On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <ni...@alfresco.com> wrote:
>> Finally, pull vs push for the consumer.
>> [...]
>> I think the former would be a little bit more work for us, but is likely to
>> lead to cleaner and simpler code for consumers. What do people think?
>
> I'd start with a push mechanism as that supports streaming and is
> better in line with the current design of Tika.
OK, that seems sensible to me, we'll go for a push option where you
specify a callback helper that'll be triggered for each embeded file. It'd
then be up to you to decide if you wanted the contents or not, based on
the filename and/or mime type.
In terms of fully streaming approach though, I'm not sure how easy it'll
be. Reviewing the different container formats, the extent that they'll be
streamable vs need buffering is:
* Tar (+compressed) - can be streamed
* Ogg / Avi / etc - different parts of the file are interlaced. If we
support streaming, the callbacks would need to handle being run
in parallel, which might add too much complexity for users?
* OLE2 - can't be streamed, we're going to have to buffer the whole file,
load it into POIFS, and only then start returning things
* Zip - we'll need to do (at least) two passes. The first pass we'll look
at what files it contains, and use that to figure out if it's
.docx, keynote, open office etc, or just plain zip. If it's a plain
zip, 2nd pass will return each file in turn. If it's a zip-based
document format, filetype specific code will identify the embeded
media for that format, and return each in turn.
I'd see this as meaning that you pass in a TikaInputStream to the service,
and a callback handler. If supported for the container, it will stream
through the file, firing the callback handler as it goes. For most cases,
the file will be buffered (to disk or memory as appropriate), the
appropriate bits identified, and then the callback handler fired for each
part.
Nick
Re: Container Extractor?
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <ni...@alfresco.com> wrote:
> Finally, pull vs push for the consumer.
> [...]
> I think the former would be a little bit more work for us, but is likely to
> lead to cleaner and simpler code for consumers. What do people think?
I'd start with a push mechanism as that supports streaming and is
better in line with the current design of Tika.
We can then add a pull layer on top of that either by using a
background thread like done by the ParsingReader class or by spooling
component data to temporary files or in-memory buffers when a
random-access backend is not available.
BR,
Jukka Zitting
Re: Container Extractor?
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Sep 2010, Nick Burch wrote:
> I've been thinking about extracting files from container formats (eg
> images in a .docx, pdfs in a zip file etc).
I've been pondering the various feedback over the weekend, and hopefully
now have a more detailed idea.
Firstly, the new service needs to work for both people who have the
container file locally, and those streaming it remotely. Some container
parsers may work better with input streams, some with files, so making the
input contract be a TikaInputStream would seem to be the right way around
this?
Next, how to control which child elements are returned. The container will
usually know the embeded file name, but not always, and will often know
the path details of it (eg /foo/bar.txt in a zip file). It may sometimes
know the mime type. This seems to me too difficult to easily represent as
a wish-list filter. So, I now think that probably the only way to work it
is to offer all the details of every file to the consumer, and let them
decide if they're interested or not. Ideally, the amount of work done by
the container parser until the consumer decides they want it + asks for
the contents will be minimised. (A filter wrapper can always be put around
it as required)
Nested embeded files - do we have a boolean flag for descend / don't
descend, or do we pass that choice back to the consumer on a per-embeded
basis similar to above? I worry that the latter would make things too
complicated and heavy-weight, so I'm leaning towards the simple boolean
flag.
Finally, pull vs push for the consumer. The two forms would probably look
something like:
====
Iterator<Embeded> embeded = containerExtractor.extract(inp, false);
for(Embeded details : embeded) {
if("application/pdf".equals(details.getMimeType()) ||
"pdf".equals(details.getSuffix()) {
handlePDF(details.getInputStream());
}
if("/README.txt".equals(details.getFilename()) {
handleREADME(details.getInputStream());
}
}
====
containerExtractor.extract(inp, false, new EmbededHandler() {
public void handle(String filename, String mimetype, InputStreamSource
futureInputStream) {
if("application/pdf".equals(mimetype) ||
(filename != null && filename.endsWith("pdf"))) {
handlePDF(futureInputStream.getInputStream());
}
}
});
====
I think the former would be a little bit more work for us, but is likely
to lead to cleaner and simpler code for consumers. What do people think?
Nick
Re: Container Extractor?
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 1 Sep 2010, Jukka Zitting wrote:
> The main complexity I see here is what the return values of such a
> service would look like, especially if you need to support cases where
> the container document is only available as an InputStream (i.e. no
> random access). Then you'd either need to use temporary files (or
> in-memory buffers) or a callback interface like this one:
>
> public interface ComponentDocumentHandler {
> void handleComponentDocument(
> InputStream stream, Metadata metadata)
> throws IOException, TikaException;
> }
The issue is that for some file formats, we'll have to process the whole
container anyway to do something useful. Even zip is problematic - we'll
want to know if it's a plain .zip file, or a .docx file, or a Keynote
file. That would potentially mean looking at the whole of the zip file's
entries before we'll know if we should expose every entry in the zip, or
only ones in certain special places. For the .docx case, we'll also need
to look at the content type and rels entries to figure out the mime types,
and potentially the real file names.
So, I think that if someone wants to use this service, they'll need to
either have the file locally, or put up with buffering the whole thing in
memory. Alas I don't see this is being a light-weight call.
In terms of linking it up with the tika parser, I'm happy to go with
whatever you suggest :)
Nick
Re: Container Extractor?
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Wed, Sep 1, 2010 at 11:54 AM, Nick Burch <ni...@alfresco.com> wrote:
> My idea is that you'd pass to this "service" a container file. You'd also
> say if you wanted recursion, and which mime types interest you. The result
> would be say an iterator of input stream, which would probably also let you
> get the filenames and mime types where supported by the container.
The main complexity I see here is what the return values of such a
service would look like, especially if you need to support cases where
the container document is only available as an InputStream (i.e. no
random access). Then you'd either need to use temporary files (or
in-memory buffers) or a callback interface like this one:
public interface ComponentDocumentHandler {
void handleComponentDocument(
InputStream stream, Metadata metadata)
throws IOException, TikaException;
}
Such callbacks could be trivially produced by passing a custom Parser
instance through the ParseContext to the package parser. The custom
Parser class should have a parse() method like this:
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, ParseException {
componentDocumentHandler.handleComponentDocument(stream, metadata);
}
> What do people think? Is this useful? Is this appropriate for Tika? If yes
> to these two, does the rough method signature sound sane?
+1 to having something like this in Tika, as long as we can come up
with a clean API.
BR,
Jukka Zitting
Re: Container Extractor?
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
+1, Nick, this sounds great...
Cheers,
Chris
On 9/1/10 2:54 AM, "Nick Burch" <ni...@alfresco.com> wrote:
Hi All
I've been thinking about extracting files from container formats (eg
images in a .docx, pdfs in a zip file etc). Given the recent number of
queries about embeded files and Tika lately, I was wondering if people
thought this might be something worth adding as another part of Tika?
My idea is that you'd pass to this "service" a container file. You'd also
say if you wanted recursion, and which mime types interest you. The result
would be say an iterator of input stream, which would probably also let
you get the filenames and mime types where supported by the container.
Example uses would be:
* .doc file, non recursive, request image/png and image/jpg
gives you all the images in the word document
* .ppt file, recursive, request excel
gives you excel files embeded in the powerpoint, and excel files embeded
in the word documents embeded in the powerpoint
* .docx file, non recursive, request image/png
treated as a ooxml file, not a plain zip file, and all png images
from the magic embeded directory are returned.
* .zip file, recursive, request pdf
gives you all pdf files anywhere in the zip
* .ogg file, non-recursive, request audio
gives you the 3 different audio streams in your video file
You could pass the resultant input streams into the regular tika parser if
you wanted to process them, or even just save them into a directory
if all you wanted was an extractor.
What do people think? Is this useful? Is this appropriate for Tika? If yes
to these two, does the rough method signature sound sane?
Nick
PS I'm willing to do most of the coding on this if it's deemed suitable
for Tika, but not for a few weeks probably, until Alfresco 3.4 is done
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++