You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2011/12/07 10:07:47 UTC

Recursive parsing

Hi,

Following the example here http://wiki.apache.org/tika/RecursiveMetadata 
I'm trying to parse nested documents and collect separately the content 
and metadata of each document. All is going well, in fact maybe a little 
too well ;) - parsers descend into internal components of compound 
documents, so e.g. I'm getting all images from Word docs as separate 
nested documents. This is very cool - it's good to know that Tika 
supports this when you need it.

However, I'd like to have an option to avoid recursing into compound 
documents, while still being able to process nested archives (like zip, 
tgz, etc). Is there any easy way to express this preference? I thought 
about using the type of handler passed to the RecursiveParser.parse(..) 
to decide when to stop recursing, but I noticed that in both cases 
(embedded components and entries in archives) an EmbeddedContentHandler 
is passed to the parse(...) method.

Oh, and I really would appreciate some further feedback on TIKA-675 - if 
this idea is ok I'd start working towards a patch.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Recursive parsing

Posted by Nick Burch <ni...@alfresco.com>.

On Thu, 8 Dec 2011, Andrzej Bialecki wrote:
> I guess that could work, but it would be very messy - I would have to 
> keep a list of all potentially interesting mime types in my code, which 
> is difficult to maintain.

Or a list of interesting parsers in your other case!

> It would be much better if the parent parser passed a token in metadata, 
> basically saying "this is invoked from a XXXParser", so then I could 
> detect that it was the PackageParser that invoked the method, and act 
> accordingly.

You could turn it around a little bit. Call the detector to get the 
mimetype, then ask the composite parser for the right parser for that 
type. Based on what you get back, either do or don't supply the recursing 
parser to the parse context

(The detector and parser map are both available from AutoDetectParser)

Nick

Re: Recursive parsing

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 08/12/2011 11:13, Nick Burch wrote:
> On Wed, 7 Dec 2011, Andrzej Bialecki wrote:
>> However, I'd like to have an option to avoid recursing into compound
>> documents, while still being able to process nested archives (like
>> zip, tgz, etc). Is there any easy way to express this preference? I
>> thought about using the type of handler passed to the
>> RecursiveParser.parse(..) to decide when to stop recursing, but I
>> noticed that in both cases (embedded components and entries in
>> archives) an EmbeddedContentHandler is passed to the parse(...) method.
>
> I'd suggest you just put the logic into your nested parser. What I'd
> suggest is that you look at the mimetype of the source document, and use
> that to decide if you supply the recursing parser or not on the parse
> context.

I guess that could work, but it would be very messy - I would have to 
keep a list of all potentially interesting mime types in my code, which 
is difficult to maintain.

It would be much better if the parent parser passed a token in metadata, 
basically saying "this is invoked from a XXXParser", so then I could 
detect that it was the PackageParser that invoked the method, and act 
accordingly.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Recursive parsing

Posted by Nick Burch <ni...@alfresco.com>.

On Wed, 7 Dec 2011, Andrzej Bialecki wrote:
> However, I'd like to have an option to avoid recursing into compound 
> documents, while still being able to process nested archives (like zip, 
> tgz, etc). Is there any easy way to express this preference? I thought 
> about using the type of handler passed to the RecursiveParser.parse(..) 
> to decide when to stop recursing, but I noticed that in both cases 
> (embedded components and entries in archives) an EmbeddedContentHandler 
> is passed to the parse(...) method.

I'd suggest you just put the logic into your nested parser. What I'd 
suggest is that you look at the mimetype of the source document, and use 
that to decide if you supply the recursing parser or not on the parse 
context.

Nick