You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2011/12/07 10:07:47 UTC
Recursive parsing
Hi,
Following the example here http://wiki.apache.org/tika/RecursiveMetadata
I'm trying to parse nested documents and collect separately the content
and metadata of each document. All is going well, in fact maybe a little
too well ;) - parsers descend into internal components of compound
documents, so e.g. I'm getting all images from Word docs as separate
nested documents. This is very cool - it's good to know that Tika
supports this when you need it.
However, I'd like to have an option to avoid recursing into compound
documents, while still being able to process nested archives (like zip,
tgz, etc). Is there any easy way to express this preference? I thought
about using the type of handler passed to the RecursiveParser.parse(..)
to decide when to stop recursing, but I noticed that in both cases
(embedded components and entries in archives) an EmbeddedContentHandler
is passed to the parse(...) method.
Oh, and I really would appreciate some further feedback on TIKA-675 - if
this idea is ok I'd start working towards a patch.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Recursive parsing
Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 8 Dec 2011, Andrzej Bialecki wrote:
> I guess that could work, but it would be very messy - I would have to
> keep a list of all potentially interesting mime types in my code, which
> is difficult to maintain.
Or a list of interesting parsers in your other case!
> It would be much better if the parent parser passed a token in metadata,
> basically saying "this is invoked from a XXXParser", so then I could
> detect that it was the PackageParser that invoked the method, and act
> accordingly.
You could turn it around a little bit. Call the detector to get the
mimetype, then ask the composite parser for the right parser for that
type. Based on what you get back, either do or don't supply the recursing
parser to the parse context
(The detector and parser map are both available from AutoDetectParser)
Nick
Re: Recursive parsing
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 08/12/2011 11:13, Nick Burch wrote:
> On Wed, 7 Dec 2011, Andrzej Bialecki wrote:
>> However, I'd like to have an option to avoid recursing into compound
>> documents, while still being able to process nested archives (like
>> zip, tgz, etc). Is there any easy way to express this preference? I
>> thought about using the type of handler passed to the
>> RecursiveParser.parse(..) to decide when to stop recursing, but I
>> noticed that in both cases (embedded components and entries in
>> archives) an EmbeddedContentHandler is passed to the parse(...) method.
>
> I'd suggest you just put the logic into your nested parser. What I'd
> suggest is that you look at the mimetype of the source document, and use
> that to decide if you supply the recursing parser or not on the parse
> context.
I guess that could work, but it would be very messy - I would have to
keep a list of all potentially interesting mime types in my code, which
is difficult to maintain.
It would be much better if the parent parser passed a token in metadata,
basically saying "this is invoked from a XXXParser", so then I could
detect that it was the PackageParser that invoked the method, and act
accordingly.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Recursive parsing
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 7 Dec 2011, Andrzej Bialecki wrote:
> However, I'd like to have an option to avoid recursing into compound
> documents, while still being able to process nested archives (like zip,
> tgz, etc). Is there any easy way to express this preference? I thought
> about using the type of handler passed to the RecursiveParser.parse(..)
> to decide when to stop recursing, but I noticed that in both cases
> (embedded components and entries in archives) an EmbeddedContentHandler
> is passed to the parse(...) method.
I'd suggest you just put the logic into your nested parser. What I'd
suggest is that you look at the mimetype of the source document, and use
that to decide if you supply the recursing parser or not on the parse
context.
Nick