You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Stephane Bastian <st...@gmail.com> on 2008/12/09 08:27:36 UTC

Extending existing Parsers - No easy to do right now, could we make it easier?

Hi All,

I finally found some time to send an email and share some thougths on 
one of the stickiest issue I had so far with Tika : It's almost not 
possible to leverage and override functionality of existing Parsers.I 
believe the main reason comes from the fact that the parse method leaves 
no room to override existing behavior or provide my own logic. It's 
pretty much an all or nothing kind of thing.

For instance, take the Html Parser and lets say I just need to extract 
some meta-data not currently handled by Tika. If I'm not mistaken, I 
basically have two solutions:

1) Modify the current Html Parser, add code to extract the new metadata 
and submit a Patch to Tika
2) Create my own class:
    - do a copy/paste of existing code - The reason for this is that 
current parse() method leaves very little room to override existing 
behavior or provide my own logic. It's pretty much an all or nothing 
kind of thing.
    - add my code
    - register my class so that it's called for a given mimeType

In all the cases I had so far, I simply needed to be able to register my 
own ContentHandler on the source document (and not on the structured 
content). Unfortunately, it's currently not possible

So, I wanted to know 1) if other people had trouble extending existing 
Parser? and 2) if this is an issue we should tackle?

BR,

Stephane Bastian

RE: Extending existing Parsers - No easy to do right now, could we make it easier?

Posted by Uwe Schindler <uw...@thetaphi.de>.

In my opinion, if somebody wants such a specialized parser with his own
optimizations, he could simply write his own parser using nekohtml and plug
into TIKA.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
> Sent: Tuesday, December 16, 2008 12:07 AM
> To: tika-dev@lucene.apache.org
> Subject: Re: Extending existing Parsers - No easy to do right now, could
> we make it easier?
> 
> Hi,
> 
> On Tue, Dec 9, 2008 at 1:04 PM, Stephane Bastian
> <st...@gmail.com> wrote:
> > In any case, as you pointed out Tika might not be the best place to do
> this.
> > However going back to my initial short term issue, which is extending
> the
> > Html Parser, I would definitely take the solution you proposed earlier
> if
> > it's still on the table ;)
> 
> I thought about this a bit more (see TIKA-182), and I must say that
> I'd rather not apply the patch to Tika. Doing so would create an extra
> binding between client code and the underlying parser library, and
> would make it difficult for us to later replace the parser if we
> wanted to.
> 
> BR,
> 
> Jukka Zitting

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Tue, Dec 9, 2008 at 1:04 PM, Stephane Bastian
<st...@gmail.com> wrote:
> In any case, as you pointed out Tika might not be the best place to do this.
> However going back to my initial short term issue, which is extending the
> Html Parser, I would definitely take the solution you proposed earlier if
> it's still on the table ;)

I thought about this a bit more (see TIKA-182), and I must say that
I'd rather not apply the patch to Tika. Doing so would create an extra
binding between client code and the underlying parser library, and
would make it difficult for us to later replace the parser if we
wanted to.

BR,

Jukka Zitting

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

Posted by Stephane Bastian <st...@gmail.com>.

Hi,

You're definitely right that there would be a mapping between a given 
document and XML, via a ContentHandler, which is king of what tika does 
already. This also means that metadata would be extracted from the "raw" 
ContentHandler.
In any case, as you pointed out Tika might not be the best place to do 
this.
However going back to my initial short term issue, which is extending 
the Html Parser, I would definitely take the solution you proposed 
earlier if it's still on the table ;)

BR,

Stephane Bastian

Jukka Zitting wrote:
> Hi,
>
> On Tue, Dec 9, 2008 at 12:19 PM, Stephane Bastian
> <st...@gmail.com> wrote:
>   
>> Parsing goes through several fairly well defined steps and in the case of
>> Tika it could be represented as follow:
>> 1) Generate Sax events out of the stream
>> 2) Extracts metadata and save them in an instance of the Metadata class
>> 3) Generate Sax events about the structure of a document
>>     
>
> For many document types steps 1 and 2 are reversed, and 1 and 3 are
> actually just a single step. I'm not sure if there's much room for
> generalization here.
>
>   
>> How about if we slightly modify Tika to hook custom code to 1) as well. We
>> could do this by adding an extra ContentHandler to the parse method:
>>
>> public void parse (InputStream stream, ContentHandler rawHanlder,
>> ContentHandler structuredHandler, Metadata metadata) ;
>>     
>
> Most document types simply don't have a "raw" SAX stream, so I don't
> think this is a good idea in the general case. The only SAX events you
> have are the ones sent to the content handler we have now, so what
> you're trying to do could just as well be achieved using a
> TeeContentHandler on top of the existing Parser interface.
>
> What I believe you are looking for is a mechanism that would map the
> low-level details of all sorts of document types to XML. That's might
> be interesting, but I'm not sure if Tika is the best place to do that.
> It might be a better idea to approach the parser libraries directly
> about a potential SAX mapping, as they are in a much better position
> to evaluate how such a mapping should look like and whether
> implementing it is reasonable.
>
>   
>> 2) Ability to leverage the MatchingContentHandler which is also working in
>> streaming mode. BTW, to me this part would probably deserve a project on its
>> own
>>     
>
> Thanks, I did think it was a good idea, but it's good to hear that
> others like it too. :-)
>
> BR,
>
> Jukka Zitting
>

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Tue, Dec 9, 2008 at 12:19 PM, Stephane Bastian
<st...@gmail.com> wrote:
> Parsing goes through several fairly well defined steps and in the case of
> Tika it could be represented as follow:
> 1) Generate Sax events out of the stream
> 2) Extracts metadata and save them in an instance of the Metadata class
> 3) Generate Sax events about the structure of a document

For many document types steps 1 and 2 are reversed, and 1 and 3 are
actually just a single step. I'm not sure if there's much room for
generalization here.

> How about if we slightly modify Tika to hook custom code to 1) as well. We
> could do this by adding an extra ContentHandler to the parse method:
>
> public void parse (InputStream stream, ContentHandler rawHanlder,
> ContentHandler structuredHandler, Metadata metadata) ;

Most document types simply don't have a "raw" SAX stream, so I don't
think this is a good idea in the general case. The only SAX events you
have are the ones sent to the content handler we have now, so what
you're trying to do could just as well be achieved using a
TeeContentHandler on top of the existing Parser interface.

What I believe you are looking for is a mechanism that would map the
low-level details of all sorts of document types to XML. That's might
be interesting, but I'm not sure if Tika is the best place to do that.
It might be a better idea to approach the parser libraries directly
about a potential SAX mapping, as they are in a much better position
to evaluate how such a mapping should look like and whether
implementing it is reasonable.

> 2) Ability to leverage the MatchingContentHandler which is also working in
> streaming mode. BTW, to me this part would probably deserve a project on its
> own

Thanks, I did think it was a good idea, but it's good to hear that
others like it too. :-)

BR,

Jukka Zitting

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

Posted by Stephane Bastian <st...@gmail.com>.

Hi Jukka,

This fix would definitely help me in the short run since I've got to 
extends the Html parser for my specific needs. However, I'm thinking 
that I may run in the same problem with another parser in a month or two.
Therefore I'm leaning toward finding a solution that would work for all 
Parsers.

Let me throw an idea here:

Parsing goes through several fairly well defined steps and in the case 
of Tika it could be represented as follow:
1) Generate Sax events out of the stream
2) Extracts metadata and save them in an instance of the Metadata class
3) Generate Sax events about the structure of a document

For html pages:
    1) is done by CyberNeko for us. Cyberneko converts an html stream 
(which most of the time is *not* well formed) to Sax events
    2) is basically the body of the parse method
    3) is kind of mixed in the body of the parse method

Right now, tika let us interact with 3) and 2) at the cost of an almost 
complete rewrite of the parent parser.

How about if we slightly modify Tika to hook custom code to 1) as well. 
We could do this by adding an extra ContentHandler to the parse method:

public void parse (InputStream stream, ContentHandler rawHanlder, 
ContentHandler structuredHandler, Metadata metadata) ;

Of course, this means modifying the signature of the parse method a bit 
and this is not something we want to do if we don't have to
However, I feel the benefits out-weight adding an extra parameter and 
provide a way for people to add extra functionality to existing Parsers 
very quickly.


As you pointed out, I could also work directly with the parser myself 
but in this case I will lose one many benefits of using Tika:
1) Streaming
2) Ability to leverage the MatchingContentHandler which is also working 
in streaming mode. BTW, to me this part would probably deserve a project 
on its own
3) Shields me from the detail of Parsing a document and converting it to 
Sax events (trivial for Html but very handy for other documents such as 
MS Office...)

BR,

Stephane Bastian

Jukka Zitting wrote:
> Hi,
>
> On Tue, Dec 9, 2008 at 8:27 AM, Stephane Bastian
> <st...@gmail.com> wrote:
>   
>> So, I wanted to know 1) if other people had trouble extending existing
>> Parser? and 2) if this is an issue we should tackle?
>>     
>
> We're of course open to contributions on issues like this, but I'm
> wondering if your use case would be better served by directly using
> the underlying parser library. If not, how about an extension point
> like the one defined in the patch below?
>
> BR,
>
> Jukka Zitting
>
> Index: src/main/java/org/apache/tika/parser/html/HtmlParser.java
> ===================================================================
> --- src/main/java/org/apache/tika/parser/html/HtmlParser.java	(revision 724309)
> +++ src/main/java/org/apache/tika/parser/html/HtmlParser.java	(working copy)
> @@ -84,6 +84,31 @@
>
>      }
>
> +    /**
> +     * Extra handler that can be specified by the client application for
> +     * additional processing of raw HTML SAX events generated by NekoHTML.
> +     */
> +    private ContentHandler extension;
> +
> +    /**
> +     * Returns the configured extension handler.
> +     *
> +     * @return configured extension handler, or <code>null</code>
> +     */
> +    public ContentHandler getExtension() {
> +        return extension;
> +    }
> +
> +    /**
> +     * Sets an extension handler for additional processing of the raw HTML
> +     * SAX events generated by the underlying HTML parser.
> +     *
> +     * @param extension extension handler
> +     */
> +    public void setExtension(ContentHandler extension) {
> +        this.extension = extension;
> +    }
> +
>      public void parse(
>              InputStream stream, ContentHandler handler, Metadata metadata)
>              throws IOException, SAXException, TikaException {
> @@ -102,9 +127,17 @@
>                  new MatchingContentHandler(getTitleHandler(metadata), title),
>                  new MatchingContentHandler(getMetaHandler(metadata), meta));
>
> +        // Simplify the HTML for Tika clients
> +        handler = new XHTMLDowngradeHandler(handler);
> +
> +        // Add the configured extension, if any
> +        if (extension != null) {
> +            handler = new TeeContentHandler(handler, extension);
> +        }
> +
>          // Parse the HTML document
>          SAXParser parser = new SAXParser();
> -        parser.setContentHandler(new XHTMLDowngradeHandler(handler));
> +        parser.setContentHandler(handler);
>          parser.parse(new InputSource(Utils.getUTF8Reader(stream, metadata)));
>      }
>

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Tue, Dec 9, 2008 at 8:27 AM, Stephane Bastian
<st...@gmail.com> wrote:
> So, I wanted to know 1) if other people had trouble extending existing
> Parser? and 2) if this is an issue we should tackle?

We're of course open to contributions on issues like this, but I'm
wondering if your use case would be better served by directly using
the underlying parser library. If not, how about an extension point
like the one defined in the patch below?

BR,

Jukka Zitting

Index: src/main/java/org/apache/tika/parser/html/HtmlParser.java
===================================================================
--- src/main/java/org/apache/tika/parser/html/HtmlParser.java	(revision 724309)
+++ src/main/java/org/apache/tika/parser/html/HtmlParser.java	(working copy)
@@ -84,6 +84,31 @@

     }

+    /**
+     * Extra handler that can be specified by the client application for
+     * additional processing of raw HTML SAX events generated by NekoHTML.
+     */
+    private ContentHandler extension;
+
+    /**
+     * Returns the configured extension handler.
+     *
+     * @return configured extension handler, or <code>null</code>
+     */
+    public ContentHandler getExtension() {
+        return extension;
+    }
+
+    /**
+     * Sets an extension handler for additional processing of the raw HTML
+     * SAX events generated by the underlying HTML parser.
+     *
+     * @param extension extension handler
+     */
+    public void setExtension(ContentHandler extension) {
+        this.extension = extension;
+    }
+
     public void parse(
             InputStream stream, ContentHandler handler, Metadata metadata)
             throws IOException, SAXException, TikaException {
@@ -102,9 +127,17 @@
                 new MatchingContentHandler(getTitleHandler(metadata), title),
                 new MatchingContentHandler(getMetaHandler(metadata), meta));

+        // Simplify the HTML for Tika clients
+        handler = new XHTMLDowngradeHandler(handler);
+
+        // Add the configured extension, if any
+        if (extension != null) {
+            handler = new TeeContentHandler(handler, extension);
+        }
+
         // Parse the HTML document
         SAXParser parser = new SAXParser();
-        parser.setContentHandler(new XHTMLDowngradeHandler(handler));
+        parser.setContentHandler(handler);
         parser.parse(new InputSource(Utils.getUTF8Reader(stream, metadata)));
     }