You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Michael Wechner <mi...@wyona.com> on 2008/08/19 16:54:59 UTC
Customzing TikaConfig or rather getParser
Hi
We are currently using Tika and it works great so far, but now we would
like to have a getParser() method which doesn't depend on the mime-type
but rather on a node path or whatever. For example we have different XML
within the filesystem, but they all have the same mime-type
application/xml, so the only way to differentiate is the path. Also it
doesn't seem like one should overwrite TikaConfig
http://incubator.apache.org/tika/apidocs/org/apache/tika/config/TikaConfig.html
How do other people handle such situations?
Thanks
Michael
Re: Customzing TikaConfig or rather getParser
Posted by Michael Wechner <mi...@wyona.com>.
Jukka Zitting schrieb:
> Hi,
>
> On Thu, Sep 4, 2008 at 1:50 PM, Michael Wechner
> <mi...@wyona.com> wrote:
>
>> Jukka Zitting schrieb:
>>
>>> The way I see it, an application should ideally only deal with a
>>> single Parser instance, that would be smart enough to select the
>>> appropriate parsing mechanism for each incoming document based on the
>>> associated metadata.
>>>
>> I am afraid that this makes the parsers less usable, but of course we could
>> introduce a meta-parser and then re-use the actual data parsers.
>>
>
> That's pretty much what the AutoDetectParser and CompositeParser
> classes are designed to do.
>
ok, thanks for this info
Michael
> BR,
>
> Jukka Zitting
>
Re: Customzing TikaConfig or rather getParser
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Thu, Sep 4, 2008 at 1:50 PM, Michael Wechner
<mi...@wyona.com> wrote:
> Jukka Zitting schrieb:
>> The way I see it, an application should ideally only deal with a
>> single Parser instance, that would be smart enough to select the
>> appropriate parsing mechanism for each incoming document based on the
>> associated metadata.
>
> I am afraid that this makes the parsers less usable, but of course we could
> introduce a meta-parser and then re-use the actual data parsers.
That's pretty much what the AutoDetectParser and CompositeParser
classes are designed to do.
BR,
Jukka Zitting
Re: Customzing TikaConfig or rather getParser
Posted by Michael Wechner <mi...@wyona.com>.
Jukka Zitting schrieb:
> Hi,
>
> On Thu, Sep 4, 2008 at 11:31 AM, Michael Wechner
> <mi...@wyona.com> wrote:
>
>> this seems to work for our usecase, but it seems to me that the actual
>> problem is just transfered one step further down.
>>
>
> "There are few problems in computer science that can not be solved by
> adding another level of indirection." -Tom Christansen
>
>
>> I think it would be better to separate the parser actual selection (via
>> chain of responsibility) from passing in metadata.
>>
>
> The way I see it, an application should ideally only deal with a
> single Parser instance, that would be smart enough to select the
> appropriate parsing mechanism for each incoming document based on the
> associated metadata.
>
I am afraid that this makes the parsers less usable, but of course we
could introduce a meta-parser and then re-use the actual data parsers.
But then again one might have to ask why handle mime-type exceptionally ;-)
> The reason for making the Metadata object a modifiable input/output
> parameter (instead of just a return value) of the parse() method was
> that a client application could feed extra metadata to the parsing
> process. In your use case that extra metadata would be the path of the
> document.
>
this is how we are now using it.
Thanks
Michael
> BR,
>
> Jukka Zitting
>
Re: Customzing TikaConfig or rather getParser
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Thu, Sep 4, 2008 at 11:31 AM, Michael Wechner
<mi...@wyona.com> wrote:
> this seems to work for our usecase, but it seems to me that the actual
> problem is just transfered one step further down.
"There are few problems in computer science that can not be solved by
adding another level of indirection." -Tom Christansen
> I think it would be better to separate the parser actual selection (via
> chain of responsibility) from passing in metadata.
The way I see it, an application should ideally only deal with a
single Parser instance, that would be smart enough to select the
appropriate parsing mechanism for each incoming document based on the
associated metadata.
The reason for making the Metadata object a modifiable input/output
parameter (instead of just a return value) of the parse() method was
that a client application could feed extra metadata to the parsing
process. In your use case that extra metadata would be the path of the
document.
BR,
Jukka Zitting
Re: Customzing TikaConfig or rather getParser
Posted by Michael Wechner <mi...@wyona.com>.
Jukka Zitting schrieb:
> Hi,
>
> On Mon, Aug 25, 2008 at 9:06 AM, Michael Wechner
> <mi...@wyona.com> wrote:
>
>> I think this is where the problem is, I mean the getParser(String) method.
>>
>> I would like to overwrite this method by implementing my own chain of
>> responsibility.
>>
>
> How about the following:
>
> public class MyCustomParser extends CompositeParser {
>
> public MyCustomParser throws TikaException {
> setConfig(TikaConfig.getDefaultConfig());
> // or whatever config you want
> }
>
> protected Parser getParser(Metadata metadata) {
> // Custom code to select an appropriate parser
> // based on the input metadata (mime type,
> // document path, whatever) passed by the client.
> // Or fallback to:
> return super.getParser(metadata);
> }
>
> }
>
> Your client code would then look like:
>
> private Parser parser = new MyCustomParser();
>
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE);
> // plus whatever other metadata you need in MyCustomParser
>
> parser.parse(stream, handler, metadata);
>
> One of my design goals for the current Parser interface was was that
> you can encapsulate this sort of functionality inside it.
>
this seems to work for our usecase, but it seems to me that the actual
problem is just transfered one step further down.
I think it would be better to separate the parser actual selection (via
chain of responsibility) from passing in metadata.
Cheers
Michael
> BR,
>
> Jukka Zitting
>
Re: Customzing TikaConfig or rather getParser
Posted by Michael Wechner <mi...@wyona.com>.
Jukka Zitting schrieb:
> Hi,
>
> On Mon, Aug 25, 2008 at 9:06 AM, Michael Wechner
> <mi...@wyona.com> wrote:
>
>> I think this is where the problem is, I mean the getParser(String) method.
>>
>> I would like to overwrite this method by implementing my own chain of
>> responsibility.
>>
>
> How about the following:
>
> public class MyCustomParser extends CompositeParser {
>
> public MyCustomParser throws TikaException {
> setConfig(TikaConfig.getDefaultConfig());
> // or whatever config you want
> }
>
> protected Parser getParser(Metadata metadata) {
> // Custom code to select an appropriate parser
> // based on the input metadata (mime type,
> // document path, whatever) passed by the client.
> // Or fallback to:
> return super.getParser(metadata);
> }
>
> }
>
> Your client code would then look like:
>
> private Parser parser = new MyCustomParser();
>
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE);
> // plus whatever other metadata you need in MyCustomParser
>
> parser.parse(stream, handler, metadata);
>
> One of my design goals for the current Parser interface was was that
> you can encapsulate this sort of functionality inside it.
>
thanks for the suggestions. Will give it a try and keep you posted on my
findings.
Thanks
Michael
> BR,
>
> Jukka Zitting
>
Re: Customzing TikaConfig or rather getParser
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Mon, Aug 25, 2008 at 9:06 AM, Michael Wechner
<mi...@wyona.com> wrote:
> I think this is where the problem is, I mean the getParser(String) method.
>
> I would like to overwrite this method by implementing my own chain of
> responsibility.
How about the following:
public class MyCustomParser extends CompositeParser {
public MyCustomParser throws TikaException {
setConfig(TikaConfig.getDefaultConfig());
// or whatever config you want
}
protected Parser getParser(Metadata metadata) {
// Custom code to select an appropriate parser
// based on the input metadata (mime type,
// document path, whatever) passed by the client.
// Or fallback to:
return super.getParser(metadata);
}
}
Your client code would then look like:
private Parser parser = new MyCustomParser();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE);
// plus whatever other metadata you need in MyCustomParser
parser.parse(stream, handler, metadata);
One of my design goals for the current Parser interface was was that
you can encapsulate this sort of functionality inside it.
BR,
Jukka Zitting
Re: Customzing TikaConfig or rather getParser
Posted by Michael Wechner <mi...@wyona.com>.
Thorsten Scherler schrieb:
> On Tue, 2008-08-19 at 16:54 +0200, Michael Wechner wrote:
>
>> Hi
>>
>>
>
> Hi Michi,
>
Hello Thorsten :-)
>
>
> I would reuse the config and create a config file
> ("/PathTo/myConfig.xml") like follow. I asked about the if doc-type is a
> possibility since it would make configuration much easier.
>
> Instead to use the plain mime type I would use the doc type:
>
what exactly do mean with doc type?
> <parser name="parse-myDocType"
> class="org.apache.tika.parser.docType.MyDocTypeParser">
> <mime>myDoctype</mime>
> </parser>
>
> and then from your code call
> TikaConfig config = new TikaConfig("/PathTo/myConfig.xml");
> Parser parser = config.getParser("myDoctype");
>
I think this is where the problem is, I mean the getParser(String) method.
I would like to overwrite this method by implementing my own chain of
responsibility.
Hence I think it would be nice to enhance this by introducing a new method
TikaConfig.getParser(ParserSelector)
(similar to
http://java.sun.com/j2se/1.4.2/docs/api/java/io/File.html#listFiles(java.io.FileFilter))
and ParserSelector would be an interface
(similar to http://java.sun.com/j2se/1.4.2/docs/api/java/io/FileFilter.html)
WDYT?
Thanks
Michael
> ...
>
> However this is to reuse the current code more then find a definitive
> solution, but maybe somebody else has another idea.
>
> HTH
>
> salu2
>
>
>> Thanks
>>
>> Michael
>>
Re: Customzing TikaConfig or rather getParser
Posted by Thorsten Scherler <th...@apache.org>.
On Tue, 2008-08-19 at 16:54 +0200, Michael Wechner wrote:
> Hi
>
Hi Michi,
> We are currently using Tika and it works great so far, but now we would
> like to have a getParser() method which doesn't depend on the mime-type
> but rather on a node path or whatever.
I suppose the doc-type would be a good determination?
> For example we have different XML
> within the filesystem, but they all have the same mime-type
> application/xml, so the only way to differentiate is the path.
Or their doc-type?
> Also it
> doesn't seem like one should overwrite TikaConfig
>
> http://incubator.apache.org/tika/apidocs/org/apache/tika/config/TikaConfig.html
>
> How do other people handle such situations?
I would reuse the config and create a config file
("/PathTo/myConfig.xml") like follow. I asked about the if doc-type is a
possibility since it would make configuration much easier.
Instead to use the plain mime type I would use the doc type:
<parser name="parse-myDocType"
class="org.apache.tika.parser.docType.MyDocTypeParser">
<mime>myDoctype</mime>
</parser>
and then from your code call
TikaConfig config = new TikaConfig("/PathTo/myConfig.xml");
Parser parser = config.getParser("myDoctype");
...
However this is to reuse the current code more then find a definitive
solution, but maybe somebody else has another idea.
HTH
salu2
> Thanks
>
> Michael
--
Thorsten Scherler thorsten.at.apache.org
Open Source Java consulting, training and solutions