You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Cedric Ulmer <ce...@francelabs.com> on 2023/01/31 10:34:06 UTC

Adding arguments to configure tika from the rest calls

Hi all,

We are playing with the regex-based detection capabilities of Tika combined with ManifoldCF, and an idea came to our mind. First, the problem: for now, a tika server has only one configuration. Therefore, if we set a regex based entity extraction, it will be applied to all of the documents (for given mime types). So if in ManifoldCF we call the Tika server during an crawling phase, we cannot have different regex rules per crawling job: any job that calls the tika server will be processed the same way.

So here is the idea: wouldn't it be possible to make the call to a tika server configurable via a REST parameter/arguments, where we could set which config we want to use for the current call ? Something like: ?enableNER=true&NERConfig=regex1

Regards,

Cédric
CEO
France Labs - Your knowledge, now
Datafari Enterprise Search

RE: Adding arguments to configure tika from the rest calls

Posted by Julien Massiera <ju...@francelabs.com>.

Hi Tim,

Thanks for the documentation ! it is clear in my opinion.

It is a very interesting way to configure parsers on the fly depending on what we want to do. In the case of the NER parser, the interest would be limited though as the only configuration that propose the parser is to restrict the mime types processed (if I am not wrong ?).

In our case, we found very frustrating with the NER parser to not being able to:

1/ Select the implementation classes we want to use or not through the parser config in the tika-config.xml file, not a JVM parameter
2/ Being able to enable or disable specific parsers during a request
3/ Concerning the RegexNERecogniser implementation, not being able to create named groups of regex in the parser config and define the group(s) we want to use through header param (and being able to disable RegexNERecogniser if we don't want to perform regex NER but it goes back to point 2/)

Based on this, according to what you said, the main solution would be to add more options to the NER parser config (point 3/) and create a NERServerConfig like the PDFServerConfig and the TesseractServerConfig. The cherry on the cake would be to be able to specify tika-config.xml file via URL parameters.

What do you think ? 

Julien   

-----Message d'origine-----
De : Tim Allison <ta...@apache.org> 
Envoyé : mercredi 15 février 2023 19:51
À : dev@tika.apache.org
Objet : Re: Adding arguments to configure tika from the rest calls

Here's a first attempt at documentation:
https://cwiki.apache.org/confluence/display/TIKA/Configuring+Parsers+At+Parse+Time+in+tika-server

Please let me know if you have any questions or want write access to improve the documentation!

On Wed, Feb 15, 2023 at 11:07 AM Julien Massiera <ju...@francelabs.com> wrote:
>
> Hi Tim,
>
> bouncing back on our mail thread, could you share more documentation on how to use the header to configure the PDFParser on the fly ?
>
> Thanks,
> Julien
>
> -----Message d'origine-----
> De : Julien Massiera <ju...@francelabs.com> Envoyé : 
> vendredi 3 février 2023 13:08 À : dev@tika.apache.org Objet : RE: 
> Adding arguments to configure tika from the rest calls
>
> Hi Tim,
>
> The NER Parse config via headers like the PDFParserConfig sounds an 
> interesting approach but I have just discovered that feature thanks to 
> your reply and I tried to find a documentation about this, 
> unfortunately the only thing I found was a TBD note on that page 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=10945
> 4066
>
> Could you tell us more about how to use it ? so that we can test it to have a better idea on how it works and how useful would it be for NER ?
>
> Thanks,
> Julien
>
> -----Message d'origine-----
> De : Tim Allison <ta...@apache.org> Envoyé : mardi 31 janvier 2023 
> 13:19 À : dev@tika.apache.org Objet : Re: Adding arguments to 
> configure tika from the rest calls
>
> Configuring specific parsers that don't have their own parser config 
> objects is a pain.  For example, we currently have an option to set 
> PDFParserConfig and TesseractParserConfig options via headers to 
> tika-server...and we have a way to extend this functionality to other 
> parsers.  This option is "not pretty"(TM), but it has the benefit of 
> correctly differentiating creation-time settings (applies to all
> files) from runtime-settings (applies to a specific file), and this process reuses a single static parser so there's no overhead in rebuilding the parser object for every file.
>
> So, we could add an ner parse config along the lines of the PDFParserConfig, or...
>
> ...I regret I can't tell if this is what you're proposing, but we could specify a tika-config.xml file via url parameters?  This would add overhead of loading the full parser for each parse where you specify your own custom parser.  Or, I guess, we could load x many default parsers and name them?
>
> On Tue, Jan 31, 2023 at 5:34 AM Cedric Ulmer <ce...@francelabs.com> wrote:
> >
> > Hi all,
> >
> > We are playing with the regex-based detection capabilities of Tika combined with ManifoldCF, and an idea came to our mind. First, the problem: for now, a tika server has only one configuration. Therefore, if we set a regex based entity extraction, it will be applied to all of the documents (for given mime types). So if in ManifoldCF we call the Tika server during an crawling phase, we cannot have different regex rules per crawling job: any job that calls the tika server will be processed the same way.
> >
> > So here is the idea: wouldn't it be possible to make the call to a 
> > tika server configurable via a REST parameter/arguments, where we 
> > could set which config we want to use for the current call ? 
> > Something
> > like: ?enableNER=true&NERConfig=regex1
> >
> > Regards,
> >
> > Cédric
> > CEO
> > France Labs - Your knowledge, now
> > Datafari Enterprise Search
> >
>
>

Re: Adding arguments to configure tika from the rest calls

Posted by Tim Allison <ta...@apache.org>.

Here's a first attempt at documentation:
https://cwiki.apache.org/confluence/display/TIKA/Configuring+Parsers+At+Parse+Time+in+tika-server

Please let me know if you have any questions or want write access to
improve the documentation!

On Wed, Feb 15, 2023 at 11:07 AM Julien Massiera
<ju...@francelabs.com> wrote:
>
> Hi Tim,
>
> bouncing back on our mail thread, could you share more documentation on how to use the header to configure the PDFParser on the fly ?
>
> Thanks,
> Julien
>
> -----Message d'origine-----
> De : Julien Massiera <ju...@francelabs.com>
> Envoyé : vendredi 3 février 2023 13:08
> À : dev@tika.apache.org
> Objet : RE: Adding arguments to configure tika from the rest calls
>
> Hi Tim,
>
> The NER Parse config via headers like the PDFParserConfig sounds an interesting approach but I have just discovered that feature thanks to your reply and I tried to find a documentation about this, unfortunately the only thing I found was a TBD note on that page https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066
>
> Could you tell us more about how to use it ? so that we can test it to have a better idea on how it works and how useful would it be for NER ?
>
> Thanks,
> Julien
>
> -----Message d'origine-----
> De : Tim Allison <ta...@apache.org>
> Envoyé : mardi 31 janvier 2023 13:19
> À : dev@tika.apache.org
> Objet : Re: Adding arguments to configure tika from the rest calls
>
> Configuring specific parsers that don't have their own parser config objects is a pain.  For example, we currently have an option to set PDFParserConfig and TesseractParserConfig options via headers to tika-server...and we have a way to extend this functionality to other parsers.  This option is "not pretty"(TM), but it has the benefit of correctly differentiating creation-time settings (applies to all
> files) from runtime-settings (applies to a specific file), and this process reuses a single static parser so there's no overhead in rebuilding the parser object for every file.
>
> So, we could add an ner parse config along the lines of the PDFParserConfig, or...
>
> ...I regret I can't tell if this is what you're proposing, but we could specify a tika-config.xml file via url parameters?  This would add overhead of loading the full parser for each parse where you specify your own custom parser.  Or, I guess, we could load x many default parsers and name them?
>
> On Tue, Jan 31, 2023 at 5:34 AM Cedric Ulmer <ce...@francelabs.com> wrote:
> >
> > Hi all,
> >
> > We are playing with the regex-based detection capabilities of Tika combined with ManifoldCF, and an idea came to our mind. First, the problem: for now, a tika server has only one configuration. Therefore, if we set a regex based entity extraction, it will be applied to all of the documents (for given mime types). So if in ManifoldCF we call the Tika server during an crawling phase, we cannot have different regex rules per crawling job: any job that calls the tika server will be processed the same way.
> >
> > So here is the idea: wouldn't it be possible to make the call to a
> > tika server configurable via a REST parameter/arguments, where we
> > could set which config we want to use for the current call ? Something
> > like: ?enableNER=true&NERConfig=regex1
> >
> > Regards,
> >
> > Cédric
> > CEO
> > France Labs - Your knowledge, now
> > Datafari Enterprise Search
> >
>
>

RE: Adding arguments to configure tika from the rest calls

Posted by Julien Massiera <ju...@francelabs.com>.

Hi Tim,

bouncing back on our mail thread, could you share more documentation on how to use the header to configure the PDFParser on the fly ? 

Thanks,
Julien

-----Message d'origine-----
De : Julien Massiera <ju...@francelabs.com> 
Envoyé : vendredi 3 février 2023 13:08
À : dev@tika.apache.org
Objet : RE: Adding arguments to configure tika from the rest calls

Hi Tim,

The NER Parse config via headers like the PDFParserConfig sounds an interesting approach but I have just discovered that feature thanks to your reply and I tried to find a documentation about this, unfortunately the only thing I found was a TBD note on that page https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066

Could you tell us more about how to use it ? so that we can test it to have a better idea on how it works and how useful would it be for NER ? 

Thanks,
Julien 

-----Message d'origine-----
De : Tim Allison <ta...@apache.org>
Envoyé : mardi 31 janvier 2023 13:19
À : dev@tika.apache.org
Objet : Re: Adding arguments to configure tika from the rest calls

Configuring specific parsers that don't have their own parser config objects is a pain.  For example, we currently have an option to set PDFParserConfig and TesseractParserConfig options via headers to tika-server...and we have a way to extend this functionality to other parsers.  This option is "not pretty"(TM), but it has the benefit of correctly differentiating creation-time settings (applies to all
files) from runtime-settings (applies to a specific file), and this process reuses a single static parser so there's no overhead in rebuilding the parser object for every file.

So, we could add an ner parse config along the lines of the PDFParserConfig, or...

...I regret I can't tell if this is what you're proposing, but we could specify a tika-config.xml file via url parameters?  This would add overhead of loading the full parser for each parse where you specify your own custom parser.  Or, I guess, we could load x many default parsers and name them?

On Tue, Jan 31, 2023 at 5:34 AM Cedric Ulmer <ce...@francelabs.com> wrote:
>
> Hi all,
>
> We are playing with the regex-based detection capabilities of Tika combined with ManifoldCF, and an idea came to our mind. First, the problem: for now, a tika server has only one configuration. Therefore, if we set a regex based entity extraction, it will be applied to all of the documents (for given mime types). So if in ManifoldCF we call the Tika server during an crawling phase, we cannot have different regex rules per crawling job: any job that calls the tika server will be processed the same way.
>
> So here is the idea: wouldn't it be possible to make the call to a 
> tika server configurable via a REST parameter/arguments, where we 
> could set which config we want to use for the current call ? Something
> like: ?enableNER=true&NERConfig=regex1
>
> Regards,
>
> Cédric
> CEO
> France Labs - Your knowledge, now
> Datafari Enterprise Search
>

RE: Adding arguments to configure tika from the rest calls

Posted by Julien Massiera <ju...@francelabs.com>.

Hi Tim,

The NER Parse config via headers like the PDFParserConfig sounds an interesting approach but I have just discovered that feature thanks to your reply and I tried to find a documentation about this, unfortunately the only thing I found was a TBD note on that page https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066

Could you tell us more about how to use it ? so that we can test it to have a better idea on how it works and how useful would it be for NER ? 

Thanks,
Julien 

-----Message d'origine-----
De : Tim Allison <ta...@apache.org> 
Envoyé : mardi 31 janvier 2023 13:19
À : dev@tika.apache.org
Objet : Re: Adding arguments to configure tika from the rest calls

Configuring specific parsers that don't have their own parser config objects is a pain.  For example, we currently have an option to set PDFParserConfig and TesseractParserConfig options via headers to tika-server...and we have a way to extend this functionality to other parsers.  This option is "not pretty"(TM), but it has the benefit of correctly differentiating creation-time settings (applies to all
files) from runtime-settings (applies to a specific file), and this process reuses a single static parser so there's no overhead in rebuilding the parser object for every file.

So, we could add an ner parse config along the lines of the PDFParserConfig, or...

...I regret I can't tell if this is what you're proposing, but we could specify a tika-config.xml file via url parameters?  This would add overhead of loading the full parser for each parse where you specify your own custom parser.  Or, I guess, we could load x many default parsers and name them?

On Tue, Jan 31, 2023 at 5:34 AM Cedric Ulmer <ce...@francelabs.com> wrote:
>
> Hi all,
>
> We are playing with the regex-based detection capabilities of Tika combined with ManifoldCF, and an idea came to our mind. First, the problem: for now, a tika server has only one configuration. Therefore, if we set a regex based entity extraction, it will be applied to all of the documents (for given mime types). So if in ManifoldCF we call the Tika server during an crawling phase, we cannot have different regex rules per crawling job: any job that calls the tika server will be processed the same way.
>
> So here is the idea: wouldn't it be possible to make the call to a 
> tika server configurable via a REST parameter/arguments, where we 
> could set which config we want to use for the current call ? Something 
> like: ?enableNER=true&NERConfig=regex1
>
> Regards,
>
> Cédric
> CEO
> France Labs - Your knowledge, now
> Datafari Enterprise Search
>

Re: Adding arguments to configure tika from the rest calls

Posted by Tim Allison <ta...@apache.org>.

Configuring specific parsers that don't have their own parser config
objects is a pain.  For example, we currently have an option to set
PDFParserConfig and TesseractParserConfig options via headers to
tika-server...and we have a way to extend this functionality to other
parsers.  This option is "not pretty"(TM), but it has the benefit of
correctly differentiating creation-time settings (applies to all
files) from runtime-settings (applies to a specific file), and this
process reuses a single static parser so there's no overhead in
rebuilding the parser object for every file.

So, we could add an ner parse config along the lines of the
PDFParserConfig, or...

...I regret I can't tell if this is what you're proposing, but we
could specify a tika-config.xml file via url parameters?  This would
add overhead of loading the full parser for each parse where you
specify your own custom parser.  Or, I guess, we could load x many
default parsers and name them?

On Tue, Jan 31, 2023 at 5:34 AM Cedric Ulmer
<ce...@francelabs.com> wrote:
>
> Hi all,
>
> We are playing with the regex-based detection capabilities of Tika combined with ManifoldCF, and an idea came to our mind. First, the problem: for now, a tika server has only one configuration. Therefore, if we set a regex based entity extraction, it will be applied to all of the documents (for given mime types). So if in ManifoldCF we call the Tika server during an crawling phase, we cannot have different regex rules per crawling job: any job that calls the tika server will be processed the same way.
>
> So here is the idea: wouldn't it be possible to make the call to a tika server configurable via a REST parameter/arguments, where we could set which config we want to use for the current call ? Something like: ?enableNER=true&NERConfig=regex1
>
> Regards,
>
> Cédric
> CEO
> France Labs - Your knowledge, now
> Datafari Enterprise Search
>