You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nick Burch <ap...@gagravarr.org> on 2015/06/06 21:30:20 UTC

Re: Configuring parsers and translators

Anyone have any thoughts on this?

On Fri, 8 May 2015, Nick Burch wrote:
> Hi All
>
> This came up in TIKA-1623, but I thought it might be better brought out to 
> the list for discussion
>
> To configure parsers on a per-document basis, such as setting PDF 
> spacing tolerances, or telling Tesseract what language it should be 
> OCRing for, we have the *Config objects. You create one of these, use 
> the setters to configure it for your document, pop it onto the Parse 
> context and it's used when processing your document
>
> To configure parsers and translators on a per-JVM basis, to apply to all 
> documents processed, it's a bit less consistent. At least some look for 
> a properties file with a specific name, usually in the tika namespace, 
> and grab their settings / keys / etc out of that. At least some expect 
> to find a *Config with their program path on it, even though that 
> remains constant between documents. None of them support getting their 
> settings from the Tika Config
>
>
> As part of our evolution of parser preferences, we're moving towards 
> people either being able to set their preferences in code, or being able 
> to supply a Tika Config xml which sets their parser preferences or 
> overrides certain bits of the default. The code option works for people 
> who want to declare certain specific things, the Tika Config one gives 
> the same functionality but allows a consistent and clean way to set it 
> between Tika App, Tika Server and java code.
>
> Another related example is the External Parser support. Because you can 
> have multiple External Parser instances in your setup, one per format / 
> program, we look for all the 
> org/apache/tika/parser/external/tika-external-parsers.xml files on the 
> classpath, and create parser instances based on definitions in there
>
>
> What do we think about setting executable paths and keys/logins for 
> parsers like OCR, Strings, Translators etc? Always on ParseContext? 
> Properties? Custom xml config? Tika config xml? Other? Combination?
>
> Nick
>

Re: Configuring parsers and translators

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 15 Jun 2015, Mattmann, Chris A (3980) wrote:
> Hey nick I guess my point is that parser context aka config properties 
> for parsers and custom config files e.g., x.properties loaded from the 
> classpath aren't configured from Tika app or server

Ah, good point. In my ideal world, you'd set the "all documents of this 
kind" settings (eg paths) in the config, then set this "this document 
only" settings (eg pdf column count, pdf inline image settings) via a 
command line option to the app / request header to the server, converted 
into ParseContext options[1]. That would then be largely the same as for 
the pure-Java users.

Hopefully there aren't too many settings which are debatable as to what 
they are!

Not sure how huge a tika config file this would all lead to...

I could see some value in properties files, for things that don't change 
between machines but do need configuration, eg the mappings for external 
parsers. Since it isn't obvious if you've missed one, I'm not sure we want 
to use them heavily for customisations for paths etc


Also, since you mention having been caught out by missing jars or missing 
service files, maybe we need to put something on the wiki about how to 
check if you have what you expected? (IIRC we log if a parser can't be 
found or can't be loaded, so mostly it's about how to enable that)

Nick

[1] Do we have tickets for adding these in yet?

Re: Configuring parsers and translators

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hey nick I guess my point is that parser context aka config properties for parsers and custom config files e.g., x.properties loaded from the classpath aren't configured from Tika app or server 

Sent from my iPhone

> On Jun 15, 2015, at 8:34 AM, Nick Burch <ap...@gagravarr.org> wrote:
> 
>> On Mon, 15 Jun 2015, Mattmann, Chris A (3980) wrote:
>> We also need to be mindful of Tika app and sever where there is no current way to see config other than Tika config file and multiple conflicting ways to set it...
> 
> Hmm? Both app and server optionally take a config file, don't they? And both offer an option/flag/endpoint to tell you the parsers they found, the detectors they found, parser decorations etc.
> 
> I'd say that the app and the server are the easiest ways to know what's going on with your Tika install, it's the pure-Java case where it's harder to know what you do or don't have!
> 
> Or have I mis-understood the use case?
> 
> Nick

Re: Configuring parsers and translators

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 15 Jun 2015, Mattmann, Chris A (3980) wrote:
> We also need to be mindful of Tika app and sever where there is no 
> current way to see config other than Tika config file and multiple 
> conflicting ways to set it...

Hmm? Both app and server optionally take a config file, don't they? And 
both offer an option/flag/endpoint to tell you the parsers they found, the 
detectors they found, parser decorations etc.

I'd say that the app and the server are the easiest ways to know what's 
going on with your Tika install, it's the pure-Java case where it's harder 
to know what you do or don't have!

Or have I mis-understood the use case?

Nick

Re: Configuring parsers and translators

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
We also need to be mindful of Tika app and sever where there is no current way to see config other than Tika config file and multiple conflicting ways to set it...

Sent from my iPhone

> On Jun 15, 2015, at 8:02 AM, Nick Burch <ap...@gagravarr.org> wrote:
> 
>> On Mon, 15 Jun 2015, Allison, Timothy B. wrote:
>> Agreed.  They are two separate but related issues.  TIKA-1508 should be fairly straightforward.  Should I start coding it?  Any other recommendations/concerns?
> 
> My personal view is that properties/configuration which apply to all documents of a type should be set at Parser creation time, either from a Tika Config object or someone in code doing "Parser p = new FooParser(); p.setblah();". Properties/config which vary from document to document should be set on the ParseContext
> 
> Not sure if we had consensus on that as a policy though?
> 
> 
> In terms of TIKA-1508, any chance you could pick two parsers which are currently configured some how, and update the issue to show how they are configured now, and how you'd see them being configured in Tika Config? I think it might be easier to review with some concrete cases, rather than the abstract idea we have now
> 
> Nick

RE: Configuring parsers and translators

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 15 Jun 2015, Allison, Timothy B. wrote:
> Agreed.  They are two separate but related issues.  TIKA-1508 should be 
> fairly straightforward.  Should I start coding it?  Any other 
> recommendations/concerns?

My personal view is that properties/configuration which apply to all 
documents of a type should be set at Parser creation time, either from a 
Tika Config object or someone in code doing "Parser p = new FooParser(); 
p.setblah();". Properties/config which vary from document to document 
should be set on the ParseContext

Not sure if we had consensus on that as a policy though?


In terms of TIKA-1508, any chance you could pick two parsers which are 
currently configured some how, and update the issue to show how they are 
configured now, and how you'd see them being configured in Tika Config? I 
think it might be easier to review with some concrete cases, rather than 
the abstract idea we have now

Nick

Re: Configuring parsers and translators

Posted by Konstantin Gribov <gr...@gmail.com>.
I think, there's a third concern should be taken in account: dynamic
configuration (e.g. based on metadata, like password provider on
per-document basis).
Currently you only can inject some dynamically configurable behavior via
ParseContext, but it adds complexity to recursive parser implementations.

-- 
Best regards,
Konstantin Gribov

пн, 15 июня 2015 г. в 16:52, Allison, Timothy B. <ta...@mitre.org>:

> Agreed.  They are two separate but related issues.  TIKA-1508 should be
> fairly straightforward.  Should I start coding it?  Any other
> recommendations/concerns?
>
>
>
> -----Original Message-----
> From: Tyler Palsulich [mailto:tpalsulich@gmail.com]
> Sent: Saturday, June 13, 2015 12:54 PM
> To: dev@tika.apache.org
> Subject: Re: Configuring parsers and translators
>
> It seems like there are two goals here, both aiming to centralize
> configuration:
>
> 1. Provide an easy mechanism to configure which parsers to use when
> (TIKA-1509).
> 2. Configure all individual parser parameters in Tika Config (not in, for
> example, TesseractOCRConfig.properties) (TIKA-1508).
>
> I'm also in favor of consolidating everything in Tika Config.
>
> Tyler
>
> On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
> > Tyler, I see your devil's advocate point.
> >
> > I strongly agree with Chris about the benefit of centralizing
> > configuration and making it easy to dump and modify the TikaConfig file.
> >
> > Even though the TikaConfig file might get ugly, it would be far better to
> > have everything nailed down there than searching through service
> > loaders...IMHO.
> >
> > I opened TIKA-1508 a while ago and haven't had any time to work on
> > it...this just deals with simple parameter settings for parsers, not the
> > far more difficult/interesting stuff that we've discussed with composite
> > parsers.
> >
> > >> My main worry with putting it all into config xml is that we
> accidently
> > end up re-inventing spring badly...
> >
> > Yeah, or re-inventing Solr's parameter loading as my example does... :(
> >
> > I think that basic parameter setting should at least be fairly trivial to
> > code...time allowing...argh.
> >
> >
> > -----Original Message-----
> > From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
> > Sent: Saturday, June 06, 2015 7:01 PM
> > To: dev@tika.apache.org
> > Subject: Re: Configuring parsers and translators
> >
> > Hey Tyler,
> >
> > I hear you, but balance that against all the hidden things here
> > and there, and everywhere, that I constantly keep discovering and
> > having to pour through lines of TikaConfig - service loaders, class
> > loaders.
> >
> > When things work right - no problem. When something goes wrong;
> > HUGE waste of time.
> >
> > Cheers,
> > Chris
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattmann@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> > -----Original Message-----
> > From: Tyler Palsulich <tp...@gmail.com>
> > Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> > Date: Saturday, June 6, 2015 at 3:59 PM
> > To: "dev@tika.apache.org" <de...@tika.apache.org>
> > Subject: Re: Configuring parsers and translators
> >
> > >(Devil's advocate hat slightly on.) My one hesitation about putting it
> all
> > >into tika-config is that the default might get to be a monstrosity --
> > >difficult for new users to use.
> > >
> > >Tyler
> > >
> > >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
> > >chris.a.mattmann@jpl.nasa.gov> wrote:
> > >
> > >> I think it would be great to have all this in the Tika Config.
> > >>
> > >> The one thing then is to provide an example default config and
> > >> to make it *hugely* clear rather than all the levels of indirection
> > >> that we currently have going on which makes it super hard when
> > >> there is a config error (SPI, swallowing print messages, etc.)
> > >>
> > >>
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Chris Mattmann, Ph.D.
> > >> Chief Architect
> > >> Instrument Software and Science Data Systems Section (398)
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 168-519, Mailstop: 168-527
> > >> Email: chris.a.mattmann@nasa.gov
> > >> WWW:  http://sunset.usc.edu/~mattmann/
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Adjunct Associate Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>
> > >>
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Tyler Palsulich <tp...@gmail.com>
> > >> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> > >> Date: Saturday, June 6, 2015 at 3:45 PM
> > >> To: "dev@tika.apache.org" <de...@tika.apache.org>
> > >> Subject: Re: Configuring parsers and translators
> > >>
> > >> >Hi Nick,
> > >> >
> > >> >I've been mulling this over since you sent the first message. But,
> I'm
> > >> >afraid I don't have a good solution or developed ideas.
> > >> >
> > >> >I agree, it would be very nice to consolidate all configuration for
> all
> > >> >parsers in the server and app.
> > >> >
> > >> >Is it feasible to put everything into tika-config? Then Parser
> > >> >implementations would read the config to pull out their own
> > >>configuration.
> > >> >Or, would it be better to keep some configuration separate?
> > >>Documentation
> > >> >would be an issue if every parser defines its own metadata keys...
> > >>But, it
> > >> >might be an improvement since we don't have "free form" properties
> and
> > >> >configuration files.
> > >> >
> > >> >Tyler
> > >> >
> > >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <ap...@gagravarr.org>
> > >>wrote:
> > >> >
> > >> >> Anyone have any thoughts on this?
> > >> >>
> > >> >> On Fri, 8 May 2015, Nick Burch wrote:
> > >> >> > Hi All
> > >> >> >
> > >> >> > This came up in TIKA-1623, but I thought it might be better
> brought
> > >> >>out
> > >> >> to
> > >> >> > the list for discussion
> > >> >> >
> > >> >> > To configure parsers on a per-document basis, such as setting PDF
> > >> >> > spacing tolerances, or telling Tesseract what language it should
> be
> > >> >> > OCRing for, we have the *Config objects. You create one of these,
> > >>use
> > >> >> > the setters to configure it for your document, pop it onto the
> > >>Parse
> > >> >> > context and it's used when processing your document
> > >> >> >
> > >> >> > To configure parsers and translators on a per-JVM basis, to apply
> > >>to
> > >> >>all
> > >> >> > documents processed, it's a bit less consistent. At least some
> look
> > >> >>for
> > >> >> > a properties file with a specific name, usually in the tika
> > >>namespace,
> > >> >> > and grab their settings / keys / etc out of that. At least some
> > >>expect
> > >> >> > to find a *Config with their program path on it, even though that
> > >> >> > remains constant between documents. None of them support getting
> > >>their
> > >> >> > settings from the Tika Config
> > >> >> >
> > >> >> >
> > >> >> > As part of our evolution of parser preferences, we're moving
> > >>towards
> > >> >> > people either being able to set their preferences in code, or
> being
> > >> >>able
> > >> >> > to supply a Tika Config xml which sets their parser preferences
> or
> > >> >> > overrides certain bits of the default. The code option works for
> > >> >>people
> > >> >> > who want to declare certain specific things, the Tika Config one
> > >>gives
> > >> >> > the same functionality but allows a consistent and clean way to
> > >>set it
> > >> >> > between Tika App, Tika Server and java code.
> > >> >> >
> > >> >> > Another related example is the External Parser support. Because
> you
> > >> >>can
> > >> >> > have multiple External Parser instances in your setup, one per
> > >>format
> > >> >>/
> > >> >> > program, we look for all the
> > >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files
> on
> > >>the
> > >> >> > classpath, and create parser instances based on definitions in
> > >>there
> > >> >> >
> > >> >> >
> > >> >> > What do we think about setting executable paths and keys/logins
> for
> > >> >> > parsers like OCR, Strings, Translators etc? Always on
> ParseContext?
> > >> >> > Properties? Custom xml config? Tika config xml? Other?
> Combination?
> > >> >> >
> > >> >> > Nick
> > >> >> >
> > >> >>
> > >>
> > >>
> >
> >
>

RE: Configuring parsers and translators

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Agreed.  They are two separate but related issues.  TIKA-1508 should be fairly straightforward.  Should I start coding it?  Any other recommendations/concerns?



-----Original Message-----
From: Tyler Palsulich [mailto:tpalsulich@gmail.com] 
Sent: Saturday, June 13, 2015 12:54 PM
To: dev@tika.apache.org
Subject: Re: Configuring parsers and translators

It seems like there are two goals here, both aiming to centralize
configuration:

1. Provide an easy mechanism to configure which parsers to use when
(TIKA-1509).
2. Configure all individual parser parameters in Tika Config (not in, for
example, TesseractOCRConfig.properties) (TIKA-1508).

I'm also in favor of consolidating everything in Tika Config.

Tyler

On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Tyler, I see your devil's advocate point.
>
> I strongly agree with Chris about the benefit of centralizing
> configuration and making it easy to dump and modify the TikaConfig file.
>
> Even though the TikaConfig file might get ugly, it would be far better to
> have everything nailed down there than searching through service
> loaders...IMHO.
>
> I opened TIKA-1508 a while ago and haven't had any time to work on
> it...this just deals with simple parameter settings for parsers, not the
> far more difficult/interesting stuff that we've discussed with composite
> parsers.
>
> >> My main worry with putting it all into config xml is that we accidently
> end up re-inventing spring badly...
>
> Yeah, or re-inventing Solr's parameter loading as my example does... :(
>
> I think that basic parameter setting should at least be fairly trivial to
> code...time allowing...argh.
>
>
> -----Original Message-----
> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
> Sent: Saturday, June 06, 2015 7:01 PM
> To: dev@tika.apache.org
> Subject: Re: Configuring parsers and translators
>
> Hey Tyler,
>
> I hear you, but balance that against all the hidden things here
> and there, and everywhere, that I constantly keep discovering and
> having to pour through lines of TikaConfig - service loaders, class
> loaders.
>
> When things work right - no problem. When something goes wrong;
> HUGE waste of time.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> -----Original Message-----
> From: Tyler Palsulich <tp...@gmail.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Saturday, June 6, 2015 at 3:59 PM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: Re: Configuring parsers and translators
>
> >(Devil's advocate hat slightly on.) My one hesitation about putting it all
> >into tika-config is that the default might get to be a monstrosity --
> >difficult for new users to use.
> >
> >Tyler
> >
> >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
> >chris.a.mattmann@jpl.nasa.gov> wrote:
> >
> >> I think it would be great to have all this in the Tika Config.
> >>
> >> The one thing then is to provide an example default config and
> >> to make it *hugely* clear rather than all the levels of indirection
> >> that we currently have going on which makes it super hard when
> >> there is a config error (SPI, swallowing print messages, etc.)
> >>
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Tyler Palsulich <tp...@gmail.com>
> >> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> >> Date: Saturday, June 6, 2015 at 3:45 PM
> >> To: "dev@tika.apache.org" <de...@tika.apache.org>
> >> Subject: Re: Configuring parsers and translators
> >>
> >> >Hi Nick,
> >> >
> >> >I've been mulling this over since you sent the first message. But, I'm
> >> >afraid I don't have a good solution or developed ideas.
> >> >
> >> >I agree, it would be very nice to consolidate all configuration for all
> >> >parsers in the server and app.
> >> >
> >> >Is it feasible to put everything into tika-config? Then Parser
> >> >implementations would read the config to pull out their own
> >>configuration.
> >> >Or, would it be better to keep some configuration separate?
> >>Documentation
> >> >would be an issue if every parser defines its own metadata keys...
> >>But, it
> >> >might be an improvement since we don't have "free form" properties and
> >> >configuration files.
> >> >
> >> >Tyler
> >> >
> >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <ap...@gagravarr.org>
> >>wrote:
> >> >
> >> >> Anyone have any thoughts on this?
> >> >>
> >> >> On Fri, 8 May 2015, Nick Burch wrote:
> >> >> > Hi All
> >> >> >
> >> >> > This came up in TIKA-1623, but I thought it might be better brought
> >> >>out
> >> >> to
> >> >> > the list for discussion
> >> >> >
> >> >> > To configure parsers on a per-document basis, such as setting PDF
> >> >> > spacing tolerances, or telling Tesseract what language it should be
> >> >> > OCRing for, we have the *Config objects. You create one of these,
> >>use
> >> >> > the setters to configure it for your document, pop it onto the
> >>Parse
> >> >> > context and it's used when processing your document
> >> >> >
> >> >> > To configure parsers and translators on a per-JVM basis, to apply
> >>to
> >> >>all
> >> >> > documents processed, it's a bit less consistent. At least some look
> >> >>for
> >> >> > a properties file with a specific name, usually in the tika
> >>namespace,
> >> >> > and grab their settings / keys / etc out of that. At least some
> >>expect
> >> >> > to find a *Config with their program path on it, even though that
> >> >> > remains constant between documents. None of them support getting
> >>their
> >> >> > settings from the Tika Config
> >> >> >
> >> >> >
> >> >> > As part of our evolution of parser preferences, we're moving
> >>towards
> >> >> > people either being able to set their preferences in code, or being
> >> >>able
> >> >> > to supply a Tika Config xml which sets their parser preferences or
> >> >> > overrides certain bits of the default. The code option works for
> >> >>people
> >> >> > who want to declare certain specific things, the Tika Config one
> >>gives
> >> >> > the same functionality but allows a consistent and clean way to
> >>set it
> >> >> > between Tika App, Tika Server and java code.
> >> >> >
> >> >> > Another related example is the External Parser support. Because you
> >> >>can
> >> >> > have multiple External Parser instances in your setup, one per
> >>format
> >> >>/
> >> >> > program, we look for all the
> >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
> >>the
> >> >> > classpath, and create parser instances based on definitions in
> >>there
> >> >> >
> >> >> >
> >> >> > What do we think about setting executable paths and keys/logins for
> >> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
> >> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
> >> >> >
> >> >> > Nick
> >> >> >
> >> >>
> >>
> >>
>
>

Re: Configuring parsers and translators

Posted by Tyler Palsulich <tp...@gmail.com>.
It seems like there are two goals here, both aiming to centralize
configuration:

1. Provide an easy mechanism to configure which parsers to use when
(TIKA-1509).
2. Configure all individual parser parameters in Tika Config (not in, for
example, TesseractOCRConfig.properties) (TIKA-1508).

I'm also in favor of consolidating everything in Tika Config.

Tyler

On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Tyler, I see your devil's advocate point.
>
> I strongly agree with Chris about the benefit of centralizing
> configuration and making it easy to dump and modify the TikaConfig file.
>
> Even though the TikaConfig file might get ugly, it would be far better to
> have everything nailed down there than searching through service
> loaders...IMHO.
>
> I opened TIKA-1508 a while ago and haven't had any time to work on
> it...this just deals with simple parameter settings for parsers, not the
> far more difficult/interesting stuff that we've discussed with composite
> parsers.
>
> >> My main worry with putting it all into config xml is that we accidently
> end up re-inventing spring badly...
>
> Yeah, or re-inventing Solr's parameter loading as my example does... :(
>
> I think that basic parameter setting should at least be fairly trivial to
> code...time allowing...argh.
>
>
> -----Original Message-----
> From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
> Sent: Saturday, June 06, 2015 7:01 PM
> To: dev@tika.apache.org
> Subject: Re: Configuring parsers and translators
>
> Hey Tyler,
>
> I hear you, but balance that against all the hidden things here
> and there, and everywhere, that I constantly keep discovering and
> having to pour through lines of TikaConfig - service loaders, class
> loaders.
>
> When things work right - no problem. When something goes wrong;
> HUGE waste of time.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> -----Original Message-----
> From: Tyler Palsulich <tp...@gmail.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Saturday, June 6, 2015 at 3:59 PM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: Re: Configuring parsers and translators
>
> >(Devil's advocate hat slightly on.) My one hesitation about putting it all
> >into tika-config is that the default might get to be a monstrosity --
> >difficult for new users to use.
> >
> >Tyler
> >
> >On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
> >chris.a.mattmann@jpl.nasa.gov> wrote:
> >
> >> I think it would be great to have all this in the Tika Config.
> >>
> >> The one thing then is to provide an example default config and
> >> to make it *hugely* clear rather than all the levels of indirection
> >> that we currently have going on which makes it super hard when
> >> there is a config error (SPI, swallowing print messages, etc.)
> >>
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Tyler Palsulich <tp...@gmail.com>
> >> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> >> Date: Saturday, June 6, 2015 at 3:45 PM
> >> To: "dev@tika.apache.org" <de...@tika.apache.org>
> >> Subject: Re: Configuring parsers and translators
> >>
> >> >Hi Nick,
> >> >
> >> >I've been mulling this over since you sent the first message. But, I'm
> >> >afraid I don't have a good solution or developed ideas.
> >> >
> >> >I agree, it would be very nice to consolidate all configuration for all
> >> >parsers in the server and app.
> >> >
> >> >Is it feasible to put everything into tika-config? Then Parser
> >> >implementations would read the config to pull out their own
> >>configuration.
> >> >Or, would it be better to keep some configuration separate?
> >>Documentation
> >> >would be an issue if every parser defines its own metadata keys...
> >>But, it
> >> >might be an improvement since we don't have "free form" properties and
> >> >configuration files.
> >> >
> >> >Tyler
> >> >
> >> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <ap...@gagravarr.org>
> >>wrote:
> >> >
> >> >> Anyone have any thoughts on this?
> >> >>
> >> >> On Fri, 8 May 2015, Nick Burch wrote:
> >> >> > Hi All
> >> >> >
> >> >> > This came up in TIKA-1623, but I thought it might be better brought
> >> >>out
> >> >> to
> >> >> > the list for discussion
> >> >> >
> >> >> > To configure parsers on a per-document basis, such as setting PDF
> >> >> > spacing tolerances, or telling Tesseract what language it should be
> >> >> > OCRing for, we have the *Config objects. You create one of these,
> >>use
> >> >> > the setters to configure it for your document, pop it onto the
> >>Parse
> >> >> > context and it's used when processing your document
> >> >> >
> >> >> > To configure parsers and translators on a per-JVM basis, to apply
> >>to
> >> >>all
> >> >> > documents processed, it's a bit less consistent. At least some look
> >> >>for
> >> >> > a properties file with a specific name, usually in the tika
> >>namespace,
> >> >> > and grab their settings / keys / etc out of that. At least some
> >>expect
> >> >> > to find a *Config with their program path on it, even though that
> >> >> > remains constant between documents. None of them support getting
> >>their
> >> >> > settings from the Tika Config
> >> >> >
> >> >> >
> >> >> > As part of our evolution of parser preferences, we're moving
> >>towards
> >> >> > people either being able to set their preferences in code, or being
> >> >>able
> >> >> > to supply a Tika Config xml which sets their parser preferences or
> >> >> > overrides certain bits of the default. The code option works for
> >> >>people
> >> >> > who want to declare certain specific things, the Tika Config one
> >>gives
> >> >> > the same functionality but allows a consistent and clean way to
> >>set it
> >> >> > between Tika App, Tika Server and java code.
> >> >> >
> >> >> > Another related example is the External Parser support. Because you
> >> >>can
> >> >> > have multiple External Parser instances in your setup, one per
> >>format
> >> >>/
> >> >> > program, we look for all the
> >> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
> >>the
> >> >> > classpath, and create parser instances based on definitions in
> >>there
> >> >> >
> >> >> >
> >> >> > What do we think about setting executable paths and keys/logins for
> >> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
> >> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
> >> >> >
> >> >> > Nick
> >> >> >
> >> >>
> >>
> >>
>
>

RE: Configuring parsers and translators

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Tyler, I see your devil's advocate point.  

I strongly agree with Chris about the benefit of centralizing configuration and making it easy to dump and modify the TikaConfig file.

Even though the TikaConfig file might get ugly, it would be far better to have everything nailed down there than searching through service loaders...IMHO.

I opened TIKA-1508 a while ago and haven't had any time to work on it...this just deals with simple parameter settings for parsers, not the far more difficult/interesting stuff that we've discussed with composite parsers.

>> My main worry with putting it all into config xml is that we accidently end up re-inventing spring badly...

Yeah, or re-inventing Solr's parameter loading as my example does... :(

I think that basic parameter setting should at least be fairly trivial to code...time allowing...argh.


-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Saturday, June 06, 2015 7:01 PM
To: dev@tika.apache.org
Subject: Re: Configuring parsers and translators

Hey Tyler,

I hear you, but balance that against all the hidden things here
and there, and everywhere, that I constantly keep discovering and
having to pour through lines of TikaConfig - service loaders, class
loaders.

When things work right - no problem. When something goes wrong;
HUGE waste of time.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Tyler Palsulich <tp...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Saturday, June 6, 2015 at 3:59 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: Configuring parsers and translators

>(Devil's advocate hat slightly on.) My one hesitation about putting it all
>into tika-config is that the default might get to be a monstrosity --
>difficult for new users to use.
>
>Tyler
>
>On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> I think it would be great to have all this in the Tika Config.
>>
>> The one thing then is to provide an example default config and
>> to make it *hugely* clear rather than all the levels of indirection
>> that we currently have going on which makes it super hard when
>> there is a config error (SPI, swallowing print messages, etc.)
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>> -----Original Message-----
>> From: Tyler Palsulich <tp...@gmail.com>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Date: Saturday, June 6, 2015 at 3:45 PM
>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Subject: Re: Configuring parsers and translators
>>
>> >Hi Nick,
>> >
>> >I've been mulling this over since you sent the first message. But, I'm
>> >afraid I don't have a good solution or developed ideas.
>> >
>> >I agree, it would be very nice to consolidate all configuration for all
>> >parsers in the server and app.
>> >
>> >Is it feasible to put everything into tika-config? Then Parser
>> >implementations would read the config to pull out their own
>>configuration.
>> >Or, would it be better to keep some configuration separate?
>>Documentation
>> >would be an issue if every parser defines its own metadata keys...
>>But, it
>> >might be an improvement since we don't have "free form" properties and
>> >configuration files.
>> >
>> >Tyler
>> >
>> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <ap...@gagravarr.org>
>>wrote:
>> >
>> >> Anyone have any thoughts on this?
>> >>
>> >> On Fri, 8 May 2015, Nick Burch wrote:
>> >> > Hi All
>> >> >
>> >> > This came up in TIKA-1623, but I thought it might be better brought
>> >>out
>> >> to
>> >> > the list for discussion
>> >> >
>> >> > To configure parsers on a per-document basis, such as setting PDF
>> >> > spacing tolerances, or telling Tesseract what language it should be
>> >> > OCRing for, we have the *Config objects. You create one of these,
>>use
>> >> > the setters to configure it for your document, pop it onto the
>>Parse
>> >> > context and it's used when processing your document
>> >> >
>> >> > To configure parsers and translators on a per-JVM basis, to apply
>>to
>> >>all
>> >> > documents processed, it's a bit less consistent. At least some look
>> >>for
>> >> > a properties file with a specific name, usually in the tika
>>namespace,
>> >> > and grab their settings / keys / etc out of that. At least some
>>expect
>> >> > to find a *Config with their program path on it, even though that
>> >> > remains constant between documents. None of them support getting
>>their
>> >> > settings from the Tika Config
>> >> >
>> >> >
>> >> > As part of our evolution of parser preferences, we're moving
>>towards
>> >> > people either being able to set their preferences in code, or being
>> >>able
>> >> > to supply a Tika Config xml which sets their parser preferences or
>> >> > overrides certain bits of the default. The code option works for
>> >>people
>> >> > who want to declare certain specific things, the Tika Config one
>>gives
>> >> > the same functionality but allows a consistent and clean way to
>>set it
>> >> > between Tika App, Tika Server and java code.
>> >> >
>> >> > Another related example is the External Parser support. Because you
>> >>can
>> >> > have multiple External Parser instances in your setup, one per
>>format
>> >>/
>> >> > program, we look for all the
>> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
>>the
>> >> > classpath, and create parser instances based on definitions in
>>there
>> >> >
>> >> >
>> >> > What do we think about setting executable paths and keys/logins for
>> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
>> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
>> >> >
>> >> > Nick
>> >> >
>> >>
>>
>>


Re: Configuring parsers and translators

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hey Tyler,

I hear you, but balance that against all the hidden things here
and there, and everywhere, that I constantly keep discovering and
having to pour through lines of TikaConfig - service loaders, class
loaders.

When things work right - no problem. When something goes wrong;
HUGE waste of time.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Tyler Palsulich <tp...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Saturday, June 6, 2015 at 3:59 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: Configuring parsers and translators

>(Devil's advocate hat slightly on.) My one hesitation about putting it all
>into tika-config is that the default might get to be a monstrosity --
>difficult for new users to use.
>
>Tyler
>
>On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> I think it would be great to have all this in the Tika Config.
>>
>> The one thing then is to provide an example default config and
>> to make it *hugely* clear rather than all the levels of indirection
>> that we currently have going on which makes it super hard when
>> there is a config error (SPI, swallowing print messages, etc.)
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>> -----Original Message-----
>> From: Tyler Palsulich <tp...@gmail.com>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Date: Saturday, June 6, 2015 at 3:45 PM
>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Subject: Re: Configuring parsers and translators
>>
>> >Hi Nick,
>> >
>> >I've been mulling this over since you sent the first message. But, I'm
>> >afraid I don't have a good solution or developed ideas.
>> >
>> >I agree, it would be very nice to consolidate all configuration for all
>> >parsers in the server and app.
>> >
>> >Is it feasible to put everything into tika-config? Then Parser
>> >implementations would read the config to pull out their own
>>configuration.
>> >Or, would it be better to keep some configuration separate?
>>Documentation
>> >would be an issue if every parser defines its own metadata keys...
>>But, it
>> >might be an improvement since we don't have "free form" properties and
>> >configuration files.
>> >
>> >Tyler
>> >
>> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <ap...@gagravarr.org>
>>wrote:
>> >
>> >> Anyone have any thoughts on this?
>> >>
>> >> On Fri, 8 May 2015, Nick Burch wrote:
>> >> > Hi All
>> >> >
>> >> > This came up in TIKA-1623, but I thought it might be better brought
>> >>out
>> >> to
>> >> > the list for discussion
>> >> >
>> >> > To configure parsers on a per-document basis, such as setting PDF
>> >> > spacing tolerances, or telling Tesseract what language it should be
>> >> > OCRing for, we have the *Config objects. You create one of these,
>>use
>> >> > the setters to configure it for your document, pop it onto the
>>Parse
>> >> > context and it's used when processing your document
>> >> >
>> >> > To configure parsers and translators on a per-JVM basis, to apply
>>to
>> >>all
>> >> > documents processed, it's a bit less consistent. At least some look
>> >>for
>> >> > a properties file with a specific name, usually in the tika
>>namespace,
>> >> > and grab their settings / keys / etc out of that. At least some
>>expect
>> >> > to find a *Config with their program path on it, even though that
>> >> > remains constant between documents. None of them support getting
>>their
>> >> > settings from the Tika Config
>> >> >
>> >> >
>> >> > As part of our evolution of parser preferences, we're moving
>>towards
>> >> > people either being able to set their preferences in code, or being
>> >>able
>> >> > to supply a Tika Config xml which sets their parser preferences or
>> >> > overrides certain bits of the default. The code option works for
>> >>people
>> >> > who want to declare certain specific things, the Tika Config one
>>gives
>> >> > the same functionality but allows a consistent and clean way to
>>set it
>> >> > between Tika App, Tika Server and java code.
>> >> >
>> >> > Another related example is the External Parser support. Because you
>> >>can
>> >> > have multiple External Parser instances in your setup, one per
>>format
>> >>/
>> >> > program, we look for all the
>> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
>>the
>> >> > classpath, and create parser instances based on definitions in
>>there
>> >> >
>> >> >
>> >> > What do we think about setting executable paths and keys/logins for
>> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
>> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
>> >> >
>> >> > Nick
>> >> >
>> >>
>>
>>


Re: Configuring parsers and translators

Posted by Nick Burch <ap...@gagravarr.org>.
On Sat, 6 Jun 2015, Tyler Palsulich wrote:
> (Devil's advocate hat slightly on.) My one hesitation about putting it 
> all into tika-config is that the default might get to be a monstrosity 
> -- difficult for new users to use.

Assuming you don't want any translators, and have no non-standard paths to 
external parsers, and are happy with default parser orderings, then your 
default config would be:

<properties/>

(The plan so far remains with using the service loader to find parsers, 
detectors and friends, with the config just being used when you want to 
override parsers or parser orderings)


My main worry with putting it all into config xml is that we accidently 
end up re-inventing spring badly...

Nick

Re: Configuring parsers and translators

Posted by Tyler Palsulich <tp...@gmail.com>.
(Devil's advocate hat slightly on.) My one hesitation about putting it all
into tika-config is that the default might get to be a monstrosity --
difficult for new users to use.

Tyler

On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> I think it would be great to have all this in the Tika Config.
>
> The one thing then is to provide an example default config and
> to make it *hugely* clear rather than all the levels of indirection
> that we currently have going on which makes it super hard when
> there is a config error (SPI, swallowing print messages, etc.)
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> -----Original Message-----
> From: Tyler Palsulich <tp...@gmail.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Saturday, June 6, 2015 at 3:45 PM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: Re: Configuring parsers and translators
>
> >Hi Nick,
> >
> >I've been mulling this over since you sent the first message. But, I'm
> >afraid I don't have a good solution or developed ideas.
> >
> >I agree, it would be very nice to consolidate all configuration for all
> >parsers in the server and app.
> >
> >Is it feasible to put everything into tika-config? Then Parser
> >implementations would read the config to pull out their own configuration.
> >Or, would it be better to keep some configuration separate? Documentation
> >would be an issue if every parser defines its own metadata keys... But, it
> >might be an improvement since we don't have "free form" properties and
> >configuration files.
> >
> >Tyler
> >
> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <ap...@gagravarr.org> wrote:
> >
> >> Anyone have any thoughts on this?
> >>
> >> On Fri, 8 May 2015, Nick Burch wrote:
> >> > Hi All
> >> >
> >> > This came up in TIKA-1623, but I thought it might be better brought
> >>out
> >> to
> >> > the list for discussion
> >> >
> >> > To configure parsers on a per-document basis, such as setting PDF
> >> > spacing tolerances, or telling Tesseract what language it should be
> >> > OCRing for, we have the *Config objects. You create one of these, use
> >> > the setters to configure it for your document, pop it onto the Parse
> >> > context and it's used when processing your document
> >> >
> >> > To configure parsers and translators on a per-JVM basis, to apply to
> >>all
> >> > documents processed, it's a bit less consistent. At least some look
> >>for
> >> > a properties file with a specific name, usually in the tika namespace,
> >> > and grab their settings / keys / etc out of that. At least some expect
> >> > to find a *Config with their program path on it, even though that
> >> > remains constant between documents. None of them support getting their
> >> > settings from the Tika Config
> >> >
> >> >
> >> > As part of our evolution of parser preferences, we're moving towards
> >> > people either being able to set their preferences in code, or being
> >>able
> >> > to supply a Tika Config xml which sets their parser preferences or
> >> > overrides certain bits of the default. The code option works for
> >>people
> >> > who want to declare certain specific things, the Tika Config one gives
> >> > the same functionality but allows a consistent and clean way to set it
> >> > between Tika App, Tika Server and java code.
> >> >
> >> > Another related example is the External Parser support. Because you
> >>can
> >> > have multiple External Parser instances in your setup, one per format
> >>/
> >> > program, we look for all the
> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on the
> >> > classpath, and create parser instances based on definitions in there
> >> >
> >> >
> >> > What do we think about setting executable paths and keys/logins for
> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
> >> >
> >> > Nick
> >> >
> >>
>
>

Re: Configuring parsers and translators

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
I think it would be great to have all this in the Tika Config.

The one thing then is to provide an example default config and
to make it *hugely* clear rather than all the levels of indirection
that we currently have going on which makes it super hard when
there is a config error (SPI, swallowing print messages, etc.)


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Tyler Palsulich <tp...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Saturday, June 6, 2015 at 3:45 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: Configuring parsers and translators

>Hi Nick,
>
>I've been mulling this over since you sent the first message. But, I'm
>afraid I don't have a good solution or developed ideas.
>
>I agree, it would be very nice to consolidate all configuration for all
>parsers in the server and app.
>
>Is it feasible to put everything into tika-config? Then Parser
>implementations would read the config to pull out their own configuration.
>Or, would it be better to keep some configuration separate? Documentation
>would be an issue if every parser defines its own metadata keys... But, it
>might be an improvement since we don't have "free form" properties and
>configuration files.
>
>Tyler
>
>On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <ap...@gagravarr.org> wrote:
>
>> Anyone have any thoughts on this?
>>
>> On Fri, 8 May 2015, Nick Burch wrote:
>> > Hi All
>> >
>> > This came up in TIKA-1623, but I thought it might be better brought
>>out
>> to
>> > the list for discussion
>> >
>> > To configure parsers on a per-document basis, such as setting PDF
>> > spacing tolerances, or telling Tesseract what language it should be
>> > OCRing for, we have the *Config objects. You create one of these, use
>> > the setters to configure it for your document, pop it onto the Parse
>> > context and it's used when processing your document
>> >
>> > To configure parsers and translators on a per-JVM basis, to apply to
>>all
>> > documents processed, it's a bit less consistent. At least some look
>>for
>> > a properties file with a specific name, usually in the tika namespace,
>> > and grab their settings / keys / etc out of that. At least some expect
>> > to find a *Config with their program path on it, even though that
>> > remains constant between documents. None of them support getting their
>> > settings from the Tika Config
>> >
>> >
>> > As part of our evolution of parser preferences, we're moving towards
>> > people either being able to set their preferences in code, or being
>>able
>> > to supply a Tika Config xml which sets their parser preferences or
>> > overrides certain bits of the default. The code option works for
>>people
>> > who want to declare certain specific things, the Tika Config one gives
>> > the same functionality but allows a consistent and clean way to set it
>> > between Tika App, Tika Server and java code.
>> >
>> > Another related example is the External Parser support. Because you
>>can
>> > have multiple External Parser instances in your setup, one per format
>>/
>> > program, we look for all the
>> > org/apache/tika/parser/external/tika-external-parsers.xml files on the
>> > classpath, and create parser instances based on definitions in there
>> >
>> >
>> > What do we think about setting executable paths and keys/logins for
>> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
>> > Properties? Custom xml config? Tika config xml? Other? Combination?
>> >
>> > Nick
>> >
>>


Re: Configuring parsers and translators

Posted by Tyler Palsulich <tp...@gmail.com>.
Hi Nick,

I've been mulling this over since you sent the first message. But, I'm
afraid I don't have a good solution or developed ideas.

I agree, it would be very nice to consolidate all configuration for all
parsers in the server and app.

Is it feasible to put everything into tika-config? Then Parser
implementations would read the config to pull out their own configuration.
Or, would it be better to keep some configuration separate? Documentation
would be an issue if every parser defines its own metadata keys... But, it
might be an improvement since we don't have "free form" properties and
configuration files.

Tyler

On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <ap...@gagravarr.org> wrote:

> Anyone have any thoughts on this?
>
> On Fri, 8 May 2015, Nick Burch wrote:
> > Hi All
> >
> > This came up in TIKA-1623, but I thought it might be better brought out
> to
> > the list for discussion
> >
> > To configure parsers on a per-document basis, such as setting PDF
> > spacing tolerances, or telling Tesseract what language it should be
> > OCRing for, we have the *Config objects. You create one of these, use
> > the setters to configure it for your document, pop it onto the Parse
> > context and it's used when processing your document
> >
> > To configure parsers and translators on a per-JVM basis, to apply to all
> > documents processed, it's a bit less consistent. At least some look for
> > a properties file with a specific name, usually in the tika namespace,
> > and grab their settings / keys / etc out of that. At least some expect
> > to find a *Config with their program path on it, even though that
> > remains constant between documents. None of them support getting their
> > settings from the Tika Config
> >
> >
> > As part of our evolution of parser preferences, we're moving towards
> > people either being able to set their preferences in code, or being able
> > to supply a Tika Config xml which sets their parser preferences or
> > overrides certain bits of the default. The code option works for people
> > who want to declare certain specific things, the Tika Config one gives
> > the same functionality but allows a consistent and clean way to set it
> > between Tika App, Tika Server and java code.
> >
> > Another related example is the External Parser support. Because you can
> > have multiple External Parser instances in your setup, one per format /
> > program, we look for all the
> > org/apache/tika/parser/external/tika-external-parsers.xml files on the
> > classpath, and create parser instances based on definitions in there
> >
> >
> > What do we think about setting executable paths and keys/logins for
> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
> > Properties? Custom xml config? Tika config xml? Other? Combination?
> >
> > Nick
> >
>