You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/02/08 21:10:55 UTC

Tika-config

What is wrong with this?
I specified the tika-config env variable.  I know it works because if I make a syntax error in the tika-config.xml, it complains.  So it's finding the file.  But it's not applying the properties

I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It's just not applying the values.

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>

        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
            <params>
                <param name="tesseractPath" type="string">c:\\tesseract_config</param>
                <param name="tessdataPath" type="string">c:\\tessdata_config</param>
            </params>
        </parser>
    </parsers>
</properties>


Re: Tika-config

Posted by Peter Kronenberg <pe...@torch.ai>.
I'll have to play with this, but it seems totally counter-intuitive. If I set a parameter in tika-config, but I don't set it in tesseractConfig, then it will ignore my setting and just use the default?



________________________________
From: Tim Allison <ta...@apache.org>
Sent: Monday, February 8, 2021, 8:52 PM
To: Peter Kronenberg; user@tika.apache.org
Subject: Re: Tika-config

Looks like I forgot to reply to the list on one of your earlier emails.

This still holds:

bq. One thing to note is that if you set params programmatically, Tika
will ignore the default settings that you made in TikaConfig.  It will
only read the config from what you pass in via the ParseContext.  So,
if in your tikaconfig.xml you set 'resize' to 100, and then you
_don't_ set it in the TesseractConfig that you send in via the
ParseContext, it will revert to the overall default of 900

On Mon, Feb 8, 2021 at 8:47 PM Tim Allison <ta...@apache.org> wrote:
>
> I regret that I'm not able to reproduce this...that is, this works for me;
>
> @Test
> public void oneOff() throws Exception {
>     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
>     TikaConfig config = new TikaConfig();
>     AutoDetectParser parser = new AutoDetectParser(config);
>     assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml);
> }
>
>
> where myconfig.xml is:
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser">
>         </parser>
>
>         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>             <params>
>                 <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param>
>                 <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param>
>             </params>
>         </parser>
>     </parsers>
> </properties>
>
> Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them.  So, if you aren't setting the path there, then, y, you won't see any effect.
>
> On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai> wrote:
>>
>> Like this.
>>
>>
>>         TikaConfig tikaConfig = new TikaConfig();
>>
>>         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>>
>>         final ParseContext parseContext = new ParseContext();
>>
>>         parseContext.set(AutoDetectParser.class, parser);
>>         parseContext.set(PDFParserConfig.class, pdfConfig);
>>         parseContext.set(TesseractOCRConfig.class, tessConfig);
>>
>> -----Original Message-----
>> From: Tim Allison <ta...@apache.org>
>> Sent: Monday, February 8, 2021 5:31 PM
>> To: user@tika.apache.org
>> Subject: Re: Tika-config
>>
>> How are you using the TikaConfig?
>>
>> On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai> wrote:
>> >
>> > What is wrong with this?
>> >
>> > I specified the tika-config env variable.  I know it works because if
>> > I make a syntax error in the tika-config.xml, it complains.  So it’s
>> > finding the file.  But it’s not applying the properties
>> >
>> >
>> >
>> > I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
>> >
>> >
>> >
>> > <?xml version="1.0" encoding="UTF-8"?> <properties>
>> >     <parsers>
>> >         <parser class="org.apache.tika.parser.DefaultParser">
>> >         </parser>
>> >
>> >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>> >             <params>
>> >                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
>> >                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
>> >             </params>
>> >         </parser>
>> >     </parsers>
>> > </properties>
>> >
>> >


Re: Tika-config

Posted by Tim Allison <ta...@apache.org>.
Looks like I forgot to reply to the list on one of your earlier emails.

This still holds:

bq. One thing to note is that if you set params programmatically, Tika
will ignore the default settings that you made in TikaConfig.  It will
only read the config from what you pass in via the ParseContext.  So,
if in your tikaconfig.xml you set 'resize' to 100, and then you
_don't_ set it in the TesseractConfig that you send in via the
ParseContext, it will revert to the overall default of 900

On Mon, Feb 8, 2021 at 8:47 PM Tim Allison <ta...@apache.org> wrote:
>
> I regret that I'm not able to reproduce this...that is, this works for me;
>
> @Test
> public void oneOff() throws Exception {
>     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
>     TikaConfig config = new TikaConfig();
>     AutoDetectParser parser = new AutoDetectParser(config);
>     assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml);
> }
>
>
> where myconfig.xml is:
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser">
>         </parser>
>
>         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>             <params>
>                 <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param>
>                 <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param>
>             </params>
>         </parser>
>     </parsers>
> </properties>
>
> Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them.  So, if you aren't setting the path there, then, y, you won't see any effect.
>
> On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai> wrote:
>>
>> Like this.
>>
>>
>>         TikaConfig tikaConfig = new TikaConfig();
>>
>>         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>>
>>         final ParseContext parseContext = new ParseContext();
>>
>>         parseContext.set(AutoDetectParser.class, parser);
>>         parseContext.set(PDFParserConfig.class, pdfConfig);
>>         parseContext.set(TesseractOCRConfig.class, tessConfig);
>>
>> -----Original Message-----
>> From: Tim Allison <ta...@apache.org>
>> Sent: Monday, February 8, 2021 5:31 PM
>> To: user@tika.apache.org
>> Subject: Re: Tika-config
>>
>> How are you using the TikaConfig?
>>
>> On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai> wrote:
>> >
>> > What is wrong with this?
>> >
>> > I specified the tika-config env variable.  I know it works because if
>> > I make a syntax error in the tika-config.xml, it complains.  So it’s
>> > finding the file.  But it’s not applying the properties
>> >
>> >
>> >
>> > I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
>> >
>> >
>> >
>> > <?xml version="1.0" encoding="UTF-8"?> <properties>
>> >     <parsers>
>> >         <parser class="org.apache.tika.parser.DefaultParser">
>> >         </parser>
>> >
>> >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>> >             <params>
>> >                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
>> >                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
>> >             </params>
>> >         </parser>
>> >     </parsers>
>> > </properties>
>> >
>> >

Re: Tika-config

Posted by Tim Allison <ta...@apache.org>.
Wait, no, the clone method won't work, because, again, you'd have to
find the TesseractOCRParser in order to call clone, which is possible,
but annoying.

So, I _guess_ the solution is to keep track of what values were
changed from default in the TesseractOCRConfig.  This will add code
and functionality that no one has ever requested...do we need this?

I'm open to other solutions.

On Mon, Feb 8, 2021 at 9:53 PM Tim Allison <ta...@apache.org> wrote:
>
> >property file and tika-config interact.
>
> If you use a tika-config, the parameters are set from there.  If you
> don't, we fall back to the property file.
>
> If you look in the TesseractOCRParser, there's a "defaultConfig".
> That is intended to be loaded and configured shortly after
> initialization and is to be used as the default config if the user
> does not otherwise pass in an OCRConfig at parse time.  That
> "internal" config is effectively static and can be used across threads
> because, under normal circumstances, it is never changed shortly after
> initialization.  As above, it is either set by a tika-config file or
> by the properties shortly after initialization.
>
> > you can programmatically change parameter values
> Yes. If you programmatically call the setters on the parser, that will
> change the underlying defaultConfig...as you'd expect. And those
> changes will go into effect across all threads for that parser.  You
> will only change the values that you call. Everything else from the
> original initialization will be unchanged.  There's no great way to
> find the parser in the AutoDetectParser...  So, basically, don't do
> this.
>
> >or pass in a tika-config to the parser which is set in the parseContext, right.
> Uh, TesseractOCRConfig, right? At parse time, the OCRParser has an
> internal default config that was set as described above.  If you then
> pass in a new tessconfig at parse time via the parsecontext, it will
> use that _instead_ of the internal config that was set shortly after
> initialization.
>
> If you want to add a "clone" method or similar or a "getConfig" to
> TesseractParser, that might work.  You'd get the default tessconfig
> (which was set via the tika-config file at initialization), clone it,
> modify it and then send it into a given parse at parse time via the
> ParseContext.  Something like that should work.
>
> As our code is currently set up... (e.g. I acknowledge there is always
> room for improvement), et's say the parameter is dpi, and the default
> is 100.
>
> If you set "dpi" to 200 in your tika-config.xml file, then the
> internal tessconfig will be 200.  Now let's say at parse time, you
> want to go back to the default...so you set dpi on a new tessconfig to
> 100 and then send that in via the parsecontext.  We don't currently
> have the code in place to know that you only changed one parameter in
> the tessconfig.  So, how would we know to overwrite that one value,
> but not say the empty path to tesseract or any of the other default
> values.
>
>
> On Mon, Feb 8, 2021 at 9:35 PM Peter Kronenberg
> <pe...@torch.ai> wrote:
> >
> > I still don't get how the property file and tika-config interact.  When you say an internal tessConfig I assume you mean the one that is packaged with tika, which could be replaced by another file in the same package (which is essentially what I'm doing now)
> >
> > Then, at runtime, you can programmatically change parameter values or pass in a tika-config to the parser which is set in the parseContext, right.  So wouldn't that simply override any values in the current config?  I don't understand how this would cause the default values to re-appear
> >
> > ________________________________
> > From: Tim Allison <ta...@apache.org>
> > Sent: Monday, February 8, 2021 9:25 PM
> > To: Peter Kronenberg <pe...@torch.ai>
> > Cc: user@tika.apache.org <us...@tika.apache.org>
> > Subject: Re: Tika-config
> >
> > sorry an "internal tessconfig"
> >
> > On Mon, Feb 8, 2021 at 9:23 PM Tim Allison <ta...@apache.org> wrote:
> > >
> > > Let's say you have an internal tessconfig file in the parser that
> > > you've configured through a tikaconfig.  When, at runtime, you send in
> > > a new tessconfig via the parsecontext, how can we tell which
> > > parameters you want to change from the new tessconfig?
> > >
> > > Yes, I realize that it would be possible to keep track of what
> > > parameters have been changed in the runtime config and then do
> > > something smart, but this hasn't been an issue to date.
> > >
> > > On Mon, Feb 8, 2021 at 9:15 PM Peter Kronenberg
> > > <pe...@torch.ai> wrote:
> > > >
> > > > Ok, your last point might be the issue. If i don't set it in tesseractOCRConfig, then seting it in tika-config has no effect? I'm not sure I understand the thinking or logic behind this.
> > > >
> > > >
> > > > ________________________________
> > > > From: Tim Allison <ta...@apache.org>
> > > > Sent: Monday, February 8, 2021 8:47:07 PM
> > > > To: Peter Kronenberg <pe...@torch.ai>; user@tika.apache.org <us...@tika.apache.org>
> > > > Subject: Re: Tika-config
> > > >
> > > > I regret that I'm not able to reproduce this...that is, this works for me;
> > > >
> > > > @Test
> > > > public void oneOff() throws Exception {
> > > >     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
> > > >     TikaConfig config = new TikaConfig();
> > > >     AutoDetectParser parser = new AutoDetectParser(config);
> > > >     assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml);
> > > > }
> > > >
> > > >
> > > > where myconfig.xml is:
> > > > <?xml version="1.0" encoding="UTF-8"?>
> > > > <properties>
> > > >     <parsers>
> > > >         <parser class="org.apache.tika.parser.DefaultParser">
> > > >         </parser>
> > > >
> > > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > > >             <params>
> > > >                 <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param>
> > > >                 <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param>
> > > >             </params>
> > > >         </parser>
> > > >     </parsers>
> > > > </properties>
> > > >
> > > > Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them.  So, if you aren't setting the path there, then, y, you won't see any effect.
> > > >
> > > > On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai> wrote:
> > > >
> > > > Like this.
> > > >
> > > >
> > > >         TikaConfig tikaConfig = new TikaConfig();
> > > >
> > > >         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
> > > >
> > > >         final ParseContext parseContext = new ParseContext();
> > > >
> > > >         parseContext.set(AutoDetectParser.class, parser);
> > > >         parseContext.set(PDFParserConfig.class, pdfConfig);
> > > >         parseContext.set(TesseractOCRConfig.class, tessConfig);
> > > >
> > > > -----Original Message-----
> > > > From: Tim Allison <ta...@apache.org>
> > > > Sent: Monday, February 8, 2021 5:31 PM
> > > > To: user@tika.apache.org
> > > > Subject: Re: Tika-config
> > > >
> > > > How are you using the TikaConfig?
> > > >
> > > > On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai> wrote:
> > > > >
> > > > > What is wrong with this?
> > > > >
> > > > > I specified the tika-config env variable.  I know it works because if
> > > > > I make a syntax error in the tika-config.xml, it complains.  So it’s
> > > > > finding the file.  But it’s not applying the properties
> > > > >
> > > > >
> > > > >
> > > > > I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
> > > > >
> > > > >
> > > > >
> > > > > <?xml version="1.0" encoding="UTF-8"?> <properties>
> > > > >     <parsers>
> > > > >         <parser class="org.apache.tika.parser.DefaultParser">
> > > > >         </parser>
> > > > >
> > > > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > > > >             <params>
> > > > >                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
> > > > >                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
> > > > >             </params>
> > > > >         </parser>
> > > > >     </parsers>
> > > > > </properties>
> > > > >
> > > > >

RE: Tika-config

Posted by Peter Kronenberg <pe...@torch.ai>.
This just is not working for me.  Based on what you said, I added values to my TesseractOCRConfig.properties
tesseractPath=dummy
tessdataPath=dummy

Ran my code and it used these values instead of from tika-config.  I then put a syntax error in tika-config and it complained, so it’s still loading tika-config, but not using the values.

So I am just not understanding how this is supposed to work.

And this statement of yours bothers me:

If you use a tika-config, the parameters are set from there.  If you
don't, we fall back to the property file.

It sounds like you’re saying that if you use tika-config, the values in the property file are ignored.  I’m using the property files as my main source of values (plus changing them programmatically).  I want to only use tika-config to set a particular set of values (tesseractPath and tessdataPath)

It’s starting to sound like tika-config and properties file are either/or.  Not combined, as I assumed.  Which is why tika-config never worked for me in the first place, since it never seemed possible to change things programmatically after loading from tika-config

The env variables were so simple 😊

From: Peter Kronenberg <pe...@torch.ai>
Sent: Monday, February 8, 2021 10:46 PM
To: Tim Allison <ta...@apache.org>; user@tika.apache.org
Subject: Re: Tika-config

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

I'm gonna have to read this again slowly.
. But you corrected me when I said that tika-config is set in the parser context. But doesn't it get passed on by virtue of being set on the autodetect parser?

I haven't looked at the code yet to see the defaultConfig, but isn't the internal tesseractOCRConfig always used? Isn't that always the default?

________________________________
From: Tim Allison <ta...@apache.org>>
Sent: Monday, February 8, 2021 9:53 PM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Subject: Re: Tika-config

>property file and tika-config interact.

If you use a tika-config, the parameters are set from there.  If you
don't, we fall back to the property file.

If you look in the TesseractOCRParser, there's a "defaultConfig".
That is intended to be loaded and configured shortly after
initialization and is to be used as the default config if the user
does not otherwise pass in an OCRConfig at parse time.  That
"internal" config is effectively static and can be used across threads
because, under normal circumstances, it is never changed shortly after
initialization.  As above, it is either set by a tika-config file or
by the properties shortly after initialization.

> you can programmatically change parameter values
Yes. If you programmatically call the setters on the parser, that will
change the underlying defaultConfig...as you'd expect. And those
changes will go into effect across all threads for that parser.  You
will only change the values that you call. Everything else from the
original initialization will be unchanged.  There's no great way to
find the parser in the AutoDetectParser...  So, basically, don't do
this.

>or pass in a tika-config to the parser which is set in the parseContext, right.
Uh, TesseractOCRConfig, right? At parse time, the OCRParser has an
internal default config that was set as described above.  If you then
pass in a new tessconfig at parse time via the parsecontext, it will
use that _instead_ of the internal config that was set shortly after
initialization.

If you want to add a "clone" method or similar or a "getConfig" to
TesseractParser, that might work.  You'd get the default tessconfig
(which was set via the tika-config file at initialization), clone it,
modify it and then send it into a given parse at parse time via the
ParseContext.  Something like that should work.

As our code is currently set up... (e.g. I acknowledge there is always
room for improvement), et's say the parameter is dpi, and the default
is 100.

If you set "dpi" to 200 in your tika-config.xml file, then the
internal tessconfig will be 200.  Now let's say at parse time, you
want to go back to the default...so you set dpi on a new tessconfig to
100 and then send that in via the parsecontext.  We don't currently
have the code in place to know that you only changed one parameter in
the tessconfig.  So, how would we know to overwrite that one value,
but not say the empty path to tesseract or any of the other default
values.


On Mon, Feb 8, 2021 at 9:35 PM Peter Kronenberg
<pe...@torch.ai>> wrote:
>
> I still don't get how the property file and tika-config interact.  When you say an internal tessConfig I assume you mean the one that is packaged with tika, which could be replaced by another file in the same package (which is essentially what I'm doing now)
>
> Then, at runtime, you can programmatically change parameter values or pass in a tika-config to the parser which is set in the parseContext, right.  So wouldn't that simply override any values in the current config?  I don't understand how this would cause the default values to re-appear
>
> ________________________________
> From: Tim Allison <ta...@apache.org>>
> Sent: Monday, February 8, 2021 9:25 PM
> To: Peter Kronenberg <pe...@torch.ai>>
> Cc: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
> Subject: Re: Tika-config
>
> sorry an "internal tessconfig"
>
> On Mon, Feb 8, 2021 at 9:23 PM Tim Allison <ta...@apache.org>> wrote:
> >
> > Let's say you have an internal tessconfig file in the parser that
> > you've configured through a tikaconfig.  When, at runtime, you send in
> > a new tessconfig via the parsecontext, how can we tell which
> > parameters you want to change from the new tessconfig?
> >
> > Yes, I realize that it would be possible to keep track of what
> > parameters have been changed in the runtime config and then do
> > something smart, but this hasn't been an issue to date.
> >
> > On Mon, Feb 8, 2021 at 9:15 PM Peter Kronenberg
> > <pe...@torch.ai>> wrote:
> > >
> > > Ok, your last point might be the issue. If i don't set it in tesseractOCRConfig, then seting it in tika-config has no effect? I'm not sure I understand the thinking or logic behind this.
> > >
> > >
> > > ________________________________
> > > From: Tim Allison <ta...@apache.org>>
> > > Sent: Monday, February 8, 2021 8:47:07 PM
> > > To: Peter Kronenberg <pe...@torch.ai>>; user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
> > > Subject: Re: Tika-config
> > >
> > > I regret that I'm not able to reproduce this...that is, this works for me;
> > >
> > > @Test
> > > public void oneOff() throws Exception {
> > >     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
> > >     TikaConfig config = new TikaConfig();
> > >     AutoDetectParser parser = new AutoDetectParser(config);
> > >     assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml);
> > > }
> > >
> > >
> > > where myconfig.xml is:
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <properties>
> > >     <parsers>
> > >         <parser class="org.apache.tika.parser.DefaultParser">
> > >         </parser>
> > >
> > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > >             <params>
> > >                 <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param>
> > >                 <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param>
> > >             </params>
> > >         </parser>
> > >     </parsers>
> > > </properties>
> > >
> > > Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them.  So, if you aren't setting the path there, then, y, you won't see any effect.
> > >
> > > On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai>> wrote:
> > >
> > > Like this.
> > >
> > >
> > >         TikaConfig tikaConfig = new TikaConfig();
> > >
> > >         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
> > >
> > >         final ParseContext parseContext = new ParseContext();
> > >
> > >         parseContext.set(AutoDetectParser.class, parser);
> > >         parseContext.set(PDFParserConfig.class, pdfConfig);
> > >         parseContext.set(TesseractOCRConfig.class, tessConfig);
> > >
> > > -----Original Message-----
> > > From: Tim Allison <ta...@apache.org>>
> > > Sent: Monday, February 8, 2021 5:31 PM
> > > To: user@tika.apache.org<ma...@tika.apache.org>
> > > Subject: Re: Tika-config
> > >
> > > How are you using the TikaConfig?
> > >
> > > On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai>> wrote:
> > > >
> > > > What is wrong with this?
> > > >
> > > > I specified the tika-config env variable.  I know it works because if
> > > > I make a syntax error in the tika-config.xml, it complains.  So it’s
> > > > finding the file.  But it’s not applying the properties
> > > >
> > > >
> > > >
> > > > I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
> > > >
> > > >
> > > >
> > > > <?xml version="1.0" encoding="UTF-8"?> <properties>
> > > >     <parsers>
> > > >         <parser class="org.apache.tika.parser.DefaultParser">
> > > >         </parser>
> > > >
> > > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > > >             <params>
> > > >                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
> > > >                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
> > > >             </params>
> > > >         </parser>
> > > >     </parsers>
> > > > </properties>
> > > >
> > > >

Re: Tika-config

Posted by Peter Kronenberg <pe...@torch.ai>.
I'm gonna have to read this again slowly.
. But you corrected me when I said that tika-config is set in the parser context. But doesn't it get passed on by virtue of being set on the autodetect parser?

I haven't looked at the code yet to see the defaultConfig, but isn't the internal tesseractOCRConfig always used? Isn't that always the default?

________________________________
From: Tim Allison <ta...@apache.org>
Sent: Monday, February 8, 2021 9:53 PM
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Tika-config

>property file and tika-config interact.

If you use a tika-config, the parameters are set from there.  If you
don't, we fall back to the property file.

If you look in the TesseractOCRParser, there's a "defaultConfig".
That is intended to be loaded and configured shortly after
initialization and is to be used as the default config if the user
does not otherwise pass in an OCRConfig at parse time.  That
"internal" config is effectively static and can be used across threads
because, under normal circumstances, it is never changed shortly after
initialization.  As above, it is either set by a tika-config file or
by the properties shortly after initialization.

> you can programmatically change parameter values
Yes. If you programmatically call the setters on the parser, that will
change the underlying defaultConfig...as you'd expect. And those
changes will go into effect across all threads for that parser.  You
will only change the values that you call. Everything else from the
original initialization will be unchanged.  There's no great way to
find the parser in the AutoDetectParser...  So, basically, don't do
this.

>or pass in a tika-config to the parser which is set in the parseContext, right.
Uh, TesseractOCRConfig, right? At parse time, the OCRParser has an
internal default config that was set as described above.  If you then
pass in a new tessconfig at parse time via the parsecontext, it will
use that _instead_ of the internal config that was set shortly after
initialization.

If you want to add a "clone" method or similar or a "getConfig" to
TesseractParser, that might work.  You'd get the default tessconfig
(which was set via the tika-config file at initialization), clone it,
modify it and then send it into a given parse at parse time via the
ParseContext.  Something like that should work.

As our code is currently set up... (e.g. I acknowledge there is always
room for improvement), et's say the parameter is dpi, and the default
is 100.

If you set "dpi" to 200 in your tika-config.xml file, then the
internal tessconfig will be 200.  Now let's say at parse time, you
want to go back to the default...so you set dpi on a new tessconfig to
100 and then send that in via the parsecontext.  We don't currently
have the code in place to know that you only changed one parameter in
the tessconfig.  So, how would we know to overwrite that one value,
but not say the empty path to tesseract or any of the other default
values.


On Mon, Feb 8, 2021 at 9:35 PM Peter Kronenberg
<pe...@torch.ai> wrote:
>
> I still don't get how the property file and tika-config interact.  When you say an internal tessConfig I assume you mean the one that is packaged with tika, which could be replaced by another file in the same package (which is essentially what I'm doing now)
>
> Then, at runtime, you can programmatically change parameter values or pass in a tika-config to the parser which is set in the parseContext, right.  So wouldn't that simply override any values in the current config?  I don't understand how this would cause the default values to re-appear
>
> ________________________________
> From: Tim Allison <ta...@apache.org>
> Sent: Monday, February 8, 2021 9:25 PM
> To: Peter Kronenberg <pe...@torch.ai>
> Cc: user@tika.apache.org <us...@tika.apache.org>
> Subject: Re: Tika-config
>
> sorry an "internal tessconfig"
>
> On Mon, Feb 8, 2021 at 9:23 PM Tim Allison <ta...@apache.org> wrote:
> >
> > Let's say you have an internal tessconfig file in the parser that
> > you've configured through a tikaconfig.  When, at runtime, you send in
> > a new tessconfig via the parsecontext, how can we tell which
> > parameters you want to change from the new tessconfig?
> >
> > Yes, I realize that it would be possible to keep track of what
> > parameters have been changed in the runtime config and then do
> > something smart, but this hasn't been an issue to date.
> >
> > On Mon, Feb 8, 2021 at 9:15 PM Peter Kronenberg
> > <pe...@torch.ai> wrote:
> > >
> > > Ok, your last point might be the issue. If i don't set it in tesseractOCRConfig, then seting it in tika-config has no effect? I'm not sure I understand the thinking or logic behind this.
> > >
> > >
> > > ________________________________
> > > From: Tim Allison <ta...@apache.org>
> > > Sent: Monday, February 8, 2021 8:47:07 PM
> > > To: Peter Kronenberg <pe...@torch.ai>; user@tika.apache.org <us...@tika.apache.org>
> > > Subject: Re: Tika-config
> > >
> > > I regret that I'm not able to reproduce this...that is, this works for me;
> > >
> > > @Test
> > > public void oneOff() throws Exception {
> > >     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
> > >     TikaConfig config = new TikaConfig();
> > >     AutoDetectParser parser = new AutoDetectParser(config);
> > >     assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml);
> > > }
> > >
> > >
> > > where myconfig.xml is:
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <properties>
> > >     <parsers>
> > >         <parser class="org.apache.tika.parser.DefaultParser">
> > >         </parser>
> > >
> > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > >             <params>
> > >                 <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param>
> > >                 <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param>
> > >             </params>
> > >         </parser>
> > >     </parsers>
> > > </properties>
> > >
> > > Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them.  So, if you aren't setting the path there, then, y, you won't see any effect.
> > >
> > > On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai> wrote:
> > >
> > > Like this.
> > >
> > >
> > >         TikaConfig tikaConfig = new TikaConfig();
> > >
> > >         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
> > >
> > >         final ParseContext parseContext = new ParseContext();
> > >
> > >         parseContext.set(AutoDetectParser.class, parser);
> > >         parseContext.set(PDFParserConfig.class, pdfConfig);
> > >         parseContext.set(TesseractOCRConfig.class, tessConfig);
> > >
> > > -----Original Message-----
> > > From: Tim Allison <ta...@apache.org>
> > > Sent: Monday, February 8, 2021 5:31 PM
> > > To: user@tika.apache.org
> > > Subject: Re: Tika-config
> > >
> > > How are you using the TikaConfig?
> > >
> > > On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai> wrote:
> > > >
> > > > What is wrong with this?
> > > >
> > > > I specified the tika-config env variable.  I know it works because if
> > > > I make a syntax error in the tika-config.xml, it complains.  So it’s
> > > > finding the file.  But it’s not applying the properties
> > > >
> > > >
> > > >
> > > > I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
> > > >
> > > >
> > > >
> > > > <?xml version="1.0" encoding="UTF-8"?> <properties>
> > > >     <parsers>
> > > >         <parser class="org.apache.tika.parser.DefaultParser">
> > > >         </parser>
> > > >
> > > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > > >             <params>
> > > >                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
> > > >                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
> > > >             </params>
> > > >         </parser>
> > > >     </parsers>
> > > > </properties>
> > > >
> > > >

Re: Tika-config

Posted by Tim Allison <ta...@apache.org>.
>property file and tika-config interact.

If you use a tika-config, the parameters are set from there.  If you
don't, we fall back to the property file.

If you look in the TesseractOCRParser, there's a "defaultConfig".
That is intended to be loaded and configured shortly after
initialization and is to be used as the default config if the user
does not otherwise pass in an OCRConfig at parse time.  That
"internal" config is effectively static and can be used across threads
because, under normal circumstances, it is never changed shortly after
initialization.  As above, it is either set by a tika-config file or
by the properties shortly after initialization.

> you can programmatically change parameter values
Yes. If you programmatically call the setters on the parser, that will
change the underlying defaultConfig...as you'd expect. And those
changes will go into effect across all threads for that parser.  You
will only change the values that you call. Everything else from the
original initialization will be unchanged.  There's no great way to
find the parser in the AutoDetectParser...  So, basically, don't do
this.

>or pass in a tika-config to the parser which is set in the parseContext, right.
Uh, TesseractOCRConfig, right? At parse time, the OCRParser has an
internal default config that was set as described above.  If you then
pass in a new tessconfig at parse time via the parsecontext, it will
use that _instead_ of the internal config that was set shortly after
initialization.

If you want to add a "clone" method or similar or a "getConfig" to
TesseractParser, that might work.  You'd get the default tessconfig
(which was set via the tika-config file at initialization), clone it,
modify it and then send it into a given parse at parse time via the
ParseContext.  Something like that should work.

As our code is currently set up... (e.g. I acknowledge there is always
room for improvement), et's say the parameter is dpi, and the default
is 100.

If you set "dpi" to 200 in your tika-config.xml file, then the
internal tessconfig will be 200.  Now let's say at parse time, you
want to go back to the default...so you set dpi on a new tessconfig to
100 and then send that in via the parsecontext.  We don't currently
have the code in place to know that you only changed one parameter in
the tessconfig.  So, how would we know to overwrite that one value,
but not say the empty path to tesseract or any of the other default
values.


On Mon, Feb 8, 2021 at 9:35 PM Peter Kronenberg
<pe...@torch.ai> wrote:
>
> I still don't get how the property file and tika-config interact.  When you say an internal tessConfig I assume you mean the one that is packaged with tika, which could be replaced by another file in the same package (which is essentially what I'm doing now)
>
> Then, at runtime, you can programmatically change parameter values or pass in a tika-config to the parser which is set in the parseContext, right.  So wouldn't that simply override any values in the current config?  I don't understand how this would cause the default values to re-appear
>
> ________________________________
> From: Tim Allison <ta...@apache.org>
> Sent: Monday, February 8, 2021 9:25 PM
> To: Peter Kronenberg <pe...@torch.ai>
> Cc: user@tika.apache.org <us...@tika.apache.org>
> Subject: Re: Tika-config
>
> sorry an "internal tessconfig"
>
> On Mon, Feb 8, 2021 at 9:23 PM Tim Allison <ta...@apache.org> wrote:
> >
> > Let's say you have an internal tessconfig file in the parser that
> > you've configured through a tikaconfig.  When, at runtime, you send in
> > a new tessconfig via the parsecontext, how can we tell which
> > parameters you want to change from the new tessconfig?
> >
> > Yes, I realize that it would be possible to keep track of what
> > parameters have been changed in the runtime config and then do
> > something smart, but this hasn't been an issue to date.
> >
> > On Mon, Feb 8, 2021 at 9:15 PM Peter Kronenberg
> > <pe...@torch.ai> wrote:
> > >
> > > Ok, your last point might be the issue. If i don't set it in tesseractOCRConfig, then seting it in tika-config has no effect? I'm not sure I understand the thinking or logic behind this.
> > >
> > >
> > > ________________________________
> > > From: Tim Allison <ta...@apache.org>
> > > Sent: Monday, February 8, 2021 8:47:07 PM
> > > To: Peter Kronenberg <pe...@torch.ai>; user@tika.apache.org <us...@tika.apache.org>
> > > Subject: Re: Tika-config
> > >
> > > I regret that I'm not able to reproduce this...that is, this works for me;
> > >
> > > @Test
> > > public void oneOff() throws Exception {
> > >     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
> > >     TikaConfig config = new TikaConfig();
> > >     AutoDetectParser parser = new AutoDetectParser(config);
> > >     assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml);
> > > }
> > >
> > >
> > > where myconfig.xml is:
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <properties>
> > >     <parsers>
> > >         <parser class="org.apache.tika.parser.DefaultParser">
> > >         </parser>
> > >
> > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > >             <params>
> > >                 <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param>
> > >                 <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param>
> > >             </params>
> > >         </parser>
> > >     </parsers>
> > > </properties>
> > >
> > > Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them.  So, if you aren't setting the path there, then, y, you won't see any effect.
> > >
> > > On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai> wrote:
> > >
> > > Like this.
> > >
> > >
> > >         TikaConfig tikaConfig = new TikaConfig();
> > >
> > >         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
> > >
> > >         final ParseContext parseContext = new ParseContext();
> > >
> > >         parseContext.set(AutoDetectParser.class, parser);
> > >         parseContext.set(PDFParserConfig.class, pdfConfig);
> > >         parseContext.set(TesseractOCRConfig.class, tessConfig);
> > >
> > > -----Original Message-----
> > > From: Tim Allison <ta...@apache.org>
> > > Sent: Monday, February 8, 2021 5:31 PM
> > > To: user@tika.apache.org
> > > Subject: Re: Tika-config
> > >
> > > How are you using the TikaConfig?
> > >
> > > On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai> wrote:
> > > >
> > > > What is wrong with this?
> > > >
> > > > I specified the tika-config env variable.  I know it works because if
> > > > I make a syntax error in the tika-config.xml, it complains.  So it’s
> > > > finding the file.  But it’s not applying the properties
> > > >
> > > >
> > > >
> > > > I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
> > > >
> > > >
> > > >
> > > > <?xml version="1.0" encoding="UTF-8"?> <properties>
> > > >     <parsers>
> > > >         <parser class="org.apache.tika.parser.DefaultParser">
> > > >         </parser>
> > > >
> > > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > > >             <params>
> > > >                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
> > > >                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
> > > >             </params>
> > > >         </parser>
> > > >     </parsers>
> > > > </properties>
> > > >
> > > >

Re: Tika-config

Posted by Peter Kronenberg <pe...@torch.ai>.
I still don't get how the property file and tika-config interact.  When you say an internal tessConfig I assume you mean the one that is packaged with tika, which could be replaced by another file in the same package (which is essentially what I'm doing now)

Then, at runtime, you can programmatically change parameter values or pass in a tika-config to the parser which is set in the parseContext, right.  So wouldn't that simply override any values in the current config?  I don't understand how this would cause the default values to re-appear

________________________________
From: Tim Allison <ta...@apache.org>
Sent: Monday, February 8, 2021 9:25 PM
To: Peter Kronenberg <pe...@torch.ai>
Cc: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Tika-config

sorry an "internal tessconfig"

On Mon, Feb 8, 2021 at 9:23 PM Tim Allison <ta...@apache.org> wrote:
>
> Let's say you have an internal tessconfig file in the parser that
> you've configured through a tikaconfig.  When, at runtime, you send in
> a new tessconfig via the parsecontext, how can we tell which
> parameters you want to change from the new tessconfig?
>
> Yes, I realize that it would be possible to keep track of what
> parameters have been changed in the runtime config and then do
> something smart, but this hasn't been an issue to date.
>
> On Mon, Feb 8, 2021 at 9:15 PM Peter Kronenberg
> <pe...@torch.ai> wrote:
> >
> > Ok, your last point might be the issue. If i don't set it in tesseractOCRConfig, then seting it in tika-config has no effect? I'm not sure I understand the thinking or logic behind this.
> >
> >
> > ________________________________
> > From: Tim Allison <ta...@apache.org>
> > Sent: Monday, February 8, 2021 8:47:07 PM
> > To: Peter Kronenberg <pe...@torch.ai>; user@tika.apache.org <us...@tika.apache.org>
> > Subject: Re: Tika-config
> >
> > I regret that I'm not able to reproduce this...that is, this works for me;
> >
> > @Test
> > public void oneOff() throws Exception {
> >     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
> >     TikaConfig config = new TikaConfig();
> >     AutoDetectParser parser = new AutoDetectParser(config);
> >     assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml);
> > }
> >
> >
> > where myconfig.xml is:
> > <?xml version="1.0" encoding="UTF-8"?>
> > <properties>
> >     <parsers>
> >         <parser class="org.apache.tika.parser.DefaultParser">
> >         </parser>
> >
> >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> >             <params>
> >                 <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param>
> >                 <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param>
> >             </params>
> >         </parser>
> >     </parsers>
> > </properties>
> >
> > Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them.  So, if you aren't setting the path there, then, y, you won't see any effect.
> >
> > On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai> wrote:
> >
> > Like this.
> >
> >
> >         TikaConfig tikaConfig = new TikaConfig();
> >
> >         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
> >
> >         final ParseContext parseContext = new ParseContext();
> >
> >         parseContext.set(AutoDetectParser.class, parser);
> >         parseContext.set(PDFParserConfig.class, pdfConfig);
> >         parseContext.set(TesseractOCRConfig.class, tessConfig);
> >
> > -----Original Message-----
> > From: Tim Allison <ta...@apache.org>
> > Sent: Monday, February 8, 2021 5:31 PM
> > To: user@tika.apache.org
> > Subject: Re: Tika-config
> >
> > How are you using the TikaConfig?
> >
> > On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai> wrote:
> > >
> > > What is wrong with this?
> > >
> > > I specified the tika-config env variable.  I know it works because if
> > > I make a syntax error in the tika-config.xml, it complains.  So it’s
> > > finding the file.  But it’s not applying the properties
> > >
> > >
> > >
> > > I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
> > >
> > >
> > >
> > > <?xml version="1.0" encoding="UTF-8"?> <properties>
> > >     <parsers>
> > >         <parser class="org.apache.tika.parser.DefaultParser">
> > >         </parser>
> > >
> > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > >             <params>
> > >                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
> > >                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
> > >             </params>
> > >         </parser>
> > >     </parsers>
> > > </properties>
> > >
> > >

Re: Tika-config

Posted by Tim Allison <ta...@apache.org>.
sorry an "internal tessconfig"

On Mon, Feb 8, 2021 at 9:23 PM Tim Allison <ta...@apache.org> wrote:
>
> Let's say you have an internal tessconfig file in the parser that
> you've configured through a tikaconfig.  When, at runtime, you send in
> a new tessconfig via the parsecontext, how can we tell which
> parameters you want to change from the new tessconfig?
>
> Yes, I realize that it would be possible to keep track of what
> parameters have been changed in the runtime config and then do
> something smart, but this hasn't been an issue to date.
>
> On Mon, Feb 8, 2021 at 9:15 PM Peter Kronenberg
> <pe...@torch.ai> wrote:
> >
> > Ok, your last point might be the issue. If i don't set it in tesseractOCRConfig, then seting it in tika-config has no effect? I'm not sure I understand the thinking or logic behind this.
> >
> >
> > ________________________________
> > From: Tim Allison <ta...@apache.org>
> > Sent: Monday, February 8, 2021 8:47:07 PM
> > To: Peter Kronenberg <pe...@torch.ai>; user@tika.apache.org <us...@tika.apache.org>
> > Subject: Re: Tika-config
> >
> > I regret that I'm not able to reproduce this...that is, this works for me;
> >
> > @Test
> > public void oneOff() throws Exception {
> >     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
> >     TikaConfig config = new TikaConfig();
> >     AutoDetectParser parser = new AutoDetectParser(config);
> >     assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml);
> > }
> >
> >
> > where myconfig.xml is:
> > <?xml version="1.0" encoding="UTF-8"?>
> > <properties>
> >     <parsers>
> >         <parser class="org.apache.tika.parser.DefaultParser">
> >         </parser>
> >
> >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> >             <params>
> >                 <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param>
> >                 <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param>
> >             </params>
> >         </parser>
> >     </parsers>
> > </properties>
> >
> > Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them.  So, if you aren't setting the path there, then, y, you won't see any effect.
> >
> > On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai> wrote:
> >
> > Like this.
> >
> >
> >         TikaConfig tikaConfig = new TikaConfig();
> >
> >         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
> >
> >         final ParseContext parseContext = new ParseContext();
> >
> >         parseContext.set(AutoDetectParser.class, parser);
> >         parseContext.set(PDFParserConfig.class, pdfConfig);
> >         parseContext.set(TesseractOCRConfig.class, tessConfig);
> >
> > -----Original Message-----
> > From: Tim Allison <ta...@apache.org>
> > Sent: Monday, February 8, 2021 5:31 PM
> > To: user@tika.apache.org
> > Subject: Re: Tika-config
> >
> > How are you using the TikaConfig?
> >
> > On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai> wrote:
> > >
> > > What is wrong with this?
> > >
> > > I specified the tika-config env variable.  I know it works because if
> > > I make a syntax error in the tika-config.xml, it complains.  So it’s
> > > finding the file.  But it’s not applying the properties
> > >
> > >
> > >
> > > I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
> > >
> > >
> > >
> > > <?xml version="1.0" encoding="UTF-8"?> <properties>
> > >     <parsers>
> > >         <parser class="org.apache.tika.parser.DefaultParser">
> > >         </parser>
> > >
> > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > >             <params>
> > >                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
> > >                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
> > >             </params>
> > >         </parser>
> > >     </parsers>
> > > </properties>
> > >
> > >

Re: Tika-config

Posted by Tim Allison <ta...@apache.org>.
Let's say you have an internal tessconfig file in the parser that
you've configured through a tikaconfig.  When, at runtime, you send in
a new tessconfig via the parsecontext, how can we tell which
parameters you want to change from the new tessconfig?

Yes, I realize that it would be possible to keep track of what
parameters have been changed in the runtime config and then do
something smart, but this hasn't been an issue to date.

On Mon, Feb 8, 2021 at 9:15 PM Peter Kronenberg
<pe...@torch.ai> wrote:
>
> Ok, your last point might be the issue. If i don't set it in tesseractOCRConfig, then seting it in tika-config has no effect? I'm not sure I understand the thinking or logic behind this.
>
>
> ________________________________
> From: Tim Allison <ta...@apache.org>
> Sent: Monday, February 8, 2021 8:47:07 PM
> To: Peter Kronenberg <pe...@torch.ai>; user@tika.apache.org <us...@tika.apache.org>
> Subject: Re: Tika-config
>
> I regret that I'm not able to reproduce this...that is, this works for me;
>
> @Test
> public void oneOff() throws Exception {
>     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
>     TikaConfig config = new TikaConfig();
>     AutoDetectParser parser = new AutoDetectParser(config);
>     assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml);
> }
>
>
> where myconfig.xml is:
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser">
>         </parser>
>
>         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>             <params>
>                 <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param>
>                 <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param>
>             </params>
>         </parser>
>     </parsers>
> </properties>
>
> Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them.  So, if you aren't setting the path there, then, y, you won't see any effect.
>
> On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai> wrote:
>
> Like this.
>
>
>         TikaConfig tikaConfig = new TikaConfig();
>
>         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>
>         final ParseContext parseContext = new ParseContext();
>
>         parseContext.set(AutoDetectParser.class, parser);
>         parseContext.set(PDFParserConfig.class, pdfConfig);
>         parseContext.set(TesseractOCRConfig.class, tessConfig);
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: Monday, February 8, 2021 5:31 PM
> To: user@tika.apache.org
> Subject: Re: Tika-config
>
> How are you using the TikaConfig?
>
> On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai> wrote:
> >
> > What is wrong with this?
> >
> > I specified the tika-config env variable.  I know it works because if
> > I make a syntax error in the tika-config.xml, it complains.  So it’s
> > finding the file.  But it’s not applying the properties
> >
> >
> >
> > I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
> >
> >
> >
> > <?xml version="1.0" encoding="UTF-8"?> <properties>
> >     <parsers>
> >         <parser class="org.apache.tika.parser.DefaultParser">
> >         </parser>
> >
> >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> >             <params>
> >                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
> >                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
> >             </params>
> >         </parser>
> >     </parsers>
> > </properties>
> >
> >

Re: Tika-config

Posted by Peter Kronenberg <pe...@torch.ai>.
Ok, your last point might be the issue. If i don't set it in tesseractOCRConfig, then seting it in tika-config has no effect? I'm not sure I understand the thinking or logic behind this.


________________________________
From: Tim Allison <ta...@apache.org>
Sent: Monday, February 8, 2021 8:47:07 PM
To: Peter Kronenberg <pe...@torch.ai>; user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Tika-config

I regret that I'm not able to reproduce this...that is, this works for me;


@Test
public void oneOff() throws Exception {
    System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
    TikaConfig config = new TikaConfig();
    AutoDetectParser parser = new AutoDetectParser(config);
    assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml);
}

where myconfig.xml is:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>

        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
            <params>
                <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param>
                <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param>
            </params>
        </parser>
    </parsers>
</properties>

Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them.  So, if you aren't setting the path there, then, y, you won't see any effect.

On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai>> wrote:
Like this.


        TikaConfig tikaConfig = new TikaConfig();

        final AutoDetectParser parser = new AutoDetectParser(tikaConfig);

        final ParseContext parseContext = new ParseContext();

        parseContext.set(AutoDetectParser.class, parser);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        parseContext.set(TesseractOCRConfig.class, tessConfig);

-----Original Message-----
From: Tim Allison <ta...@apache.org>>
Sent: Monday, February 8, 2021 5:31 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Tika-config

How are you using the TikaConfig?

On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai>> wrote:
>
> What is wrong with this?
>
> I specified the tika-config env variable.  I know it works because if
> I make a syntax error in the tika-config.xml, it complains.  So it’s
> finding the file.  But it’s not applying the properties
>
>
>
> I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
>
>
>
> <?xml version="1.0" encoding="UTF-8"?> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser">
>         </parser>
>
>         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>             <params>
>                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
>                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
>             </params>
>         </parser>
>     </parsers>
> </properties>
>
>

Re: Tika-config

Posted by Tim Allison <ta...@apache.org>.
I regret that I'm not able to reproduce this...that is, this works for me;

@Test
public void oneOff() throws Exception {
    System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
    TikaConfig config = new TikaConfig();
    AutoDetectParser parser = new AutoDetectParser(config);
    assertContains("quick brown fox", getXML("testOCR_spacing.png",
parser).xml);
}


where myconfig.xml is:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>

        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
            <params>
                <param name="tesseractPath" type="string">C:\Program
Files\Tesseract-OCR2</param>
                <param name="tessdataPath" type="string">C:\Program
Files\Tesseract-OCR2\tessdata</param>
            </params>
        </parser>
    </parsers>
</properties>

Whatever you set in your tessConfig will _override_ the underlying settings
of the parser...all of them.  So, if you aren't setting the path there,
then, y, you won't see any effect.

On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <pe...@torch.ai>
wrote:

> Like this.
>
>
>         TikaConfig tikaConfig = new TikaConfig();
>
>         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>
>         final ParseContext parseContext = new ParseContext();
>
>         parseContext.set(AutoDetectParser.class, parser);
>         parseContext.set(PDFParserConfig.class, pdfConfig);
>         parseContext.set(TesseractOCRConfig.class, tessConfig);
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: Monday, February 8, 2021 5:31 PM
> To: user@tika.apache.org
> Subject: Re: Tika-config
>
> How are you using the TikaConfig?
>
> On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai>
> wrote:
> >
> > What is wrong with this?
> >
> > I specified the tika-config env variable.  I know it works because if
> > I make a syntax error in the tika-config.xml, it complains.  So it’s
> > finding the file.  But it’s not applying the properties
> >
> >
> >
> > I have this tika-config.  I tried forward slashes instead of the double
> backslashes.  Same result.  No errors.  It’s just not applying the values.
> >
> >
> >
> > <?xml version="1.0" encoding="UTF-8"?> <properties>
> >     <parsers>
> >         <parser class="org.apache.tika.parser.DefaultParser">
> >         </parser>
> >
> >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> >             <params>
> >                 <param name="tesseractPath"
> type="string">c:\\tesseract_config</param>
> >                 <param name="tessdataPath"
> type="string">c:\\tessdata_config</param>
> >             </params>
> >         </parser>
> >     </parsers>
> > </properties>
> >
> >
>

RE: Tika-config

Posted by Peter Kronenberg <pe...@torch.ai>.
Like this.


        TikaConfig tikaConfig = new TikaConfig();

        final AutoDetectParser parser = new AutoDetectParser(tikaConfig);

        final ParseContext parseContext = new ParseContext();

        parseContext.set(AutoDetectParser.class, parser);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        parseContext.set(TesseractOCRConfig.class, tessConfig);

-----Original Message-----
From: Tim Allison <ta...@apache.org> 
Sent: Monday, February 8, 2021 5:31 PM
To: user@tika.apache.org
Subject: Re: Tika-config

How are you using the TikaConfig?

On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <pe...@torch.ai> wrote:
>
> What is wrong with this?
>
> I specified the tika-config env variable.  I know it works because if 
> I make a syntax error in the tika-config.xml, it complains.  So it’s 
> finding the file.  But it’s not applying the properties
>
>
>
> I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
>
>
>
> <?xml version="1.0" encoding="UTF-8"?> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser">
>         </parser>
>
>         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>             <params>
>                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
>                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
>             </params>
>         </parser>
>     </parsers>
> </properties>
>
>

Re: Tika-config

Posted by Tim Allison <ta...@apache.org>.
How are you using the TikaConfig?

On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg
<pe...@torch.ai> wrote:
>
> What is wrong with this?
>
> I specified the tika-config env variable.  I know it works because if I make a syntax error in the tika-config.xml, it complains.  So it’s finding the file.  But it’s not applying the properties
>
>
>
> I have this tika-config.  I tried forward slashes instead of the double backslashes.  Same result.  No errors.  It’s just not applying the values.
>
>
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser">
>         </parser>
>
>         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>             <params>
>                 <param name="tesseractPath" type="string">c:\\tesseract_config</param>
>                 <param name="tessdataPath" type="string">c:\\tessdata_config</param>
>             </params>
>         </parser>
>     </parsers>
> </properties>
>
>