You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/02/09 21:21:28 UTC

RE: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x

You're killing me here!  I just finished an implementation that relies on this.
I never figured out how to set properties at runtime if I use tika-config.

Can you please provide an example of setting properties with tika-config and then optionally changing them at runtime?  How does the TesseractOCRConfig and PDFParser objects get initialized if not from the corresponding .properties file?

-----Original Message-----
From: Tim Allison (Jira) <ji...@apache.org> 
Sent: Tuesday, February 9, 2021 4:10 PM
To: dev@tika.apache.org
Subject: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.

    [ https://issues.apache.org/jira/browse/TIKA-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282043#comment-17282043 ]

Tim Allison commented on TIKA-3297:
-----------------------------------

I got rid of the .properties for tesseract.  Users can no longer set the tesseract path, tess data or imagemagick via the TesseractOCRConfig.  These _must_ be set via a tika-config.xml.  If there is a use case for setting these at parse time, let me know.



Now, when a user sends in a TesseractOCRConfig at parse time, that config remembers what fields the user set.  The TesseractOCRParser will now clone the default internal config and update only those fields that the user has manipulated and sent in via the ParseContext.  In short, this will now "update" the baseline set via the tika-config.xml.  It will not overwrite what was set in the tika-config.xml file.



If this looks good, I'll do the same to the PDFParser.

> Simplify parser configuration in 2.x
> ------------------------------------
>
>                 Key: TIKA-3297
>                 URL: https://issues.apache.org/jira/browse/TIKA-3297
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> We currently have .properties files and tika-config.xml and runtime configuration.  We should simplify to tika-config.xml.
> From a security perspective, I'm thinking we should also allow executable paths to be set only via tika-config.xml...not programmatically via a TesseractConfig.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x

Posted by Tim Allison <ta...@apache.org>.
Y

On Tue, Feb 9, 2021 at 7:16 PM Peter Kronenberg <pe...@torch.ai>
wrote:

> How are the default values defined?  Previously, it was whatever was in
> the default .properties file, right?  Are they just hard-coded now?
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: Tuesday, February 9, 2021 5:59 PM
> To: <de...@tika.apache.org> <de...@tika.apache.org>
> Subject: Re: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser
> configuration in 2.x
>
>  >How does the TesseractOCRConfig and PDFParser objects get initialized if
> not from the corresponding .properties file?
>
> Configuration is initialized by the default values.  If there's a
> tika-config.xml, that will overwrite those fields shortly after
> initialization.
>
> On Tue, Feb 9, 2021 at 4:21 PM Peter Kronenberg <pe...@torch.ai>
> wrote:
> >
> > You're killing me here!  I just finished an implementation that relies
> on this.
> > I never figured out how to set properties at runtime if I use
> tika-config.
> >
> > Can you please provide an example of setting properties with tika-config
> and then optionally changing them at runtime?  How does the
> TesseractOCRConfig and PDFParser objects get initialized if not from the
> corresponding .properties file?
> >
> > -----Original Message-----
> > From: Tim Allison (Jira) <ji...@apache.org>
> > Sent: Tuesday, February 9, 2021 4:10 PM
> > To: dev@tika.apache.org
> > Subject: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser
> > configuration in 2.x
> >
> > CAUTION: This email originated from outside of the organization. DO NOT
> click links or open attachments unless you recognize the sender and know
> the content is safe.
> >
> >     [
> > https://issues.apache.org/jira/browse/TIKA-3297?page=com.atlassian.jir
> > a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282
> > 043#comment-17282043 ]
> >
> > Tim Allison commented on TIKA-3297:
> > -----------------------------------
> >
> > I got rid of the .properties for tesseract.  Users can no longer set the
> tesseract path, tess data or imagemagick via the TesseractOCRConfig.  These
> _must_ be set via a tika-config.xml.  If there is a use case for setting
> these at parse time, let me know.
> >
> >
> >
> > Now, when a user sends in a TesseractOCRConfig at parse time, that
> config remembers what fields the user set.  The TesseractOCRParser will now
> clone the default internal config and update only those fields that the
> user has manipulated and sent in via the ParseContext.  In short, this will
> now "update" the baseline set via the tika-config.xml.  It will not
> overwrite what was set in the tika-config.xml file.
> >
> >
> >
> > If this looks good, I'll do the same to the PDFParser.
> >
> > > Simplify parser configuration in 2.x
> > > ------------------------------------
> > >
> > >                 Key: TIKA-3297
> > >                 URL: https://issues.apache.org/jira/browse/TIKA-3297
> > >             Project: Tika
> > >          Issue Type: Task
> > >            Reporter: Tim Allison
> > >            Priority: Major
> > >
> > > We currently have .properties files and tika-config.xml and runtime
> configuration.  We should simplify to tika-config.xml.
> > > From a security perspective, I'm thinking we should also allow
> executable paths to be set only via tika-config.xml...not programmatically
> via a TesseractConfig.
> >
> >
> >
> > --
> > This message was sent by Atlassian Jira
> > (v8.3.4#803005)
>

RE: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x

Posted by Peter Kronenberg <pe...@torch.ai>.
How are the default values defined?  Previously, it was whatever was in the default .properties file, right?  Are they just hard-coded now?

-----Original Message-----
From: Tim Allison <ta...@apache.org> 
Sent: Tuesday, February 9, 2021 5:59 PM
To: <de...@tika.apache.org> <de...@tika.apache.org>
Subject: Re: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x

 >How does the TesseractOCRConfig and PDFParser objects get initialized if not from the corresponding .properties file?

Configuration is initialized by the default values.  If there's a tika-config.xml, that will overwrite those fields shortly after initialization.

On Tue, Feb 9, 2021 at 4:21 PM Peter Kronenberg <pe...@torch.ai> wrote:
>
> You're killing me here!  I just finished an implementation that relies on this.
> I never figured out how to set properties at runtime if I use tika-config.
>
> Can you please provide an example of setting properties with tika-config and then optionally changing them at runtime?  How does the TesseractOCRConfig and PDFParser objects get initialized if not from the corresponding .properties file?
>
> -----Original Message-----
> From: Tim Allison (Jira) <ji...@apache.org>
> Sent: Tuesday, February 9, 2021 4:10 PM
> To: dev@tika.apache.org
> Subject: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser 
> configuration in 2.x
>
> CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
>
>     [ 
> https://issues.apache.org/jira/browse/TIKA-3297?page=com.atlassian.jir
> a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282
> 043#comment-17282043 ]
>
> Tim Allison commented on TIKA-3297:
> -----------------------------------
>
> I got rid of the .properties for tesseract.  Users can no longer set the tesseract path, tess data or imagemagick via the TesseractOCRConfig.  These _must_ be set via a tika-config.xml.  If there is a use case for setting these at parse time, let me know.
>
>
>
> Now, when a user sends in a TesseractOCRConfig at parse time, that config remembers what fields the user set.  The TesseractOCRParser will now clone the default internal config and update only those fields that the user has manipulated and sent in via the ParseContext.  In short, this will now "update" the baseline set via the tika-config.xml.  It will not overwrite what was set in the tika-config.xml file.
>
>
>
> If this looks good, I'll do the same to the PDFParser.
>
> > Simplify parser configuration in 2.x
> > ------------------------------------
> >
> >                 Key: TIKA-3297
> >                 URL: https://issues.apache.org/jira/browse/TIKA-3297
> >             Project: Tika
> >          Issue Type: Task
> >            Reporter: Tim Allison
> >            Priority: Major
> >
> > We currently have .properties files and tika-config.xml and runtime configuration.  We should simplify to tika-config.xml.
> > From a security perspective, I'm thinking we should also allow executable paths to be set only via tika-config.xml...not programmatically via a TesseractConfig.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)

Re: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x

Posted by Tim Allison <ta...@apache.org>.
 >How does the TesseractOCRConfig and PDFParser objects get
initialized if not from the corresponding .properties file?

Configuration is initialized by the default values.  If there's a
tika-config.xml, that will overwrite those fields shortly after
initialization.

On Tue, Feb 9, 2021 at 4:21 PM Peter Kronenberg
<pe...@torch.ai> wrote:
>
> You're killing me here!  I just finished an implementation that relies on this.
> I never figured out how to set properties at runtime if I use tika-config.
>
> Can you please provide an example of setting properties with tika-config and then optionally changing them at runtime?  How does the TesseractOCRConfig and PDFParser objects get initialized if not from the corresponding .properties file?
>
> -----Original Message-----
> From: Tim Allison (Jira) <ji...@apache.org>
> Sent: Tuesday, February 9, 2021 4:10 PM
> To: dev@tika.apache.org
> Subject: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x
>
> CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
>
>     [ https://issues.apache.org/jira/browse/TIKA-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282043#comment-17282043 ]
>
> Tim Allison commented on TIKA-3297:
> -----------------------------------
>
> I got rid of the .properties for tesseract.  Users can no longer set the tesseract path, tess data or imagemagick via the TesseractOCRConfig.  These _must_ be set via a tika-config.xml.  If there is a use case for setting these at parse time, let me know.
>
>
>
> Now, when a user sends in a TesseractOCRConfig at parse time, that config remembers what fields the user set.  The TesseractOCRParser will now clone the default internal config and update only those fields that the user has manipulated and sent in via the ParseContext.  In short, this will now "update" the baseline set via the tika-config.xml.  It will not overwrite what was set in the tika-config.xml file.
>
>
>
> If this looks good, I'll do the same to the PDFParser.
>
> > Simplify parser configuration in 2.x
> > ------------------------------------
> >
> >                 Key: TIKA-3297
> >                 URL: https://issues.apache.org/jira/browse/TIKA-3297
> >             Project: Tika
> >          Issue Type: Task
> >            Reporter: Tim Allison
> >            Priority: Major
> >
> > We currently have .properties files and tika-config.xml and runtime configuration.  We should simplify to tika-config.xml.
> > From a security perspective, I'm thinking we should also allow executable paths to be set only via tika-config.xml...not programmatically via a TesseractConfig.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)

RE: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x

Posted by Peter Kronenberg <pe...@torch.ai>.
I certainly agree that it was confusing and non-intuitive, but I didn't expect things to change so drastically, so quickly!

I'll take a look at the unit tests for examples, but it sounds like you're saying I will be able to use tika-config.xml for all my default settings and then still change the programmatically pretty much the same way I'm doing now?

-----Original Message-----
From: Tim Allison <ta...@apache.org> 
Sent: Tuesday, February 9, 2021 5:57 PM
To: <de...@tika.apache.org> <de...@tika.apache.org>
Subject: Re: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x

Peter,

I did not intend to cause pain.  It felt like I spent numerous hours trying to help you debug what you were seeing and explaining the current configuration methods.  I was unsuccessful in communicating to you that what you were seeing was "expected."

Rather than spend more time trying to explain unsuccessfully how configuration worked, I thought it better to simplify configuration and make "updating" via the configs in the parsecontext possible (rather than overwriting).  If you remember, that was the part that you asked for numerous times and/or expressed surprise around.

In short, the PDFParserConfig and the TesseractOCRConfig, when sent in via the ParseContext, will update the settings from the baseline as set in the initial tika-config.

Unit tests that demonstrate this new behavior are here:
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java#L1072

https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java#L111

Cheers,

      Tim

On Tue, Feb 9, 2021 at 4:21 PM Peter Kronenberg <pe...@torch.ai> wrote:
>
> You're killing me here!  I just finished an implementation that relies on this.
> I never figured out how to set properties at runtime if I use tika-config.
>
> Can you please provide an example of setting properties with tika-config and then optionally changing them at runtime?  How does the TesseractOCRConfig and PDFParser objects get initialized if not from the corresponding .properties file?
>
> -----Original Message-----
> From: Tim Allison (Jira) <ji...@apache.org>
> Sent: Tuesday, February 9, 2021 4:10 PM
> To: dev@tika.apache.org
> Subject: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser 
> configuration in 2.x
>
> CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
>
>     [ 
> https://issues.apache.org/jira/browse/TIKA-3297?page=com.atlassian.jir
> a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282
> 043#comment-17282043 ]
>
> Tim Allison commented on TIKA-3297:
> -----------------------------------
>
> I got rid of the .properties for tesseract.  Users can no longer set the tesseract path, tess data or imagemagick via the TesseractOCRConfig.  These _must_ be set via a tika-config.xml.  If there is a use case for setting these at parse time, let me know.
>
>
>
> Now, when a user sends in a TesseractOCRConfig at parse time, that config remembers what fields the user set.  The TesseractOCRParser will now clone the default internal config and update only those fields that the user has manipulated and sent in via the ParseContext.  In short, this will now "update" the baseline set via the tika-config.xml.  It will not overwrite what was set in the tika-config.xml file.
>
>
>
> If this looks good, I'll do the same to the PDFParser.
>
> > Simplify parser configuration in 2.x
> > ------------------------------------
> >
> >                 Key: TIKA-3297
> >                 URL: https://issues.apache.org/jira/browse/TIKA-3297
> >             Project: Tika
> >          Issue Type: Task
> >            Reporter: Tim Allison
> >            Priority: Major
> >
> > We currently have .properties files and tika-config.xml and runtime configuration.  We should simplify to tika-config.xml.
> > From a security perspective, I'm thinking we should also allow executable paths to be set only via tika-config.xml...not programmatically via a TesseractConfig.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)

Re: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x

Posted by Tim Allison <ta...@apache.org>.
Peter,

I did not intend to cause pain.  It felt like I spent numerous hours
trying to help you debug what you were seeing and explaining the
current configuration methods.  I was unsuccessful in communicating to
you that what you were seeing was "expected."

Rather than spend more time trying to explain unsuccessfully how
configuration worked, I thought it better to simplify configuration
and make "updating" via the configs in the parsecontext possible
(rather than overwriting).  If you remember, that was the part that
you asked for numerous times and/or expressed surprise around.

In short, the PDFParserConfig and the TesseractOCRConfig, when sent in
via the ParseContext, will update the settings from the baseline as
set in the initial tika-config.

Unit tests that demonstrate this new behavior are here:
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java#L1072

https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java#L111

Cheers,

      Tim

On Tue, Feb 9, 2021 at 4:21 PM Peter Kronenberg
<pe...@torch.ai> wrote:
>
> You're killing me here!  I just finished an implementation that relies on this.
> I never figured out how to set properties at runtime if I use tika-config.
>
> Can you please provide an example of setting properties with tika-config and then optionally changing them at runtime?  How does the TesseractOCRConfig and PDFParser objects get initialized if not from the corresponding .properties file?
>
> -----Original Message-----
> From: Tim Allison (Jira) <ji...@apache.org>
> Sent: Tuesday, February 9, 2021 4:10 PM
> To: dev@tika.apache.org
> Subject: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser configuration in 2.x
>
> CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
>
>     [ https://issues.apache.org/jira/browse/TIKA-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282043#comment-17282043 ]
>
> Tim Allison commented on TIKA-3297:
> -----------------------------------
>
> I got rid of the .properties for tesseract.  Users can no longer set the tesseract path, tess data or imagemagick via the TesseractOCRConfig.  These _must_ be set via a tika-config.xml.  If there is a use case for setting these at parse time, let me know.
>
>
>
> Now, when a user sends in a TesseractOCRConfig at parse time, that config remembers what fields the user set.  The TesseractOCRParser will now clone the default internal config and update only those fields that the user has manipulated and sent in via the ParseContext.  In short, this will now "update" the baseline set via the tika-config.xml.  It will not overwrite what was set in the tika-config.xml file.
>
>
>
> If this looks good, I'll do the same to the PDFParser.
>
> > Simplify parser configuration in 2.x
> > ------------------------------------
> >
> >                 Key: TIKA-3297
> >                 URL: https://issues.apache.org/jira/browse/TIKA-3297
> >             Project: Tika
> >          Issue Type: Task
> >            Reporter: Tim Allison
> >            Priority: Major
> >
> > We currently have .properties files and tika-config.xml and runtime configuration.  We should simplify to tika-config.xml.
> > From a security perspective, I'm thinking we should also allow executable paths to be set only via tika-config.xml...not programmatically via a TesseractConfig.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)