You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/02/10 15:11:47 UTC

New config paradigm

Ok, I’m gonna have questions 😊

In this code, I assume that this extracts the settings that are in the tika-config.  And we have to extract one parser at a time, right?

try (InputStream is = TikaOCRParser.class.getResourceAsStream("/tika-config.xml")) {
    tikaConfig = new TikaConfig(is);
}
Parser pdfParser = findParser(tikaConfig.getParser(), org.apache.tika.parser.pdf.PDFParser.class);
PDFParserConfig pdfParserConfig = ((PDFParser)pdfParser).getPDFParserConfig();
System.out.println("OCR Strategy: " + pdfParserConfig.getOcrStrategy());

If I then proceed to do this


final PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.AUTO);


final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
final ParseContext parseContext = new ParseContext();

parseContext.set(AutoDetectParser.class, parser);
parseContext.set(PDFParserConfig.class, pdfConfig);


How do I now get the values that are being used in the composite parseContext?  I want to confirm that the values are as expected


RE: {EXTERNAL}New config paradigm

Posted by Peter Kronenberg <pe...@torch.ai>.
Well both, I guess.  When I send the result back to the user, I want to be able to also capture all the options that were used.  I can get them from what was set, but I’d rather get them from the ‘source’.  Although, if there is no practical way to do it, once I’m convinced that It all ‘works’, I’ll be ok with just echoing back what was set.

From: Tim Allison <ta...@apache.org>
Sent: Friday, February 12, 2021 11:40 AM
To: user@tika.apache.org
Subject: Re: {EXTERNAL}New config paradigm

Is the goal to do this on an ongoing/programmatic basis, or do you just want debugging info during development?

On Fri, Feb 12, 2021 at 9:54 AM Peter Kronenberg <pe...@torch.ai>> wrote:
Still trying to understand how I get the settings that have been set on a parseContext.  In other words, let’s say that I just have a parseContext. I have no idea what configs have been added to it.  Is there a way to extract the parsers or the configs from the parseContext and view the settings?
I can use the settings that I *think* I passed into it, but I would rather get the settings from the parseContext itself, to ensure that they are what I think they are.


From: Peter Kronenberg <pe...@torch.ai>>
Sent: Wednesday, February 10, 2021 10:12 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: {EXTERNAL}New config paradigm

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
Ok, I’m gonna have questions 😊

In this code, I assume that this extracts the settings that are in the tika-config.  And we have to extract one parser at a time, right?

try (InputStream is = TikaOCRParser.class.getResourceAsStream("/tika-config.xml")) {
    tikaConfig = new TikaConfig(is);
}
Parser pdfParser = findParser(tikaConfig.getParser(), org.apache.tika.parser.pdf.PDFParser.class);
PDFParserConfig pdfParserConfig = ((PDFParser)pdfParser).getPDFParserConfig();
System.out.println("OCR Strategy: " + pdfParserConfig.getOcrStrategy());

If I then proceed to do this


final PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.AUTO);


final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
final ParseContext parseContext = new ParseContext();

parseContext.set(AutoDetectParser.class, parser);
parseContext.set(PDFParserConfig.class, pdfConfig);

How do I now get the values that are being used in the composite parseContext?  I want to confirm that the values are as expected


Re: {EXTERNAL}New config paradigm

Posted by Tim Allison <ta...@apache.org>.
Is the goal to do this on an ongoing/programmatic basis, or do you just
want debugging info during development?

On Fri, Feb 12, 2021 at 9:54 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> Still trying to understand how I get the settings that have been set on a
> parseContext.  In other words, let’s say that I just have a parseContext. I
> have no idea what configs have been added to it.  Is there a way to extract
> the parsers or the configs from the parseContext and view the settings?
>
> I can use the settings that I **think** I passed into it, but I would
> rather get the settings from the parseContext itself, to ensure that they
> are what I think they are.
>
>
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Wednesday, February 10, 2021 10:12 AM
> *To:* user@tika.apache.org
> *Subject:* {EXTERNAL}New config paradigm
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> CAUTION: This email originated from outside of the organization. DO NOT
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
> Ok, I’m gonna have questions 😊
>
>
>
> In this code, I assume that this extracts the settings that are in the
> tika-config.  And we have to extract one parser at a time, right?
>
>
>
> *try *(InputStream is = TikaOCRParser.*class*.getResourceAsStream(
> *"/tika-config.xml"*)) {
>     tikaConfig = *new *TikaConfig(is);
> }
> Parser pdfParser = *findParser*(tikaConfig.getParser(),
> org.apache.tika.parser.pdf.PDFParser.*class*);
> PDFParserConfig pdfParserConfig =
> ((PDFParser)pdfParser).getPDFParserConfig();
> System.*out*.println(*"OCR Strategy: " *+
> pdfParserConfig.getOcrStrategy());
>
>
>
> If I then proceed to do this
>
>
>
> *final *PDFParserConfig pdfConfig = *new *PDFParserConfig();
> pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.*AUTO*);
>
>
>
> *final *AutoDetectParser parser = *new *AutoDetectParser(tikaConfig);
> *final *ParseContext parseContext = *new *ParseContext();
>
> parseContext.set(AutoDetectParser.*class*, parser);
> parseContext.set(PDFParserConfig.*class*, pdfConfig);
>
> How do I now get the values that are being used in the composite parseContext?  I want to confirm that the values are as expected
>
>
>

RE: {EXTERNAL}New config paradigm

Posted by Peter Kronenberg <pe...@torch.ai>.
Still trying to understand how I get the settings that have been set on a parseContext.  In other words, let’s say that I just have a parseContext. I have no idea what configs have been added to it.  Is there a way to extract the parsers or the configs from the parseContext and view the settings?
I can use the settings that I *think* I passed into it, but I would rather get the settings from the parseContext itself, to ensure that they are what I think they are.


From: Peter Kronenberg <pe...@torch.ai>
Sent: Wednesday, February 10, 2021 10:12 AM
To: user@tika.apache.org
Subject: {EXTERNAL}New config paradigm

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
Ok, I’m gonna have questions 😊

In this code, I assume that this extracts the settings that are in the tika-config.  And we have to extract one parser at a time, right?

try (InputStream is = TikaOCRParser.class.getResourceAsStream("/tika-config.xml")) {
    tikaConfig = new TikaConfig(is);
}
Parser pdfParser = findParser(tikaConfig.getParser(), org.apache.tika.parser.pdf.PDFParser.class);
PDFParserConfig pdfParserConfig = ((PDFParser)pdfParser).getPDFParserConfig();
System.out.println("OCR Strategy: " + pdfParserConfig.getOcrStrategy());

If I then proceed to do this


final PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.AUTO);


final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
final ParseContext parseContext = new ParseContext();

parseContext.set(AutoDetectParser.class, parser);
parseContext.set(PDFParserConfig.class, pdfConfig);

How do I now get the values that are being used in the composite parseContext?  I want to confirm that the values are as expected