You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/01/28 14:29:51 UTC

FW: {EXTERNAL}Invalid language code

Sent this to users@ but realized it’s probably more relevant here

From: Peter Kronenberg <pe...@torch.ai>
Sent: Wednesday, January 27, 2021 9:22 PM
To: user@tika.apache.org
Subject: RE: {EXTERNAL}Invalid language code

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Upon looking at the code, I realize that I might have been too ambitious about putting the errors in the metadata.  I suppose an exception would be fine, similar to what it’s doing now 😊

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Wednesday, January 27, 2021 8:51 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: {EXTERNAL}Invalid language code

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Different, but related issue.  It seems that Tika doesn’t support Tesseract scripts.  Looks like this came out with version 4.0.0.  See https://github.com/manisandro/gImageReader/issues/323

In the Tessdata directory there is a directory called script.  These are pseudo-language files that define the script or alphabet of the language.  See https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES and https://github.com/tesseract-ocr/tessdata/tree/master/script

Right now, Tika uses a regular expression to validate the language string, assuming it is set of  ISO-639-2 language code separated by plus signs.
In light of my previous comment about validating that the language (or script) file exists, I suggest parsing the language string by the plus sign and not doing any other validating on the string, but instead, actually checking to see that the file exists in either tessdata or tessdata/script.
If any of them don’t exists, then a message would be put in the metadata
(which brings me to another issue that I think some of the Warnings that Tika puts out should go into the metadata, perhaps with a tag of x-message to make it easier to programmatically pass back information, since the warnings just go to the console and aren’t passed back to the caller.  But that’s another issue)

Thoughts?



From: Peter Kronenberg <pe...@torch.ai>>
Sent: Wednesday, January 27, 2021 4:03 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: {EXTERNAL}Invalid language code

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
If I pass in a non-existantlanguage code (i.e., the code matches the regular expression, but there is no corresponding language file in Tessdata), I am not getting any error message.  If I do it from the command line with Tesseract, I get an error, but with Tika, I’m not seeing any error in the logs.  Not sure why the error from Tesseract is not being displayed somewhere.    Tika just blindly calls Tesseract but then doesn’t get any output back.  Is that the expected behavior?

RE: FW: {EXTERNAL}Invalid language code

Posted by Peter Kronenberg <pe...@torch.ai>.
Tesseract puts out some very ugly and misleading messages which simply assume that you haven’t set the tessdata directory correctly .



[cid:image001.png@01D6F579.530AC130]



I’ve already been playing around with code to check the existence of the languages files, which is a lot cleaner.



-----Original Message-----
From: Tim Allison <ta...@apache.org>
Sent: Thursday, January 28, 2021 1:07 PM
To: <de...@tika.apache.org> <de...@tika.apache.org>
Subject: Re: FW: {EXTERNAL}Invalid language code



>if any of them don’t exists, then a message would be put in the

>metadata



Rather than having us check the existence of the files, can we report tesseract complaining about tesseract not having that script installed...if it does?





On Thu, Jan 28, 2021 at 9:30 AM Peter Kronenberg <pe...@torch.ai>>

wrote:



> Sent this to users@ but realized it’s probably more relevant here

>

> From: Peter Kronenberg <pe...@torch.ai>>

> Sent: Wednesday, January 27, 2021 9:22 PM

> To: user@tika.apache.org<ma...@tika.apache.org>

> Subject: RE: {EXTERNAL}Invalid language code

>

> This email was sent from outside your organisation, yet is displaying

> the name of someone from your organisation. This often happens in

> phishing attempts. Please only interact with this email if you know

> its source and that the content is safe.

>

> Upon looking at the code, I realize that I might have been too

> ambitious about putting the errors in the metadata.  I suppose an

> exception would be fine, similar to what it’s doing now 😊

>

> From: Peter Kronenberg <peter.kronenberg@torch.ai<mailto:

> peter.kronenberg@torch.ai<ma...@torch.ai>>>

> Sent: Wednesday, January 27, 2021 8:51 PM

> To: user@tika.apache.org<ma...@tika.apache.org>>

> Subject: RE: {EXTERNAL}Invalid language code

>

> This email was sent from outside your organisation, yet is displaying

> the name of someone from your organisation. This often happens in

> phishing attempts. Please only interact with this email if you know

> its source and that the content is safe.

>

> Different, but related issue.  It seems that Tika doesn’t support

> Tesseract scripts.  Looks like this came out with version 4.0.0.  See

> https://github.com/manisandro/gImageReader/issues/323

>

> In the Tessdata directory there is a directory called script.  These

> are pseudo-language files that define the script or alphabet of the language.

> See

> https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1

> .asc#LANGUAGES and

> https://github.com/tesseract-ocr/tessdata/tree/master/script

>

> Right now, Tika uses a regular expression to validate the language

> string, assuming it is set of  ISO-639-2 language code separated by plus signs.

> In light of my previous comment about validating that the language (or

> script) file exists, I suggest parsing the language string by the plus

> sign and not doing any other validating on the string, but instead,

> actually checking to see that the file exists in either tessdata or tessdata/script.

> If any of them don’t exists, then a message would be put in the

> metadata (which brings me to another issue that I think some of the

> Warnings that Tika puts out should go into the metadata, perhaps with

> a tag of x-message to make it easier to programmatically pass back

> information, since the warnings just go to the console and aren’t

> passed back to the caller.  But that’s another issue)

>

> Thoughts?

>

>

>

> From: Peter Kronenberg <peter.kronenberg@torch.ai<mailto:

> peter.kronenberg@torch.ai<ma...@torch.ai>>>

> Sent: Wednesday, January 27, 2021 4:03 PM

> To: user@tika.apache.org<ma...@tika.apache.org>>

> Subject: {EXTERNAL}Invalid language code

>

> This email was sent from outside your organisation, yet is displaying

> the name of someone from your organisation. This often happens in

> phishing attempts. Please only interact with this email if you know

> its source and that the content is safe.

>

> CAUTION: This email originated from outside of the organization. DO

> NOT click links or open attachments unless you recognize the sender

> and know the content is safe.

> If I pass in a non-existantlanguage code (i.e., the code matches the

> regular expression, but there is no corresponding language file in

> Tessdata), I am not getting any error message.  If I do it from the

> command line with Tesseract, I get an error, but with Tika, I’m not

> seeing any error in the logs.  Not sure why the error from Tesseract is not being

> displayed somewhere.    Tika just blindly calls Tesseract but then doesn’t

> get any output back.  Is that the expected behavior?

>

Re: FW: {EXTERNAL}Invalid language code

Posted by Tim Allison <ta...@apache.org>.
>if any of them don’t exists, then a message would be put in the metadata

Rather than having us check the existence of the files, can we report
tesseract complaining about tesseract not having that script installed...if
it does?


On Thu, Jan 28, 2021 at 9:30 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> Sent this to users@ but realized it’s probably more relevant here
>
> From: Peter Kronenberg <pe...@torch.ai>
> Sent: Wednesday, January 27, 2021 9:22 PM
> To: user@tika.apache.org
> Subject: RE: {EXTERNAL}Invalid language code
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
> Upon looking at the code, I realize that I might have been too ambitious
> about putting the errors in the metadata.  I suppose an exception would be
> fine, similar to what it’s doing now 😊
>
> From: Peter Kronenberg <peter.kronenberg@torch.ai<mailto:
> peter.kronenberg@torch.ai>>
> Sent: Wednesday, January 27, 2021 8:51 PM
> To: user@tika.apache.org<ma...@tika.apache.org>
> Subject: RE: {EXTERNAL}Invalid language code
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
> Different, but related issue.  It seems that Tika doesn’t support
> Tesseract scripts.  Looks like this came out with version 4.0.0.  See
> https://github.com/manisandro/gImageReader/issues/323
>
> In the Tessdata directory there is a directory called script.  These are
> pseudo-language files that define the script or alphabet of the language.
> See
> https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES
> and https://github.com/tesseract-ocr/tessdata/tree/master/script
>
> Right now, Tika uses a regular expression to validate the language string,
> assuming it is set of  ISO-639-2 language code separated by plus signs.
> In light of my previous comment about validating that the language (or
> script) file exists, I suggest parsing the language string by the plus sign
> and not doing any other validating on the string, but instead, actually
> checking to see that the file exists in either tessdata or tessdata/script.
> If any of them don’t exists, then a message would be put in the metadata
> (which brings me to another issue that I think some of the Warnings that
> Tika puts out should go into the metadata, perhaps with a tag of x-message
> to make it easier to programmatically pass back information, since the
> warnings just go to the console and aren’t passed back to the caller.  But
> that’s another issue)
>
> Thoughts?
>
>
>
> From: Peter Kronenberg <peter.kronenberg@torch.ai<mailto:
> peter.kronenberg@torch.ai>>
> Sent: Wednesday, January 27, 2021 4:03 PM
> To: user@tika.apache.org<ma...@tika.apache.org>
> Subject: {EXTERNAL}Invalid language code
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
> CAUTION: This email originated from outside of the organization. DO NOT
> click links or open attachments unless you recognize the sender and know
> the content is safe.
> If I pass in a non-existantlanguage code (i.e., the code matches the
> regular expression, but there is no corresponding language file in
> Tessdata), I am not getting any error message.  If I do it from the command
> line with Tesseract, I get an error, but with Tika, I’m not seeing any
> error in the logs.  Not sure why the error from Tesseract is not being
> displayed somewhere.    Tika just blindly calls Tesseract but then doesn’t
> get any output back.  Is that the expected behavior?
>