You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Juan Elosua <ju...@gmail.com> on 2019/12/05 11:43:29 UTC

Wrong language detection in tika server 1.22

Hi all,

Since this is my first email allow me to give some context: my name is Juan
Elosua and I have come across tika for document parsing for an information
security project we are working on.

First of all sorry if this is not the way to send potential issues along
but I was unsure how to communicate them.

The potential issue I found concerns tika-server version 1.22 and more
precisely the language detector interface.

If I send a PDF document to that endpoint it returns *'th' (thai) *as the
detected language but the pdf document is in spanish. I have converted the
pdf to a plain text file (using pdftotext) and rerun the test and then the
language has been detected correctly as *'es'*






*$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
http://localhost:9998/language/stream
<http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
@BOE-A-2019-9455.txt http://localhost:9998/language/stream
<http://localhost:9998/language/stream>es*

I have used a publicly available pdf file to ease the replication, you can
find the original document here:
https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf

Please, let me know what's the best way to report issues.

Saw the "reporting issues" docs for tika, but should I create an account in
order to report the issues or is that something internal to the core team?

Thanks in advance

Juan

Re: Wrong language detection in tika server 1.22

Posted by Juan Elosua <ju...@gmail.com>.

Hi Tim,

Understood, so the only difference between the /stream and /string endpoint
is the bytestream to UTF-8 conversion.

With the change on the wiki is more clear that the file parsing is limited
to that.

Thank you

Cheers

Juan

On Thu, Dec 5, 2019, 17:21 Tim Allison <ta...@apache.org> wrote:

> I just updated our wiki.  Please let me know if we can improve it further.
>
>
> https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS#TikaJAXRS-LanguageResource
>
> On Thu, Dec 5, 2019 at 10:44 AM Tim Allison <ta...@apache.org> wrote:
>
> > In looking at the source code for this (for the first time?)...it looks
> > like that endpoint expects UTF-8 text.  It does not parse the file and
> then
> > run lang id on the parsed text.
> >
> > On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <ju...@gmail.com>
> wrote:
> >
> >> Hi all,
> >>
> >> Since this is my first email allow me to give some context: my name is
> >> Juan
> >> Elosua and I have come across tika for document parsing for an
> information
> >> security project we are working on.
> >>
> >> First of all sorry if this is not the way to send potential issues along
> >> but I was unsure how to communicate them.
> >>
> >> The potential issue I found concerns tika-server version 1.22 and more
> >> precisely the language detector interface.
> >>
> >> If I send a PDF document to that endpoint it returns *'th' (thai) *as
> the
> >> detected language but the pdf document is in spanish. I have converted
> the
> >> pdf to a plain text file (using pdftotext) and rerun the test and then
> the
> >> language has been detected correctly as *'es'*
> >>
> >>
> >>
> >>
> >>
> >>
> >> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
> >> http://localhost:9998/language/stream
> >> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
> >> @BOE-A-2019-9455.txt http://localhost:9998/language/stream
> >> <http://localhost:9998/language/stream>es*
> >>
> >> I have used a publicly available pdf file to ease the replication, you
> can
> >> find the original document here:
> >> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf
> >>
> >> Please, let me know what's the best way to report issues.
> >>
> >> Saw the "reporting issues" docs for tika, but should I create an account
> >> in
> >> order to report the issues or is that something internal to the core
> team?
> >>
> >> Thanks in advance
> >>
> >> Juan
> >>
> >
>

Re: Wrong language detection in tika server 1.22

Posted by Tim Allison <ta...@apache.org>.

I just updated our wiki.  Please let me know if we can improve it further.

https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS#TikaJAXRS-LanguageResource

On Thu, Dec 5, 2019 at 10:44 AM Tim Allison <ta...@apache.org> wrote:

> In looking at the source code for this (for the first time?)...it looks
> like that endpoint expects UTF-8 text.  It does not parse the file and then
> run lang id on the parsed text.
>
> On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <ju...@gmail.com> wrote:
>
>> Hi all,
>>
>> Since this is my first email allow me to give some context: my name is
>> Juan
>> Elosua and I have come across tika for document parsing for an information
>> security project we are working on.
>>
>> First of all sorry if this is not the way to send potential issues along
>> but I was unsure how to communicate them.
>>
>> The potential issue I found concerns tika-server version 1.22 and more
>> precisely the language detector interface.
>>
>> If I send a PDF document to that endpoint it returns *'th' (thai) *as the
>> detected language but the pdf document is in spanish. I have converted the
>> pdf to a plain text file (using pdftotext) and rerun the test and then the
>> language has been detected correctly as *'es'*
>>
>>
>>
>>
>>
>>
>> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
>> http://localhost:9998/language/stream
>> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
>> @BOE-A-2019-9455.txt http://localhost:9998/language/stream
>> <http://localhost:9998/language/stream>es*
>>
>> I have used a publicly available pdf file to ease the replication, you can
>> find the original document here:
>> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf
>>
>> Please, let me know what's the best way to report issues.
>>
>> Saw the "reporting issues" docs for tika, but should I create an account
>> in
>> order to report the issues or is that something internal to the core team?
>>
>> Thanks in advance
>>
>> Juan
>>
>

Re: Wrong language detection in tika server 1.22

Posted by Tim Allison <ta...@apache.org>.

In looking at the source code for this (for the first time?)...it looks
like that endpoint expects UTF-8 text.  It does not parse the file and then
run lang id on the parsed text.

On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <ju...@gmail.com> wrote:

> Hi all,
>
> Since this is my first email allow me to give some context: my name is Juan
> Elosua and I have come across tika for document parsing for an information
> security project we are working on.
>
> First of all sorry if this is not the way to send potential issues along
> but I was unsure how to communicate them.
>
> The potential issue I found concerns tika-server version 1.22 and more
> precisely the language detector interface.
>
> If I send a PDF document to that endpoint it returns *'th' (thai) *as the
> detected language but the pdf document is in spanish. I have converted the
> pdf to a plain text file (using pdftotext) and rerun the test and then the
> language has been detected correctly as *'es'*
>
>
>
>
>
>
> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
> http://localhost:9998/language/stream
> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
> @BOE-A-2019-9455.txt http://localhost:9998/language/stream
> <http://localhost:9998/language/stream>es*
>
> I have used a publicly available pdf file to ease the replication, you can
> find the original document here:
> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf
>
> Please, let me know what's the best way to report issues.
>
> Saw the "reporting issues" docs for tika, but should I create an account in
> order to report the issues or is that something internal to the core team?
>
> Thanks in advance
>
> Juan
>