You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Zaheer Beig (JIRA)" <ji...@apache.org> on 2014/08/30 12:13:52 UTC

[jira] [Created] (TIKA-1405) German content detected as French

Zaheer Beig created TIKA-1405:
---------------------------------

             Summary: German content detected as French
                 Key: TIKA-1405
                 URL: https://issues.apache.org/jira/browse/TIKA-1405
             Project: Tika
          Issue Type: Bug
          Components: languageidentifier
    Affects Versions: 1.4
         Environment: Linux
            Reporter: Zaheer Beig


Hi,
We are using Apache Tika 1.4  for document conversion to text and language detection in one of our project. We are facing below issues with language detection:

1. When the text is in all UPPER CASE, even though the language is English, it gets detected as Estonian.
2. For many of our German content , language gets detected as French [Though this is not the case for all German content]

Any update on this will be very helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: [jira] [Created] (TIKA-1405) German content detected as French

Posted by Oleg Tikhonov <ol...@apache.org>.
Hi,
does context contain only one language or it's mixed.
if the text contains a "single" language then it seems something strange in
our language profiles. If it mixed - then it kindda ok. The first detected
will be an answer.

What is a size of context? one word or "bunch" of text? Basically to detect
language on big text is more precise then on small.

Best regards,
Oleg


On Sat, Aug 30, 2014 at 1:13 PM, Zaheer Beig (JIRA) <ji...@apache.org> wrote:

> Zaheer Beig created TIKA-1405:
> ---------------------------------
>
>              Summary: German content detected as French
>                  Key: TIKA-1405
>                  URL: https://issues.apache.org/jira/browse/TIKA-1405
>              Project: Tika
>           Issue Type: Bug
>           Components: languageidentifier
>     Affects Versions: 1.4
>          Environment: Linux
>             Reporter: Zaheer Beig
>
>
> Hi,
> We are using Apache Tika 1.4  for document conversion to text and language
> detection in one of our project. We are facing below issues with language
> detection:
>
> 1. When the text is in all UPPER CASE, even though the language is
> English, it gets detected as Estonian.
> 2. For many of our German content , language gets detected as French
> [Though this is not the case for all German content]
>
> Any update on this will be very helpful.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>