You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Zaheer Beig (JIRA)" <ji...@apache.org> on 2014/08/30 12:13:52 UTC
[jira] [Created] (TIKA-1405) German content detected as French
Zaheer Beig created TIKA-1405:
---------------------------------
Summary: German content detected as French
Key: TIKA-1405
URL: https://issues.apache.org/jira/browse/TIKA-1405
Project: Tika
Issue Type: Bug
Components: languageidentifier
Affects Versions: 1.4
Environment: Linux
Reporter: Zaheer Beig
Hi,
We are using Apache Tika 1.4 for document conversion to text and language detection in one of our project. We are facing below issues with language detection:
1. When the text is in all UPPER CASE, even though the language is English, it gets detected as Estonian.
2. For many of our German content , language gets detected as French [Though this is not the case for all German content]
Any update on this will be very helpful.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
Re: [jira] [Created] (TIKA-1405) German content detected as French
Posted by Oleg Tikhonov <ol...@apache.org>.
Hi,
does context contain only one language or it's mixed.
if the text contains a "single" language then it seems something strange in
our language profiles. If it mixed - then it kindda ok. The first detected
will be an answer.
What is a size of context? one word or "bunch" of text? Basically to detect
language on big text is more precise then on small.
Best regards,
Oleg
On Sat, Aug 30, 2014 at 1:13 PM, Zaheer Beig (JIRA) <ji...@apache.org> wrote:
> Zaheer Beig created TIKA-1405:
> ---------------------------------
>
> Summary: German content detected as French
> Key: TIKA-1405
> URL: https://issues.apache.org/jira/browse/TIKA-1405
> Project: Tika
> Issue Type: Bug
> Components: languageidentifier
> Affects Versions: 1.4
> Environment: Linux
> Reporter: Zaheer Beig
>
>
> Hi,
> We are using Apache Tika 1.4 for document conversion to text and language
> detection in one of our project. We are facing below issues with language
> detection:
>
> 1. When the text is in all UPPER CASE, even though the language is
> English, it gets detected as Estonian.
> 2. For many of our German content , language gets detected as French
> [Though this is not the case for all German content]
>
> Any update on this will be very helpful.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>