You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jan Høydahl (JIRA)" <ji...@apache.org> on 2010/08/24 19:05:17 UTC

[jira] Created: (TIKA-496) Language identifier profile comparison favors large profiles

Language identifier profile comparison favors large profiles
------------------------------------------------------------

                 Key: TIKA-496
                 URL: https://issues.apache.org/jira/browse/TIKA-496
             Project: Tika
          Issue Type: Bug
          Components: languageidentifier
    Affects Versions: 0.7
            Reporter: Jan Høydahl


I think I've found a flaw in the distance algorithm.

In LanguageProfile.java distance() method, we normalize the frequency for an ngram by dividing by the total count.
The total count for a profile is simply the sum of all counts in the profile.

Problem is, that the .ngp files are cutoff at 1000 entries, and the total count is then the sum of all those 1000 entries.
However, there will be a long-tail of lower frequency ngrams which are cut off and therefore not included in the total count.
Effect is that the ngrams from profiles with large training set are more important than ngrams from smaller training set.

You can see this effect especially well when classifying short texts in a language wich has similar sister languages with larger training sets. My example is "no" vs "da".

Sample from the tail of "no.ngp":
_gå 461
ask 461
ria 459
små 459

...and from the tail of "dk.ngp":
dbr 966
ost 966
ævn 964

It is obvious that "dk" has a longer tail after cutoff than "no" and therefore a larger sum.

A solution is to count the real total count when generating the .ngp file and storing the total in the profile file itself, instead of counting when loading the cutoff profile.
Alterniatvely, normalize counts before writing the .ngp file, so that the top entry is always 100000

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-496) Language identifier profile comparison favors large profiles

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901977#action_12901977 ] 

Ken Krugler commented on TIKA-496:
----------------------------------

I think that the current profile data was generating from a corpus of the same set of documents (EU publications) translated into the target languages. In that situation, the total ngram counts should be similar, so the problem you mention doesn't appear.

I'd be in favor of changing the profile file format to have explicit frequencies versus counts.

> Language identifier profile comparison favors large profiles
> ------------------------------------------------------------
>
>                 Key: TIKA-496
>                 URL: https://issues.apache.org/jira/browse/TIKA-496
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>
> I think I've found a flaw in the distance algorithm.
> In LanguageProfile.java distance() method, we normalize the frequency for an ngram by dividing by the total count.
> The total count for a profile is simply the sum of all counts in the profile.
> Problem is, that the .ngp files are cutoff at 1000 entries, and the total count is then the sum of all those 1000 entries.
> However, there will be a long-tail of lower frequency ngrams which are cut off and therefore not included in the total count.
> Effect is that the ngrams from profiles with large training set are more important than ngrams from smaller training set.
> You can see this effect especially well when classifying short texts in a language wich has similar sister languages with larger training sets. My example is "no" vs "da".
> Sample from the tail of "no.ngp":
> _gå 461
> ask 461
> ria 459
> små 459
> ...and from the tail of "dk.ngp":
> dbr 966
> ost 966
> ævn 964
> It is obvious that "dk" has a longer tail after cutoff than "no" and therefore a larger sum.
> A solution is to count the real total count when generating the .ngp file and storing the total in the profile file itself, instead of counting when loading the cutoff profile.
> Alterniatvely, normalize counts before writing the .ngp file, so that the top entry is always 100000

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-496) Language identifier profile comparison favors large profiles

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901980#action_12901980 ] 

Jan Høydahl commented on TIKA-496:
----------------------------------

Well, Norway is not part of EU, so that document probably don't exist - the Norwegian corups is arguably shorter based on the .ngp, and there are not tests for "no" either.

> Language identifier profile comparison favors large profiles
> ------------------------------------------------------------
>
>                 Key: TIKA-496
>                 URL: https://issues.apache.org/jira/browse/TIKA-496
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>
> I think I've found a flaw in the distance algorithm.
> In LanguageProfile.java distance() method, we normalize the frequency for an ngram by dividing by the total count.
> The total count for a profile is simply the sum of all counts in the profile.
> Problem is, that the .ngp files are cutoff at 1000 entries, and the total count is then the sum of all those 1000 entries.
> However, there will be a long-tail of lower frequency ngrams which are cut off and therefore not included in the total count.
> Effect is that the ngrams from profiles with large training set are more important than ngrams from smaller training set.
> You can see this effect especially well when classifying short texts in a language wich has similar sister languages with larger training sets. My example is "no" vs "da".
> Sample from the tail of "no.ngp":
> _gå 461
> ask 461
> ria 459
> små 459
> ...and from the tail of "dk.ngp":
> dbr 966
> ost 966
> ævn 964
> It is obvious that "dk" has a longer tail after cutoff than "no" and therefore a larger sum.
> A solution is to count the real total count when generating the .ngp file and storing the total in the profile file itself, instead of counting when loading the cutoff profile.
> Alterniatvely, normalize counts before writing the .ngp file, so that the top entry is always 100000

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.