You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/01/24 19:52:17 UTC

[jira] Created: (TIKA-369) Improve accuracy of language detection

Improve accuracy of language detection
--------------------------------------

                 Key: TIKA-369
                 URL: https://issues.apache.org/jira/browse/TIKA-369
             Project: Tika
          Issue Type: Improvement
          Components: languageidentifier
    Affects Versions: 0.6
            Reporter: Ken Krugler
            Assignee: Ken Krugler


Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:

1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. It might be sufficient to re-enable support for 1..4-grams (similar to original Nutch code) to improve quality.
2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-369) Improve accuracy of language detection

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-369:
-----------------------------

    Attachment: lingdet-mccs.pdf

Smaller version of Ted Dunning's 1994 paper.

> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: lingdet-mccs.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. It might be sufficient to re-enable support for 1..4-grams (similar to original Nutch code) to improve quality.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
> 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-369) Improve accuracy of language detection

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-369:
-----------------------------

    Attachment:     (was: dunning94-trimmed.pdf)

> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: lingdet-mccs.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. It might be sufficient to re-enable support for 1..4-grams (similar to original Nutch code) to improve quality.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
> 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-369) Improve accuracy of language detection

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-369:
-----------------------------

    Attachment: dunning94-trimmed.pdf

> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: dunning94-trimmed.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. It might be sufficient to re-enable support for 1..4-grams (similar to original Nutch code) to improve quality.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
> 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-369) Improve accuracy of language detection

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804288#action_12804288 ] 

Ken Krugler commented on TIKA-369:
----------------------------------

Karl Wettin had contributed a language detector to Lucene, though it was never rolled in. See [https://issues.apache.org/jira/browse/LUCENE-826]. This might be an interesting alternative.

Jean-François Halleux also contributed a "language guesser" to Lucene a while back. See [https://issues.apache.org/jira/browse/LUCENE-180]. This was markd as duplication of [LUCENE-826].

> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: dunning94-trimmed.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. It might be sufficient to re-enable support for 1..4-grams (similar to original Nutch code) to improve quality.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
> 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-369) Improve accuracy of language detection

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-369:
-----------------------------

    Description: 
Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:

1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed.
2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.



  was:
Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:

1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. It might be sufficient to re-enable support for 1..4-grams (similar to original Nutch code) to improve quality.
2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.




> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: lingdet-mccs.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
> 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (TIKA-369) Improve accuracy of language detection

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804288#action_12804288 ] 

Ken Krugler edited comment on TIKA-369 at 1/24/10 7:39 PM:
-----------------------------------------------------------

Karl Wettin had contributed a language detector to Lucene, though it was never rolled in. See [/LUCENE-826]. This might be an interesting alternative.

Jean-François Halleux also contributed a "language guesser" to Lucene a while back. See [LUCENE-180]. This was markd as duplication of [LUCENE-826].

      was (Author: kkrugler):
    Karl Wettin had contributed a language detector to Lucene, though it was never rolled in. See [https://issues.apache.org/jira/browse/LUCENE-826]. This might be an interesting alternative.

Jean-François Halleux also contributed a "language guesser" to Lucene a while back. See [https://issues.apache.org/jira/browse/LUCENE-180]. This was markd as duplication of [LUCENE-826].
  
> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: dunning94-trimmed.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. It might be sufficient to re-enable support for 1..4-grams (similar to original Nutch code) to improve quality.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
> 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (TIKA-369) Improve accuracy of language detection

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804288#action_12804288 ] 

Ken Krugler edited comment on TIKA-369 at 1/24/10 7:39 PM:
-----------------------------------------------------------

Karl Wettin had contributed a language detector to Lucene, though it was never rolled in. See [LUCENE-826]. This might be an interesting alternative.

Jean-François Halleux also contributed a "language guesser" to Lucene a while back. See [LUCENE-180]. This was markd as duplication of [LUCENE-826].

      was (Author: kkrugler):
    Karl Wettin had contributed a language detector to Lucene, though it was never rolled in. See [/LUCENE-826]. This might be an interesting alternative.

Jean-François Halleux also contributed a "language guesser" to Lucene a while back. See [LUCENE-180]. This was markd as duplication of [LUCENE-826].
  
> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: dunning94-trimmed.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. It might be sufficient to re-enable support for 1..4-grams (similar to original Nutch code) to improve quality.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
> 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-369) Improve accuracy of language detection

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-369:
-----------------------------

    Attachment: Surprise and Coincidence.pdf

Attaching another paper from Ted that makes it clearer why the chi-squared method currently used has problems for small text chunks.

> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: lingdet-mccs.pdf, Surprise and Coincidence.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
> 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.