You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2007/03/07 05:22:24 UTC

[jira] Created: (LUCENE-826) Language detector

Language detector
-----------------

                 Key: LUCENE-826
                 URL: https://issues.apache.org/jira/browse/LUCENE-826
             Project: Lucene - Java
          Issue Type: New Feature
            Reporter: Karl Wettin
         Assigned To: Karl Wettin


A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 

Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.

Initialized like this:

{code}
    LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));

    root.addBranch("uralic");
    root.addBranch("fino-ugric", "uralic");
    root.addBranch("ugric", "uralic");
    root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");

    root.addBranch("proto-indo european");
    root.addBranch("germanic", "proto-indo european");
    root.addBranch("northern germanic", "germanic");
    root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
    root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
    root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");

    root.addBranch("west germanic", "germanic");
    root.addLanguage("west germanic", "eng", "english", "en", "UK");

    root.mkdirs();

    LanguageClassifier classifier = new LanguageClassifier(root);
    if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
      classifier.compileTrainingData(); // from wikipedia
    }
    classifier.buildClassifier();
{code}


Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:

(testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)

{code}
    assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
    testEquals("swe", classifier.classify(norway_in_swedish).getISO());
    testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
    testEquals("swe", classifier.classify(finland_in_swedish).getISO());
    testEquals("swe", classifier.classify(uk_in_swedish).getISO());

    testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
    assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
    testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
    testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
    testEquals("nor", classifier.classify(uk_in_norwegian).getISO());

    testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
    testEquals("fin", classifier.classify(norway_in_finnish).getISO());
    testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
    assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
    testEquals("fin", classifier.classify(uk_in_finnish).getISO());

    testEquals("dan", classifier.classify(sweden_in_danish).getISO());
    // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
    testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
    assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
    testEquals("dan", classifier.classify(finland_in_danish).getISO());
    testEquals("dan", classifier.classify(uk_in_danish).getISO());

    testEquals("eng", classifier.classify(sweden_in_english).getISO());
    testEquals("eng", classifier.classify(norway_in_english).getISO());
    testEquals("eng", classifier.classify(denmark_in_english).getISO());
    testEquals("eng", classifier.classify(finland_in_english).getISO());
    assertEquals("eng", classifier.classify(uk_in_english).getISO());
{code}

I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.

It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-826) Language detector

Posted by "Peter Taylor (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541115 ] 

Peter Taylor commented on LUCENE-826:
-------------------------------------

Uh never mind ;) I have poked around and I am guessing you are using version 3.5.3 or thereabouts.



> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-826) Language detector

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-826:
-------------------------------

    Attachment: ld.tar.gz

Added support for all modern large germanic, balto-slavic, latin and some other languages. I'll add the complete indo-iranian tree soon.

The test case will gather and classify random pages from wikipedia in the target language. Only on too small articles (again, I say that 160 charaters, one paragraph, is required) or them with very mixed language (article talking about something like a discography of a non native band) is there a false positive.

Documents with mixed languages could probably be handled at paragraph level, reporting back as the document is in language A, but contains paragraphs (quotes, et c) in language B and C.

Supported languages(35):

swedish
danish
norwegian
islandic
faroese

dutch
afrikaans
frisian

low german
german

english

latvian
lithuanian

russian
ukranian
belarussian

czech
slovak
polish

bosnian
croatian
macedonian
bulgarian
slovenian
serbian

italian
spanish
french
portugese

armenian

greek

hungarian
finnish
estonian

modern persian (farsi)

There are some languages in the training set that due to low representation in Wikipedia also have problems with false positive classifications: 

Faroese with its 80 paragraphs (mean is 600) get some 60% false positives. 

Macedonian with its 150 paragraphs get 45% false positives, most often Serbian.

Croatian is often confused with Bosnian.

Also, some of these southern slavic languages can use either cyrillic or latin alphabet, and this is something I should consider a bit. 

All other languages are detected without any problems.

One simple way to get the false positives better here is to manually check the training data. There is some <!-- html comments --> here and there. Hopefully they are washed away with the feature selection.

Preparing the training data (download data from Wikipedia, parse, tokenize) for all them languages takes just a few minutes on my dual core, but the token feature selection (selecting the 7000 most prominent tokens out of 65000, in 20000 paragraphs of text) takes 90 minutes and consumes something like 700MB heap. 

Once the arff-file is create the classifier takes 10 minutes to compile (the support vectors) and once done it consumes not more than a fistful of MB. It could probably be serialized and dumped to disk for faster loading at startup time.

The time it takes to classify a document will of course depend on its size. Wikipedia articles average out on about 500ms.

For a really speedy classification of very large texts one could switch to REPtree instead of SVM. It does the job 95% as well (with a big enough text), but at 1% of the time or 2ms per classification. I still focus on 160 charaters long paragraphs though.

Next step is optimizations. The current training data for the 35 languages is 25000 instances and 7000 attributes. That is an instane amount of data. Way too much.

I think the CPU performance and RAM requirements can be optimized quite some by simply make the number of training instances (paragraphs) a bit more even. 500 per language. It is quite gaussian right now, and that is wrong. Also, by selecting 100*language attributes (tokens) for use in the SVM rathern than 200 as now does not do much to the classification quality, but would make the speed in creating training data and building the classifier to sqrt(what it is now).

For now I run on my 6 languages. It takes just a minute to download data from Wikipedia, tokenize and build the classifier. And classification time is about 100ms on average for a Wikipedia article.




> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>         Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-826) Language detector

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478712 ] 

Karl Wettin commented on LUCENE-826:
------------------------------------

Ahhh, I could not let be go without some more tests. Added a bunch of languages and it seems as it works quite splendid. Again, 10-cross fold validation output on 160+ characters long paragraphs:

Time taken to build model: 45.51 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        5566               98.8808 %
Incorrectly Classified Instances        63                1.1192 %
Kappa statistic                          0.9874
Mean absolute error                      0.139 
Root mean squared error                  0.2555
Relative absolute error                 93.6301 %
Root relative squared error             93.7791 %
Total Number of Instances             5629     

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
  0.996     0.003      0.988     0.996     0.992      0.997    eng
  0.988     0          0.998     0.988     0.993      0.995    swe
  0.984     0.002      0.982     0.984     0.983      0.996    spa
  0.988     0          0.995     0.988     0.992      0.997    fre
  0.979     0.001      0.982     0.979     0.981      0.992    nld
  0.97      0.002      0.97      0.97      0.97       0.993    nor
  1         0          1         1         1          1        afr
  0.914     0.001      0.946     0.914     0.93       0.992    dan
  0.986     0.001      0.981     0.986     0.984      0.999    pot
  0.998     0.001      0.993     0.998     0.995      0.999    fin
  0.99      0.001      0.993     0.99      0.992      0.999    ita
  0.998     0          0.998     0.998     0.998      0.999    ger

=== Confusion Matrix ===

    a    b    c    d    e    f    g    h    i    j    k    l   <-- classified as
 1044    1    1    0    0    0    0    0    1    1    0    0 |    a = eng
    2  425    0    0    2    0    0    0    0    0    1    0 |    b = swe
    0    0  434    1    1    0    0    0    5    0    0    0 |    c = spa
    2    0    0  418    0    0    0    0    0    1    0    2 |    d = fre
    4    0    2    0  333    0    0    0    0    0    1    0 |    e = nld
    1    0    0    0    0  322    0    7    1    0    1    0 |    f = nor
    0    0    0    0    0    0  230    0    0    0    0    0 |    g = afr
    1    0    0    0    2   10    0  139    0    0    0    0 |    h = dan
    0    0    5    0    0    0    0    0  362    0    0    0 |    i = pot
    0    0    0    0    0    0    0    1    0  440    0    0 |    j = fin
    2    0    0    0    1    0    0    0    0    1  417    0 |    k = ita
    1    0    0    1    0    0    0    0    0    0    0 1002 |    l = ger



    root.addBranch("uralic");
    root.addBranch("uralic", "fino-ugric");
    root.addBranch("uralic", "ugric");
    //root.addLanguage("hungarian", "ugric");
    root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
    //root.addLanguage("sami", "fino-ugric");
    //root.addLanguage("estonian", "fino-ugric");
    //root.addLanguage("livonian", "fino-ugric");

    root.addBranch("proto-indo european");

    root.addBranch("proto-indo european", "italic");
    root.addBranch("italic", "latino-faliscan");
    root.addBranch("latino-faliscan", "latin");
    root.addLanguage("latin", "ita", "italian", "it", "Italia");
    root.addLanguage("latin", "fre", "french", "fr", "France");
    root.addLanguage("latin", "pot", "portugese", "pt", "Portugal");
    root.addLanguage("latin", "spa", "spanish", "es", "Espa%C3%B1a");

    root.addBranch("proto-indo european", "germanic");
    root.addBranch("germanic", "northern germanic");
    root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
    root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
    root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");

    root.addBranch("germanic", "west germanic");
    root.addLanguage("west germanic", "eng", "english", "en", "UK");
    root.addLanguage("west germanic", "ger", "german", "de", "Deutschland");

    root.addBranch("west germanic", "middle dutch");
    root.addLanguage("middle dutch", "nld", "dutch", "nl", "Nederland");
    root.addLanguage("middle dutch", "afr", "afrikaans", "af", "Nederland");
  

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>         Attachments: ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-826) Language detector

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478691 ] 

Karl Wettin commented on LUCENE-826:
------------------------------------

Some performance in numbers: using only 160+ character long paragraphs as training data I get these results from a 10-fold cross validation:


Time taken to build model: 2.12 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        1199               98.6831 %
Incorrectly Classified Instances        16                1.3169 %
Kappa statistic                          0.9814
Mean absolute error                      0.2408
Root mean squared error                  0.3173
Relative absolute error                 84.8251 %
Root relative squared error             84.235  %
Total Number of Instances             1215     

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
  1         0.009      0.989     1         0.995      0.995    eng
  0.979     0.001      0.995     0.979     0.987      0.994    swe
  0.973     0.003      0.984     0.973     0.979      0.996    nor
  0.946     0.005      0.935     0.946     0.941      0.975    dan
  0.989     0          1         0.989     0.995      0.997    fin

=== Confusion Matrix ===

   a   b   c   d   e   <-- classified as
 562   0   0   0   0 |   a = eng
   3 183   0   1   0 |   b = swe
   1   0 183   4   0 |   c = nor
   1   1   3  87   0 |   d = dan
   1   0   0   1 184 |   e = fin

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>         Attachments: ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-826) Language detector

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541202 ] 

Karl Wettin commented on LUCENE-826:
------------------------------------

Peter Taylor - 08/Nov/07 10:15 AM
> Just out of curiosity which version of Weka are you using...

You can also check out all Lucene-no-deps baysian LUCENE-1039, spell checker in the test case.

I have 600 instances per class, and 25 classes. Get great results with ^3-4, 3- and 3-5$ ngrams of context sensitive 2-5 word sentances . Using a LUCENE-550 index is 4-5 times faster (100-300ms) than a RAMDirectory (500-1600) for classification.


> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-826) Language detector

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805027#action_12805027 ] 

Karl Wettin commented on LUCENE-826:
------------------------------------

Hi Ken,

it's hard for me to compare. I'll rant a bit about my experience from language detection though. 

I still haven't found a one strategy that works good on any text: a user query, a sentence, a paragraph or a complete document. 1-5 grams using SVM or NB works pretty good for them all but you really need to train it with the same sort of data you want to classify. Even when training with a mix of text lengths it tend to perform a lot worse than if you had one classifier for each data type. And you still probably want to twiddle with the classifier knobs to make it work great with the data you are classifying and training with.

In some cases I've used 1-10 grams and other times I've used 2-4 grams. Sometimes I've used SVM and other times I've used a simple desiction tree.

To sum it up, to achieve good quality I've always had to  build a classifier for that specific use case. Weka has a great test suite for figuring out what to use. Set it up, press play and return one week later to find out what to use.

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-826) Language detector

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-826:
-------------------------------

    Attachment: ld.tar.gz

tar-ball with code and a precompiled training data set that detects swedish, danish, norwegian, english and finnish.

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>         Attachments: ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Closed: (LUCENE-826) Language detector

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin closed LUCENE-826.
------------------------------

    Resolution: Won't Fix

too much dependencies and stuff. there will be something better in mahout in the future.

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-826) Language detector

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804285#action_12804285 ] 

Ken Krugler commented on LUCENE-826:
------------------------------------

I think Nutch (and eventually Mahout) plan to use Tika for charset/mime-type/language detection going forward.

I've filed an issue [TIKA-369] about improving the current Tika code, which is a simplification of the Nutch code. While using this on lots of docs, there were performance issues. And for small chunks of text the quality isn't very good.

It would be interesting if Karl could comment on the approach Ted Dunning took (many years ago - 1994 :)) versus what he did.

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-826) Language detector

Posted by "Peter Taylor (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541094 ] 

Peter Taylor commented on LUCENE-826:
-------------------------------------

Just out of curiosity which version of Weka are you using...

I ask because in newer versions of weka...

In the LanguageClassifier.java source file we have the following problem...

stringToWordVector.setDelimiters(";"); <-- setDelimiters method has disappeared
stringToWordVector.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL, StringToWordVector.TAGS_FILTER)); <-- this works

and in older versions of weka...

In the LanguageClassifier.java source file we have the following problem...

stringToWordVector.setDelimiters(";"); <-- this now works :-)
stringToWordVector.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL, StringToWordVector.TAGS_FILTER)); <-- older versions of the API simply expect a boolean value rather than a SelectedTag object as a param)

Please advise :-)

Cheers,

Peter

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-826) Language detector

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478694 ] 

Karl Wettin commented on LUCENE-826:
------------------------------------

Foot note:

The diffrence between this and the Nutch gram-based language identifier is quite a bit. For a starter this calculate the vertices on full words, edge-grams and bi-grams where the two charaters are the same. The frequency is normalized against the text size. The same goes for analysis at classification time. The n most important (feature selection using ranked information gain)  tokens are selected for consideration by the classifier, currently 200 (out of 1000 per language) per registred language. So whis the default test (5 languages) there are 1000 tokens. It is really speedy on my dual core. 

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>         Attachments: ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org