You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2008/04/12 20:21:04 UTC
[jira] Closed: (LUCENE-826) Language detector
[ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wettin closed LUCENE-826.
------------------------------
Resolution: Won't Fix
too much dependencies and stuff. there will be something better in mahout in the future.
> Language detector
> -----------------
>
> Key: LUCENE-826
> URL: https://issues.apache.org/jira/browse/LUCENE-826
> Project: Lucene - Java
> Issue Type: New Feature
> Reporter: Karl Wettin
> Assignee: Karl Wettin
> Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications.
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies. Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
> LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
> root.addBranch("uralic");
> root.addBranch("fino-ugric", "uralic");
> root.addBranch("ugric", "uralic");
> root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
> root.addBranch("proto-indo european");
> root.addBranch("germanic", "proto-indo european");
> root.addBranch("northern germanic", "germanic");
> root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
> root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
> root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
> root.addBranch("west germanic", "germanic");
> root.addLanguage("west germanic", "eng", "english", "en", "UK");
> root.mkdirs();
> LanguageClassifier classifier = new LanguageClassifier(root);
> if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
> classifier.compileTrainingData(); // from wikipedia
> }
> classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
> {code}
> assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
> testEquals("swe", classifier.classify(norway_in_swedish).getISO());
> testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
> testEquals("swe", classifier.classify(finland_in_swedish).getISO());
> testEquals("swe", classifier.classify(uk_in_swedish).getISO());
> testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
> assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
> testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
> testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
> testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
> testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
> testEquals("fin", classifier.classify(norway_in_finnish).getISO());
> testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
> assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
> testEquals("fin", classifier.classify(uk_in_finnish).getISO());
> testEquals("dan", classifier.classify(sweden_in_danish).getISO());
> // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
> testEquals("dan", classifier.classify(norway_in_danish).getISO());
> assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
> testEquals("dan", classifier.classify(finland_in_danish).getISO());
> testEquals("dan", classifier.classify(uk_in_danish).getISO());
> testEquals("eng", classifier.classify(sweden_in_english).getISO());
> testEquals("eng", classifier.classify(norway_in_english).getISO());
> testEquals("eng", classifier.classify(denmark_in_english).getISO());
> testEquals("eng", classifier.classify(finland_in_english).getISO());
> assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org