You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2010/12/05 15:23:46 UTC

[Solr Wiki] Update of "LanguageDetection" by GrantIngersoll

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageDetection" page has been changed by GrantIngersoll.
http://wiki.apache.org/solr/LanguageDetection

--------------------------------------------------

New page:
= Solr's Language Detection =

<!> [[Solr4.0]]

See https://issues.apache.org/jira/browse/SOLR-1979.

= Introduction =

This feature adds the ability to detect the language of a document before indexing and then make appropriate decisions about analysis, etc.  It currently relies on Tika's language detection capabilities, which covers many, but not all, languages.  See http://tika.apache.org/0.8/detection.html for more information on the languages supported.

= Configuration =

= Input Parameters =

= Examples =

= Caveats =

Since Tika uses an n-gram based approach to detection, it is susceptible to poor detection on especially short inputs.  We rely on Tika's LanguageIdentifier.isReasonablyCertain() method to indicate the confidence Tika has in the detection.  There currently is not a way to pass in your own threshold, but see https://issues.apache.org/jira/browse/TIKA-568 for more info.

= Resources =

 * http://tika.apache.org