You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/10/16 06:10:28 UTC

[Solr Wiki] Update of "LanguageDetection" by RobertMuir

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageDetection" page has been changed by RobertMuir:
http://wiki.apache.org/solr/LanguageDetection?action=diff&rev1=10&rev2=11

Comment:
update documentation for additional implementation

  
  = Introduction =
  
- This feature adds the ability to detect the language of a document before indexing and then make appropriate decisions about analysis, etc. It is implemented as an UpdateRequestProcessor, and currently relies on Tika's language detection capabilities, which covers many, but not all, languages.  See http://tika.apache.org/0.10/detection.html for more information on the languages supported.
+ This feature adds the ability to detect the language of a document before indexing and then make appropriate decisions about analysis, etc. It is implemented as an UpdateRequestProcessor, and there are two implementations: 
+  * Tika implementation based upon Tika's language detection capabilities, which covers many, but not all, languages.  See http://tika.apache.org/0.10/detection.html for more information on the languages supported.
+  * LangDetect implementation based upon http://code.google.com/p/language-detection/ which supports more languages (53) and has some advanced CJK support.
  
  The component also supports automatic renaming of fields according to detected language and other advanced parameters, all explained in the next section.
  
  = Configuration =
  The UpdateRequestProcessor is configured in solrconfig.xml, and supports many parameters. All parameters listed may also be overridded on the update request itself. A minimal configuration specifies the input fields for language identification as well as the output field for the detected language code:
  {{{
- <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
+ <processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
+    <lst name="defaults">
+      <str name="langid.fl">title,subject,text,keywords</str>
+      <str name="langid.langField">language_s</str>
+    </lst>
+ </processor>
+ }}}
+ 
+ Alternatively, using the implementation based on http://code.google.com/p/language-detection/
+ {{{
+ <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
     <lst name="defaults">
       <str name="langid.fl">title,subject,text,keywords</str>
       <str name="langid.langField">language_s</str>
@@ -152, +164 @@

  
  = Examples =
  
- == Detect and map Scandinavian languages and fallback to generic for other languages ==
+ == Detect and map Scandinavian languages with Tika and fallback to generic for other languages ==
  
  {{{
-  <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
+  <processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
     <str name="langid">true</str>
     <str name="langid.fl">title,body</str>
     <str name="langid.langField">language</str>
@@ -168, +180 @@

  
  = Caveats =
  
- Since Tika uses an n-gram based approach to detection, it is susceptible to poor detection on especially short inputs. The threshold you specify in langid.threshold is normalized to match a certain similarity score in Tika, but this is not reliable for thresholds lower than 0.8. In the future, the detection quality may be improved due to changes in Tika or use of other language detection libraries.
+ Since the implementations uses an n-gram based approach to detection, they are susceptible to poor detection on especially short inputs. The threshold you specify in langid.threshold is normalized to match a certain similarity score in Tika, but this is not reliable for thresholds lower than 0.8. In the future, the detection quality may be improved due to changes in Tika or use of other language detection libraries.
  
  = Resources =
  
   * [[http://tika.apache.org/|Apache Tika]]
+  * [[http://code.google.com/p/language-detection/|Language detection Library for Java]]
   * [[https://issues.apache.org/jira/browse/SOLR-1979|SOLR-1979]]