You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by lu...@jakarta.apache.org on 2004/07/08 15:30:01 UTC

[Jakarta Lucene Wiki] Updated: IndexingOtherLanguages

   Date: 2004-07-08T06:30:01
   Editor: 128.230.38.21 <>
   Wiki: Jakarta Lucene Wiki
   Page: IndexingOtherLanguages
   URL: http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages

   no comment

Change Log:

------------------------------------------------------------------------------
@@ -10,7 +10,7 @@
 
  1. Know the encoding of the documents you wish to index.  Java assumes the native encoding when reading in files unless you tell it otherwise.  To create a Reader that supports reading in other encodings, see [http://java.sun.com/j2se/1.4.2/docs/api/java/io/InputStreamReader.html InputStreamReader].  I find it easiest to convert all of my files to UTF-8 before indexing, and then I read them in by doing:[[BR]]
     `Reader reader = new InputStreamReader(new FileInputStream("path to file"), "UTF-8");`
-Note:  The demo supplied with Lucene does not support UTF-8 out of the box.  You will have to modify it.
+    
 
  2. Identify the Analyzer you will use or write your own if none exists.  There are many great analyzers available that will index a wide variety of languages.  See [http://jakarta.apache.org/lucene/docs/lucene-sandbox/ Sandbox] for some.  Otherwise, look around the web.  If you are writing your own, consider donating it to the Lucene Sandbox so that others can benefit from your brilliance.  See item 3. below for what is needed in a custom analyzer.
      'Put example of writing an Analyzer here'

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org