You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Liaqat Ali <li...@gmail.com> on 2007/12/04 11:53:51 UTC

Indexing Non-English text

Hi,
I m facing a problem while indexing a small .txt file with Lucene. The 
file which i want to index with lucene is in Urdu language (varient of 
Arabic and Persian). But the Index i get is in Unicode form, not in the 
real form (original Urdu text). This program works good for a file in 
English language. This is the code i use for indexing..

        FileReader file = new FileReader ("urdoc.txt");
        BufferedReader buff = new BufferedReader(file);
        String line = buff.readLine();
        boolean eof = false;
        buff.close();
        String indexDir = "D:\\index";
               Analyzer analyzer = new StandardAnalyzer();
            boolean createFlag = true;
        IndexWriter writer =
                    new IndexWriter(indexDir, analyzer, createFlag);
            Document document  = new Document();
        document.add(new Field("fieldname",line, Field.Store.YES,
        Field.Index.TOKENIZED));
            writer.addDocument(document);
            writer.close();

Kindly guide me, what I should do, would i have to change this code or 
whatever else do you suggest?

Liaqat

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing Non-English text

Posted by Grant Ingersoll <gs...@apache.org>.

FileReader is dependent on your local locale.  http://wiki.apache.org/lucene-java/IndexingOtherLanguages 
  has some useful tips.  Essentially, you need to make sure you  
control the encodings at all input points of your application.  Lucene  
will do the appropriate thing internally.

On Dec 4, 2007, at 5:53 AM, Liaqat Ali wrote:

> Hi,
> I m facing a problem while indexing a small .txt file with Lucene.  
> The file which i want to index with lucene is in Urdu language  
> (varient of Arabic and Persian). But the Index i get is in Unicode  
> form, not in the real form (original Urdu text). This program works  
> good for a file in English language. This is the code i use for  
> indexing..
>
>       FileReader file = new FileReader ("urdoc.txt");
>       BufferedReader buff = new BufferedReader(file);
>       String line = buff.readLine();
>       boolean eof = false;
>       buff.close();
>       String indexDir = "D:\\index";
>              Analyzer analyzer = new StandardAnalyzer();
>           boolean createFlag = true;
>       IndexWriter writer =
>                   new IndexWriter(indexDir, analyzer, createFlag);
>           Document document  = new Document();
>       document.add(new Field("fieldname",line, Field.Store.YES,
>       Field.Index.TOKENIZED));
>           writer.addDocument(document);
>           writer.close();
>
> Kindly guide me, what I should do, would i have to change this code  
> or whatever else do you suggest?
>
> Liaqat
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org