You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by KK <di...@gmail.com> on 2009/05/25 16:02:11 UTC

Hit highlighting for non-english unicode index/queries not working?

Hi,
I'm trying to index some non-english texts. Indexing and searching is
working fine. From command line I'm able to provide the utf-8 unicoded text
as input like this,
\u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
and able to get the search results.
Then I tried to add hit highlighting for the same. So I started with simple
english texts and used pharse queries for providing input queries. My code
looks like this,


import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;
import java.io.*;
import java.nio.charset.Charset;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.FilterIndexReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.HitCollector;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TopDocCollector;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.Scorer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.SimpleAnalyzer;


/** Simple command-line based search demo. */
public class LuceneSearcher {
    private static final String indexPath = "/opt/lucene/index" + "/core36";
//core36 refers to the exact index directory for tamil pages

    private void searchIndex(String terms) throws Exception{
        String queryString = "";
        PhraseQuery phrase = new PhraseQuery();
        String[] termArray = terms.split(" ");
        for (int i=0; i<termArray.length; i++) {
            System.out.println("adding " + termArray[i]);
            //phrase.add(new Term("content", termArray[i]));
            //queryString += termArray[i];
        }
        /
        //phrase.add(new Term("content", "ubuntu"));
        String tamilQuery = new
String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
        //tamilQuery = new String("ubuntu");
        phrase.add(new Term("content", tamilQuery));
        phrase.setSlop(1);
        System.out.println("phrase query " + phrase.toString());

         IndexSearcher searcher = new IndexSearcher(indexPath);
        QueryParser queryParser = null;
        try {
            queryParser = new QueryParser("content", new SimpleAnalyzer());
        } catch (Exception ex) {
             ex.printStackTrace();
        }

        //Query query = queryParser.parse(queryString);

        Hits hits = null;
        try {
             hits = searcher.search(phrase);
        } catch (Exception ex) {
             ex.printStackTrace();
        }
        //for highlighter section
        QueryScorer scorer = new QueryScorer(phrase);
        Highlighter highlighter = new Highlighter(scorer);

        for (int i = 0; i < hits.length(); i++) {
            String content = hits.doc(i).get("content");
            TokenStream stream = new SimpleAnalyzer().tokenStream("content",
new StringReader(content));
            String fragment = highlighter.getBestFragments(stream, content,
5, "...");
            System.out.println(fragment);
        }


        int hitCount = hits.length();
        System.out.println("Results found :" + hitCount);

        /*
        for (int ix=0; ix<hitCount; ix++) {
             Document doc = hits.doc(ix);
            System.out.println(doc.get("content"));
        }
        */
    }

    public static void main(String args[]) throws Exception{
         LuceneSearcher searcher = new LuceneSearcher();
        String termString = args[0];
        System.out.println("searching for " + args[0]);
        searcher.searchIndex(termString);
    }

}
----------------------code ends here---------------------------------
NB: Please ignore basic coding conventio[ indentations, comments etc]. You
might find some unneccesary code intermixed with the highlighting code,
ignore them .

Now when I searched for some english docs I got the results with <b></b>
tags sorrounding the hits like this,

<B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
<B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security notices
that affect the current supported releases of <B>Ubuntu</B>. These notices
are also posted

Now I thought of testing the same for temil texts. Before this I would like
to add one more information that prior to adding the codes for highlighting
I was able to search a lucene index from the command line using the raw
unicode texts like this,
[kk@kk-laptop]$ java LuceneSearcher "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"

and it gives me the page that mathces the above query. Now I tried to do the
same alongwith highliting. So in the code I posted above you can see that I
commented out the english terms and added one tamil unicode query and tried
to see If it gives me the same result that I was getting prior to
highlighting and found that I'm not getting any results. This might be
because the query I'm forming using these unicode texts is wrong, or may be
something else. I'm not able to figure out what exactly is going wrong? Some
silly mistake I guess, still I'm not able to find out. Can some one take the
pain to go throgh the above code and find out whats wrong. Thank you very
much.

Thanks,
KK.

Re: Hit highlighting for non-english unicode index/queries not working?

Posted by Erick Erickson <er...@gmail.com>.
LowercaseFilter is part of Lucene, as are any number of other filters. Thebasic
idea is just that *after* tokenization, there may be further
transformations you want to do on each token, such as lower-casing
it, stemming it, skipping it, <whatever else you might like to do>....

But watch out a bit, there are token Filters and search Filters, and they
have nothing to do with each other <G>.

Erick

On Tue, May 26, 2009 at 9:51 AM, KK <di...@gmail.com> wrote:

> Thank you Erick.
> As of now I'm using whitespaceanalyzer and no stemming and not stop word
> remova. Now I feel writing a simple analyzer won't be that difficult after
> going thru your mail. I'll give it a try. I don't have any idea on filters
> but I'm pretty it must be simple and will definitely go through the
> examples
> of LIA 2ndEdn. Thank you.
>
> --KK
>
> On Tue, May 26, 2009 at 6:55 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > It's fairly easy to construct your own analyzer bystringing together some
> > filters and tokenizers. LIA (1st ed)
> > had a SynonymAnalyzer. You probably want something like
> > (WARNING, example only, I'm not even sure it compiles!! Ripped
> > off  from the WIKI)
> >
> > public class MyAnalyzer extends Analyzer
> > {
> >    public TokenStream tokenStream (String field, final Reader reader) {
> >            return  new LowercaseFilter (new WhitespaceTokenizer(reader));
> >   }
> > }
> >
> > There are a number of Filters you can string together if you want to,
> say,
> > remove stop words etc..
> >
> > HTH
> > Erick
> >
> > On Tue, May 26, 2009 at 6:38 AM, KK <di...@gmail.com> wrote:
> >
> > > Thank you @Muir.
> > > I was earlier using simpleanalyzer for all purposes but as you
> > reccomended
> > > me the whitespace one, I tried to use that analyzer and good thing is
> > that
> > > I'm able to index/search non-english text as well as supporting hit
> > > highlighting for these non-english texts. Thank you very much.
> > > But now there is one silly problem. As whitespaceanalyzer doesnot do
> > > anything other than separating the tokens based on the space, for
> english
> > > pages case-folding is getting missed. Unless I provide the exact words
> > > including the right cases it doesnot give me results, which is quite
> > > obivious. As I went thru the LIA 2nd Edn book, found that it mentions
> we
> > > can
> > > use analyzers on document level and also on field level. I was quite
> > amazed
> > > at the granularity of analysis supported by Lucene. But its there we
> just
> > > have to make use of it. So I'm thinking of giving it a try that will
> help
> > > me
> > > support  both english and non-english indexing/searching/highlighting.
> > > Thank
> > > you all. Any ideas on the same are always welcome.
> > >
> > > Thanks,
> > > KK.
> > >
> > >
> > > On Tue, May 26, 2009 at 1:24 AM, Robert Muir <rc...@gmail.com> wrote:
> > >
> > > > as mentioned previously, i dont think your text is being analyzed the
> > way
> > > > you want.
> > > >
> > > > SimpleAnalyzer will break your word
> > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > > > (பரிணாம) into 3 tokens:
> > > >
> > > > \u0BAA\u0BB0
> > > > \u0BA3
> > > > \u0BAE
> > > >
> > > > Not only does it incorrectly split your word into three words, but it
> > > > completely drops the dependent vowels (\u0BBF and \u0BBE).
> > > >
> > > > This is why i would recommend trying whitespace analyzer instead.
> > > > Also take a look at the Luke index tool, its a very quick way to see
> > how
> > > > your words are being analyzed by various analyzers.
> > > >
> > > >
> > > > On Mon, May 25, 2009 at 10:02 AM, KK <di...@gmail.com>
> > wrote:
> > > >
> > > > > Hi,
> > > > > I'm trying to index some non-english texts. Indexing and searching
> is
> > > > > working fine. From command line I'm able to provide the utf-8
> > unicoded
> > > > text
> > > > > as input like this,
> > > > > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > > > > and able to get the search results.
> > > > > Then I tried to add hit highlighting for the same. So I started
> with
> > > > simple
> > > > > english texts and used pharse queries for providing input queries.
> My
> > > > code
> > > > > looks like this,
> > > > >
> > > > >
> > > > > import java.io.FileReader;
> > > > > import java.io.IOException;
> > > > > import java.io.InputStreamReader;
> > > > > import java.util.Date;
> > > > > import java.io.*;
> > > > > import java.nio.charset.Charset;
> > > > >
> > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > import org.apache.lucene.document.Document;
> > > > > import org.apache.lucene.index.FilterIndexReader;
> > > > > import org.apache.lucene.index.IndexReader;
> > > > > import org.apache.lucene.index.Term;
> > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > import org.apache.lucene.search.HitCollector;
> > > > > import org.apache.lucene.search.Hits;
> > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > import org.apache.lucene.search.Query;
> > > > > import org.apache.lucene.search.PhraseQuery;
> > > > > import org.apache.lucene.search.ScoreDoc;
> > > > > import org.apache.lucene.search.Searcher;
> > > > > import org.apache.lucene.search.TopDocCollector;
> > > > > import org.apache.lucene.search.highlight.Highlighter;
> > > > > import org.apache.lucene.search.highlight.QueryScorer;
> > > > > import org.apache.lucene.search.Scorer;
> > > > > import org.apache.lucene.analysis.TokenStream;
> > > > > import org.apache.lucene.analysis.SimpleAnalyzer;
> > > > >
> > > > >
> > > > > /** Simple command-line based search demo. */
> > > > > public class LuceneSearcher {
> > > > >    private static final String indexPath = "/opt/lucene/index" +
> > > > "/core36";
> > > > > //core36 refers to the exact index directory for tamil pages
> > > > >
> > > > >    private void searchIndex(String terms) throws Exception{
> > > > >        String queryString = "";
> > > > >        PhraseQuery phrase = new PhraseQuery();
> > > > >        String[] termArray = terms.split(" ");
> > > > >        for (int i=0; i<termArray.length; i++) {
> > > > >            System.out.println("adding " + termArray[i]);
> > > > >            //phrase.add(new Term("content", termArray[i]));
> > > > >            //queryString += termArray[i];
> > > > >        }
> > > > >        /
> > > > >        //phrase.add(new Term("content", "ubuntu"));
> > > > >        String tamilQuery = new
> > > > > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
> > > > >        //tamilQuery = new String("ubuntu");
> > > > >        phrase.add(new Term("content", tamilQuery));
> > > > >        phrase.setSlop(1);
> > > > >        System.out.println("phrase query " + phrase.toString());
> > > > >
> > > > >         IndexSearcher searcher = new IndexSearcher(indexPath);
> > > > >        QueryParser queryParser = null;
> > > > >        try {
> > > > >            queryParser = new QueryParser("content", new
> > > > SimpleAnalyzer());
> > > > >        } catch (Exception ex) {
> > > > >             ex.printStackTrace();
> > > > >        }
> > > > >
> > > > >        //Query query = queryParser.parse(queryString);
> > > > >
> > > > >        Hits hits = null;
> > > > >        try {
> > > > >             hits = searcher.search(phrase);
> > > > >        } catch (Exception ex) {
> > > > >             ex.printStackTrace();
> > > > >        }
> > > > >        //for highlighter section
> > > > >        QueryScorer scorer = new QueryScorer(phrase);
> > > > >        Highlighter highlighter = new Highlighter(scorer);
> > > > >
> > > > >        for (int i = 0; i < hits.length(); i++) {
> > > > >            String content = hits.doc(i).get("content");
> > > > >            TokenStream stream = new
> > > > SimpleAnalyzer().tokenStream("content",
> > > > > new StringReader(content));
> > > > >            String fragment = highlighter.getBestFragments(stream,
> > > > content,
> > > > > 5, "...");
> > > > >            System.out.println(fragment);
> > > > >        }
> > > > >
> > > > >
> > > > >        int hitCount = hits.length();
> > > > >        System.out.println("Results found :" + hitCount);
> > > > >
> > > > >        /*
> > > > >        for (int ix=0; ix<hitCount; ix++) {
> > > > >             Document doc = hits.doc(ix);
> > > > >            System.out.println(doc.get("content"));
> > > > >        }
> > > > >        */
> > > > >    }
> > > > >
> > > > >    public static void main(String args[]) throws Exception{
> > > > >         LuceneSearcher searcher = new LuceneSearcher();
> > > > >        String termString = args[0];
> > > > >        System.out.println("searching for " + args[0]);
> > > > >        searcher.searchIndex(termString);
> > > > >    }
> > > > >
> > > > > }
> > > > > ----------------------code ends
> here---------------------------------
> > > > > NB: Please ignore basic coding conventio[ indentations, comments
> > etc].
> > > > You
> > > > > might find some unneccesary code intermixed with the highlighting
> > code,
> > > > > ignore them .
> > > > >
> > > > > Now when I searched for some english docs I got the results with
> > > <b></b>
> > > > > tags sorrounding the hits like this,
> > > > >
> > > > > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
> > > > > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security
> > > > notices
> > > > > that affect the current supported releases of <B>Ubuntu</B>. These
> > > > notices
> > > > > are also posted
> > > > >
> > > > > Now I thought of testing the same for temil texts. Before this I
> > would
> > > > like
> > > > > to add one more information that prior to adding the codes for
> > > > highlighting
> > > > > I was able to search a lucene index from the command line using the
> > raw
> > > > > unicode texts like this,
> > > > > [kk@kk-laptop]$ java LuceneSearcher
> > > > "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"
> > > > >
> > > > > and it gives me the page that mathces the above query. Now I tried
> to
> > > do
> > > > > the
> > > > > same alongwith highliting. So in the code I posted above you can
> see
> > > that
> > > > I
> > > > > commented out the english terms and added one tamil unicode query
> and
> > > > tried
> > > > > to see If it gives me the same result that I was getting prior to
> > > > > highlighting and found that I'm not getting any results. This might
> > be
> > > > > because the query I'm forming using these unicode texts is wrong,
> or
> > > may
> > > > be
> > > > > something else. I'm not able to figure out what exactly is going
> > wrong?
> > > > > Some
> > > > > silly mistake I guess, still I'm not able to find out. Can some one
> > > take
> > > > > the
> > > > > pain to go throgh the above code and find out whats wrong. Thank
> you
> > > very
> > > > > much.
> > > > >
> > > > > Thanks,
> > > > > KK.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> >
>

Re: Hit highlighting for non-english unicode index/queries not working?

Posted by KK <di...@gmail.com>.
Thank you Erick.
As of now I'm using whitespaceanalyzer and no stemming and not stop word
remova. Now I feel writing a simple analyzer won't be that difficult after
going thru your mail. I'll give it a try. I don't have any idea on filters
but I'm pretty it must be simple and will definitely go through the examples
of LIA 2ndEdn. Thank you.

--KK

On Tue, May 26, 2009 at 6:55 PM, Erick Erickson <er...@gmail.com>wrote:

> It's fairly easy to construct your own analyzer bystringing together some
> filters and tokenizers. LIA (1st ed)
> had a SynonymAnalyzer. You probably want something like
> (WARNING, example only, I'm not even sure it compiles!! Ripped
> off  from the WIKI)
>
> public class MyAnalyzer extends Analyzer
> {
>    public TokenStream tokenStream (String field, final Reader reader) {
>            return  new LowercaseFilter (new WhitespaceTokenizer(reader));
>   }
> }
>
> There are a number of Filters you can string together if you want to, say,
> remove stop words etc..
>
> HTH
> Erick
>
> On Tue, May 26, 2009 at 6:38 AM, KK <di...@gmail.com> wrote:
>
> > Thank you @Muir.
> > I was earlier using simpleanalyzer for all purposes but as you
> reccomended
> > me the whitespace one, I tried to use that analyzer and good thing is
> that
> > I'm able to index/search non-english text as well as supporting hit
> > highlighting for these non-english texts. Thank you very much.
> > But now there is one silly problem. As whitespaceanalyzer doesnot do
> > anything other than separating the tokens based on the space, for english
> > pages case-folding is getting missed. Unless I provide the exact words
> > including the right cases it doesnot give me results, which is quite
> > obivious. As I went thru the LIA 2nd Edn book, found that it mentions we
> > can
> > use analyzers on document level and also on field level. I was quite
> amazed
> > at the granularity of analysis supported by Lucene. But its there we just
> > have to make use of it. So I'm thinking of giving it a try that will help
> > me
> > support  both english and non-english indexing/searching/highlighting.
> > Thank
> > you all. Any ideas on the same are always welcome.
> >
> > Thanks,
> > KK.
> >
> >
> > On Tue, May 26, 2009 at 1:24 AM, Robert Muir <rc...@gmail.com> wrote:
> >
> > > as mentioned previously, i dont think your text is being analyzed the
> way
> > > you want.
> > >
> > > SimpleAnalyzer will break your word
> \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > > (பரிணாம) into 3 tokens:
> > >
> > > \u0BAA\u0BB0
> > > \u0BA3
> > > \u0BAE
> > >
> > > Not only does it incorrectly split your word into three words, but it
> > > completely drops the dependent vowels (\u0BBF and \u0BBE).
> > >
> > > This is why i would recommend trying whitespace analyzer instead.
> > > Also take a look at the Luke index tool, its a very quick way to see
> how
> > > your words are being analyzed by various analyzers.
> > >
> > >
> > > On Mon, May 25, 2009 at 10:02 AM, KK <di...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > > I'm trying to index some non-english texts. Indexing and searching is
> > > > working fine. From command line I'm able to provide the utf-8
> unicoded
> > > text
> > > > as input like this,
> > > > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > > > and able to get the search results.
> > > > Then I tried to add hit highlighting for the same. So I started with
> > > simple
> > > > english texts and used pharse queries for providing input queries. My
> > > code
> > > > looks like this,
> > > >
> > > >
> > > > import java.io.FileReader;
> > > > import java.io.IOException;
> > > > import java.io.InputStreamReader;
> > > > import java.util.Date;
> > > > import java.io.*;
> > > > import java.nio.charset.Charset;
> > > >
> > > > import org.apache.lucene.analysis.Analyzer;
> > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > import org.apache.lucene.document.Document;
> > > > import org.apache.lucene.index.FilterIndexReader;
> > > > import org.apache.lucene.index.IndexReader;
> > > > import org.apache.lucene.index.Term;
> > > > import org.apache.lucene.queryParser.QueryParser;
> > > > import org.apache.lucene.search.HitCollector;
> > > > import org.apache.lucene.search.Hits;
> > > > import org.apache.lucene.search.IndexSearcher;
> > > > import org.apache.lucene.search.Query;
> > > > import org.apache.lucene.search.PhraseQuery;
> > > > import org.apache.lucene.search.ScoreDoc;
> > > > import org.apache.lucene.search.Searcher;
> > > > import org.apache.lucene.search.TopDocCollector;
> > > > import org.apache.lucene.search.highlight.Highlighter;
> > > > import org.apache.lucene.search.highlight.QueryScorer;
> > > > import org.apache.lucene.search.Scorer;
> > > > import org.apache.lucene.analysis.TokenStream;
> > > > import org.apache.lucene.analysis.SimpleAnalyzer;
> > > >
> > > >
> > > > /** Simple command-line based search demo. */
> > > > public class LuceneSearcher {
> > > >    private static final String indexPath = "/opt/lucene/index" +
> > > "/core36";
> > > > //core36 refers to the exact index directory for tamil pages
> > > >
> > > >    private void searchIndex(String terms) throws Exception{
> > > >        String queryString = "";
> > > >        PhraseQuery phrase = new PhraseQuery();
> > > >        String[] termArray = terms.split(" ");
> > > >        for (int i=0; i<termArray.length; i++) {
> > > >            System.out.println("adding " + termArray[i]);
> > > >            //phrase.add(new Term("content", termArray[i]));
> > > >            //queryString += termArray[i];
> > > >        }
> > > >        /
> > > >        //phrase.add(new Term("content", "ubuntu"));
> > > >        String tamilQuery = new
> > > > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
> > > >        //tamilQuery = new String("ubuntu");
> > > >        phrase.add(new Term("content", tamilQuery));
> > > >        phrase.setSlop(1);
> > > >        System.out.println("phrase query " + phrase.toString());
> > > >
> > > >         IndexSearcher searcher = new IndexSearcher(indexPath);
> > > >        QueryParser queryParser = null;
> > > >        try {
> > > >            queryParser = new QueryParser("content", new
> > > SimpleAnalyzer());
> > > >        } catch (Exception ex) {
> > > >             ex.printStackTrace();
> > > >        }
> > > >
> > > >        //Query query = queryParser.parse(queryString);
> > > >
> > > >        Hits hits = null;
> > > >        try {
> > > >             hits = searcher.search(phrase);
> > > >        } catch (Exception ex) {
> > > >             ex.printStackTrace();
> > > >        }
> > > >        //for highlighter section
> > > >        QueryScorer scorer = new QueryScorer(phrase);
> > > >        Highlighter highlighter = new Highlighter(scorer);
> > > >
> > > >        for (int i = 0; i < hits.length(); i++) {
> > > >            String content = hits.doc(i).get("content");
> > > >            TokenStream stream = new
> > > SimpleAnalyzer().tokenStream("content",
> > > > new StringReader(content));
> > > >            String fragment = highlighter.getBestFragments(stream,
> > > content,
> > > > 5, "...");
> > > >            System.out.println(fragment);
> > > >        }
> > > >
> > > >
> > > >        int hitCount = hits.length();
> > > >        System.out.println("Results found :" + hitCount);
> > > >
> > > >        /*
> > > >        for (int ix=0; ix<hitCount; ix++) {
> > > >             Document doc = hits.doc(ix);
> > > >            System.out.println(doc.get("content"));
> > > >        }
> > > >        */
> > > >    }
> > > >
> > > >    public static void main(String args[]) throws Exception{
> > > >         LuceneSearcher searcher = new LuceneSearcher();
> > > >        String termString = args[0];
> > > >        System.out.println("searching for " + args[0]);
> > > >        searcher.searchIndex(termString);
> > > >    }
> > > >
> > > > }
> > > > ----------------------code ends here---------------------------------
> > > > NB: Please ignore basic coding conventio[ indentations, comments
> etc].
> > > You
> > > > might find some unneccesary code intermixed with the highlighting
> code,
> > > > ignore them .
> > > >
> > > > Now when I searched for some english docs I got the results with
> > <b></b>
> > > > tags sorrounding the hits like this,
> > > >
> > > > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
> > > > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security
> > > notices
> > > > that affect the current supported releases of <B>Ubuntu</B>. These
> > > notices
> > > > are also posted
> > > >
> > > > Now I thought of testing the same for temil texts. Before this I
> would
> > > like
> > > > to add one more information that prior to adding the codes for
> > > highlighting
> > > > I was able to search a lucene index from the command line using the
> raw
> > > > unicode texts like this,
> > > > [kk@kk-laptop]$ java LuceneSearcher
> > > "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"
> > > >
> > > > and it gives me the page that mathces the above query. Now I tried to
> > do
> > > > the
> > > > same alongwith highliting. So in the code I posted above you can see
> > that
> > > I
> > > > commented out the english terms and added one tamil unicode query and
> > > tried
> > > > to see If it gives me the same result that I was getting prior to
> > > > highlighting and found that I'm not getting any results. This might
> be
> > > > because the query I'm forming using these unicode texts is wrong, or
> > may
> > > be
> > > > something else. I'm not able to figure out what exactly is going
> wrong?
> > > > Some
> > > > silly mistake I guess, still I'm not able to find out. Can some one
> > take
> > > > the
> > > > pain to go throgh the above code and find out whats wrong. Thank you
> > very
> > > > much.
> > > >
> > > > Thanks,
> > > > KK.
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
>

Re: Hit highlighting for non-english unicode index/queries not working?

Posted by Erick Erickson <er...@gmail.com>.
It's fairly easy to construct your own analyzer bystringing together some
filters and tokenizers. LIA (1st ed)
had a SynonymAnalyzer. You probably want something like
(WARNING, example only, I'm not even sure it compiles!! Ripped
off  from the WIKI)

public class MyAnalyzer extends Analyzer
{
    public TokenStream tokenStream (String field, final Reader reader) {
            return  new LowercaseFilter (new WhitespaceTokenizer(reader));
   }
}

There are a number of Filters you can string together if you want to, say,
remove stop words etc..

HTH
Erick

On Tue, May 26, 2009 at 6:38 AM, KK <di...@gmail.com> wrote:

> Thank you @Muir.
> I was earlier using simpleanalyzer for all purposes but as you reccomended
> me the whitespace one, I tried to use that analyzer and good thing is that
> I'm able to index/search non-english text as well as supporting hit
> highlighting for these non-english texts. Thank you very much.
> But now there is one silly problem. As whitespaceanalyzer doesnot do
> anything other than separating the tokens based on the space, for english
> pages case-folding is getting missed. Unless I provide the exact words
> including the right cases it doesnot give me results, which is quite
> obivious. As I went thru the LIA 2nd Edn book, found that it mentions we
> can
> use analyzers on document level and also on field level. I was quite amazed
> at the granularity of analysis supported by Lucene. But its there we just
> have to make use of it. So I'm thinking of giving it a try that will help
> me
> support  both english and non-english indexing/searching/highlighting.
> Thank
> you all. Any ideas on the same are always welcome.
>
> Thanks,
> KK.
>
>
> On Tue, May 26, 2009 at 1:24 AM, Robert Muir <rc...@gmail.com> wrote:
>
> > as mentioned previously, i dont think your text is being analyzed the way
> > you want.
> >
> > SimpleAnalyzer will break your word \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > (பரிணாம) into 3 tokens:
> >
> > \u0BAA\u0BB0
> > \u0BA3
> > \u0BAE
> >
> > Not only does it incorrectly split your word into three words, but it
> > completely drops the dependent vowels (\u0BBF and \u0BBE).
> >
> > This is why i would recommend trying whitespace analyzer instead.
> > Also take a look at the Luke index tool, its a very quick way to see how
> > your words are being analyzed by various analyzers.
> >
> >
> > On Mon, May 25, 2009 at 10:02 AM, KK <di...@gmail.com> wrote:
> >
> > > Hi,
> > > I'm trying to index some non-english texts. Indexing and searching is
> > > working fine. From command line I'm able to provide the utf-8 unicoded
> > text
> > > as input like this,
> > > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > > and able to get the search results.
> > > Then I tried to add hit highlighting for the same. So I started with
> > simple
> > > english texts and used pharse queries for providing input queries. My
> > code
> > > looks like this,
> > >
> > >
> > > import java.io.FileReader;
> > > import java.io.IOException;
> > > import java.io.InputStreamReader;
> > > import java.util.Date;
> > > import java.io.*;
> > > import java.nio.charset.Charset;
> > >
> > > import org.apache.lucene.analysis.Analyzer;
> > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > import org.apache.lucene.document.Document;
> > > import org.apache.lucene.index.FilterIndexReader;
> > > import org.apache.lucene.index.IndexReader;
> > > import org.apache.lucene.index.Term;
> > > import org.apache.lucene.queryParser.QueryParser;
> > > import org.apache.lucene.search.HitCollector;
> > > import org.apache.lucene.search.Hits;
> > > import org.apache.lucene.search.IndexSearcher;
> > > import org.apache.lucene.search.Query;
> > > import org.apache.lucene.search.PhraseQuery;
> > > import org.apache.lucene.search.ScoreDoc;
> > > import org.apache.lucene.search.Searcher;
> > > import org.apache.lucene.search.TopDocCollector;
> > > import org.apache.lucene.search.highlight.Highlighter;
> > > import org.apache.lucene.search.highlight.QueryScorer;
> > > import org.apache.lucene.search.Scorer;
> > > import org.apache.lucene.analysis.TokenStream;
> > > import org.apache.lucene.analysis.SimpleAnalyzer;
> > >
> > >
> > > /** Simple command-line based search demo. */
> > > public class LuceneSearcher {
> > >    private static final String indexPath = "/opt/lucene/index" +
> > "/core36";
> > > //core36 refers to the exact index directory for tamil pages
> > >
> > >    private void searchIndex(String terms) throws Exception{
> > >        String queryString = "";
> > >        PhraseQuery phrase = new PhraseQuery();
> > >        String[] termArray = terms.split(" ");
> > >        for (int i=0; i<termArray.length; i++) {
> > >            System.out.println("adding " + termArray[i]);
> > >            //phrase.add(new Term("content", termArray[i]));
> > >            //queryString += termArray[i];
> > >        }
> > >        /
> > >        //phrase.add(new Term("content", "ubuntu"));
> > >        String tamilQuery = new
> > > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
> > >        //tamilQuery = new String("ubuntu");
> > >        phrase.add(new Term("content", tamilQuery));
> > >        phrase.setSlop(1);
> > >        System.out.println("phrase query " + phrase.toString());
> > >
> > >         IndexSearcher searcher = new IndexSearcher(indexPath);
> > >        QueryParser queryParser = null;
> > >        try {
> > >            queryParser = new QueryParser("content", new
> > SimpleAnalyzer());
> > >        } catch (Exception ex) {
> > >             ex.printStackTrace();
> > >        }
> > >
> > >        //Query query = queryParser.parse(queryString);
> > >
> > >        Hits hits = null;
> > >        try {
> > >             hits = searcher.search(phrase);
> > >        } catch (Exception ex) {
> > >             ex.printStackTrace();
> > >        }
> > >        //for highlighter section
> > >        QueryScorer scorer = new QueryScorer(phrase);
> > >        Highlighter highlighter = new Highlighter(scorer);
> > >
> > >        for (int i = 0; i < hits.length(); i++) {
> > >            String content = hits.doc(i).get("content");
> > >            TokenStream stream = new
> > SimpleAnalyzer().tokenStream("content",
> > > new StringReader(content));
> > >            String fragment = highlighter.getBestFragments(stream,
> > content,
> > > 5, "...");
> > >            System.out.println(fragment);
> > >        }
> > >
> > >
> > >        int hitCount = hits.length();
> > >        System.out.println("Results found :" + hitCount);
> > >
> > >        /*
> > >        for (int ix=0; ix<hitCount; ix++) {
> > >             Document doc = hits.doc(ix);
> > >            System.out.println(doc.get("content"));
> > >        }
> > >        */
> > >    }
> > >
> > >    public static void main(String args[]) throws Exception{
> > >         LuceneSearcher searcher = new LuceneSearcher();
> > >        String termString = args[0];
> > >        System.out.println("searching for " + args[0]);
> > >        searcher.searchIndex(termString);
> > >    }
> > >
> > > }
> > > ----------------------code ends here---------------------------------
> > > NB: Please ignore basic coding conventio[ indentations, comments etc].
> > You
> > > might find some unneccesary code intermixed with the highlighting code,
> > > ignore them .
> > >
> > > Now when I searched for some english docs I got the results with
> <b></b>
> > > tags sorrounding the hits like this,
> > >
> > > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
> > > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security
> > notices
> > > that affect the current supported releases of <B>Ubuntu</B>. These
> > notices
> > > are also posted
> > >
> > > Now I thought of testing the same for temil texts. Before this I would
> > like
> > > to add one more information that prior to adding the codes for
> > highlighting
> > > I was able to search a lucene index from the command line using the raw
> > > unicode texts like this,
> > > [kk@kk-laptop]$ java LuceneSearcher
> > "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"
> > >
> > > and it gives me the page that mathces the above query. Now I tried to
> do
> > > the
> > > same alongwith highliting. So in the code I posted above you can see
> that
> > I
> > > commented out the english terms and added one tamil unicode query and
> > tried
> > > to see If it gives me the same result that I was getting prior to
> > > highlighting and found that I'm not getting any results. This might be
> > > because the query I'm forming using these unicode texts is wrong, or
> may
> > be
> > > something else. I'm not able to figure out what exactly is going wrong?
> > > Some
> > > silly mistake I guess, still I'm not able to find out. Can some one
> take
> > > the
> > > pain to go throgh the above code and find out whats wrong. Thank you
> very
> > > much.
> > >
> > > Thanks,
> > > KK.
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>

Re: Hit highlighting for non-english unicode index/queries not working?

Posted by KK <di...@gmail.com>.
Thank you @Muir.
I was earlier using simpleanalyzer for all purposes but as you reccomended
me the whitespace one, I tried to use that analyzer and good thing is that
I'm able to index/search non-english text as well as supporting hit
highlighting for these non-english texts. Thank you very much.
But now there is one silly problem. As whitespaceanalyzer doesnot do
anything other than separating the tokens based on the space, for english
pages case-folding is getting missed. Unless I provide the exact words
including the right cases it doesnot give me results, which is quite
obivious. As I went thru the LIA 2nd Edn book, found that it mentions we can
use analyzers on document level and also on field level. I was quite amazed
at the granularity of analysis supported by Lucene. But its there we just
have to make use of it. So I'm thinking of giving it a try that will help me
support  both english and non-english indexing/searching/highlighting. Thank
you all. Any ideas on the same are always welcome.

Thanks,
KK.


On Tue, May 26, 2009 at 1:24 AM, Robert Muir <rc...@gmail.com> wrote:

> as mentioned previously, i dont think your text is being analyzed the way
> you want.
>
> SimpleAnalyzer will break your word \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> (பரிணாம) into 3 tokens:
>
> \u0BAA\u0BB0
> \u0BA3
> \u0BAE
>
> Not only does it incorrectly split your word into three words, but it
> completely drops the dependent vowels (\u0BBF and \u0BBE).
>
> This is why i would recommend trying whitespace analyzer instead.
> Also take a look at the Luke index tool, its a very quick way to see how
> your words are being analyzed by various analyzers.
>
>
> On Mon, May 25, 2009 at 10:02 AM, KK <di...@gmail.com> wrote:
>
> > Hi,
> > I'm trying to index some non-english texts. Indexing and searching is
> > working fine. From command line I'm able to provide the utf-8 unicoded
> text
> > as input like this,
> > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > and able to get the search results.
> > Then I tried to add hit highlighting for the same. So I started with
> simple
> > english texts and used pharse queries for providing input queries. My
> code
> > looks like this,
> >
> >
> > import java.io.FileReader;
> > import java.io.IOException;
> > import java.io.InputStreamReader;
> > import java.util.Date;
> > import java.io.*;
> > import java.nio.charset.Charset;
> >
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > import org.apache.lucene.document.Document;
> > import org.apache.lucene.index.FilterIndexReader;
> > import org.apache.lucene.index.IndexReader;
> > import org.apache.lucene.index.Term;
> > import org.apache.lucene.queryParser.QueryParser;
> > import org.apache.lucene.search.HitCollector;
> > import org.apache.lucene.search.Hits;
> > import org.apache.lucene.search.IndexSearcher;
> > import org.apache.lucene.search.Query;
> > import org.apache.lucene.search.PhraseQuery;
> > import org.apache.lucene.search.ScoreDoc;
> > import org.apache.lucene.search.Searcher;
> > import org.apache.lucene.search.TopDocCollector;
> > import org.apache.lucene.search.highlight.Highlighter;
> > import org.apache.lucene.search.highlight.QueryScorer;
> > import org.apache.lucene.search.Scorer;
> > import org.apache.lucene.analysis.TokenStream;
> > import org.apache.lucene.analysis.SimpleAnalyzer;
> >
> >
> > /** Simple command-line based search demo. */
> > public class LuceneSearcher {
> >    private static final String indexPath = "/opt/lucene/index" +
> "/core36";
> > //core36 refers to the exact index directory for tamil pages
> >
> >    private void searchIndex(String terms) throws Exception{
> >        String queryString = "";
> >        PhraseQuery phrase = new PhraseQuery();
> >        String[] termArray = terms.split(" ");
> >        for (int i=0; i<termArray.length; i++) {
> >            System.out.println("adding " + termArray[i]);
> >            //phrase.add(new Term("content", termArray[i]));
> >            //queryString += termArray[i];
> >        }
> >        /
> >        //phrase.add(new Term("content", "ubuntu"));
> >        String tamilQuery = new
> > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
> >        //tamilQuery = new String("ubuntu");
> >        phrase.add(new Term("content", tamilQuery));
> >        phrase.setSlop(1);
> >        System.out.println("phrase query " + phrase.toString());
> >
> >         IndexSearcher searcher = new IndexSearcher(indexPath);
> >        QueryParser queryParser = null;
> >        try {
> >            queryParser = new QueryParser("content", new
> SimpleAnalyzer());
> >        } catch (Exception ex) {
> >             ex.printStackTrace();
> >        }
> >
> >        //Query query = queryParser.parse(queryString);
> >
> >        Hits hits = null;
> >        try {
> >             hits = searcher.search(phrase);
> >        } catch (Exception ex) {
> >             ex.printStackTrace();
> >        }
> >        //for highlighter section
> >        QueryScorer scorer = new QueryScorer(phrase);
> >        Highlighter highlighter = new Highlighter(scorer);
> >
> >        for (int i = 0; i < hits.length(); i++) {
> >            String content = hits.doc(i).get("content");
> >            TokenStream stream = new
> SimpleAnalyzer().tokenStream("content",
> > new StringReader(content));
> >            String fragment = highlighter.getBestFragments(stream,
> content,
> > 5, "...");
> >            System.out.println(fragment);
> >        }
> >
> >
> >        int hitCount = hits.length();
> >        System.out.println("Results found :" + hitCount);
> >
> >        /*
> >        for (int ix=0; ix<hitCount; ix++) {
> >             Document doc = hits.doc(ix);
> >            System.out.println(doc.get("content"));
> >        }
> >        */
> >    }
> >
> >    public static void main(String args[]) throws Exception{
> >         LuceneSearcher searcher = new LuceneSearcher();
> >        String termString = args[0];
> >        System.out.println("searching for " + args[0]);
> >        searcher.searchIndex(termString);
> >    }
> >
> > }
> > ----------------------code ends here---------------------------------
> > NB: Please ignore basic coding conventio[ indentations, comments etc].
> You
> > might find some unneccesary code intermixed with the highlighting code,
> > ignore them .
> >
> > Now when I searched for some english docs I got the results with <b></b>
> > tags sorrounding the hits like this,
> >
> > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
> > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security
> notices
> > that affect the current supported releases of <B>Ubuntu</B>. These
> notices
> > are also posted
> >
> > Now I thought of testing the same for temil texts. Before this I would
> like
> > to add one more information that prior to adding the codes for
> highlighting
> > I was able to search a lucene index from the command line using the raw
> > unicode texts like this,
> > [kk@kk-laptop]$ java LuceneSearcher
> "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"
> >
> > and it gives me the page that mathces the above query. Now I tried to do
> > the
> > same alongwith highliting. So in the code I posted above you can see that
> I
> > commented out the english terms and added one tamil unicode query and
> tried
> > to see If it gives me the same result that I was getting prior to
> > highlighting and found that I'm not getting any results. This might be
> > because the query I'm forming using these unicode texts is wrong, or may
> be
> > something else. I'm not able to figure out what exactly is going wrong?
> > Some
> > silly mistake I guess, still I'm not able to find out. Can some one take
> > the
> > pain to go throgh the above code and find out whats wrong. Thank you very
> > much.
> >
> > Thanks,
> > KK.
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: Hit highlighting for non-english unicode index/queries not working?

Posted by Robert Muir <rc...@gmail.com>.
as mentioned previously, i dont think your text is being analyzed the way
you want.

SimpleAnalyzer will break your word \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
(பரிணாம) into 3 tokens:

\u0BAA\u0BB0
\u0BA3
\u0BAE

Not only does it incorrectly split your word into three words, but it
completely drops the dependent vowels (\u0BBF and \u0BBE).

This is why i would recommend trying whitespace analyzer instead.
Also take a look at the Luke index tool, its a very quick way to see how
your words are being analyzed by various analyzers.


On Mon, May 25, 2009 at 10:02 AM, KK <di...@gmail.com> wrote:

> Hi,
> I'm trying to index some non-english texts. Indexing and searching is
> working fine. From command line I'm able to provide the utf-8 unicoded text
> as input like this,
> \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> and able to get the search results.
> Then I tried to add hit highlighting for the same. So I started with simple
> english texts and used pharse queries for providing input queries. My code
> looks like this,
>
>
> import java.io.FileReader;
> import java.io.IOException;
> import java.io.InputStreamReader;
> import java.util.Date;
> import java.io.*;
> import java.nio.charset.Charset;
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.index.FilterIndexReader;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.HitCollector;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.PhraseQuery;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.search.TopDocCollector;
> import org.apache.lucene.search.highlight.Highlighter;
> import org.apache.lucene.search.highlight.QueryScorer;
> import org.apache.lucene.search.Scorer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.SimpleAnalyzer;
>
>
> /** Simple command-line based search demo. */
> public class LuceneSearcher {
>    private static final String indexPath = "/opt/lucene/index" + "/core36";
> //core36 refers to the exact index directory for tamil pages
>
>    private void searchIndex(String terms) throws Exception{
>        String queryString = "";
>        PhraseQuery phrase = new PhraseQuery();
>        String[] termArray = terms.split(" ");
>        for (int i=0; i<termArray.length; i++) {
>            System.out.println("adding " + termArray[i]);
>            //phrase.add(new Term("content", termArray[i]));
>            //queryString += termArray[i];
>        }
>        /
>        //phrase.add(new Term("content", "ubuntu"));
>        String tamilQuery = new
> String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
>        //tamilQuery = new String("ubuntu");
>        phrase.add(new Term("content", tamilQuery));
>        phrase.setSlop(1);
>        System.out.println("phrase query " + phrase.toString());
>
>         IndexSearcher searcher = new IndexSearcher(indexPath);
>        QueryParser queryParser = null;
>        try {
>            queryParser = new QueryParser("content", new SimpleAnalyzer());
>        } catch (Exception ex) {
>             ex.printStackTrace();
>        }
>
>        //Query query = queryParser.parse(queryString);
>
>        Hits hits = null;
>        try {
>             hits = searcher.search(phrase);
>        } catch (Exception ex) {
>             ex.printStackTrace();
>        }
>        //for highlighter section
>        QueryScorer scorer = new QueryScorer(phrase);
>        Highlighter highlighter = new Highlighter(scorer);
>
>        for (int i = 0; i < hits.length(); i++) {
>            String content = hits.doc(i).get("content");
>            TokenStream stream = new SimpleAnalyzer().tokenStream("content",
> new StringReader(content));
>            String fragment = highlighter.getBestFragments(stream, content,
> 5, "...");
>            System.out.println(fragment);
>        }
>
>
>        int hitCount = hits.length();
>        System.out.println("Results found :" + hitCount);
>
>        /*
>        for (int ix=0; ix<hitCount; ix++) {
>             Document doc = hits.doc(ix);
>            System.out.println(doc.get("content"));
>        }
>        */
>    }
>
>    public static void main(String args[]) throws Exception{
>         LuceneSearcher searcher = new LuceneSearcher();
>        String termString = args[0];
>        System.out.println("searching for " + args[0]);
>        searcher.searchIndex(termString);
>    }
>
> }
> ----------------------code ends here---------------------------------
> NB: Please ignore basic coding conventio[ indentations, comments etc]. You
> might find some unneccesary code intermixed with the highlighting code,
> ignore them .
>
> Now when I searched for some english docs I got the results with <b></b>
> tags sorrounding the hits like this,
>
> <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
> <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security notices
> that affect the current supported releases of <B>Ubuntu</B>. These notices
> are also posted
>
> Now I thought of testing the same for temil texts. Before this I would like
> to add one more information that prior to adding the codes for highlighting
> I was able to search a lucene index from the command line using the raw
> unicode texts like this,
> [kk@kk-laptop]$ java LuceneSearcher "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"
>
> and it gives me the page that mathces the above query. Now I tried to do
> the
> same alongwith highliting. So in the code I posted above you can see that I
> commented out the english terms and added one tamil unicode query and tried
> to see If it gives me the same result that I was getting prior to
> highlighting and found that I'm not getting any results. This might be
> because the query I'm forming using these unicode texts is wrong, or may be
> something else. I'm not able to figure out what exactly is going wrong?
> Some
> silly mistake I guess, still I'm not able to find out. Can some one take
> the
> pain to go throgh the above code and find out whats wrong. Thank you very
> much.
>
> Thanks,
> KK.
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Hit highlighting for non-english unicode index/queries not working?

Posted by Michael McCandless <lu...@mikemccandless.com>.
Could you boil down this example to a smaller test case that fails?

Eg make a RAMDir, index one document (that should show hilighting),
search it, run highlight and show that it's not working?

Mike

On Mon, May 25, 2009 at 10:02 AM, KK <di...@gmail.com> wrote:
> Hi,
> I'm trying to index some non-english texts. Indexing and searching is
> working fine. From command line I'm able to provide the utf-8 unicoded text
> as input like this,
> \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> and able to get the search results.
> Then I tried to add hit highlighting for the same. So I started with simple
> english texts and used pharse queries for providing input queries. My code
> looks like this,
>
>
> import java.io.FileReader;
> import java.io.IOException;
> import java.io.InputStreamReader;
> import java.util.Date;
> import java.io.*;
> import java.nio.charset.Charset;
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.index.FilterIndexReader;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.HitCollector;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.PhraseQuery;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.search.TopDocCollector;
> import org.apache.lucene.search.highlight.Highlighter;
> import org.apache.lucene.search.highlight.QueryScorer;
> import org.apache.lucene.search.Scorer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.SimpleAnalyzer;
>
>
> /** Simple command-line based search demo. */
> public class LuceneSearcher {
>    private static final String indexPath = "/opt/lucene/index" + "/core36";
> //core36 refers to the exact index directory for tamil pages
>
>    private void searchIndex(String terms) throws Exception{
>        String queryString = "";
>        PhraseQuery phrase = new PhraseQuery();
>        String[] termArray = terms.split(" ");
>        for (int i=0; i<termArray.length; i++) {
>            System.out.println("adding " + termArray[i]);
>            //phrase.add(new Term("content", termArray[i]));
>            //queryString += termArray[i];
>        }
>        /
>        //phrase.add(new Term("content", "ubuntu"));
>        String tamilQuery = new
> String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
>        //tamilQuery = new String("ubuntu");
>        phrase.add(new Term("content", tamilQuery));
>        phrase.setSlop(1);
>        System.out.println("phrase query " + phrase.toString());
>
>         IndexSearcher searcher = new IndexSearcher(indexPath);
>        QueryParser queryParser = null;
>        try {
>            queryParser = new QueryParser("content", new SimpleAnalyzer());
>        } catch (Exception ex) {
>             ex.printStackTrace();
>        }
>
>        //Query query = queryParser.parse(queryString);
>
>        Hits hits = null;
>        try {
>             hits = searcher.search(phrase);
>        } catch (Exception ex) {
>             ex.printStackTrace();
>        }
>        //for highlighter section
>        QueryScorer scorer = new QueryScorer(phrase);
>        Highlighter highlighter = new Highlighter(scorer);
>
>        for (int i = 0; i < hits.length(); i++) {
>            String content = hits.doc(i).get("content");
>            TokenStream stream = new SimpleAnalyzer().tokenStream("content",
> new StringReader(content));
>            String fragment = highlighter.getBestFragments(stream, content,
> 5, "...");
>            System.out.println(fragment);
>        }
>
>
>        int hitCount = hits.length();
>        System.out.println("Results found :" + hitCount);
>
>        /*
>        for (int ix=0; ix<hitCount; ix++) {
>             Document doc = hits.doc(ix);
>            System.out.println(doc.get("content"));
>        }
>        */
>    }
>
>    public static void main(String args[]) throws Exception{
>         LuceneSearcher searcher = new LuceneSearcher();
>        String termString = args[0];
>        System.out.println("searching for " + args[0]);
>        searcher.searchIndex(termString);
>    }
>
> }
> ----------------------code ends here---------------------------------
> NB: Please ignore basic coding conventio[ indentations, comments etc]. You
> might find some unneccesary code intermixed with the highlighting code,
> ignore them .
>
> Now when I searched for some english docs I got the results with <b></b>
> tags sorrounding the hits like this,
>
> <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
> <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security notices
> that affect the current supported releases of <B>Ubuntu</B>. These notices
> are also posted
>
> Now I thought of testing the same for temil texts. Before this I would like
> to add one more information that prior to adding the codes for highlighting
> I was able to search a lucene index from the command line using the raw
> unicode texts like this,
> [kk@kk-laptop]$ java LuceneSearcher "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"
>
> and it gives me the page that mathces the above query. Now I tried to do the
> same alongwith highliting. So in the code I posted above you can see that I
> commented out the english terms and added one tamil unicode query and tried
> to see If it gives me the same result that I was getting prior to
> highlighting and found that I'm not getting any results. This might be
> because the query I'm forming using these unicode texts is wrong, or may be
> something else. I'm not able to figure out what exactly is going wrong? Some
> silly mistake I guess, still I'm not able to find out. Can some one take the
> pain to go throgh the above code and find out whats wrong. Thank you very
> much.
>
> Thanks,
> KK.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org