You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2009/05/25 18:14:51 UTC
Re: Hit highlighting for non-english unicode index/queries not working?

Could you boil down this example to a smaller test case that fails?

Eg make a RAMDir, index one document (that should show hilighting),
search it, run highlight and show that it's not working?

Mike

On Mon, May 25, 2009 at 10:02 AM, KK <di...@gmail.com> wrote:
> Hi,
> I'm trying to index some non-english texts. Indexing and searching is
> working fine. From command line I'm able to provide the utf-8 unicoded text
> as input like this,
> \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> and able to get the search results.
> Then I tried to add hit highlighting for the same. So I started with simple
> english texts and used pharse queries for providing input queries. My code
> looks like this,
>
>
> import java.io.FileReader;
> import java.io.IOException;
> import java.io.InputStreamReader;
> import java.util.Date;
> import java.io.*;
> import java.nio.charset.Charset;
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.index.FilterIndexReader;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.HitCollector;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.PhraseQuery;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.search.TopDocCollector;
> import org.apache.lucene.search.highlight.Highlighter;
> import org.apache.lucene.search.highlight.QueryScorer;
> import org.apache.lucene.search.Scorer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.SimpleAnalyzer;
>
>
> /** Simple command-line based search demo. */
> public class LuceneSearcher {
>    private static final String indexPath = "/opt/lucene/index" + "/core36";
> //core36 refers to the exact index directory for tamil pages
>
>    private void searchIndex(String terms) throws Exception{
>        String queryString = "";
>        PhraseQuery phrase = new PhraseQuery();
>        String[] termArray = terms.split(" ");
>        for (int i=0; i<termArray.length; i++) {
>            System.out.println("adding " + termArray[i]);
>            //phrase.add(new Term("content", termArray[i]));
>            //queryString += termArray[i];
>        }
>        /
>        //phrase.add(new Term("content", "ubuntu"));
>        String tamilQuery = new
> String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
>        //tamilQuery = new String("ubuntu");
>        phrase.add(new Term("content", tamilQuery));
>        phrase.setSlop(1);
>        System.out.println("phrase query " + phrase.toString());
>
>         IndexSearcher searcher = new IndexSearcher(indexPath);
>        QueryParser queryParser = null;
>        try {
>            queryParser = new QueryParser("content", new SimpleAnalyzer());
>        } catch (Exception ex) {
>             ex.printStackTrace();
>        }
>
>        //Query query = queryParser.parse(queryString);
>
>        Hits hits = null;
>        try {
>             hits = searcher.search(phrase);
>        } catch (Exception ex) {
>             ex.printStackTrace();
>        }
>        //for highlighter section
>        QueryScorer scorer = new QueryScorer(phrase);
>        Highlighter highlighter = new Highlighter(scorer);
>
>        for (int i = 0; i < hits.length(); i++) {
>            String content = hits.doc(i).get("content");
>            TokenStream stream = new SimpleAnalyzer().tokenStream("content",
> new StringReader(content));
>            String fragment = highlighter.getBestFragments(stream, content,
> 5, "...");
>            System.out.println(fragment);
>        }
>
>
>        int hitCount = hits.length();
>        System.out.println("Results found :" + hitCount);
>
>        /*
>        for (int ix=0; ix<hitCount; ix++) {
>             Document doc = hits.doc(ix);
>            System.out.println(doc.get("content"));
>        }
>        */
>    }
>
>    public static void main(String args[]) throws Exception{
>         LuceneSearcher searcher = new LuceneSearcher();
>        String termString = args[0];
>        System.out.println("searching for " + args[0]);
>        searcher.searchIndex(termString);
>    }
>
> }
> ----------------------code ends here---------------------------------
> NB: Please ignore basic coding conventio[ indentations, comments etc]. You
> might find some unneccesary code intermixed with the highlighting code,
> ignore them .
>
> Now when I searched for some english docs I got the results with <b></b>
> tags sorrounding the hits like this,
>
> <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
> <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security notices
> that affect the current supported releases of <B>Ubuntu</B>. These notices
> are also posted
>
> Now I thought of testing the same for temil texts. Before this I would like
> to add one more information that prior to adding the codes for highlighting
> I was able to search a lucene index from the command line using the raw
> unicode texts like this,
> [kk@kk-laptop]$ java LuceneSearcher "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"
>
> and it gives me the page that mathces the above query. Now I tried to do the
> same alongwith highliting. So in the code I posted above you can see that I
> commented out the english terms and added one tamil unicode query and tried
> to see If it gives me the same result that I was getting prior to
> highlighting and found that I'm not getting any results. This might be
> because the query I'm forming using these unicode texts is wrong, or may be
> something else. I'm not able to figure out what exactly is going wrong? Some
> silly mistake I guess, still I'm not able to find out. Can some one take the
> pain to go throgh the above code and find out whats wrong. Thank you very
> much.
>
> Thanks,
> KK.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org