You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by -Arne- <ar...@googlemail.com> on 2010/03/01 12:15:18 UTC

Highlighting large documents (Lucene 3.0.0)

Hi,

I'm using Lucene 3.0.0 and have large documents to search (logfiles
0,5-20MB). For better search results the query tokens are truncated left and
right. A search for "user" is made to "*user*". The performance of searching
even complex queries with more than one searchterm is quite good. But
highlighting the search results took quite a while. I have tried the default
Highlighter, which doesn't seemed to be fast enough and the
FastVectorHighlighter, which seems to be fast enought, but didn't return
fragments for truncated queries, for not truncated query I got fragments.
Could anybode please tell me what is the best way to highlight large
documents and, if the FastVectorHighlighter is the solution for faster
highlighting, how to highlight truncated search queries.

Thanks in advance,
-Arne-
-- 
View this message in context: http://old.nabble.com/Highlighting-large-documents-%28Lucene-3.0.0%29-tp27714198p27714198.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting large documents (Lucene 3.0.0)

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
-Arne- wrote:
> Hi Koji,
> thanks for your answer. Can you help me a once again?  What exactly  I
> suposse to do? 
>
>   
The concrete program in my mind here:

public class TestHighlightTruncatedSearchQuery {
 
  static Directory dir = new RAMDirectory();
  static Analyzer analyzer = new BiGramAnalyzer();
  static final String[] DOCS = {
    "import org.apache.lucene.analysis.Analyzer;",
    "import org.apache.lucene.analysis.TokenStream;",
    "import org.apache.lucene.analysis.ngram.NGramTokenizer;",
    "import org.apache.lucene.index.IndexWriter;",
    "import org.apache.lucene.index.IndexWriter.MaxFieldLength;",
    "import org.apache.lucene.store.Directory;",
    "import org.apache.lucene.store.RAMDirectory;"
  };
  static final String F = "f";

  public static void main(String[] args) throws Exception {
    makeIndex();
    searchIndex();
  }

  static void makeIndex() throws IOException {
    IndexWriter writer = new IndexWriter( dir, analyzer, true, 
MaxFieldLength.LIMITED );
    for( String value : DOCS ){
      Document doc = new Document();
      doc.add( new Field( F, value, Store.YES, Index.ANALYZED, 
TermVector.WITH_POSITIONS_OFFSETS ) );
      writer.addDocument( doc );
    }
    writer.close();
  }
 
  static void searchIndex() throws Exception {
    IndexSearcher searcher = new IndexSearcher( dir, true );
    IndexReader reader = searcher.getIndexReader();
    QueryParser parser = new QueryParser( F, analyzer );
    // use "Direct" rather than "Direct"
    Query query = parser.parse( "Direct" );
    FastVectorHighlighter h = new FastVectorHighlighter();
    FieldQuery fieldQuery = h.getFieldQuery( query );
    TopDocs docs = searcher.search( query, 10 );
    for( ScoreDoc scoreDoc : docs.scoreDocs ){
      String snippet = h.getBestFragment( fieldQuery, reader, 
scoreDoc.doc, F, 100 );
      System.out.println( scoreDoc.doc + " : " + snippet );
    }
    searcher.close();
  }
 
  static class BiGramAnalyzer extends Analyzer {
    public TokenStream tokenStream(String fieldName, Reader reader) {
      return new NGramTokenizer( reader, 2, 2 );
    }
  }
}


Koji

-- 
http://www.rondhuit.com/en/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting large documents (Lucene 3.0.0)

Posted by -Arne- <ar...@googlemail.com>.
Hi Koji,
thanks for your answer. Can you help me a once again?  What exactly  I
suposse to do? 


Koji Sekiguchi-2 wrote:
> 
> -Arne- wrote:
>> Hi,
>>
>> I'm using Lucene 3.0.0 and have large documents to search (logfiles
>> 0,5-20MB). For better search results the query tokens are truncated left
>> and
>> right. A search for "user" is made to "*user*". The performance of
>> searching
>> even complex queries with more than one searchterm is quite good. But
>> highlighting the search results took quite a while. I have tried the
>> default
>> Highlighter, which doesn't seemed to be fast enough and the
>> FastVectorHighlighter, which seems to be fast enought, but didn't return
>> fragments for truncated queries, for not truncated query I got fragments.
>> Could anybode please tell me what is the best way to highlight large
>> documents and, if the FastVectorHighlighter is the solution for faster
>> highlighting, how to highlight truncated search queries.
>>
>> Thanks in advance,
>> -Arne-
>>   
> I'm not sure this is the best way, but can you index and search
> the highlighting field with NGram? Since FVH supports
> NGram field to highlight, you can use "user" just as "user"
> (rather than "*user*") to highlight the NGram field.
> 
> Koji
> 
> -- 
> http://www.rondhuit.com/en/
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Highlighting-large-documents-%28Lucene-3.0.0%29-tp27714198p27745072.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Highlighting large documents (Lucene 3.0.0)

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
-Arne- wrote:
> Hi,
>
> I'm using Lucene 3.0.0 and have large documents to search (logfiles
> 0,5-20MB). For better search results the query tokens are truncated left and
> right. A search for "user" is made to "*user*". The performance of searching
> even complex queries with more than one searchterm is quite good. But
> highlighting the search results took quite a while. I have tried the default
> Highlighter, which doesn't seemed to be fast enough and the
> FastVectorHighlighter, which seems to be fast enought, but didn't return
> fragments for truncated queries, for not truncated query I got fragments.
> Could anybode please tell me what is the best way to highlight large
> documents and, if the FastVectorHighlighter is the solution for faster
> highlighting, how to highlight truncated search queries.
>
> Thanks in advance,
> -Arne-
>   
I'm not sure this is the best way, but can you index and search
the highlighting field with NGram? Since FVH supports
NGram field to highlight, you can use "user" just as "user"
(rather than "*user*") to highlight the NGram field.

Koji

-- 
http://www.rondhuit.com/en/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org