You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ma...@yahoo.co.uk on 2004/10/29 01:16:11 UTC

Faster highlighting with TermPositionVectors

Thanks to the recent changes (see CVS) in TermFreqVector support we can now make use of term offset information held 
in the Lucene index rather than incurring the cost of re-analyzing text to highlight it.

I have created a  class ( see http://www.inperspective.com/lucene/TokenSources.java ) which handles creating
a TokenStream from the TermPositionVector stored in the database which can then be passed to the highlighter.
This approach is significantly faster than re-parsing the original text.
If people are happy with this class I'll add it to the Highlighter sandbox but it may sit better elsewhere in the Lucene code base
as a more general purpose utility.

BTW as part of putting this together I found that the TermFreq code throws a null pointer when indexing fields
that produce no tokens (ie empty or all stopwords). Otherwise things work very well.


Cheers
Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Faster highlighting with TermPositionVectors

Posted by Fred Toth <ft...@synernet.com>.
Hi,

We are very interested in highlighting, but haven't gotten around
to reviewing the state of the highlighting mechanisms.

Could someone possibly give me the "big picture" on highlighting?
What code is available?
How does it work?
What are the current issues?

Many thanks,

Fred

At 07:16 PM 10/28/2004, you wrote:
>Thanks to the recent changes (see CVS) in TermFreqVector support we can 
>now make use of term offset information held
>in the Lucene index rather than incurring the cost of re-analyzing text to 
>highlight it.
>
>I have created a  class ( see 
>http://www.inperspective.com/lucene/TokenSources.java ) which handles creating
>a TokenStream from the TermPositionVector stored in the database which can 
>then be passed to the highlighter.
>This approach is significantly faster than re-parsing the original text.
>If people are happy with this class I'll add it to the Highlighter sandbox 
>but it may sit better elsewhere in the Lucene code base
>as a more general purpose utility.
>
>BTW as part of putting this together I found that the TermFreq code throws 
>a null pointer when indexing fields
>that produce no tokens (ie empty or all stopwords). Otherwise things work 
>very well.
>
>
>Cheers
>Mark
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Faster highlighting with TermPositionVectors

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Mark,

This is great stuff!

One quick comment just at my look at the code (I haven't tried it yet). 
  Shouldn't the tpv variable be used in this method?

     public static TokenStream getAnyTokenStream(IndexReader reader,int 
docId, String field,Analyzer analyzer) throws IOException
     {
		TokenStream ts=null;

		TermFreqVector tfv=(TermFreqVector) 
reader.getTermFreqVector(docId,field);
		if(tfv!=null)
		{
		    if(tfv instanceof TermPositionVector)
		    {
		        //the most efficient choice..
 >>>		        TermPositionVector tpv=(TermPositionVector) 
reader.getTermFreqVector(docId,field);
		        ts=getTokenStream(reader,docId,field);
		    }
		}
		//No token info stored so fall back to analyzing raw content
		if(ts==null)
		{
		    ts=getTokenStream(reader,docId,field,analyzer);
		}
		return ts;
     }

Erik


On Oct 28, 2004, at 7:16 PM, markharw00d@yahoo.co.uk wrote:

> Thanks to the recent changes (see CVS) in TermFreqVector support we 
> can now make use of term offset information held
> in the Lucene index rather than incurring the cost of re-analyzing 
> text to highlight it.
>
> I have created a  class ( see 
> http://www.inperspective.com/lucene/TokenSources.java ) which handles 
> creating
> a TokenStream from the TermPositionVector stored in the database which 
> can then be passed to the highlighter.
> This approach is significantly faster than re-parsing the original 
> text.
> If people are happy with this class I'll add it to the Highlighter 
> sandbox but it may sit better elsewhere in the Lucene code base
> as a more general purpose utility.
>
> BTW as part of putting this together I found that the TermFreq code 
> throws a null pointer when indexing fields
> that produce no tokens (ie empty or all stopwords). Otherwise things 
> work very well.
>
>
> Cheers
> Mark
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org