You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Kevin A. Burton" <bu...@newsmonster.org> on 2004/03/31 02:56:45 UTC

Performance of hit highlighting and finding term positions for a specific document

I'm playing with this package:

http://home.clara.net/markharwood/lucene/highlight.htm

Trying to do hit highlighting.  This implementation uses another 
Analyzer to find the positions for the result terms. 

This seems that it's very inefficient since lucene already knows the 
frequency and position of given terms in the index.

My question is whether it's hard to find a TermPosition for a given term 
in a given document rather than the whole index.

IndexReader.termPositions( Term term ) is term specific not term and 
document specific.

Also it seems that after all this time that Lucene should have efficient 
hit highlighting as a standard package.  Is there any interest in seeing 
a contribution in the sandbox for this if it uses the index positions?

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

Re: Performance of hit highlighting and finding term positions for a specific document

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.

Erik Hatcher wrote:

> On Mar 30, 2004, at 7:56 PM, Kevin A. Burton wrote:
>
>> Trying to do hit highlighting.  This implementation uses another 
>> Analyzer to find the positions for the result terms.
>> This seems that it's very inefficient since lucene already knows the 
>> frequency and position of given terms in the index.
>
>
> What if the original analyzer removed stopped words, stemmed, and 
> injected synonyms?

Just use the same analyzer :)... I agree it's not the best approach for 
this reason and the CPU reason.

>> Also it seems that after all this time that Lucene should have 
>> efficient hit highlighting as a standard package.  Is there any 
>> interest in seeing a contribution in the sandbox for this if it uses 
>> the index positions?
>
>
> Big +1, regardless of the implementation details.  Hit hilighting is 
> so commonly requested that having it available at least in the 
> sandbox, or perhaps even in the core, makes a lot of sense. 

Well if we could make it efficient by using the frequency and positions 
of terms we're all set :)... I just need to figure out how to do this 
efficiently per document.

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

Re: Performance of hit highlighting and finding term positions for a specific document

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Mar 30, 2004, at 7:56 PM, Kevin A. Burton wrote:
> Trying to do hit highlighting.  This implementation uses another 
> Analyzer to find the positions for the result terms.
> This seems that it's very inefficient since lucene already knows the 
> frequency and position of given terms in the index.

What if the original analyzer removed stopped words, stemmed, and 
injected synonyms?

> Also it seems that after all this time that Lucene should have 
> efficient hit highlighting as a standard package.  Is there any 
> interest in seeing a contribution in the sandbox for this if it uses 
> the index positions?

Big +1, regardless of the implementation details.  Hit hilighting is so 
commonly requested that having it available at least in the sandbox, or 
perhaps even in the core, makes a lot of sense.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: Performance of hit highlighting and finding term positions for a specific document

Posted by Jochen Frey <lu...@quontis.com>.

> Several solutions have been proposed.  The simplest is to not scan past
> the first 10k or so for snippets unless nothing relevant is found in the
> first 10k.  I don't think Mark's highlighter yet does this, but I might
> be mistaken.
> 
> > since lucene already knows the
> > frequency and position of given terms in the index.
> 
> Lucene indexes record that a term is the nth term, not that it occurs at
> the nth character in the text.  The latter is needed for highlighting,
> but storing this would make indexes much larger and slower to update.
> 

None of those solutions that I know about (other than re-parsing) work for
us (for us the highlighting must be confided to exactly one sentence), and
even though we are desperate to have something smarter, we would not want to
lose the benefits of super small and fast indexes.

We have pondered (but don't have the time, currently) to develop a package
that would store token locations (outside of the Lucene core) and hack
Lucene to get token-ids.

Sorry, no real solutions here; I guess this post is a +1 for keeping indexes
small and fast, and a +1 for this being a real problem without a perfect
solution (yet).

Jochen



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Performance of hit highlighting and finding term positions for a specific document

Posted by Doug Cutting <cu...@apache.org>.

Kevin A. Burton wrote:
> I'm playing with this package:
> 
> http://home.clara.net/markharwood/lucene/highlight.htm
> 
> Trying to do hit highlighting.  This implementation uses another 
> Analyzer to find the positions for the result terms.
> This seems that it's very inefficient

Does it just seem inefficient, or is is it actually too inefficient in 
practice?  Folks have benchmarked this, and, for documents less than 10k 
characters or so, re-tokenizing is fast enough.  But it can be slow if 
the majority of your documents are longer than this.

Several solutions have been proposed.  The simplest is to not scan past 
the first 10k or so for snippets unless nothing relevant is found in the 
first 10k.  I don't think Mark's highlighter yet does this, but I might 
be mistaken.

> since lucene already knows the 
> frequency and position of given terms in the index.

Lucene indexes record that a term is the nth term, not that it occurs at 
the nth character in the text.  The latter is needed for highlighting, 
but storing this would make indexes much larger and slower to update.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: RE : Performance of hit highlighting and finding term positions for a specific document

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.

Rasik Pandey wrote:

>Hello,
>
>  
>
>>I've been meaning to look into good ways to store token offset
>>information to allow for very
>>efficient highlighting and I believe Mark may also be looking
>>into improving the highlighter via
>>other means such as temporary ram indexes. Search the archives
>>to get a background on some of the
>>idea's we've tossed around ('Dmitry's Term Vector stuff, plus
>>some' and 'Demoting results' come to
>>mind as threads that touch this topic).
>>    
>>
>
>I would be nice if CachingRewrittenQueryWrapper.java that I sent to lucene-dev (see below) last week became part of these highlighting effors, if appropriate. We use it to collect terms for a query that searches of multiple indices.
>  
>
Actually I had to write one for my tests with the highlighter. I'm using 
a MultiSearcher and a WildcardQuery which the highlighter didn't have 
support for. 

My impl was fairly basic so I wouldn't suggest a contribution... I'm 
sure your's is better.  The suggested changes to the highlighter for 
providing tokens would make this work well together.

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

RE : Performance of hit highlighting and finding term positions for a specific document

Posted by Rasik Pandey <ra...@ajlsm.com>.

Hello,

> I've been meaning to look into good ways to store token offset
> information to allow for very
> efficient highlighting and I believe Mark may also be looking
> into improving the highlighter via
> other means such as temporary ram indexes. Search the archives
> to get a background on some of the
> idea's we've tossed around ('Dmitry's Term Vector stuff, plus
> some' and 'Demoting results' come to
> mind as threads that touch this topic).

I would be nice if CachingRewrittenQueryWrapper.java that I sent to lucene-dev (see below) last week became part of these highlighting effors, if appropriate. We use it to collect terms for a query that searches of multiple indices.

Regards,
RBP





> -----Message d'origine-----
> De : Rasik Pandey [mailto:rasik.pandey@ajlsm.com]
> Envoyé : mercredi 17 mars 2004 13:36
> À : 'Lucene Developers List'; korfut@lycos.com
> Objet : RE : Query Term Collector (was: Re: New highlighter
> package available)
> 
> Hello All,
> 
> I don't know how this Thread/issue was resolved, but if you are
> still interested I have a simple way of doing this term
> collection ONLY at query time. I've tested it and it works with
> highlighting, etc. without the extra rewrite() call on the
> index.
> 
> Comments are welcome!
> 
> 
> package org.apache.lucene.search;
> 
> import org.apache.lucene.search.Weight;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.Similarity;
> import org.apache.lucene.index.IndexReader;
> 
> import java.io.IOException;
> 
> /*Rasik Pandey rasik.pandey@ajlsm.com*/
> /**Simple wrapper for a Lucene query that
>  * collects all queries generated by calling
>  * rewrite on the original Lucene query and stores
>  * them in a BooleanQuery.
>  *
>  * A Searcher will call the rewrite() method
>  * for each index and hence generate a query
>  * containing terms for the respective index. This
>  * class collects these queries so that they may be
>  * used for highlighting, query expansion, etc. by
>  * retrieving the underlying terms.
>  *
>  * @see #rewrite
>  * @see #getRewrittenQueries
>  * @see #resetRewrittenQueries
>  * @see #getOriginalQuery
>  */
> public class CachingRewrittenQueryWrapper extends Query{
>     protected org.apache.lucene.search.Query originalQuery =
> null;
>     protected BooleanQuery rewrittenQueries = new
> BooleanQuery();
> 
>     public CachingRewrittenQueryWrapper(Query originalQuery) {
>         this.originalQuery = originalQuery;
>     }
> 
>     public BooleanQuery getRewrittenQueries() {
>         return this.rewrittenQueries;
>     }
> 
>     public void resetRewrittenQueries() {
>         BooleanQuery newCachedQuery = new BooleanQuery();
> 
> newCachedQuery.setMaxClauseCount(this.rewrittenQueries.getMaxCl
> auseCount());
>         this.rewrittenQueries = newCachedQuery;
>     }
> 
>     public Query getOriginalQuery() {
>         return this.originalQuery;
>     }
> 
>     public void setBoost(float b) {
>         this.originalQuery.setBoost(b);
>     }
> 
>     public float getBoost() {
>         return this.originalQuery.getBoost();
>     }
> 
> 
>     protected Weight createWeight(Searcher searcher) {
>         return this.originalQuery.createWeight(searcher);
>     }
> 
>     public Query rewrite(IndexReader reader) throws IOException
> {
>         Query rewrittenQuery =
> this.originalQuery.rewrite(reader);
>         this.rewrittenQueries.add(rewrittenQuery, false,
> false);
>         return rewrittenQuery;
>     }
> 
>     public Query combine(Query[] queries) {
>         return this.originalQuery.combine(queries);
>     }
> 
>     public Similarity getSimilarity(Searcher searcher) {
>         return this.originalQuery.getSimilarity(searcher);
>     }
> 
>     protected void finalize() throws Throwable {
>         super.finalize();
>         //TODO maybe something here to ensure that all
> resources held by rewrittenQueries are cleaned up properly
>     }
> 
>     public String toString() {
>         return this.originalQuery.toString();
>     }
> 
>     public String toString(String field) {
>        return this.originalQuery.toString(field);
>     }
> }
> 
> 
> 
> ---------------------------------------------------------------
> ------
> To unsubscribe, e-mail: lucene-dev-
> unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-
> help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Performance of hit highlighting and finding term positions for a specific document

Posted by Bruce Ritchie <br...@jivesoftware.com>.

Kevin A. Burton wrote:

> I'm playing with this package:
> 
> http://home.clara.net/markharwood/lucene/highlight.htm
> 
> Trying to do hit highlighting.  This implementation uses another 
> Analyzer to find the positions for the result terms.
> This seems that it's very inefficient since lucene already knows the 
> frequency and position of given terms in the index.
> 
> My question is whether it's hard to find a TermPosition for a given term 
> in a given document rather than the whole index.
> 
> IndexReader.termPositions( Term term ) is term specific not term and 
> document specific.

As far as I know it's not currently possible to get this information from a standard lucene index.

> Also it seems that after all this time that Lucene should have efficient 
> hit highlighting as a standard package.  Is there any interest in seeing 
> a contribution in the sandbox for this if it uses the index positions?

I've been meaning to look into good ways to store token offset information to allow for very 
efficient highlighting and I believe Mark may also be looking into improving the highlighter via 
other means such as temporary ram indexes. Search the archives to get a background on some of the 
idea's we've tossed around ('Dmitry's Term Vector stuff, plus some' and 'Demoting results' come to 
mind as threads that touch this topic).

Regards,

Bruce Ritchie
http://www.jivesoftware.com/

Re: RE : Performance of hit highlighting and finding term positions for a specific document

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.

Rasik Pandey wrote:

>Kevin,
>
>  
>
>>http://home.clara.net/markharwood/lucene/highlight.htm
>>
>>Trying to do hit highlighting.  This implementation uses
>>another
>>Analyzer to find the positions for the result terms.
>>
>>This seems that it's very inefficient since lucene already
>>knows the
>>frequency and position of given terms in the index.
>>    
>>
>
>Can you explain in more detail what you mean here?
>
It uses the StandardAnalyzer again to re-index to find tokens... when it 
finds the same token that matched a search request it highlights it.

It works... just not too efficient.

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

RE : Performance of hit highlighting and finding term positions for a specific document

Posted by Rasik Pandey <ra...@ajlsm.com>.

Kevin,

> http://home.clara.net/markharwood/lucene/highlight.htm
> 
> Trying to do hit highlighting.  This implementation uses
> another
> Analyzer to find the positions for the result terms.
> 
> This seems that it's very inefficient since lucene already
> knows the
> frequency and position of given terms in the index.

Can you explain in more detail what you mean here?


Regards,
RBP






---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Performance of hit highlighting and finding term positions for a specific document

Posted by Stephane James Vaucher <va...@cirano.qc.ca>.

I agree with you that a highlight package should be available directly 
from the lucene website. To offer this much-desired feature, having a 
dependency on a personal web site seems a little weird to me. It would 
also force the community to support this functionality, which would seem 
appropriate.

cheers,
sv

On Tue, 30 Mar 2004, Kevin A. Burton wrote:

> I'm playing with this package:
> 
> http://home.clara.net/markharwood/lucene/highlight.htm
> 
> Trying to do hit highlighting.  This implementation uses another 
> Analyzer to find the positions for the result terms. 
> 
> This seems that it's very inefficient since lucene already knows the 
> frequency and position of given terms in the index.
> 
> My question is whether it's hard to find a TermPosition for a given term 
> in a given document rather than the whole index.
> 
> IndexReader.termPositions( Term term ) is term specific not term and 
> document specific.
> 
> Also it seems that after all this time that Lucene should have efficient 
> hit highlighting as a standard package.  Is there any interest in seeing 
> a contribution in the sandbox for this if it uses the index positions?
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org