You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by David Kaelbling <dk...@blackducksoftware.com> on 2008/04/23 22:15:48 UTC

SpanScorer handling of non-disjoint phrases

Hi,

I've been using the 2.3.1 contrib highlighter with the 2/10/2008
SpanHighlighter patch, and have run into some trouble.  If I have two
phrases in a query that share terms (e.g. "hello world" and "hello
goodbye") the SpanScorer seems to not highlight 'hello' consistently.

It looks to me like WeightedSpanTermExtractor.extract() is clobbering
the span positions for 'hello' the second time it encounters the term.
Should terms.putAll(booleanTerms) and terms.putAll(disjunctTerms) really
be replacing the old entry, or should the try to addPositionSpans()?

        Thanks,
        David

PS: And while I'm asking, it looks like getWeightedSpanTermsWithScores()
will wrap the cachingTokenFilter passed it by SpanScorer.init() into
another CachingTokenFilter, duplicating the cache?

-- 
David Kaelbling
Senior Software Engineer
Black Duck Software, Inc.

dkaelbling@blackducksoftware.com
T +1.781.810.2041
F +1.781.891.5145

http://www.blackducksoftware.com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SpanScorer handling of non-disjoint phrases

Posted by Mark Miller <ma...@gmail.com>.

On Wed, 2008-04-23 at 16:15 -0400, David Kaelbling wrote:
> Hi,
> 
> I've been using the 2.3.1 contrib highlighter with the 2/10/2008
> SpanHighlighter patch, and have run into some trouble.  If I have two
> phrases in a query that share terms (e.g. "hello world" and "hello
> goodbye") the SpanScorer seems to not highlight 'hello' consistently.
> 
> It looks to me like WeightedSpanTermExtractor.extract() is clobbering
> the span positions for 'hello' the second time it encounters the term.
> Should terms.putAll(booleanTerms) and terms.putAll(disjunctTerms) really
> be replacing the old entry, or should the try to addPositionSpans()?
> 
>         Thanks,
>         David
> 
> PS: And while I'm asking, it looks like getWeightedSpanTermsWithScores()
> will wrap the cachingTokenFilter passed it by SpanScorer.init() into
> another CachingTokenFilter, duplicating the cache?
> 

Hmmm...reminds me of an early dev bug I thought I added a test case for
and fixed.

I will take a look as soon as I can.

- mark


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SpanScorer handling of non-disjoint phrases

Posted by Mark Miller <ma...@gmail.com>.

Hmmm...my quick test of a query with two phrases and a common term
appeared to work correctly. Could you submit an example that
demonstrates the failure or perhaps shed some further light on the
problem?

As to your P.S. question, you are right...that particular method was
needlessly re wrapping the stream. I have fixed it now, thanks for
pointing it out.

- Mark

On Wed, 2008-04-23 at 20:55 -0400, Mark Miller wrote:
> On Wed, 2008-04-23 at 16:15 -0400, David Kaelbling wrote:
> > Hi,
> > 
> > I've been using the 2.3.1 contrib highlighter with the 2/10/2008
> > SpanHighlighter patch, and have run into some trouble.  If I have
two
> > phrases in a query that share terms (e.g. "hello world" and "hello
> > goodbye") the SpanScorer seems to not highlight 'hello'
consistently.
> > 
> > It looks to me like WeightedSpanTermExtractor.extract() is
clobbering
> > the span positions for 'hello' the second time it encounters the
term.
> > Should terms.putAll(booleanTerms) and terms.putAll(disjunctTerms)
really
> > be replacing the old entry, or should the try to addPositionSpans()?
> > 
> >         Thanks,
> >         David
> > 
> > PS: And while I'm asking, it looks like
getWeightedSpanTermsWithScores()
> > will wrap the cachingTokenFilter passed it by SpanScorer.init() into
> > another CachingTokenFilter, duplicating the cache?
> > 
> 
> Hmmm...reminds me of an early dev bug I thought I added a test case
for
> and fixed.
> 
> I will take a look as soon as I can.
> 
> - mark
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SpanScorer handling of non-disjoint phrases

Posted by David Kaelbling <dk...@blackducksoftware.com>.

On Wed, 23 Apr 2008 at 21:58:41 -0400, Mark Miller wrote:
>
> Hmmm...my quick test of a query with two phrases and a common term 
> appeared to work correctly. Could you submit an example that
> demonstrates the failure or perhaps shed some further light on the
> problem?

Hi,

"Already fixed" is entirely possible!  I'm using an old snapshot from
2/10/2008, and the code I was looking at (in WeightedSpanTermExtractor) doesn't
seem to exist any more -- maybe it mutated into QueryTermExtractor?  Anyway
the query was:

+contents:"hello world" +contents:1.0 +(contents:movie contents:"hello dolly 1.0")

The WeightedSpanTermExtractor code looked like this:

    if (query instanceof BooleanQuery) {
      BooleanClause[] queryClauses = ((BooleanQuery) query).getClauses();
      Map booleanTerms = new HashMap();
      for (int i = 0; i < queryClauses.length; i++) {
        if (!queryClauses[i].isProhibited()) {
          extract(queryClauses[i].getQuery(), booleanTerms);
        }
      }
      terms.putAll(booleanTerms);
    } else if (query instanceof PhraseQuery) { ...

If a term in 'booleanTerms' was already in the terms map, putAll discarded 
the old value.  I had to tweak this to merge the maps, and if both old and
new terms were position sensitive combine the two position spans (otherwise
keep the insensitive WeightedSpanTerm).

If you're using a HashSet of WeightedTerms rather than a Map keyed on Terms, 
the collision I experienced may not happen.

	Thanks,
	David

-- 
David Kaelbling
Senior Software Engineer
Black Duck Software, Inc.

dkaelbling@blackducksoftware.com
T +1.781.810.2041
F +1.781.891.5145

http://www.blackducksoftware.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org