You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Dawid Weiss <da...@gmail.com> on 2017/05/22 11:03:57 UTC

Span query positions screwed up in Solr (field with WordDelimiterGraphFilter)

I have this curious situation which puzzles me. There is a field with
WDGF set up (flattening filter at indexing time). The input document
has this:

AAA,BBB CCC - DDD

This is indexed as the following

term => pos
AAA,BBB  => 1
AAA => 1
BBB => 2
CCC => 3
- => 4
DDD => 6

(note the absence of any term at position '5').

The query analyzer for:

field:"AAA,BBB CCC - DDD"

returns the same token sequence, but the debug query in Solr yields:

+SpanNearQuery(
  spanNear([
    spanOr([
      field:AAA,BBB,
      spanNear([field:AAA, field:BBB], 0, true)
    ]),
    funding_program:CCC,
    funding_program:-,
    funding_program:DDD], 0, true)
)

which doesn't match the input document (because of skipped position).

Anyone has a clue where the problem may originate from?

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Span query positions screwed up in Solr (field with WordDelimiterGraphFilter)

Posted by Dawid Weiss <da...@gmail.com>.

Thanks for the follow-up guys. I'll comment on the issue.
Dawid

On Tue, May 23, 2017 at 4:03 PM, jim ferenczi <ji...@gmail.com> wrote:
> Nice, thanks David !
>
> Dawid, I opened https://issues.apache.org/jira/browse/LUCENE-7848
> Though when I try to replicate I have the following output:
> Term => pos
> AAA,BBB  => 1
> AAA => 1
> BBB => 2
> CCC => 3
> DDD => 5
>
> Can you comment on the issue with the options you used in the WDGF ?
>
> 2017-05-23 14:43 GMT+02:00 David Smiley <da...@gmail.com>:
>>
>> Jim,
>> With SpanNearQuery (phrase), gaps need not be simulated with slop; you can
>> get the real thing.  See SpanNearQuery.addGap(int width).  Perhaps you've
>> seen outdated code in places like WeightedSpanTermExtractor that predated
>> the existence of addGap.
>> ~ David
>>
>> On May 23, 2017, at 8:24 AM, jim ferenczi <ji...@gmail.com> wrote:
>>
>> Hi Dawid,
>> This is indeed related to the changes made to QueryBuilder in presence of
>> a graph token stream. We moved to SpanQueries to handle quoted graph token
>> stream in this change:
>> https://issues.apache.org/jira/browse/LUCENE-7699
>> So instead of building all path in the graph we try to build a single
>> optimized span query but gaps are ignored as you noticed in your analysis.
>> We could simulate the gaps with the slop factor of the SpanQueries but it
>> would not be 100% accurate. I think it's a common problem when converting
>> PhraseQuery into SpanQuery.
>> Although I don't understand if it's the WDGF that creates the gap or the
>> flattening filter. In fact I don't understand why there is a gap in your
>> input, is it expected ?
>>
>>
>> 2017-05-23 10:23 GMT+02:00 Dawid Weiss <da...@gmail.com>:
>>>
>>> This seems to be related/broken by LUCENE-7638, which Jim worked on
>>> quite recently, actually.
>>>
>>> https://issues.apache.org/jira/browse/LUCENE-7638
>>>
>>> Dawid
>>>
>>> On Tue, May 23, 2017 at 10:20 AM, Dawid Weiss <da...@gmail.com>
>>> wrote:
>>> > I looked a bit deeper into this and it seems to be a Lucene
>>> > QueryBuilder bug. Or me not understanding how graph token streams
>>> > should be handled. The situation is as follows:
>>> >
>>> > 1. Solr calls Lucene's QueryBuilder.createFieldQuery for "AAA,BBB CCC -
>>> > DDD".
>>> > 2. QueryBuilder sees that the query is a graph and that it's quoted,
>>> > so it invokes analyzeGraphPhrase, here:
>>> >
>>> >       } else if (isGraph) {
>>> >         // graph
>>> >         if (quoted) {
>>> >           return analyzeGraphPhrase(stream, field, phraseSlop);
>>> >         } else {
>>> >           return analyzeGraphBoolean(field, stream, operator);
>>> >         }
>>> >
>>> > Note that even though this is a quoted query, it is analyzed and the
>>> > resulting token stream has a "position gap".
>>> >
>>> > 3. analyzeGraphPhrase creates a span query for the side path:
>>> >
>>> > SpanQuery q = createSpanQuery(ts, field);
>>> >
>>> > but createSpanQuery doesn't look at positions at all, it just assumes
>>> > they're contiguous:
>>> >
>>> >       return new SpanNearQuery(terms.toArray(new SpanTermQuery[0]), 0,
>>> > true);
>>> >
>>> > Looks to me like the created span query should try to correctly
>>> > determine the slop factor and not assume a contiguous token sequence?
>>> >
>>> > Dawid
>>> >
>>> > On Mon, May 22, 2017 at 1:03 PM, Dawid Weiss <da...@gmail.com>
>>> > wrote:
>>> >> I have this curious situation which puzzles me. There is a field with
>>> >> WDGF set up (flattening filter at indexing time). The input document
>>> >> has this:
>>> >>
>>> >> AAA,BBB CCC - DDD
>>> >>
>>> >> This is indexed as the following
>>> >>
>>> >> term => pos
>>> >> AAA,BBB  => 1
>>> >> AAA => 1
>>> >> BBB => 2
>>> >> CCC => 3
>>> >> - => 4
>>> >> DDD => 6
>>> >>
>>> >> (note the absence of any term at position '5').
>>> >>
>>> >> The query analyzer for:
>>> >>
>>> >> field:"AAA,BBB CCC - DDD"
>>> >>
>>> >> returns the same token sequence, but the debug query in Solr yields:
>>> >>
>>> >> +SpanNearQuery(
>>> >>   spanNear([
>>> >>     spanOr([
>>> >>       field:AAA,BBB,
>>> >>       spanNear([field:AAA, field:BBB], 0, true)
>>> >>     ]),
>>> >>     funding_program:CCC,
>>> >>     funding_program:-,
>>> >>     funding_program:DDD], 0, true)
>>> >> )
>>> >>
>>> >> which doesn't match the input document (because of skipped position).
>>> >>
>>> >> Anyone has a clue where the problem may originate from?
>>> >>
>>> >> Dawid
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Span query positions screwed up in Solr (field with WordDelimiterGraphFilter)

Posted by jim ferenczi <ji...@gmail.com>.

Nice, thanks David !

Dawid, I opened https://issues.apache.org/jira/browse/LUCENE-7848
Though when I try to replicate I have the following output:
Term => pos
AAA,BBB  => 1
AAA => 1
BBB => 2
CCC => 3
DDD => 5

Can you comment on the issue with the options you used in the WDGF ?

2017-05-23 14:43 GMT+02:00 David Smiley <da...@gmail.com>:

> Jim,
> With SpanNearQuery (phrase), gaps need not be *simulated* with slop; you
> can get the real thing.  See SpanNearQuery.addGap(int width).  Perhaps
> you've seen outdated code in places like WeightedSpanTermExtractor that
> predated the existence of addGap.
> ~ David
>
> On May 23, 2017, at 8:24 AM, jim ferenczi <ji...@gmail.com> wrote:
>
> Hi Dawid,
> This is indeed related to the changes made to QueryBuilder in presence of
> a graph token stream. We moved to SpanQueries to handle quoted graph token
> stream in this change:
> https://issues.apache.org/jira/browse/LUCENE-7699
> So instead of building all path in the graph we try to build a single
> optimized span query but gaps are ignored as you noticed in your analysis.
> We could simulate the gaps with the slop factor of the SpanQueries but it
> would not be 100% accurate. I think it's a common problem when converting
> PhraseQuery into SpanQuery.
> Although I don't understand if it's the WDGF that creates the gap or the
> flattening filter. In fact I don't understand why there is a gap in your
> input, is it expected ?
>
>
> 2017-05-23 10:23 GMT+02:00 Dawid Weiss <da...@gmail.com>:
>
>> This seems to be related/broken by LUCENE-7638, which Jim worked on
>> quite recently, actually.
>>
>> https://issues.apache.org/jira/browse/LUCENE-7638
>>
>> Dawid
>>
>> On Tue, May 23, 2017 at 10:20 AM, Dawid Weiss <da...@gmail.com>
>> wrote:
>> > I looked a bit deeper into this and it seems to be a Lucene
>> > QueryBuilder bug. Or me not understanding how graph token streams
>> > should be handled. The situation is as follows:
>> >
>> > 1. Solr calls Lucene's QueryBuilder.createFieldQuery for "AAA,BBB CCC -
>> DDD".
>> > 2. QueryBuilder sees that the query is a graph and that it's quoted,
>> > so it invokes analyzeGraphPhrase, here:
>> >
>> >       } else if (isGraph) {
>> >         // graph
>> >         if (quoted) {
>> >           return analyzeGraphPhrase(stream, field, phraseSlop);
>> >         } else {
>> >           return analyzeGraphBoolean(field, stream, operator);
>> >         }
>> >
>> > Note that even though this is a quoted query, it is analyzed and the
>> > resulting token stream has a "position gap".
>> >
>> > 3. analyzeGraphPhrase creates a span query for the side path:
>> >
>> > SpanQuery q = createSpanQuery(ts, field);
>> >
>> > but createSpanQuery doesn't look at positions at all, it just assumes
>> > they're contiguous:
>> >
>> >       return new SpanNearQuery(terms.toArray(new SpanTermQuery[0]), 0,
>> true);
>> >
>> > Looks to me like the created span query should try to correctly
>> > determine the slop factor and not assume a contiguous token sequence?
>> >
>> > Dawid
>> >
>> > On Mon, May 22, 2017 at 1:03 PM, Dawid Weiss <da...@gmail.com>
>> wrote:
>> >> I have this curious situation which puzzles me. There is a field with
>> >> WDGF set up (flattening filter at indexing time). The input document
>> >> has this:
>> >>
>> >> AAA,BBB CCC - DDD
>> >>
>> >> This is indexed as the following
>> >>
>> >> term => pos
>> >> AAA,BBB  => 1
>> >> AAA => 1
>> >> BBB => 2
>> >> CCC => 3
>> >> - => 4
>> >> DDD => 6
>> >>
>> >> (note the absence of any term at position '5').
>> >>
>> >> The query analyzer for:
>> >>
>> >> field:"AAA,BBB CCC - DDD"
>> >>
>> >> returns the same token sequence, but the debug query in Solr yields:
>> >>
>> >> +SpanNearQuery(
>> >>   spanNear([
>> >>     spanOr([
>> >>       field:AAA,BBB,
>> >>       spanNear([field:AAA, field:BBB], 0, true)
>> >>     ]),
>> >>     funding_program:CCC,
>> >>     funding_program:-,
>> >>     funding_program:DDD], 0, true)
>> >> )
>> >>
>> >> which doesn't match the input document (because of skipped position).
>> >>
>> >> Anyone has a clue where the problem may originate from?
>> >>
>> >> Dawid
>>
>
>
>

Re: Span query positions screwed up in Solr (field with WordDelimiterGraphFilter)

Posted by David Smiley <da...@gmail.com>.

Jim,
With SpanNearQuery (phrase), gaps need not be simulated with slop; you can get the real thing.  See SpanNearQuery.addGap(int width).  Perhaps you've seen outdated code in places like WeightedSpanTermExtractor that predated the existence of addGap.
~ David

> On May 23, 2017, at 8:24 AM, jim ferenczi <ji...@gmail.com> wrote:
> 
> Hi Dawid,
> This is indeed related to the changes made to QueryBuilder in presence of a graph token stream. We moved to SpanQueries to handle quoted graph token stream in this change:
> https://issues.apache.org/jira/browse/LUCENE-7699 <https://issues.apache.org/jira/browse/LUCENE-7699>
> So instead of building all path in the graph we try to build a single optimized span query but gaps are ignored as you noticed in your analysis. We could simulate the gaps with the slop factor of the SpanQueries but it would not be 100% accurate. I think it's a common problem when converting PhraseQuery into SpanQuery.
> Although I don't understand if it's the WDGF that creates the gap or the flattening filter. In fact I don't understand why there is a gap in your input, is it expected ?
> 
> 
> 2017-05-23 10:23 GMT+02:00 Dawid Weiss <dawid.weiss@gmail.com <ma...@gmail.com>>:
> This seems to be related/broken by LUCENE-7638, which Jim worked on
> quite recently, actually.
> 
> https://issues.apache.org/jira/browse/LUCENE-7638 <https://issues.apache.org/jira/browse/LUCENE-7638>
> 
> Dawid
> 
> On Tue, May 23, 2017 at 10:20 AM, Dawid Weiss <dawid.weiss@gmail.com <ma...@gmail.com>> wrote:
> > I looked a bit deeper into this and it seems to be a Lucene
> > QueryBuilder bug. Or me not understanding how graph token streams
> > should be handled. The situation is as follows:
> >
> > 1. Solr calls Lucene's QueryBuilder.createFieldQuery for "AAA,BBB CCC - DDD".
> > 2. QueryBuilder sees that the query is a graph and that it's quoted,
> > so it invokes analyzeGraphPhrase, here:
> >
> >       } else if (isGraph) {
> >         // graph
> >         if (quoted) {
> >           return analyzeGraphPhrase(stream, field, phraseSlop);
> >         } else {
> >           return analyzeGraphBoolean(field, stream, operator);
> >         }
> >
> > Note that even though this is a quoted query, it is analyzed and the
> > resulting token stream has a "position gap".
> >
> > 3. analyzeGraphPhrase creates a span query for the side path:
> >
> > SpanQuery q = createSpanQuery(ts, field);
> >
> > but createSpanQuery doesn't look at positions at all, it just assumes
> > they're contiguous:
> >
> >       return new SpanNearQuery(terms.toArray(new SpanTermQuery[0]), 0, true);
> >
> > Looks to me like the created span query should try to correctly
> > determine the slop factor and not assume a contiguous token sequence?
> >
> > Dawid
> >
> > On Mon, May 22, 2017 at 1:03 PM, Dawid Weiss <dawid.weiss@gmail.com <ma...@gmail.com>> wrote:
> >> I have this curious situation which puzzles me. There is a field with
> >> WDGF set up (flattening filter at indexing time). The input document
> >> has this:
> >>
> >> AAA,BBB CCC - DDD
> >>
> >> This is indexed as the following
> >>
> >> term => pos
> >> AAA,BBB  => 1
> >> AAA => 1
> >> BBB => 2
> >> CCC => 3
> >> - => 4
> >> DDD => 6
> >>
> >> (note the absence of any term at position '5').
> >>
> >> The query analyzer for:
> >>
> >> field:"AAA,BBB CCC - DDD"
> >>
> >> returns the same token sequence, but the debug query in Solr yields:
> >>
> >> +SpanNearQuery(
> >>   spanNear([
> >>     spanOr([
> >>       field:AAA,BBB,
> >>       spanNear([field:AAA, field:BBB], 0, true)
> >>     ]),
> >>     funding_program:CCC,
> >>     funding_program:-,
> >>     funding_program:DDD], 0, true)
> >> )
> >>
> >> which doesn't match the input document (because of skipped position).
> >>
> >> Anyone has a clue where the problem may originate from?
> >>
> >> Dawid
>

Re: Span query positions screwed up in Solr (field with WordDelimiterGraphFilter)

Posted by jim ferenczi <ji...@gmail.com>.

Hi Dawid,
This is indeed related to the changes made to QueryBuilder in presence of a
graph token stream. We moved to SpanQueries to handle quoted graph token
stream in this change:
https://issues.apache.org/jira/browse/LUCENE-7699
So instead of building all path in the graph we try to build a single
optimized span query but gaps are ignored as you noticed in your analysis.
We could simulate the gaps with the slop factor of the SpanQueries but it
would not be 100% accurate. I think it's a common problem when converting
PhraseQuery into SpanQuery.
Although I don't understand if it's the WDGF that creates the gap or the
flattening filter. In fact I don't understand why there is a gap in your
input, is it expected ?


2017-05-23 10:23 GMT+02:00 Dawid Weiss <da...@gmail.com>:

> This seems to be related/broken by LUCENE-7638, which Jim worked on
> quite recently, actually.
>
> https://issues.apache.org/jira/browse/LUCENE-7638
>
> Dawid
>
> On Tue, May 23, 2017 at 10:20 AM, Dawid Weiss <da...@gmail.com>
> wrote:
> > I looked a bit deeper into this and it seems to be a Lucene
> > QueryBuilder bug. Or me not understanding how graph token streams
> > should be handled. The situation is as follows:
> >
> > 1. Solr calls Lucene's QueryBuilder.createFieldQuery for "AAA,BBB CCC -
> DDD".
> > 2. QueryBuilder sees that the query is a graph and that it's quoted,
> > so it invokes analyzeGraphPhrase, here:
> >
> >       } else if (isGraph) {
> >         // graph
> >         if (quoted) {
> >           return analyzeGraphPhrase(stream, field, phraseSlop);
> >         } else {
> >           return analyzeGraphBoolean(field, stream, operator);
> >         }
> >
> > Note that even though this is a quoted query, it is analyzed and the
> > resulting token stream has a "position gap".
> >
> > 3. analyzeGraphPhrase creates a span query for the side path:
> >
> > SpanQuery q = createSpanQuery(ts, field);
> >
> > but createSpanQuery doesn't look at positions at all, it just assumes
> > they're contiguous:
> >
> >       return new SpanNearQuery(terms.toArray(new SpanTermQuery[0]), 0,
> true);
> >
> > Looks to me like the created span query should try to correctly
> > determine the slop factor and not assume a contiguous token sequence?
> >
> > Dawid
> >
> > On Mon, May 22, 2017 at 1:03 PM, Dawid Weiss <da...@gmail.com>
> wrote:
> >> I have this curious situation which puzzles me. There is a field with
> >> WDGF set up (flattening filter at indexing time). The input document
> >> has this:
> >>
> >> AAA,BBB CCC - DDD
> >>
> >> This is indexed as the following
> >>
> >> term => pos
> >> AAA,BBB  => 1
> >> AAA => 1
> >> BBB => 2
> >> CCC => 3
> >> - => 4
> >> DDD => 6
> >>
> >> (note the absence of any term at position '5').
> >>
> >> The query analyzer for:
> >>
> >> field:"AAA,BBB CCC - DDD"
> >>
> >> returns the same token sequence, but the debug query in Solr yields:
> >>
> >> +SpanNearQuery(
> >>   spanNear([
> >>     spanOr([
> >>       field:AAA,BBB,
> >>       spanNear([field:AAA, field:BBB], 0, true)
> >>     ]),
> >>     funding_program:CCC,
> >>     funding_program:-,
> >>     funding_program:DDD], 0, true)
> >> )
> >>
> >> which doesn't match the input document (because of skipped position).
> >>
> >> Anyone has a clue where the problem may originate from?
> >>
> >> Dawid
>

Re: Span query positions screwed up in Solr (field with WordDelimiterGraphFilter)

Posted by Dawid Weiss <da...@gmail.com>.

This seems to be related/broken by LUCENE-7638, which Jim worked on
quite recently, actually.

https://issues.apache.org/jira/browse/LUCENE-7638

Dawid

On Tue, May 23, 2017 at 10:20 AM, Dawid Weiss <da...@gmail.com> wrote:
> I looked a bit deeper into this and it seems to be a Lucene
> QueryBuilder bug. Or me not understanding how graph token streams
> should be handled. The situation is as follows:
>
> 1. Solr calls Lucene's QueryBuilder.createFieldQuery for "AAA,BBB CCC - DDD".
> 2. QueryBuilder sees that the query is a graph and that it's quoted,
> so it invokes analyzeGraphPhrase, here:
>
>       } else if (isGraph) {
>         // graph
>         if (quoted) {
>           return analyzeGraphPhrase(stream, field, phraseSlop);
>         } else {
>           return analyzeGraphBoolean(field, stream, operator);
>         }
>
> Note that even though this is a quoted query, it is analyzed and the
> resulting token stream has a "position gap".
>
> 3. analyzeGraphPhrase creates a span query for the side path:
>
> SpanQuery q = createSpanQuery(ts, field);
>
> but createSpanQuery doesn't look at positions at all, it just assumes
> they're contiguous:
>
>       return new SpanNearQuery(terms.toArray(new SpanTermQuery[0]), 0, true);
>
> Looks to me like the created span query should try to correctly
> determine the slop factor and not assume a contiguous token sequence?
>
> Dawid
>
> On Mon, May 22, 2017 at 1:03 PM, Dawid Weiss <da...@gmail.com> wrote:
>> I have this curious situation which puzzles me. There is a field with
>> WDGF set up (flattening filter at indexing time). The input document
>> has this:
>>
>> AAA,BBB CCC - DDD
>>
>> This is indexed as the following
>>
>> term => pos
>> AAA,BBB  => 1
>> AAA => 1
>> BBB => 2
>> CCC => 3
>> - => 4
>> DDD => 6
>>
>> (note the absence of any term at position '5').
>>
>> The query analyzer for:
>>
>> field:"AAA,BBB CCC - DDD"
>>
>> returns the same token sequence, but the debug query in Solr yields:
>>
>> +SpanNearQuery(
>>   spanNear([
>>     spanOr([
>>       field:AAA,BBB,
>>       spanNear([field:AAA, field:BBB], 0, true)
>>     ]),
>>     funding_program:CCC,
>>     funding_program:-,
>>     funding_program:DDD], 0, true)
>> )
>>
>> which doesn't match the input document (because of skipped position).
>>
>> Anyone has a clue where the problem may originate from?
>>
>> Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Span query positions screwed up in Solr (field with WordDelimiterGraphFilter)

Posted by Dawid Weiss <da...@gmail.com>.

I looked a bit deeper into this and it seems to be a Lucene
QueryBuilder bug. Or me not understanding how graph token streams
should be handled. The situation is as follows:

1. Solr calls Lucene's QueryBuilder.createFieldQuery for "AAA,BBB CCC - DDD".
2. QueryBuilder sees that the query is a graph and that it's quoted,
so it invokes analyzeGraphPhrase, here:

      } else if (isGraph) {
        // graph
        if (quoted) {
          return analyzeGraphPhrase(stream, field, phraseSlop);
        } else {
          return analyzeGraphBoolean(field, stream, operator);
        }

Note that even though this is a quoted query, it is analyzed and the
resulting token stream has a "position gap".

3. analyzeGraphPhrase creates a span query for the side path:

SpanQuery q = createSpanQuery(ts, field);

but createSpanQuery doesn't look at positions at all, it just assumes
they're contiguous:

      return new SpanNearQuery(terms.toArray(new SpanTermQuery[0]), 0, true);

Looks to me like the created span query should try to correctly
determine the slop factor and not assume a contiguous token sequence?

Dawid

On Mon, May 22, 2017 at 1:03 PM, Dawid Weiss <da...@gmail.com> wrote:
> I have this curious situation which puzzles me. There is a field with
> WDGF set up (flattening filter at indexing time). The input document
> has this:
>
> AAA,BBB CCC - DDD
>
> This is indexed as the following
>
> term => pos
> AAA,BBB  => 1
> AAA => 1
> BBB => 2
> CCC => 3
> - => 4
> DDD => 6
>
> (note the absence of any term at position '5').
>
> The query analyzer for:
>
> field:"AAA,BBB CCC - DDD"
>
> returns the same token sequence, but the debug query in Solr yields:
>
> +SpanNearQuery(
>   spanNear([
>     spanOr([
>       field:AAA,BBB,
>       spanNear([field:AAA, field:BBB], 0, true)
>     ]),
>     funding_program:CCC,
>     funding_program:-,
>     funding_program:DDD], 0, true)
> )
>
> which doesn't match the input document (because of skipped position).
>
> Anyone has a clue where the problem may originate from?
>
> Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org