You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/06/20 19:53:44 UTC

RE: SpanQuery - How to wrap a NOT subquery

Bouncing over to user’s list.

As you’ve found, spans are different from regular queries.  MUST_NOT at the BooleanQuery level means that the term must not appear anywhere in the document; whereas spans focus on terms near each other.

Have you tried SpanNotQuery?  This would allow you at least to do something like:

termA but not if zyx or yyy appears X words before or Y words after



From: Brandon Miller [mailto:computerengineer.brandon@gmail.com]
Sent: Monday, June 20, 2016 2:36 PM
To: dev@lucene.apache.org
Subject: SpanQuery - How to wrap a NOT subquery

Greetings!

I'm wanting to support this:
TermA within_N_terms_of (abc and cba or xyz and not zyx or not yyy)

Focusing on the sub-query:
I have ANDs and ORs figured out (special tricks playing with slops and such).

I'm having the hardest time figuring out how to wrap a NOT.

Outside of SpanQuery, I'm using a BooleanQuery with a MUST_NOT clause.  That's fine (if you know another way, I'd like to hear that, too, but this appears to work dandy).

However, SpanQuery requires queries that are also of type SpanQuery or SpanMultiTermQueryWrapper will allow you to throw in anything derived from MultiTermQuery (which includes AutomatedQuery).

Right now, I'm at a loss.  We have huge, complex, nested boolean queries inside proximity operators with our current solution.

If I need to write a custom solution, then that's what I need to hear and perhaps a couple of pointers.

Thanks a bunch and God bless!

Brandon

Re: SpanQuery - How to wrap a NOT subquery

Posted by Brandon Miller <co...@gmail.com>.

Thank you!!

Okay, I think I have that all squared away.

*SpanLastQuery*:
I need something like SpanFirstQuery, except that it would be
SpanLastQuery.  Is there a way to get that to work?

*Proximity weighting getting ignored*:
I also need to get span term boosting working.
Here's my query:
"one thousand two hundred thirty" pre/5 (seven:5.32 or three:3 or two:2.9)
Here's the resulting Solr query:
spanNear([spanNear([field:one, field:thousand, field:two, field:hundred,
field:thirty], 0, true), spanOr([spanOr([(field:seven)^5.32,
(field:three)^3.0]), (field:two)^2.9])], 5, true)

It's returning
[1232, 1233, 1237]
Expected:
[1237, 1233, 1232]

Here's what the scoreDocs has to say about this search's results:
[doc=1232 score=5.903808 shardIndex=0,
 doc=1233 score=5.903808 shardIndex=0,
 doc=1237 score=5.903808 shardIndex=0]

Notice that the scores were all the exact same.
Why don't the boosts appear to be working?

Thank you!

On Tue, Jun 21, 2016 at 1:40 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> >Awesome, 0 pre and 1 post works!
>
> Great!
>
> > What if I wanted to match thirty, but exclude if six or seven are
> included anywhere in the document?
>
> Any time you need "anywhere in the document", use a "regular" query (not
> SpanQuery).  As you wrote initially, you can construct a BooleanQuery that
> includes a complex SpanQuery and another Query that is
> BooleanClause.Occur.MUST_NOT.
>
> > I also tried 0 pre and 0 post
> You'd use those if you wanted to find something that didn't contain
> something else:
>
> ["William Clinton"~2 Jefferson]!~0,0
>
> Find 'william' within two words of 'clinton', but not if 'jefferson'
> appears between them.
>
> > I replaced pre with Integer.MAX_VALUE and post with Integer.MAX_VALUE -
> 5 and it works!
> I'll have to think about this one...
>
>

RE: SpanQuery - How to wrap a NOT subquery

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>Awesome, 0 pre and 1 post works!

Great!

> What if I wanted to match thirty, but exclude if six or seven are included anywhere in the document?

Any time you need "anywhere in the document", use a "regular" query (not SpanQuery).  As you wrote initially, you can construct a BooleanQuery that includes a complex SpanQuery and another Query that is BooleanClause.Occur.MUST_NOT.

> I also tried 0 pre and 0 post
You'd use those if you wanted to find something that didn't contain something else: 

["William Clinton"~2 Jefferson]!~0,0

Find 'william' within two words of 'clinton', but not if 'jefferson' appears between them.

> I replaced pre with Integer.MAX_VALUE and post with Integer.MAX_VALUE - 5 and it works!
I'll have to think about this one...

Re: SpanQuery - How to wrap a NOT subquery

Posted by Brandon Miller <co...@gmail.com>.

Awesome, 0 pre and 1 post works!
I replaced pre with Integer.MAX_VALUE and post with Integer.MAX_VALUE - 5
and it works!

If I replace post with Integer.MAX_VALUE - 4 (or -3, -2, -1, -0), it fails.
But, if it's -(5+), it appears to work.

Thank you guys for suffering through my inexperience with Solr.

*NOTE*: In case someone would find it helpful to follow my reasoning before
I discovered the work-around above:

I don't understand why 0,1 works and Integer.MAX_VALUE, Integer.MAX_VALUE
doesn't.  I mean I know that six and seven both come one word after thirty *in
this case*.  This is case-dependent.  What if I wanted to match thirty, but
exclude if six or seven are included anywhere in the document?

How will I know what numbers to plug into pre and post when they could be
anywhere in the document?

In this case, those numbers worked.  Why didn't the big numbers work?
After all, six and seven were unique throughout the whole number (i.e. six
and seven were only at the end of the document).

I also tried 0 pre and 0 post, but that gave me the same as I had when I
had pre and post as really large numbers.

On Tue, Jun 21, 2016 at 12:50 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> >Perhaps I'm misunderstanding the pre/post parameters?
>
> Pre/post parameters: " 'six' or 'seven' should not appear $pre tokens
> before 'thirty' or $post tokens after 'thirty'
>
> Maybe something like this:
> spanNear([
>   spanNear([field:one, field:thousand, field:one, field:hundred], 0, true),
>   spanNot(field:thirty, spanOr([field:six, field:seven]), 0,
> 1)
>   ], 0, true)
>
>

RE: SpanQuery - How to wrap a NOT subquery

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>Perhaps I'm misunderstanding the pre/post parameters?

Pre/post parameters: " 'six' or 'seven' should not appear $pre tokens before 'thirty' or $post tokens after 'thirty'

Maybe something like this:
spanNear([
  spanNear([field:one, field:thousand, field:one, field:hundred], 0, true),
  spanNot(field:thirty, spanOr([field:six, field:seven]), 0,
1)
  ], 0, true)

Re: SpanQuery - How to wrap a NOT subquery

Posted by Brandon Miller <co...@gmail.com>.

I saw the second post--the first post was new to me.

We plan on connecting with those people later on, but right now, I'm trying
to write a stop-gap dtSearch compiler until we can at least secure the
funding we need to employ their help.

Right now, I have a very functional query parser, with just a few holes
needing to be patched.

I rewrote my AND NOT and OR NOT queries.

Now I'm perplexed why this query is not working as expected:
spanNear([
  spanNear([field:one, field:thousand, field:one, field:hundred], 0, true),
  spanNot(field:thirty, spanOr([field:six, field:seven]), 2147483647,
2147483647)
  ], 0, true)

is returning 1130..1139.

expected:<[1130, 1131, 1132, 1133, 1134, 1135, 1138, 1139]> but was:<[1130,
1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139]>

I would've expected 1136 and 1137 to have been excluded.

Original dtSearch string: "one thousand one hundred" pre/0 (thirty and not
(six or seven))
I even tried it with pre/5 to see if there was something funny going on
with that, but it gave the same results: 1130..1139.

If you can tell me what it should look like when the SpanQuery is converted
to a string, I should be able to figure out the rest.

Perhaps I'm misunderstanding the pre/post parameters?

Thank you for any help!

On Tue, Jun 21, 2016 at 9:46 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> > dtSearch allows a user to have NOTs embedded in proximity searches.
>
> And, if you're heading down the path of building your own queryparser to
> handle dtSearch's syntax, please read and heed Charlie Hull's post:
>
> http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/
>
> See also:
>
>
> http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/
>
>

RE: SpanQuery - How to wrap a NOT subquery

Posted by "Allison, Timothy B." <ta...@mitre.org>.

> dtSearch allows a user to have NOTs embedded in proximity searches.

And, if you're heading down the path of building your own queryparser to handle dtSearch's syntax, please read and heed Charlie Hull's post:

http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/

See also:

http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/

RE: SpanQuery - How to wrap a NOT subquery

Posted by "Allison, Timothy B." <ta...@mitre.org>.

In the syntax for <self_promotion>LUCENE-5205’s SpanQueryParser [0]</self_promotion>, that’d be

[“one thousand one hundred thirty” (six seven)]!~0,1

In English: find “one thousand one hundred thirty”, but not if six or seven comes immediately after it.

[0] https://github.com/tballison/lucene-addons/tree/master/lucene-5205

From: Brandon Miller [mailto:computerengineer.brandon@gmail.com]
Sent: Monday, June 20, 2016 4:12 PM
To: Allison, Timothy B. <ta...@mitre.org>; solr-user@lucene.apache.org
Subject: Re: SpanQuery - How to wrap a NOT subquery

Thank you, Timothy.

I have support for and am using SpanNotQuery elsewhere.  Maybe there is another use for it that I'm not considering.  I'm wondering if there's a clever way of reusing it in order to satisfy the requirements of proximity NOTs, too.

dtSearch allows a user to have NOTs embedded in proximity searches.
I.e.
Let's say you have an index whose ID has been converted to English phrases, like 1001 would be "One thousand one"

"one thousand one hundred" pre/0 (thirty and not (six or seven))
Returns: 1130, 1131, 1132, 1133, 1134, 1135,            1138, 1139

Perhaps I've been staring at the screen too long and the obvious answer is hiding from me.

Here's how I'm trying to implement it, but it's incorrect...  It's giving me 1130..1139 without excluding anything.



            public Query visitNot_expr(Not_exprContext ctx) {
                      //ProximityNotSupportedFor("NOT");
                        Query subquery = visit(ctx.expr());
                        BooleanQuery.Builder query = new BooleanQuery.Builder();
                        query.add(subquery, BooleanClause.Occur.MUST_NOT);
                        // TODO: Consolidate this so that we don't use MatchAllDocsQuery, but using the other query, to increase performance
                        query.add(new MatchAllDocsQuery(), BooleanClause.Occur.SHOULD);

                        if(currentlyInASpanQuery){
                                    SpanQuery matchAllDocs = getSpanWildcardQuery(new Term(defaultFieldName,"*"));
                                    SpanNotQuery snq = new SpanNotQuery(matchAllDocs, (SpanQuery)subquery, Integer.MAX_VALUE, Integer.MAX_VALUE);
                                    return snq;
                        } else {
                                    return query.build();
                        }
            }

        protected SpanQuery getSpanWildcardQuery(Term term) {
                        WildcardQuery wq = new WildcardQuery(term);
               SpanQuery swq = new SpanMultiTermQueryWrapper<>(wq);
               return swq;
            }


On Mon, Jun 20, 2016 at 2:53 PM, Allison, Timothy B. <ta...@mitre.org>> wrote:
Bouncing over to user’s list.

As you’ve found, spans are different from regular queries.  MUST_NOT at the BooleanQuery level means that the term must not appear anywhere in the document; whereas spans focus on terms near each other.

Have you tried SpanNotQuery?  This would allow you at least to do something like:

termA but not if zyx or yyy appears X words before or Y words after



From: Brandon Miller [mailto:computerengineer.brandon@gmail.com<ma...@gmail.com>]
Sent: Monday, June 20, 2016 2:36 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: SpanQuery - How to wrap a NOT subquery

Greetings!

I'm wanting to support this:
TermA within_N_terms_of (abc and cba or xyz and not zyx or not yyy)

Focusing on the sub-query:
I have ANDs and ORs figured out (special tricks playing with slops and such).

I'm having the hardest time figuring out how to wrap a NOT.

Outside of SpanQuery, I'm using a BooleanQuery with a MUST_NOT clause.  That's fine (if you know another way, I'd like to hear that, too, but this appears to work dandy).

However, SpanQuery requires queries that are also of type SpanQuery or SpanMultiTermQueryWrapper will allow you to throw in anything derived from MultiTermQuery (which includes AutomatedQuery).

Right now, I'm at a loss.  We have huge, complex, nested boolean queries inside proximity operators with our current solution.

If I need to write a custom solution, then that's what I need to hear and perhaps a couple of pointers.

Thanks a bunch and God bless!

Brandon

Re: SpanQuery - How to wrap a NOT subquery

Posted by Brandon Miller <co...@gmail.com>.

Thank you, Timothy.

I have support for and am using SpanNotQuery elsewhere.  Maybe there is
another use for it that I'm not considering.  I'm wondering if there's a
clever way of reusing it in order to satisfy the requirements of proximity
NOTs, too.

dtSearch allows a user to have NOTs embedded in proximity searches.
I.e.
Let's say you have an index whose ID has been converted to English phrases,
like 1001 would be "One thousand one"

"one thousand one hundred" pre/0 (thirty and not (six or seven))
Returns: 1130, 1131, 1132, 1133, 1134, 1135,            1138, 1139

Perhaps I've been staring at the screen too long and the obvious answer is
hiding from me.

Here's how I'm trying to implement it, but it's incorrect...  It's giving
me 1130..1139 without excluding anything.



public Query visitNot_expr(Not_exprContext ctx) {
//ProximityNotSupportedFor("NOT");
Query subquery = visit(ctx.expr());
BooleanQuery.Builder query = new BooleanQuery.Builder();
query.add(subquery, BooleanClause.Occur.MUST_NOT);
// TODO: Consolidate this so that we don't use MatchAllDocsQuery, but using
the other query, to increase performance
query.add(new MatchAllDocsQuery(), BooleanClause.Occur.SHOULD);
if(currentlyInASpanQuery){
SpanQuery matchAllDocs = getSpanWildcardQuery(new
Term(defaultFieldName,"*"));
SpanNotQuery snq = new SpanNotQuery(matchAllDocs, (SpanQuery)subquery,
Integer.MAX_VALUE, Integer.MAX_VALUE);
return snq;
} else {
return query.build();
}
}

        protected SpanQuery getSpanWildcardQuery(Term term) {
WildcardQuery wq = new WildcardQuery(term);
   SpanQuery swq = new SpanMultiTermQueryWrapper<>(wq);
   return swq;
}


On Mon, Jun 20, 2016 at 2:53 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Bouncing over to user’s list.
>
>
>
> As you’ve found, spans are different from regular queries.  MUST_NOT at
> the BooleanQuery level means that the term must not appear anywhere in the
> document; whereas spans focus on terms near each other.
>
>
>
> Have you tried SpanNotQuery?  This would allow you at least to do
> something like:
>
>
>
> termA but not if zyx or yyy appears X words before or Y words after
>
>
>
>
>
>
>
> *From:* Brandon Miller [mailto:computerengineer.brandon@gmail.com]
> *Sent:* Monday, June 20, 2016 2:36 PM
> *To:* dev@lucene.apache.org
> *Subject:* SpanQuery - How to wrap a NOT subquery
>
>
>
> Greetings!
>
>
>
> I'm wanting to support this:
>
> TermA within_N_terms_of (abc and cba or xyz and not zyx or not yyy)
>
>
>
> Focusing on the sub-query:
>
> I have ANDs and ORs figured out (special tricks playing with slops and
> such).
>
>
>
> I'm having the hardest time figuring out how to wrap a NOT.
>
>
>
> Outside of SpanQuery, I'm using a BooleanQuery with a MUST_NOT clause.
> That's fine (if you know another way, I'd like to hear that, too, but this
> appears to work dandy).
>
>
>
> However, SpanQuery requires queries that are also of type SpanQuery or
> SpanMultiTermQueryWrapper will allow you to throw in anything derived from
> MultiTermQuery (which includes AutomatedQuery).
>
>
>
> Right now, I'm at a loss.  We have huge, complex, nested boolean queries
> inside proximity operators with our current solution.
>
>
>
> If I need to write a custom solution, then that's what I need to hear and
> perhaps a couple of pointers.
>
>
>
> Thanks a bunch and God bless!
>
>
>
> Brandon
>