You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark Bennett <mb...@ideaeng.com> on 2009/08/10 23:39:16 UTC

Overview of Query Parsing API Stack? / Dismax parsing, new 1.4 parsing, etc.

There's some good Wiki pages on the syntax to use for queries, including
nested queries.

But trying traipse through the code to get "the big picture" is a bit
involved.

A couple example:

Over the past few months I've had several questions about dismax, and why it
was or wasn't doing something a certain way.  I came up with a workaround
for CJK, but today I'm back looking at the shingles stuff today and where,
exactly, shingle queries break.  I found the logical discussions about *why*
in some of the threads, but the actual code path makes quite a few hops, to
util classes, and to Lucene, etc.  I'll get there eventually, but having a
map would be nice.

Another example, at the last Meetup it was mentioned that big changes are
coming to query parsing pretty soon.  Understanding the "before" and "after"
logic would be nice, and I don't recall whether that impacted just Lucene,
or if Solr was also going to be affected.

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: Overview of Query Parsing API Stack? / Dismax parsing, new 1.4 parsing, etc.

Posted by Mark Bennett <mb...@ideaeng.com>.

Thanks Hoss and Yonik.

Hoss, you had a particluarly pertinent passage:
> ... because the normal Lucene QueryParser uses whitespace ...
> and breaks up the input on the whitespace boundaries
> before it ever passes those chunks ... to the analyzers

This is EXACTLY what the issue is.  At first I thought it was the result of
using dismax, but from what you said, I'm guessing it affects all queries.
And does somebody have a "worked" example of engineering around it.

Yonik,

I was surprised by your IBM comments, because based on what they had
presented at the meetup, I also thought it would be more "granular".  Have
you chatted with them to confirm?

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Thu, Aug 20, 2009 at 7:16 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : Subject: Overview of Query Parsing API Stack? / Dismax parsing,
> :     new 1.4  parsing, etc.
>
> Oh, what i would give for time to sit and document in depth how some of
> this stuff works (assuming i first had time to verify that it really does
> work the way i think)
>
> The nutshell answer is that as far as solr (1.4) is concerned, the main
> unit of "query parsing" is a QParser ... lots of places in the code base
> may care about parsing differnet strngs for the purposes of producting a
> Query object, but ultimately they all use a QParser.
>
> QParsers are plugins that you can configure instances of in your
> solrcinfog.xml and assign names to.  by default, all of various pieces of
> code in solr that do any sort of query related parsing use some basic
> convention to pick a QParser by name -- so StandardRequestHandler uses the
> QParser named "lucene" for parsing the "q" param, while
> DisMaxRequestHandler uses a QParser named "dismax" for "q", and "func" for
> the "bf" param.  so if you wanted to make some change so that *any* code
> path anywhere attempting to use the lucene syntax got your custom query
> parsing logic, you could configure a QParser with the name "lucene" and
> override the default.
>
> The brilliantly confusing magic comes into play when strings to be parsed
> start with the "local params" syntax (ie: "{!foo a=f,b=z}blah blah" ...
> that tells the parsing code to override whatever QParser it would have
> used for that string, and to pass everything after the "}" charcter to the
> parser named "foo", with a=f and b=z added to the list of SolrParams it's
> already got (from the query string, or default params in solrconfig,
> etc...)
>
> For most types of queries, the QParser ultimately uses Lucenes
> "QueryParser" class, or some subclass of it (DisMaxQueryParser used by the
> DisMaxQPlugin is a subclass of QueryParser") and 9 times out of 10 if
> people want to customize query parsing without inventing a 100% new
> syntax, they also write a subclass.
>
> coming in Lucene 2.9 (which is what Solr 1.4 will use) is a completley new
> QueryParser framework, which (i'm told) is suppose to make it much easier
> to create custom query parser syntaxs, but i haven't had time to look at
> it to see what all hte fuss is about.  so in theory you could use it to
> implement a new QPlugin in SOlr 1.4.
>
> no matter how you ultimately implement code that goes from "String" to
> "Query" you have to be concerned about the type of data in the field that
> Query objects refers to (if it was lowercased at index time, you want to
> lowercase at query time, etc...).  Solr does it's best to help query
> parsers out by supporting an <analyer type="query"/> in the schema.xml so
> that the schema creator that specify how to "analyze" a piece of
> input when building queries, but depending on the query syntax it's not
> always easy to get the behavior you expect from a particular query parser
> / analyzer pair (This part of query parsing typically trips people up when
> dealing with multiword synonyms, or analyzers that don't tokenize on
> whitespace, because the normal Lucene QueryParser uses whitespace as part
> of it's markup, and breaks up the input on the whitespace boundaries
> before it ever passes those chunks of input to the analyzers)
>
> : But trying traipse through the code to get "the big picture" is a bit
> : involved.
>
> like i said: the world of query parsing in solr all revolves arround the
> QParser API ... if you want to make sense of it, start there, and work out
> in both directions.
>
> PS: please, please, please ... as you make progress on understanding these
> internals, feel free to plagerize this email as the starting point of a
> new wiki page documenting your understanding for others who come along
> with teh same question.
>
>
> -Hoss
>
>

Re: Overview of Query Parsing API Stack? / Dismax parsing, new 1.4 parsing, etc.

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Thu, Aug 20, 2009 at 10:16 PM, Chris
Hostetter<ho...@fucit.org> wrote:
> coming in Lucene 2.9 (which is what Solr 1.4 will use) is a completley new
> QueryParser framework, which (i'm told) is suppose to make it much easier
> to create custom query parser syntaxs,

I've quickly looked, but haven't seen this to be the case.
The new query parser framework uses the same JavaCC grammar and
creates intermediate objects that eventually create Lucene Query
objects.

>From an IBM perspective (where this parser came from), it makes it
easier to add a new syntax because they have multiple back-ends
(Lucene being one, probably OmniFind or other proprietary search
engines being others).  But from the Lucene perspective, there is only
Lucene as a back-end.

So if you want to try and extend the syntax of the lucene query
parser, it still seems to come down to hacking on the JavaCC grammar
(the hard part).

-Yonik
http://www.lucidimagination.com

Re: Overview of Query Parsing API Stack? / Dismax parsing, new 1.4 parsing, etc.

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: Overview of Query Parsing API Stack? / Dismax parsing,
:     new 1.4  parsing, etc.

Oh, what i would give for time to sit and document in depth how some of 
this stuff works (assuming i first had time to verify that it really does 
work the way i think)

The nutshell answer is that as far as solr (1.4) is concerned, the main 
unit of "query parsing" is a QParser ... lots of places in the code base 
may care about parsing differnet strngs for the purposes of producting a 
Query object, but ultimately they all use a QParser.

QParsers are plugins that you can configure instances of in your 
solrcinfog.xml and assign names to.  by default, all of various pieces of 
code in solr that do any sort of query related parsing use some basic 
convention to pick a QParser by name -- so StandardRequestHandler uses the 
QParser named "lucene" for parsing the "q" param, while 
DisMaxRequestHandler uses a QParser named "dismax" for "q", and "func" for 
the "bf" param.  so if you wanted to make some change so that *any* code 
path anywhere attempting to use the lucene syntax got your custom query 
parsing logic, you could configure a QParser with the name "lucene" and 
override the default.

The brilliantly confusing magic comes into play when strings to be parsed 
start with the "local params" syntax (ie: "{!foo a=f,b=z}blah blah" ... 
that tells the parsing code to override whatever QParser it would have 
used for that string, and to pass everything after the "}" charcter to the 
parser named "foo", with a=f and b=z added to the list of SolrParams it's 
already got (from the query string, or default params in solrconfig, 
etc...)

For most types of queries, the QParser ultimately uses Lucenes 
"QueryParser" class, or some subclass of it (DisMaxQueryParser used by the 
DisMaxQPlugin is a subclass of QueryParser") and 9 times out of 10 if 
people want to customize query parsing without inventing a 100% new 
syntax, they also write a subclass.

coming in Lucene 2.9 (which is what Solr 1.4 will use) is a completley new 
QueryParser framework, which (i'm told) is suppose to make it much easier 
to create custom query parser syntaxs, but i haven't had time to look at 
it to see what all hte fuss is about.  so in theory you could use it to 
implement a new QPlugin in SOlr 1.4.

no matter how you ultimately implement code that goes from "String" to 
"Query" you have to be concerned about the type of data in the field that 
Query objects refers to (if it was lowercased at index time, you want to 
lowercase at query time, etc...).  Solr does it's best to help query 
parsers out by supporting an <analyer type="query"/> in the schema.xml so 
that the schema creator that specify how to "analyze" a piece of 
input when building queries, but depending on the query syntax it's not 
always easy to get the behavior you expect from a particular query parser 
/ analyzer pair (This part of query parsing typically trips people up when 
dealing with multiword synonyms, or analyzers that don't tokenize on 
whitespace, because the normal Lucene QueryParser uses whitespace as part 
of it's markup, and breaks up the input on the whitespace boundaries 
before it ever passes those chunks of input to the analyzers)

: But trying traipse through the code to get "the big picture" is a bit
: involved.

like i said: the world of query parsing in solr all revolves arround the 
QParser API ... if you want to make sense of it, start there, and work out 
in both directions.

PS: please, please, please ... as you make progress on understanding these 
internals, feel free to plagerize this email as the starting point of a 
new wiki page documenting your understanding for others who come along 
with teh same question.


-Hoss