You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark Bennett <mb...@ideaeng.com> on 2009/08/10 23:39:16 UTC
Overview of Query Parsing API Stack? / Dismax parsing, new 1.4
parsing, etc.
There's some good Wiki pages on the syntax to use for queries, including
nested queries.
But trying traipse through the code to get "the big picture" is a bit
involved.
A couple example:
Over the past few months I've had several questions about dismax, and why it
was or wasn't doing something a certain way. I came up with a workaround
for CJK, but today I'm back looking at the shingles stuff today and where,
exactly, shingle queries break. I found the logical discussions about *why*
in some of the threads, but the actual code path makes quite a few hops, to
util classes, and to Lucene, etc. I'll get there eventually, but having a
map would be nice.
Another example, at the last Meetup it was mentioned that big changes are
coming to query parsing pretty soon. Understanding the "before" and "after"
logic would be nice, and I don't recall whether that impacted just Lucene,
or if Solr was also going to be affected.
--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
Re: Overview of Query Parsing API Stack? / Dismax parsing, new 1.4
parsing, etc.
Posted by Mark Bennett <mb...@ideaeng.com>.
Thanks Hoss and Yonik.
Hoss, you had a particluarly pertinent passage:
> ... because the normal Lucene QueryParser uses whitespace ...
> and breaks up the input on the whitespace boundaries
> before it ever passes those chunks ... to the analyzers
This is EXACTLY what the issue is. At first I thought it was the result of
using dismax, but from what you said, I'm guessing it affects all queries.
And does somebody have a "worked" example of engineering around it.
Yonik,
I was surprised by your IBM comments, because based on what they had
presented at the meetup, I also thought it would be more "granular". Have
you chatted with them to confirm?
--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
On Thu, Aug 20, 2009 at 7:16 PM, Chris Hostetter
<ho...@fucit.org>wrote:
>
> : Subject: Overview of Query Parsing API Stack? / Dismax parsing,
> : new 1.4 parsing, etc.
>
> Oh, what i would give for time to sit and document in depth how some of
> this stuff works (assuming i first had time to verify that it really does
> work the way i think)
>
> The nutshell answer is that as far as solr (1.4) is concerned, the main
> unit of "query parsing" is a QParser ... lots of places in the code base
> may care about parsing differnet strngs for the purposes of producting a
> Query object, but ultimately they all use a QParser.
>
> QParsers are plugins that you can configure instances of in your
> solrcinfog.xml and assign names to. by default, all of various pieces of
> code in solr that do any sort of query related parsing use some basic
> convention to pick a QParser by name -- so StandardRequestHandler uses the
> QParser named "lucene" for parsing the "q" param, while
> DisMaxRequestHandler uses a QParser named "dismax" for "q", and "func" for
> the "bf" param. so if you wanted to make some change so that *any* code
> path anywhere attempting to use the lucene syntax got your custom query
> parsing logic, you could configure a QParser with the name "lucene" and
> override the default.
>
> The brilliantly confusing magic comes into play when strings to be parsed
> start with the "local params" syntax (ie: "{!foo a=f,b=z}blah blah" ...
> that tells the parsing code to override whatever QParser it would have
> used for that string, and to pass everything after the "}" charcter to the
> parser named "foo", with a=f and b=z added to the list of SolrParams it's
> already got (from the query string, or default params in solrconfig,
> etc...)
>
> For most types of queries, the QParser ultimately uses Lucenes
> "QueryParser" class, or some subclass of it (DisMaxQueryParser used by the
> DisMaxQPlugin is a subclass of QueryParser") and 9 times out of 10 if
> people want to customize query parsing without inventing a 100% new
> syntax, they also write a subclass.
>
> coming in Lucene 2.9 (which is what Solr 1.4 will use) is a completley new
> QueryParser framework, which (i'm told) is suppose to make it much easier
> to create custom query parser syntaxs, but i haven't had time to look at
> it to see what all hte fuss is about. so in theory you could use it to
> implement a new QPlugin in SOlr 1.4.
>
> no matter how you ultimately implement code that goes from "String" to
> "Query" you have to be concerned about the type of data in the field that
> Query objects refers to (if it was lowercased at index time, you want to
> lowercase at query time, etc...). Solr does it's best to help query
> parsers out by supporting an <analyer type="query"/> in the schema.xml so
> that the schema creator that specify how to "analyze" a piece of
> input when building queries, but depending on the query syntax it's not
> always easy to get the behavior you expect from a particular query parser
> / analyzer pair (This part of query parsing typically trips people up when
> dealing with multiword synonyms, or analyzers that don't tokenize on
> whitespace, because the normal Lucene QueryParser uses whitespace as part
> of it's markup, and breaks up the input on the whitespace boundaries
> before it ever passes those chunks of input to the analyzers)
>
> : But trying traipse through the code to get "the big picture" is a bit
> : involved.
>
> like i said: the world of query parsing in solr all revolves arround the
> QParser API ... if you want to make sense of it, start there, and work out
> in both directions.
>
> PS: please, please, please ... as you make progress on understanding these
> internals, feel free to plagerize this email as the starting point of a
> new wiki page documenting your understanding for others who come along
> with teh same question.
>
>
> -Hoss
>
>
Re: Overview of Query Parsing API Stack? / Dismax parsing, new 1.4
parsing, etc.
Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Aug 20, 2009 at 10:16 PM, Chris
Hostetter<ho...@fucit.org> wrote:
> coming in Lucene 2.9 (which is what Solr 1.4 will use) is a completley new
> QueryParser framework, which (i'm told) is suppose to make it much easier
> to create custom query parser syntaxs,
I've quickly looked, but haven't seen this to be the case.
The new query parser framework uses the same JavaCC grammar and
creates intermediate objects that eventually create Lucene Query
objects.
>From an IBM perspective (where this parser came from), it makes it
easier to add a new syntax because they have multiple back-ends
(Lucene being one, probably OmniFind or other proprietary search
engines being others). But from the Lucene perspective, there is only
Lucene as a back-end.
So if you want to try and extend the syntax of the lucene query
parser, it still seems to come down to hacking on the JavaCC grammar
(the hard part).
-Yonik
http://www.lucidimagination.com
Re: Overview of Query Parsing API Stack? / Dismax parsing, new 1.4
parsing, etc.
Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Overview of Query Parsing API Stack? / Dismax parsing,
: new 1.4 parsing, etc.
Oh, what i would give for time to sit and document in depth how some of
this stuff works (assuming i first had time to verify that it really does
work the way i think)
The nutshell answer is that as far as solr (1.4) is concerned, the main
unit of "query parsing" is a QParser ... lots of places in the code base
may care about parsing differnet strngs for the purposes of producting a
Query object, but ultimately they all use a QParser.
QParsers are plugins that you can configure instances of in your
solrcinfog.xml and assign names to. by default, all of various pieces of
code in solr that do any sort of query related parsing use some basic
convention to pick a QParser by name -- so StandardRequestHandler uses the
QParser named "lucene" for parsing the "q" param, while
DisMaxRequestHandler uses a QParser named "dismax" for "q", and "func" for
the "bf" param. so if you wanted to make some change so that *any* code
path anywhere attempting to use the lucene syntax got your custom query
parsing logic, you could configure a QParser with the name "lucene" and
override the default.
The brilliantly confusing magic comes into play when strings to be parsed
start with the "local params" syntax (ie: "{!foo a=f,b=z}blah blah" ...
that tells the parsing code to override whatever QParser it would have
used for that string, and to pass everything after the "}" charcter to the
parser named "foo", with a=f and b=z added to the list of SolrParams it's
already got (from the query string, or default params in solrconfig,
etc...)
For most types of queries, the QParser ultimately uses Lucenes
"QueryParser" class, or some subclass of it (DisMaxQueryParser used by the
DisMaxQPlugin is a subclass of QueryParser") and 9 times out of 10 if
people want to customize query parsing without inventing a 100% new
syntax, they also write a subclass.
coming in Lucene 2.9 (which is what Solr 1.4 will use) is a completley new
QueryParser framework, which (i'm told) is suppose to make it much easier
to create custom query parser syntaxs, but i haven't had time to look at
it to see what all hte fuss is about. so in theory you could use it to
implement a new QPlugin in SOlr 1.4.
no matter how you ultimately implement code that goes from "String" to
"Query" you have to be concerned about the type of data in the field that
Query objects refers to (if it was lowercased at index time, you want to
lowercase at query time, etc...). Solr does it's best to help query
parsers out by supporting an <analyer type="query"/> in the schema.xml so
that the schema creator that specify how to "analyze" a piece of
input when building queries, but depending on the query syntax it's not
always easy to get the behavior you expect from a particular query parser
/ analyzer pair (This part of query parsing typically trips people up when
dealing with multiword synonyms, or analyzers that don't tokenize on
whitespace, because the normal Lucene QueryParser uses whitespace as part
of it's markup, and breaks up the input on the whitespace boundaries
before it ever passes those chunks of input to the analyzers)
: But trying traipse through the code to get "the big picture" is a bit
: involved.
like i said: the world of query parsing in solr all revolves arround the
QParser API ... if you want to make sense of it, start there, and work out
in both directions.
PS: please, please, please ... as you make progress on understanding these
internals, feel free to plagerize this email as the starting point of a
new wiki page documenting your understanding for others who come along
with teh same question.
-Hoss