You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Steven White <sw...@gmail.com> on 2015/05/04 14:29:17 UTC

Re: analyzer, indexAnalyzer and queryAnalyzer

Thanks Doug.  This is extremely helpful.  It is much appreciated that you
took the time to write it all.

Do we have a Solr / Lucene wiki with such "did you know?" write ups?  If
not, just having this kind of knowledge in an email isn't good enough as it
won't be as searchable as a wiki.

Steve

On Wed, Apr 29, 2015 at 9:24 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> So Solr has the idea of a query parser. The query parser is a convenient
> way of passing a search string to Solr and having Solr parse it into
> underlying Lucene queries: You can see a list of query parsers here
> http://wiki.apache.org/solr/QueryParser
>
> What this means is that the query parser does work to pull terms into
> individual clauses *before* analysis is run. It's a parsing layer that sits
> outside the analysis chain. This creates problems like the "sea biscuit"
> problem, whereby we declare "sea biscuit" as a query time synonym of
> "seabiscuit". As you may know synonyms are checked during analysis.
> However, if the query parser splits up "sea" from "biscuit" before running
> analysis, the query time analyzer will fail. The string "sea" is brought by
> itself to the query time analyzer and of course won't match "sea biscuit".
> Same with the string "biscuit" in isolation. If the full string "sea
> biscuit" was brought to the analyzer, it would see [sea] next to [biscuit]
> and declare it a synonym of seabiscuit. Thanks to the query parser, the
> analyzer has lost the association between the terms, and both terms aren't
> brought together to the analyzer.
>
> My colleague John Berryman wrote a pretty good blog post on this
>
> http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/
>
> There's several solutions out there that attempt to address this problem.
> One from Ted Sullivan at Lucidworks
>
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
>
> Another popular one is the hon-lucene-synonyms plugin:
>
> http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/search/FieldQParserPlugin.html
>
> Yet another work-around is to use the field query parser:
>
> http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/search/FieldQParserPlugin.html
>
> I also tend to write my own query parsers, so on the one hand its annoying
> that query parsers have the problems above, on the flipside Solr makes it
> very easy to implement whatever parsing you think is appropriatte with a
> small bit of Java/Lucene knowledge.
>
> Hopefully that explanation wasn't too deep, but its an important thing to
> know about Solr. Are you asking out of curiosity, or do you have a specific
> problem?
>
> Thanks
> -Doug
>
> On Wed, Apr 29, 2015 at 6:32 PM, Steven White <sw...@gmail.com>
> wrote:
>
> > Hi Doug,
> >
> > I don't understand what you mean by the following:
> >
> > > For example, if a user searches for q=hot dogs&defType=edismax&qf=title
> > > body the *query parser* *not* the *analyzer* first turns the query
> into:
> >
> > If I have indexAnalyzer and queryAnalyzer in a fieldType that are 100%
> > identical, the example you provided, does it stand?  If so, why?  Or do
> you
> > mean something totally different by "query parser"?
> >
> > Thanks
> >
> > Steve
> >
> >
> > On Wed, Apr 29, 2015 at 4:18 PM, Doug Turnbull <
> > dturnbull@opensourceconnections.com> wrote:
> >
> > > *> 1) If the content of indexAnalyzer and queryAnalyzer are exactly the
> > > same,that's the same as if I have an analyzer only, right?*
> > > 1) Yes
> > >
> > > *>  2) Under the hood, all three are the same thing when it comes to
> what
> > > kind*
> > > *of data and configuration attributes can take, right?*
> > > 2) Yes. Both take in text and output a token stream.
> > >
> > > *>What I'm trying to figure out is this: beside being able to configure
> > a*
> > >
> > > *fieldType to have different analyzer setting at index and query time,
> > > thereis nothing else that's unique about each.*
> > >
> > > The only thing to look out for in Solr land is the query parser. Most
> > Solr
> > > query parsers treat whitespace as meaningful.
> > >
> > > For example, if a user searches for q=hot dogs&defType=edismax&qf=title
> > > body the *query parser* *not* the *analyzer* first turns the query
> into:
> > >
> > > (title:hot title:dog) | (body:hot body:dog)
> > >
> > > each word which *then *gets analyzed. This is because the query parser
> > > tries to be smart and turn "hot dog" into hot OR dog, or more
> > specifically
> > > making them two must clauses.
> > >
> > > This trips quite a few folks up, you can use the field query parser
> which
> > > uses the field as a phrase query. Hope that helps
> > >
> > >
> > > --
> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> Connections,
> > > LLC | 240.476.9983 | http://www.opensourceconnections.com
> > > Author: Taming Search <http://manning.com/turnbull> from Manning
> > > Publications
> > > This e-mail and all contents, including attachments, is considered to
> be
> > > Company Confidential unless explicitly stated otherwise, regardless
> > > of whether attachments are marked as such.
> > > On Wed, Apr 29, 2015 at 3:41 PM, Steven White <sw...@gmail.com>
> > > wrote:
> > >
> > > > Hi Everyone,
> > > >
> > > > Looking at Solr's schema.xml, there are three kind of analyzers:
> > > analyzer,
> > > > indexAnalyzer and queryAnalyzer.  I have two questions about them:
> > > >
> > > > 1) If the content of indexAnalyzer and queryAnalyzer are exactly the
> > > same,
> > > > that's the same as if I have an analyzer only, right?
> > > >
> > > > 2) Under the hood, all three are the same thing when it comes to what
> > > kind
> > > > of data and configuration attributes can take, right?
> > > >
> > > > What I'm trying to figure out is this: beside being able to
> configure a
> > > > fieldType to have different analyzer setting at index and query time,
> > > there
> > > > is nothing else that's unique about each.
> > > >
> > > > Thanks
> > > >
> > > > Steve
> > > >
> > >
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
> LLC | 240.476.9983 | http://www.opensourceconnections.com
> Author: Taming Search <http://manning.com/turnbull> from Manning
> Publications
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Re: analyzer, indexAnalyzer and queryAnalyzer

Posted by Shawn Heisey <ap...@elyograg.org>.

On 5/4/2015 6:29 AM, Steven White wrote:
> Thanks Doug.  This is extremely helpful.  It is much appreciated that you
> took the time to write it all.
> 
> Do we have a Solr / Lucene wiki with such "did you know?" write ups?  If
> not, just having this kind of knowledge in an email isn't good enough as it
> won't be as searchable as a wiki.

There is a community-editable wiki.  If you want write permission, just
create an account on that wiki and let us know (either here or on the
#solr IRC channel) what your username is, and we can get you added to
the contributors group.

https://wiki.apache.org/solr

The Apache Solr Reference Guide is kept on another wiki system, but the
only committers can edit that wiki, because it is released as official
documentation.  Community users can comment on its pages if they have
suggestions.

https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

Everything that happens on both wikis is visible to anyone who
subscribes to the commits mailing list, so if there is good information
available that should go into the official documentation, editing the
community wiki or commenting on the reference guide is usually enough to
make the committers aware of it.

You can find information on the various mailing lists here:

https://lucene.apache.org/core/discussion.html
https://lucene.apache.org/solr/resources.html#mailing-lists

Thanks,
Shawn