You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Yonik Seeley <yo...@lucidimagination.com> on 2009/10/10 14:01:34 UTC

Re: Dismax: Impossible to search for a _phrase_ in tokenized and untokenized fields at the same time

On Sat, Oct 10, 2009 at 6:34 AM, Alex Baranov <al...@gmail.com> wrote:
>
> Hello,
>
> It seems to me that there is no way how I can use dismax handler for
> searching in both tokenized and untokenized fields while I'm searching for a
> phrase.
>
> Consider the next example. I have two fields in index: product_name and
> product_name_un. The schema looks like:
>
>        <fieldType name="string_ignore_case" class="solr.TextField"
> positionIncrementGap="100" omitNorms="true">
>      <analyzer>
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>    <fieldType name="text_no_stopwords_en" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory"
> language="English"/>
>      </analyzer>
>        </fieldType>
>
>   <field name="product_name" type="text_no_stopwords_en" indexed="true"
> stored="true"/>
>   <field name="product_name_un" type="string_ignore_case" indexed="true"
> stored="true"/>
>
> <copyField source="product_name" dest="product_name_un"/>
>
> I'm using dismax to search in both of them at the same time:
> "defType=dismax&qf=product_name product_name_un^2.0". (this is done to bring
> on top of the results the products which name _equals_ the entered
> criteria).
>
> 1. When I'm searching for the phrase (two or more keywords), e.g. <blue
> car>, the input string is tokenized and even I have in the index
> product_name_un="blue car", the "product_name_un^2.0" part of the dismax
> config has no effect.

Hmmm, right.  This is due to the fact that the Lucene query parser
(still actually used in dismax) breaks things up by whitespace
*before* analysis (so the analyzer for the untokenized field never
sees the two tokens together).

> 2. When I enter <"blue car"> (in quotas) the string is not tokenized and
> "product_name_un^2.0" part works, but nothing could be found in product_name
> field.

Using explicit quotes will make a phrase query, so blue and car must
appear right next to eachother in product_name.
If it's OK to require both blue and car, in product_name then you can
just set a slop for explicit phrase queries with the qs parameter.

-Yonik
http://www.lucidimagination.com





> I.e. there is no way to have a proper search against two fields at the same
> time. The workaround that I found is using "bq" parameter for specifying the
> boost query for search in field product_name_un. But I don't think that this
> should be the only solution.
>
>
> Another note, related to that: when I set as a default field for search
> product_name_un, and query with the ../select/?q=blue car&rows=10&... I got
> empty results despite the fact that I have "blue car" value in the index in
> that field. I have to use quotas again to fix that... Shouldn't it determine
> the field type and apply corresponding analyzers/tokenizers/etc.?
>
> --
> View this message in context: http://www.nabble.com/Dismax%3A-Impossible-to-search-for-a-_phrase_-in-tokenized-and-untokenized-fields-at-the-same-time-tp25832932p25832932.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Dismax: Impossible to search for a _phrase_ in tokenized and untokenized fields at the same time

Posted by Alex Baranov <al...@gmail.com>.

I guess this is a bug that should be added in JIRA (if it is not there
already). Should I add it?


> Hmmm, right.  This is due to the fact that the Lucene query parser
> (still actually used in dismax) breaks things up by whitespace
> *before* analysis (so the analyzer for the untokenized field never
> sees the two tokens together).
>

Is there a way how to tell to Lucene parser not to break things up by the
whitespace? Should one use some whitespace code instead of actual <space>?

I think what we need here is some kind of a "special quotas" which will tell
not to use Lucene query parser at all (might be very useful for situation
like this when search is applied to the default field, i.e. when the field
is not specified).

If it's OK to require both blue and car, in product_name then you can
> just set a slop for explicit phrase queries with the qs parameter.
>

It's not good for me unfortunately, but thanks for the suggestion.

Alex Baranov.

On Sat, Oct 10, 2009 at 3:01 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Sat, Oct 10, 2009 at 6:34 AM, Alex Baranov <al...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > It seems to me that there is no way how I can use dismax handler for
> > searching in both tokenized and untokenized fields while I'm searching
> for a
> > phrase.
> >
> > Consider the next example. I have two fields in index: product_name and
> > product_name_un. The schema looks like:
> >
> >        <fieldType name="string_ignore_case" class="solr.TextField"
> > positionIncrementGap="100" omitNorms="true">
> >      <analyzer>
> >         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
> >
> >    <fieldType name="text_no_stopwords_en" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer>
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.ISOLatin1AccentFilterFactory"/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> > language="English"/>
> >      </analyzer>
> >        </fieldType>
> >
> >   <field name="product_name" type="text_no_stopwords_en" indexed="true"
> > stored="true"/>
> >   <field name="product_name_un" type="string_ignore_case" indexed="true"
> > stored="true"/>
> >
> > <copyField source="product_name" dest="product_name_un"/>
> >
> > I'm using dismax to search in both of them at the same time:
> > "defType=dismax&qf=product_name product_name_un^2.0". (this is done to
> bring
> > on top of the results the products which name _equals_ the entered
> > criteria).
> >
> > 1. When I'm searching for the phrase (two or more keywords), e.g. <blue
> > car>, the input string is tokenized and even I have in the index
> > product_name_un="blue car", the "product_name_un^2.0" part of the dismax
> > config has no effect.
>
> Hmmm, right.  This is due to the fact that the Lucene query parser
> (still actually used in dismax) breaks things up by whitespace
> *before* analysis (so the analyzer for the untokenized field never
> sees the two tokens together).
>
> > 2. When I enter <"blue car"> (in quotas) the string is not tokenized and
> > "product_name_un^2.0" part works, but nothing could be found in
> product_name
> > field.
>
> Using explicit quotes will make a phrase query, so blue and car must
> appear right next to eachother in product_name.
> If it's OK to require both blue and car, in product_name then you can
> just set a slop for explicit phrase queries with the qs parameter.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
>
>
> > I.e. there is no way to have a proper search against two fields at the
> same
> > time. The workaround that I found is using "bq" parameter for specifying
> the
> > boost query for search in field product_name_un. But I don't think that
> this
> > should be the only solution.
> >
> >
> > Another note, related to that: when I set as a default field for search
> > product_name_un, and query with the ../select/?q=blue car&rows=10&... I
> got
> > empty results despite the fact that I have "blue car" value in the index
> in
> > that field. I have to use quotas again to fix that... Shouldn't it
> determine
> > the field type and apply corresponding analyzers/tokenizers/etc.?
> >
> > --
> > View this message in context:
> http://www.nabble.com/Dismax%3A-Impossible-to-search-for-a-_phrase_-in-tokenized-and-untokenized-fields-at-the-same-time-tp25832932p25832932.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
>