You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Greg Bowyer <gb...@shopzilla.com> on 2010/06/05 00:21:00 UTC

Help with Shingled queries

Hi all

Interesting and by the looks of things very solid project you have here with 
SOLR, however ..

I have an index that contains a large number of "phrases" that I need to search 
for over, each of these phrases is fairly small being on average about 4 words 
long.

The search terms that I am given to search these phrases are very long, and 
quite arbitrary, sometimes the search terms will be up to 25 words long.

As such the performance of my index when built naively is sporadic sometimes 
searches are very fast on average they are somewhat slower.

I have attempted to improve this situation by using shingling for the phrases 
and the related search queries, in my schema I have the following


    <fieldType name="bigramed_phrase" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="true" 
outputUnigramIfNoNgram="true" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="false" 
outputUnigramIfNoNgram="true" />
      </analyzer>
    </fieldType>

In the indexes, as seen with luke I do indeed have a large range of shingled 
terms.

When I run the analyser for either query or index terms I also see the breakdown 
with the shingled terms correctly displayed.

However when I attempt to use this in a query I do not see the terms applied in 
the debug output, for example with the term "short red evil fox" I would expect 
to see the shingles
'short_red' 'red_evil' 'evil_fox'

but instead I get the following

"debug":{
  "rawquerystring":"short red evil fox",
  "querystring":"short red evil fox",
  "parsedquery":"+() ()",
  "parsedquery_toString":"+() ()",
  "explain":{},
  "QParser":"DisMaxQParser",
  "altquerystring":null,
  "boostfuncs":null,
  "filter_queries":["atomId:(8235 100000914 100000911 )"],
  "parsed_filter_queries":["atomId:8235 atomId:100000914 atomId:100000911"],
  "timing":{ ......

Does anyone know what I could be doing wrong here, is it a bug in the debug 
output, a stupid mistake misconception or piece of idiocy on my part or 
something else.


Many thanks

-- Greg Bowyer



Re: Help with Shingled queries

Posted by Chris Hostetter <ho...@fucit.org>.
: the queryparser first splits on whitespace.

FWIW: robert is refering to the LuceneQParser, and it also applies to the 
DismaxQParser ... whitespace is considered markup in those parsers unless 
it's escaped or quoted.

The FieldQParser may make more sense for your usecase - or you may need a 
custom QParser (hard to tell)

To answer your specific question...

: > the debug output, for example with the term "short red evil fox" I would
: > expect
: > to see the shingles
: > 'short_red' 'red_evil' 'evil_fox'
: >
: > but instead I get the following
: >
: > "debug":{
: >  "rawquerystring":"short red evil fox",
: >  "querystring":"short red evil fox",
: >  "parsedquery":"+() ()",
: >  "parsedquery_toString":"+() ()",
: >  "explain":{},
: >  "QParser":"DisMaxQParser",

...you are using the DisMaxQParser, but evidently you haven't configured 
the qf or pf fields, so you are getting a query that is completley empty.



-Hoss


Re: Help with Shingled queries

Posted by Robert Muir <rc...@gmail.com>.
the queryparser first splits on whitespace.

so each individual word of your query: short,red,evil,fox gets its own
tokenstream, and therefore isn't shingled.

On Fri, Jun 4, 2010 at 6:21 PM, Greg Bowyer <gb...@shopzilla.com> wrote:

> Hi all
>
> Interesting and by the looks of things very solid project you have here
> with
> SOLR, however ..
>
> I have an index that contains a large number of "phrases" that I need to
> search
> for over, each of these phrases is fairly small being on average about 4
> words
> long.
>
> The search terms that I am given to search these phrases are very long, and
> quite arbitrary, sometimes the search terms will be up to 25 words long.
>
> As such the performance of my index when built naively is sporadic
> sometimes
> searches are very fast on average they are somewhat slower.
>
> I have attempted to improve this situation by using shingling for the
> phrases
> and the related search queries, in my schema I have the following
>
>
>    <fieldType name="bigramed_phrase" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
> outputUnigramIfNoNgram="true" />
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.ShingleFilterFactory" outputUnigrams="false"
> outputUnigramIfNoNgram="true" />
>      </analyzer>
>    </fieldType>
>
> In the indexes, as seen with luke I do indeed have a large range of
> shingled
> terms.
>
> When I run the analyser for either query or index terms I also see the
> breakdown
> with the shingled terms correctly displayed.
>
> However when I attempt to use this in a query I do not see the terms
> applied in
> the debug output, for example with the term "short red evil fox" I would
> expect
> to see the shingles
> 'short_red' 'red_evil' 'evil_fox'
>
> but instead I get the following
>
> "debug":{
>  "rawquerystring":"short red evil fox",
>  "querystring":"short red evil fox",
>  "parsedquery":"+() ()",
>  "parsedquery_toString":"+() ()",
>  "explain":{},
>  "QParser":"DisMaxQParser",
>  "altquerystring":null,
>  "boostfuncs":null,
>  "filter_queries":["atomId:(8235 100000914 100000911 )"],
>  "parsed_filter_queries":["atomId:8235 atomId:100000914 atomId:100000911"],
>  "timing":{ ......
>
> Does anyone know what I could be doing wrong here, is it a bug in the debug
> output, a stupid mistake misconception or piece of idiocy on my part or
> something else.
>
>
> Many thanks
>
> -- Greg Bowyer
>
>
>


-- 
Robert Muir
rcmuir@gmail.com