You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Kevin Osborn <os...@yahoo.com> on 2010/02/13 07:13:33 UTC

parsing strings into phrase queries

Right now if I have the query model:(Nokia BH-212V), the parser turns this into +(model:nokia model:"bh 212 v"). The problem is that I might have a model called Nokia BH-212, so this is completely missed. In my case, I would like my query to be +(model:nokia model:bh model:212 model:v).

This is my schema for the field:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100" >
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" />
        <filter class="solr.LowerCaseFilterFactory" />
    <filter class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory" protected="protwords.txt" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" /> 
    <filter class="solr.SynonymFilterFactory" synonyms="query_synonyms.txt" ignoreCase="true" expand="true" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> 
        <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" /> 
        <filter class="solr.LowerCaseFilterFactory" /> 
    <filter class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory" protected="protwords.txt" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> 
      </analyzer>
    </fieldType>

Re: parsing strings into phrase queries

Posted by Otis Gospodnetic <ot...@yahoo.com>.

This sounds useful to me!
Here's a pointer: http://wiki.apache.org/solr/HowToContribute

Thanks!
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/

________________________________
From: Kevin Osborn <os...@yahoo.com>
To: solr-user@lucene.apache.org
Sent: Thu, February 18, 2010 1:15:11 PM
Subject: Re: parsing strings into phrase queries

The PositionFilter worked great for my purpose along with another filter that I build.

In my case, my indexed data may be something like "X150". So, a query for "Nokia X150" should match. But I don't want random matches on "x". However, if my indexed data is "G7", I do want a query on "PowerShot G7" to match on "g" and "7". So, a simple length filter will not do. Instead I build a custom filter (that I am willing to contribute back) that filters out singletons that are surrounded by longer tokens (3 or more by default). So, "PowerShot G7" becomes "power" "shot" "g" "7", but "Nokia X150" becomes "nokia" "150".

And then I put the results of this into a PositionFilter. This allows "Nokia X150ABC" to match against the "X150" part. So far I really like this for partial part number searches. And then to boost exact matches, I used copyField to create another field without PositionFilter. And then did an optional phrase query on that.

________________________________
From: Lance Norskog <go...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Wed, February 17, 2010 7:23:23 PM
Subject: Re: parsing strings into phrase queries

That would be great. After reading this and the PositionFilter class I
still don't know how to use it.

On Wed, Feb 17, 2010 at 12:38 PM, Robert Muir <rc...@gmail.com> wrote:
> i think we can improve the docs/wiki to show this example use case, i
> noticed the wiki explanation for this filter gives a more complex shingles
> example, which is interesting, but this seems to be a common problem and
> maybe we should add this use case.
>
> On Wed, Feb 17, 2010 at 1:54 PM, Chris Hostetter
> <ho...@fucit.org>wrote:
>
>>
>> : take a look at PositionFilter
>>
>> Right, there was another thread recently where almost the exact same issue
>> was discussed...
>>
>> http://old.nabble.com/Re%3A-Tokenizer-question-p27120836.html
>>
>> ..except that i was ignorant of the existence of PositionFilter when i
>> wrote that message.
>>
>>
>>
>> -Hoss
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

-- 
Lance Norskog
goksron@gmail.com

Re: parsing strings into phrase queries

Posted by Kevin Osborn <os...@yahoo.com>.

The PositionFilter worked great for my purpose along with another filter that I build.

In my case, my indexed data may be something like "X150". So, a query for "Nokia X150" should match. But I don't want random matches on "x". However, if my indexed data is "G7", I do want a query on "PowerShot G7" to match on "g" and "7". So, a simple length filter will not do. Instead I build a custom filter (that I am willing to contribute back) that filters out singletons that are surrounded by longer tokens (3 or more by default). So, "PowerShot G7" becomes "power" "shot" "g" "7", but "Nokia X150" becomes "nokia" "150".

And then I put the results of this into a PositionFilter. This allows "Nokia X150ABC" to match against the "X150" part. So far I really like this for partial part number searches. And then to boost exact matches, I used copyField to create another field without PositionFilter. And then did an optional phrase query on that.

________________________________
From: Lance Norskog <go...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Wed, February 17, 2010 7:23:23 PM
Subject: Re: parsing strings into phrase queries

That would be great. After reading this and the PositionFilter class I
still don't know how to use it.

On Wed, Feb 17, 2010 at 12:38 PM, Robert Muir <rc...@gmail.com> wrote:
> i think we can improve the docs/wiki to show this example use case, i
> noticed the wiki explanation for this filter gives a more complex shingles
> example, which is interesting, but this seems to be a common problem and
> maybe we should add this use case.
>
> On Wed, Feb 17, 2010 at 1:54 PM, Chris Hostetter
> <ho...@fucit.org>wrote:
>
>>
>> : take a look at PositionFilter
>>
>> Right, there was another thread recently where almost the exact same issue
>> was discussed...
>>
>> http://old.nabble.com/Re%3A-Tokenizer-question-p27120836.html
>>
>> ..except that i was ignorant of the existence of PositionFilter when i
>> wrote that message.
>>
>>
>>
>> -Hoss
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

-- 
Lance Norskog
goksron@gmail.com

Re: parsing strings into phrase queries

Posted by Lance Norskog <go...@gmail.com>.

Thanks Robert, that helped.

On Thu, Feb 18, 2010 at 5:48 AM, Robert Muir <rc...@gmail.com> wrote:
> i gave it a rough shot Lance, if there's a better way to explain it, please
> edit
>
> On Wed, Feb 17, 2010 at 10:23 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> That would be great. After reading this and the PositionFilter class I
>> still don't know how to use it.
>>
>> On Wed, Feb 17, 2010 at 12:38 PM, Robert Muir <rc...@gmail.com> wrote:
>> > i think we can improve the docs/wiki to show this example use case, i
>> > noticed the wiki explanation for this filter gives a more complex
>> shingles
>> > example, which is interesting, but this seems to be a common problem and
>> > maybe we should add this use case.
>> >
>> > On Wed, Feb 17, 2010 at 1:54 PM, Chris Hostetter
>> > <ho...@fucit.org>wrote:
>> >
>> >>
>> >> : take a look at PositionFilter
>> >>
>> >> Right, there was another thread recently where almost the exact same
>> issue
>> >> was discussed...
>> >>
>> >> http://old.nabble.com/Re%3A-Tokenizer-question-p27120836.html
>> >>
>> >> ..except that i was ignorant of the existence of PositionFilter when i
>> >> wrote that message.
>> >>
>> >>
>> >>
>> >> -Hoss
>> >>
>> >>
>> >
>> >
>> > --
>> > Robert Muir
>> > rcmuir@gmail.com
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: parsing strings into phrase queries

Posted by Robert Muir <rc...@gmail.com>.

i gave it a rough shot Lance, if there's a better way to explain it, please
edit

On Wed, Feb 17, 2010 at 10:23 PM, Lance Norskog <go...@gmail.com> wrote:

> That would be great. After reading this and the PositionFilter class I
> still don't know how to use it.
>
> On Wed, Feb 17, 2010 at 12:38 PM, Robert Muir <rc...@gmail.com> wrote:
> > i think we can improve the docs/wiki to show this example use case, i
> > noticed the wiki explanation for this filter gives a more complex
> shingles
> > example, which is interesting, but this seems to be a common problem and
> > maybe we should add this use case.
> >
> > On Wed, Feb 17, 2010 at 1:54 PM, Chris Hostetter
> > <ho...@fucit.org>wrote:
> >
> >>
> >> : take a look at PositionFilter
> >>
> >> Right, there was another thread recently where almost the exact same
> issue
> >> was discussed...
> >>
> >> http://old.nabble.com/Re%3A-Tokenizer-question-p27120836.html
> >>
> >> ..except that i was ignorant of the existence of PositionFilter when i
> >> wrote that message.
> >>
> >>
> >>
> >> -Hoss
> >>
> >>
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

Re: parsing strings into phrase queries

Posted by Lance Norskog <go...@gmail.com>.

That would be great. After reading this and the PositionFilter class I
still don't know how to use it.

On Wed, Feb 17, 2010 at 12:38 PM, Robert Muir <rc...@gmail.com> wrote:
> i think we can improve the docs/wiki to show this example use case, i
> noticed the wiki explanation for this filter gives a more complex shingles
> example, which is interesting, but this seems to be a common problem and
> maybe we should add this use case.
>
> On Wed, Feb 17, 2010 at 1:54 PM, Chris Hostetter
> <ho...@fucit.org>wrote:
>
>>
>> : take a look at PositionFilter
>>
>> Right, there was another thread recently where almost the exact same issue
>> was discussed...
>>
>> http://old.nabble.com/Re%3A-Tokenizer-question-p27120836.html
>>
>> ..except that i was ignorant of the existence of PositionFilter when i
>> wrote that message.
>>
>>
>>
>> -Hoss
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: parsing strings into phrase queries

Posted by Robert Muir <rc...@gmail.com>.

i think we can improve the docs/wiki to show this example use case, i
noticed the wiki explanation for this filter gives a more complex shingles
example, which is interesting, but this seems to be a common problem and
maybe we should add this use case.

On Wed, Feb 17, 2010 at 1:54 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : take a look at PositionFilter
>
> Right, there was another thread recently where almost the exact same issue
> was discussed...
>
> http://old.nabble.com/Re%3A-Tokenizer-question-p27120836.html
>
> ..except that i was ignorant of the existence of PositionFilter when i
> wrote that message.
>
>
>
> -Hoss
>
>

-- 
Robert Muir
rcmuir@gmail.com

Re: parsing strings into phrase queries

Posted by Chris Hostetter <ho...@fucit.org>.

: take a look at PositionFilter

Right, there was another thread recently where almost the exact same issue 
was discussed...

http://old.nabble.com/Re%3A-Tokenizer-question-p27120836.html

..except that i was ignorant of the existence of PositionFilter when i 
wrote that message.



-Hoss

Re: parsing strings into phrase queries

Posted by Robert Muir <rc...@gmail.com>.

take a look at PositionFilter

On Feb 13, 2010 1:14 AM, "Kevin Osborn" <os...@yahoo.com> wrote:

Right now if I have the query model:(Nokia BH-212V), the parser turns this
into +(model:nokia model:"bh 212 v"). The problem is that I might have a
model called Nokia BH-212, so this is completely missed. In my case, I would
like my query to be +(model:nokia model:bh model:212 model:v).

This is my schema for the field:

   <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
>
     <analyzer type="index">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
ignoreCase="true" expand="true" />
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
       <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1" />
       <filter class="solr.LowerCaseFilterFactory" />
   <filter
class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
protected="protwords.txt" />
       <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.WhitespaceTokenizerFactory" />
   <filter class="solr.SynonymFilterFactory" synonyms="query_synonyms.txt"
ignoreCase="true" expand="true" />
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
       <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" />
       <filter class="solr.LowerCaseFilterFactory" />
   <filter
class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
protected="protwords.txt" />
       <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
     </analyzer>
   </fieldType>

Re: parsing strings into phrase queries

Posted by Erick Erickson <er...@gmail.com>.

I don't see a good way to fix this without some heuristic you'd have to
implement to munge your query. There's no good for SOLR to intuit that
what you want is a partial match in this case. If you can create some
rules like "remove any single letters after numbers in the query"
that would be "good enough", you might get something satisfactory.
But I don't know of a way for SOLR to do this for you....

And if you *can't* write a "good enough" rule, this sounds like an
intractable problem.

Not much help I know....

On Sat, Feb 13, 2010 at 1:13 AM, Kevin Osborn <os...@yahoo.com> wrote:

> Right now if I have the query model:(Nokia BH-212V), the parser turns this
> into +(model:nokia model:"bh 212 v"). The problem is that I might have a
> model called Nokia BH-212, so this is completely missed. In my case, I would
> like my query to be +(model:nokia model:bh model:212 model:v).
>
> This is my schema for the field:
>
>    <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
> >
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
> ignoreCase="true" expand="true" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>        <filter class="solr.WordDelimiterFilterFactory"
> splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="1" catenateAll="1" />
>        <filter class="solr.LowerCaseFilterFactory" />
>    <filter
> class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
> protected="protwords.txt" />
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory" />
>    <filter class="solr.SynonymFilterFactory" synonyms="query_synonyms.txt"
> ignoreCase="true" expand="true" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>        <filter class="solr.WordDelimiterFilterFactory"
> splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="0" catenateNumbers="0" catenateAll="0" />
>        <filter class="solr.LowerCaseFilterFactory" />
>    <filter
> class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
> protected="protwords.txt" />
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>      </analyzer>
>    </fieldType>
>
>
>
>