You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Paul Forsyth <pf...@ez.no> on 2009/09/14 13:44:14 UTC

Searching for the '+' character

Hi all,

I need some help with a curious problem i can't find a solution for. I  
am somewhat of a newbie with the various analyzers and handlers and  
how they work together, so im looking for advice on how to proceed  
with my issue.

I have content with text like 'product+' which has been indexed as  
text. I need to search for the character '+', but try as I might i  
can't do this.

 From the docs it should just be a matter of escaping:

http://lucene.apache.org/java/2_4_1/queryparsersyntax.html#Escaping%20Special%20Characters

So queries like:

http://localhost:8983/solr/select/?q=\+&debugQuery=true or http://localhost:8983/solr/select/?q=\%2B&debugQuery=true

should do the trick but they don't. I get:

http://pastie.org/616055 and http://pastie.org/616052, respectively.

Only with the + url encoded does it appear in the output, but no  
results are returned.

I believe that the + is being stripped somehow but im not sure where  
exactly to look.

I included the debug info from the query but im not sure if the output  
is helpfull.

Does anyone have ideas on this issue, and how i should try to proceed?

Many thanks,

Paul

Re: Searching for the '+' character

Posted by Paul Forsyth <pf...@ez.no>.

Interesting. I thought that would be the 'hard' approach rather than  
add a filter, but i guess thats all it really is anyway.

Has this been done before? Build a filter to transform a word there  
and back?

On 14 Sep 2009, at 17:17, Chantal Ackermann wrote:

>
>
> Paul Forsyth schrieb:
>> Hi Erick,
>> In this specific case my client does have a new product with a '+' at
>> the end. Its just one of those odd ones!
>> Customers are expected to put + into the search box so i have to have
>> results to show.
>> I hear your concerns though. Originally i thought I would need to
>> transform the + into something else, and do this back and forwards to
>> get a match!
>
> sorry for jumping into the discussion with my little knowledge - but  
> I actually think transforming the '+' into something else in the  
> index (something like 'pluzz' that has a low probability to appear  
> as such in the regular input) is a good solution. You just have to  
> do the same on the query side. You could have your own filter for  
> that to put it in the schema or just do it "manually" at index and  
> query time.
>
> is that a possibility?
>
> Chantal
>
>> Hopefully this will be a standard solr install, but with this tweak
>> for escaped chars....
>> Paul
>> On 14 Sep 2009, at 17:01, Erick Erickson wrote:
>>> Before you go too much further with this, I've just got to ask
>>> whetherthe
>>> use case for searching "product+" really serves your customers.
>>> If you mess around with analyzers to make things include the "+",
>>> what does that mean for "&"? "*"? "."? any other weird character
>>> you can think of?
>>>
>>> Would it be a bad thing for "product" to match "product+" and vice
>>> versa? Would it be more or less confusing for your users to have
>>> "product"
>>> FAIL to match "product+"?
>>>
>>> Of course only you really know your problem space, but think  
>>> carefully
>>> about this issue before you take on the work of making "product+"  
>>> work
>>> because it'll inevitably be waaaay more work than you think. Imagine
>>> the
>>> bug reports when "product&" fails to match "product+", both of which
>>> fail to match "product"....
>>>
>>> I'd also get a copy of Luke and look at the index to be sure what  
>>> you
>>> *think*
>>> is in there is *actually* there. It'll also help you understand what
>>> analyzers
>>> do better.
>>>
>>> Don't forget that using different analyzers when indexing and
>>> querying will
>>> lead to...er..."interesting" results.
>>>
>>> Best
>>> Erick
>>>
>>> On Mon, Sep 14, 2009 at 11:38 AM, Paul Forsyth <pf...@ez.no> wrote:
>>>
>>>> Thanks Ahmet,
>>>>
>>>> Thats excellent, thanks :) I may have to increase the gramsize to
>>>> take into
>>>> account other possible uses but i can now read around these filters
>>>> to make
>>>> the adjustments.
>>>>
>>>> With regard to WordDelimiterFilterFactory. Is there a way to  
>>>> place a
>>>> delimiter on this filter to still get most of its functionality
>>>> without it
>>>> absorbing the + signs? Will i loose a lot of 'good' functionality  
>>>> by
>>>> removing it? 'preserveOriginal' sounds promising and seems to work
>>>> but is it
>>>> a good idea to use this?
>>>>
>>>>
>>>> On 14 Sep 2009, at 16:16, AHMET ARSLAN wrote:
>>>>
>>>>
>>>>> --- On Mon, 9/14/09, Paul Forsyth <pf...@ez.no> wrote:
>>>>>
>>>>> From: Paul Forsyth <pf...@ez.no>
>>>>>> Subject: Re: Searching for the '+' character
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Date: Monday, September 14, 2009, 5:55 PM
>>>>>> With words like 'product+' i'd expect
>>>>>> a search for '+' to return results like any other character
>>>>>> or word, so '+' would be found within 'product+' or similar
>>>>>> text.
>>>>>>
>>>>>> I've tried removing the worddelimiter from the query
>>>>>> analyzer, restarting and reindexing but i get the same
>>>>>> result. Nothing is found. I assume one of the filters could
>>>>>> be adjusted to keep the '+'.
>>>>>>
>>>>>> Weird thing is that i tried to remove all filters from the
>>>>>> analyzer and i get the same result.
>>>>>>
>>>>>> Paul
>>>>>>
>>>>> When you remove all filters '+' is kept, but still '+' won't match
>>>>> 'product+'. Because you want to search inside a token.
>>>>>
>>>>> If + sign is always at the end of of your text, and you want to
>>>>> search
>>>>> only last character of your text EdgeNGramFilterFactory can do  
>>>>> that.
>>>>> with the settings side="back" maxGramSize="1" minGramSize="1"
>>>>>
>>>>> The fieldType below will match '+' to 'product+'
>>>>>
>>>>> <fieldType name="textx" class="solr.TextField"
>>>>> positionIncrementGap="100">
>>>>>   <analyzer type="index">
>>>>>     <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>     <filter class="ISOLatin1AccentFilterFactory"/>
>>>>>     <filter class="solr.SnowballPorterFilterFactory"
>>>>> language="English"/>
>>>>>      <filter class="solr.EdgeNGramFilterFactory" side="back"
>>>>> maxGramSize="1" minGramSize="1"/>
>>>>>   </analyzer>
>>>>>   <analyzer type="query">
>>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>     <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="synonyms.txt"
>>>>> ignoreCase="true" expand="true"/>
>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>     <filter class="ISOLatin1AccentFilterFactory"/>
>>>>>     <filter class="solr.SnowballPorterFilterFactory"
>>>>> language="English"/>
>>>>>   </analyzer>
>>>>> </fieldType>
>>>>>
>>>>>
>>>>> But this time 'product+' will be reduced to only '+'. You won't be
>>>>> able to
>>>>> search it otherways for example product*. Along with the last
>>>>> character, if
>>>>> you want to keep the original word it self you can set maxGramSize
>>>>> to 512.
>>>>> By doing this token 'product+' will produce 8 tokens: (and query
>>>>> product* or
>>>>> product+ will return it )
>>>>>
>>>>> + word
>>>>> t+ word
>>>>> ct+ word
>>>>> uct+ word
>>>>> duct+ word
>>>>> oduct+ word
>>>>> roduct+ word
>>>>> product+ word
>>>>>
>>>>> If + sign can be anywhere inside the text you can use
>>>>> NGramTokenFilter.
>>>>> Hope this helps.
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Best regards,
>>>>
>>>> Paul Forsyth
>>>>
>>>> mail: pf@ez.no
>>>> skype: paulforsyth
>>>>
>>>>
>> Best regards,
>> Paul Forsyth
>> mail: pf@ez.no
>> skype: paulforsyth

Best regards,

Paul Forsyth

mail: pf@ez.no
skype: paulforsyth

Re: Searching for the '+' character

Posted by Chantal Ackermann <ch...@btelligent.de>.


Paul Forsyth schrieb:
> Hi Erick,
> 
> In this specific case my client does have a new product with a '+' at
> the end. Its just one of those odd ones!
> 
> Customers are expected to put + into the search box so i have to have
> results to show.
> 
> I hear your concerns though. Originally i thought I would need to
> transform the + into something else, and do this back and forwards to
> get a match!

sorry for jumping into the discussion with my little knowledge - but I 
actually think transforming the '+' into something else in the index 
(something like 'pluzz' that has a low probability to appear as such in 
the regular input) is a good solution. You just have to do the same on 
the query side. You could have your own filter for that to put it in the 
schema or just do it "manually" at index and query time.

is that a possibility?

Chantal

> 
> Hopefully this will be a standard solr install, but with this tweak
> for escaped chars....
> 
> Paul
> 
> On 14 Sep 2009, at 17:01, Erick Erickson wrote:
> 
>> Before you go too much further with this, I've just got to ask
>> whetherthe
>> use case for searching "product+" really serves your customers.
>> If you mess around with analyzers to make things include the "+",
>> what does that mean for "&"? "*"? "."? any other weird character
>> you can think of?
>>
>> Would it be a bad thing for "product" to match "product+" and vice
>> versa? Would it be more or less confusing for your users to have
>> "product"
>> FAIL to match "product+"?
>>
>> Of course only you really know your problem space, but think carefully
>> about this issue before you take on the work of making "product+" work
>> because it'll inevitably be waaaay more work than you think. Imagine
>> the
>> bug reports when "product&" fails to match "product+", both of which
>> fail to match "product"....
>>
>> I'd also get a copy of Luke and look at the index to be sure what you
>> *think*
>> is in there is *actually* there. It'll also help you understand what
>> analyzers
>> do better.
>>
>> Don't forget that using different analyzers when indexing and
>> querying will
>> lead to...er..."interesting" results.
>>
>> Best
>> Erick
>>
>> On Mon, Sep 14, 2009 at 11:38 AM, Paul Forsyth <pf...@ez.no> wrote:
>>
>>> Thanks Ahmet,
>>>
>>> Thats excellent, thanks :) I may have to increase the gramsize to
>>> take into
>>> account other possible uses but i can now read around these filters
>>> to make
>>> the adjustments.
>>>
>>> With regard to WordDelimiterFilterFactory. Is there a way to place a
>>> delimiter on this filter to still get most of its functionality
>>> without it
>>> absorbing the + signs? Will i loose a lot of 'good' functionality by
>>> removing it? 'preserveOriginal' sounds promising and seems to work
>>> but is it
>>> a good idea to use this?
>>>
>>>
>>> On 14 Sep 2009, at 16:16, AHMET ARSLAN wrote:
>>>
>>>
>>>> --- On Mon, 9/14/09, Paul Forsyth <pf...@ez.no> wrote:
>>>>
>>>> From: Paul Forsyth <pf...@ez.no>
>>>>> Subject: Re: Searching for the '+' character
>>>>> To: solr-user@lucene.apache.org
>>>>> Date: Monday, September 14, 2009, 5:55 PM
>>>>> With words like 'product+' i'd expect
>>>>> a search for '+' to return results like any other character
>>>>> or word, so '+' would be found within 'product+' or similar
>>>>> text.
>>>>>
>>>>> I've tried removing the worddelimiter from the query
>>>>> analyzer, restarting and reindexing but i get the same
>>>>> result. Nothing is found. I assume one of the filters could
>>>>> be adjusted to keep the '+'.
>>>>>
>>>>> Weird thing is that i tried to remove all filters from the
>>>>> analyzer and i get the same result.
>>>>>
>>>>> Paul
>>>>>
>>>> When you remove all filters '+' is kept, but still '+' won't match
>>>> 'product+'. Because you want to search inside a token.
>>>>
>>>> If + sign is always at the end of of your text, and you want to
>>>> search
>>>> only last character of your text EdgeNGramFilterFactory can do that.
>>>> with the settings side="back" maxGramSize="1" minGramSize="1"
>>>>
>>>> The fieldType below will match '+' to 'product+'
>>>>
>>>> <fieldType name="textx" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>    <analyzer type="index">
>>>>      <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>      <filter class="ISOLatin1AccentFilterFactory"/>
>>>>      <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English"/>
>>>>       <filter class="solr.EdgeNGramFilterFactory" side="back"
>>>> maxGramSize="1" minGramSize="1"/>
>>>>    </analyzer>
>>>>    <analyzer type="query">
>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>      <filter class="solr.SynonymFilterFactory"
>>>> synonyms="synonyms.txt"
>>>> ignoreCase="true" expand="true"/>
>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>      <filter class="ISOLatin1AccentFilterFactory"/>
>>>>      <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English"/>
>>>>    </analyzer>
>>>>  </fieldType>
>>>>
>>>>
>>>> But this time 'product+' will be reduced to only '+'. You won't be
>>>> able to
>>>> search it otherways for example product*. Along with the last
>>>> character, if
>>>> you want to keep the original word it self you can set maxGramSize
>>>> to 512.
>>>> By doing this token 'product+' will produce 8 tokens: (and query
>>>> product* or
>>>> product+ will return it )
>>>>
>>>> + word
>>>> t+ word
>>>> ct+ word
>>>> uct+ word
>>>> duct+ word
>>>> oduct+ word
>>>> roduct+ word
>>>> product+ word
>>>>
>>>> If + sign can be anywhere inside the text you can use
>>>> NGramTokenFilter.
>>>> Hope this helps.
>>>>
>>>>
>>>>
>>>>
>>> Best regards,
>>>
>>> Paul Forsyth
>>>
>>> mail: pf@ez.no
>>> skype: paulforsyth
>>>
>>>
> 
> Best regards,
> 
> Paul Forsyth
> 
> mail: pf@ez.no
> skype: paulforsyth
>

Re: Searching for the '+' character

Posted by Paul Forsyth <pf...@ez.no>.

Hi Erick,

In this specific case my client does have a new product with a '+' at  
the end. Its just one of those odd ones!

Customers are expected to put + into the search box so i have to have  
results to show.

I hear your concerns though. Originally i thought I would need to  
transform the + into something else, and do this back and forwards to  
get a match!

Hopefully this will be a standard solr install, but with this tweak  
for escaped chars....

Paul

On 14 Sep 2009, at 17:01, Erick Erickson wrote:

> Before you go too much further with this, I've just got to ask  
> whetherthe
> use case for searching "product+" really serves your customers.
> If you mess around with analyzers to make things include the "+",
> what does that mean for "&"? "*"? "."? any other weird character
> you can think of?
>
> Would it be a bad thing for "product" to match "product+" and vice
> versa? Would it be more or less confusing for your users to have  
> "product"
> FAIL to match "product+"?
>
> Of course only you really know your problem space, but think carefully
> about this issue before you take on the work of making "product+" work
> because it'll inevitably be waaaay more work than you think. Imagine  
> the
> bug reports when "product&" fails to match "product+", both of which
> fail to match "product"....
>
> I'd also get a copy of Luke and look at the index to be sure what you
> *think*
> is in there is *actually* there. It'll also help you understand what
> analyzers
> do better.
>
> Don't forget that using different analyzers when indexing and  
> querying will
> lead to...er..."interesting" results.
>
> Best
> Erick
>
> On Mon, Sep 14, 2009 at 11:38 AM, Paul Forsyth <pf...@ez.no> wrote:
>
>> Thanks Ahmet,
>>
>> Thats excellent, thanks :) I may have to increase the gramsize to  
>> take into
>> account other possible uses but i can now read around these filters  
>> to make
>> the adjustments.
>>
>> With regard to WordDelimiterFilterFactory. Is there a way to place a
>> delimiter on this filter to still get most of its functionality  
>> without it
>> absorbing the + signs? Will i loose a lot of 'good' functionality by
>> removing it? 'preserveOriginal' sounds promising and seems to work  
>> but is it
>> a good idea to use this?
>>
>>
>> On 14 Sep 2009, at 16:16, AHMET ARSLAN wrote:
>>
>>
>>>
>>> --- On Mon, 9/14/09, Paul Forsyth <pf...@ez.no> wrote:
>>>
>>> From: Paul Forsyth <pf...@ez.no>
>>>> Subject: Re: Searching for the '+' character
>>>> To: solr-user@lucene.apache.org
>>>> Date: Monday, September 14, 2009, 5:55 PM
>>>> With words like 'product+' i'd expect
>>>> a search for '+' to return results like any other character
>>>> or word, so '+' would be found within 'product+' or similar
>>>> text.
>>>>
>>>> I've tried removing the worddelimiter from the query
>>>> analyzer, restarting and reindexing but i get the same
>>>> result. Nothing is found. I assume one of the filters could
>>>> be adjusted to keep the '+'.
>>>>
>>>> Weird thing is that i tried to remove all filters from the
>>>> analyzer and i get the same result.
>>>>
>>>> Paul
>>>>
>>>
>>> When you remove all filters '+' is kept, but still '+' won't match
>>> 'product+'. Because you want to search inside a token.
>>>
>>> If + sign is always at the end of of your text, and you want to  
>>> search
>>> only last character of your text EdgeNGramFilterFactory can do that.
>>> with the settings side="back" maxGramSize="1" minGramSize="1"
>>>
>>> The fieldType below will match '+' to 'product+'
>>>
>>> <fieldType name="textx" class="solr.TextField"  
>>> positionIncrementGap="100">
>>>    <analyzer type="index">
>>>      <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>      <filter class="ISOLatin1AccentFilterFactory"/>
>>>      <filter class="solr.SnowballPorterFilterFactory"
>>> language="English"/>
>>>       <filter class="solr.EdgeNGramFilterFactory" side="back"
>>> maxGramSize="1" minGramSize="1"/>
>>>    </analyzer>
>>>    <analyzer type="query">
>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>      <filter class="solr.SynonymFilterFactory"  
>>> synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>      <filter class="ISOLatin1AccentFilterFactory"/>
>>>      <filter class="solr.SnowballPorterFilterFactory"
>>> language="English"/>
>>>    </analyzer>
>>>  </fieldType>
>>>
>>>
>>> But this time 'product+' will be reduced to only '+'. You won't be  
>>> able to
>>> search it otherways for example product*. Along with the last  
>>> character, if
>>> you want to keep the original word it self you can set maxGramSize  
>>> to 512.
>>> By doing this token 'product+' will produce 8 tokens: (and query  
>>> product* or
>>> product+ will return it )
>>>
>>> + word
>>> t+ word
>>> ct+ word
>>> uct+ word
>>> duct+ word
>>> oduct+ word
>>> roduct+ word
>>> product+ word
>>>
>>> If + sign can be anywhere inside the text you can use  
>>> NGramTokenFilter.
>>> Hope this helps.
>>>
>>>
>>>
>>>
>> Best regards,
>>
>> Paul Forsyth
>>
>> mail: pf@ez.no
>> skype: paulforsyth
>>
>>

Best regards,

Paul Forsyth

mail: pf@ez.no
skype: paulforsyth

Re: Searching for the '+' character

Posted by Erick Erickson <er...@gmail.com>.

Before you go too much further with this, I've just got to ask whetherthe
use case for searching "product+" really serves your customers.
If you mess around with analyzers to make things include the "+",
what does that mean for "&"? "*"? "."? any other weird character
you can think of?

Would it be a bad thing for "product" to match "product+" and vice
versa? Would it be more or less confusing for your users to have "product"
FAIL to match "product+"?

Of course only you really know your problem space, but think carefully
about this issue before you take on the work of making "product+" work
because it'll inevitably be waaaay more work than you think. Imagine the
bug reports when "product&" fails to match "product+", both of which
fail to match "product"....

I'd also get a copy of Luke and look at the index to be sure what you
*think*
is in there is *actually* there. It'll also help you understand what
analyzers
do better.

Don't forget that using different analyzers when indexing and querying will
lead to...er..."interesting" results.

Best
Erick

On Mon, Sep 14, 2009 at 11:38 AM, Paul Forsyth <pf...@ez.no> wrote:

> Thanks Ahmet,
>
> Thats excellent, thanks :) I may have to increase the gramsize to take into
> account other possible uses but i can now read around these filters to make
> the adjustments.
>
> With regard to WordDelimiterFilterFactory. Is there a way to place a
> delimiter on this filter to still get most of its functionality without it
> absorbing the + signs? Will i loose a lot of 'good' functionality by
> removing it? 'preserveOriginal' sounds promising and seems to work but is it
> a good idea to use this?
>
>
> On 14 Sep 2009, at 16:16, AHMET ARSLAN wrote:
>
>
>>
>> --- On Mon, 9/14/09, Paul Forsyth <pf...@ez.no> wrote:
>>
>>  From: Paul Forsyth <pf...@ez.no>
>>> Subject: Re: Searching for the '+' character
>>> To: solr-user@lucene.apache.org
>>> Date: Monday, September 14, 2009, 5:55 PM
>>> With words like 'product+' i'd expect
>>> a search for '+' to return results like any other character
>>> or word, so '+' would be found within 'product+' or similar
>>> text.
>>>
>>> I've tried removing the worddelimiter from the query
>>> analyzer, restarting and reindexing but i get the same
>>> result. Nothing is found. I assume one of the filters could
>>> be adjusted to keep the '+'.
>>>
>>> Weird thing is that i tried to remove all filters from the
>>> analyzer and i get the same result.
>>>
>>> Paul
>>>
>>
>> When you remove all filters '+' is kept, but still '+' won't match
>> 'product+'. Because you want to search inside a token.
>>
>> If + sign is always at the end of of your text, and you want to search
>> only last character of your text EdgeNGramFilterFactory can do that.
>> with the settings side="back" maxGramSize="1" minGramSize="1"
>>
>> The fieldType below will match '+' to 'product+'
>>
>> <fieldType name="textx" class="solr.TextField" positionIncrementGap="100">
>>     <analyzer type="index">
>>       <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="ISOLatin1AccentFilterFactory"/>
>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"/>
>>        <filter class="solr.EdgeNGramFilterFactory" side="back"
>> maxGramSize="1" minGramSize="1"/>
>>     </analyzer>
>>     <analyzer type="query">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="ISOLatin1AccentFilterFactory"/>
>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"/>
>>     </analyzer>
>>   </fieldType>
>>
>>
>> But this time 'product+' will be reduced to only '+'. You won't be able to
>> search it otherways for example product*. Along with the last character, if
>> you want to keep the original word it self you can set maxGramSize to 512.
>> By doing this token 'product+' will produce 8 tokens: (and query product* or
>> product+ will return it )
>>
>> + word
>> t+ word
>> ct+ word
>> uct+ word
>> duct+ word
>> oduct+ word
>> roduct+ word
>> product+ word
>>
>> If + sign can be anywhere inside the text you can use NGramTokenFilter.
>> Hope this helps.
>>
>>
>>
>>
> Best regards,
>
> Paul Forsyth
>
> mail: pf@ez.no
> skype: paulforsyth
>
>

Re: Searching for the '+' character

Posted by Matt Weber <ma...@mattweber.org>.

Why don't you create a synonym for + that expands to your customers  
product name that includes the plus?  You can even have your FE do  
this sort of replacement BEFORE submitting to Solr.

Thanks,

Matt Weber

On Sep 14, 2009, at 11:42 AM, AHMET ARSLAN wrote:

>> Thanks Ahmet,
>>
>> Thats excellent, thanks :) I may have to increase the
>> gramsize to take into account other possible uses but i can
>> now read around these filters to make the adjustments.
>>
>> With regard to WordDelimiterFilterFactory. Is there a way
>> to place a delimiter on this filter to still get most of its
>> functionality without it absorbing the + signs?
>
> Yes you are right, preserveOriginal="1" will causes the original  
> token to be indexed without modifications.
>
>> Will i loose a lot of 'good' functionality by removing it?
>
> It depends of your input data. It is used to break one token into  
> subwords.
> Like: "Wi-Fi" -> "Wi", "Fi" and "PowerShot" -> "Power", "Shot"
> If you input data set contains such words, you may need it.
>
> But I think just to make last character searchable, using  
> NGramFilter(s) is not an optimal solution. I don't know what type of  
> dataset you have but, I think using separate two fields (with  
> different types) for that is more suitable. One field will contain  
> actual data itself. The other will hold only the last character(s).
>
> You can achieve this by a copyField or programatically during  
> indexing. The type of the field lastCharsField will be using  
> EdgeNGramFilter so that only last character of token(s) will pass  
> that filter.
>
> During searching you will search those two fields:
> originalField:\+ OR lastCharsField:\+
>
> The query lastCharsField:\+ will return you all the products ending  
> with +.
>
> Hope this helps.
>
>
>
>
>

Re: Searching for the '+' character

Posted by AHMET ARSLAN <io...@yahoo.com>.

> Thanks Ahmet,
> 
> Thats excellent, thanks :) I may have to increase the
> gramsize to take into account other possible uses but i can
> now read around these filters to make the adjustments.
> 
> With regard to WordDelimiterFilterFactory. Is there a way
> to place a delimiter on this filter to still get most of its
> functionality without it absorbing the + signs? 

Yes you are right, preserveOriginal="1" will causes the original token to be indexed without modifications.

> Will i loose a lot of 'good' functionality by removing it?

It depends of your input data. It is used to break one token into subwords.
Like: "Wi-Fi" -> "Wi", "Fi" and "PowerShot" -> "Power", "Shot"
If you input data set contains such words, you may need it.

But I think just to make last character searchable, using NGramFilter(s) is not an optimal solution. I don't know what type of dataset you have but, I think using separate two fields (with different types) for that is more suitable. One field will contain actual data itself. The other will hold only the last character(s).

You can achieve this by a copyField or programatically during indexing. The type of the field lastCharsField will be using EdgeNGramFilter so that only last character of token(s) will pass that filter.

During searching you will search those two fields: 
originalField:\+ OR lastCharsField:\+

The query lastCharsField:\+ will return you all the products ending with +.

Hope this helps.

Re: Searching for the '+' character

Posted by Paul Forsyth <pf...@ez.no>.

Thanks Ahmet,

Thats excellent, thanks :) I may have to increase the gramsize to take  
into account other possible uses but i can now read around these  
filters to make the adjustments.

With regard to WordDelimiterFilterFactory. Is there a way to place a  
delimiter on this filter to still get most of its functionality  
without it absorbing the + signs? Will i loose a lot of 'good'  
functionality by removing it? 'preserveOriginal' sounds promising and  
seems to work but is it a good idea to use this?

On 14 Sep 2009, at 16:16, AHMET ARSLAN wrote:

>
>
> --- On Mon, 9/14/09, Paul Forsyth <pf...@ez.no> wrote:
>
>> From: Paul Forsyth <pf...@ez.no>
>> Subject: Re: Searching for the '+' character
>> To: solr-user@lucene.apache.org
>> Date: Monday, September 14, 2009, 5:55 PM
>> With words like 'product+' i'd expect
>> a search for '+' to return results like any other character
>> or word, so '+' would be found within 'product+' or similar
>> text.
>>
>> I've tried removing the worddelimiter from the query
>> analyzer, restarting and reindexing but i get the same
>> result. Nothing is found. I assume one of the filters could
>> be adjusted to keep the '+'.
>>
>> Weird thing is that i tried to remove all filters from the
>> analyzer and i get the same result.
>>
>> Paul
>
> When you remove all filters '+' is kept, but still '+' won't match  
> 'product+'. Because you want to search inside a token.
>
> If + sign is always at the end of of your text, and you want to  
> search only last character of your text EdgeNGramFilterFactory can  
> do that.
> with the settings side="back" maxGramSize="1" minGramSize="1"
>
> The fieldType below will match '+' to 'product+'
>
> <fieldType name="textx" class="solr.TextField"  
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory"  
> language="English"/>
> 	<filter class="solr.EdgeNGramFilterFactory" side="back"  
> maxGramSize="1" minGramSize="1"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory"  
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory"  
> language="English"/>
>      </analyzer>
>    </fieldType>
>
>
> But this time 'product+' will be reduced to only '+'. You won't be  
> able to search it otherways for example product*. Along with the  
> last character, if you want to keep the original word it self you  
> can set maxGramSize to 512. By doing this token 'product+' will  
> produce 8 tokens: (and query product* or product+ will return it )
>
> + word
> t+ word
> ct+ word
> uct+ word
> duct+ word
> oduct+ word
> roduct+ word
> product+ word
>
> If + sign can be anywhere inside the text you can use  
> NGramTokenFilter.
> Hope this helps.
>
>
>

Best regards,

Paul Forsyth

mail: pf@ez.no
skype: paulforsyth

Re: Searching for the '+' character

Posted by AHMET ARSLAN <io...@yahoo.com>.


--- On Mon, 9/14/09, Paul Forsyth <pf...@ez.no> wrote:

> From: Paul Forsyth <pf...@ez.no>
> Subject: Re: Searching for the '+' character
> To: solr-user@lucene.apache.org
> Date: Monday, September 14, 2009, 5:55 PM
> With words like 'product+' i'd expect
> a search for '+' to return results like any other character
> or word, so '+' would be found within 'product+' or similar
> text.
> 
> I've tried removing the worddelimiter from the query
> analyzer, restarting and reindexing but i get the same
> result. Nothing is found. I assume one of the filters could
> be adjusted to keep the '+'.
> 
> Weird thing is that i tried to remove all filters from the
> analyzer and i get the same result.
> 
> Paul

When you remove all filters '+' is kept, but still '+' won't match 'product+'. Because you want to search inside a token.

If + sign is always at the end of of your text, and you want to search only last character of your text EdgeNGramFilterFactory can do that.
with the settings side="back" maxGramSize="1" minGramSize="1"

The fieldType below will match '+' to 'product+'

<fieldType name="textx" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
	<filter class="solr.EdgeNGramFilterFactory" side="back" maxGramSize="1" minGramSize="1"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>      
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
      </analyzer>
    </fieldType>


But this time 'product+' will be reduced to only '+'. You won't be able to search it otherways for example product*. Along with the last character, if you want to keep the original word it self you can set maxGramSize to 512. By doing this token 'product+' will produce 8 tokens: (and query product* or product+ will return it )

+ word
t+ word
ct+ word
uct+ word
duct+ word
oduct+ word
roduct+ word
product+ word

If + sign can be anywhere inside the text you can use NGramTokenFilter.
Hope this helps.

Re: Searching for the '+' character

Posted by Paul Forsyth <pf...@ez.no>.

With words like 'product+' i'd expect a search for '+' to return  
results like any other character or word, so '+' would be found within  
'product+' or similar text.

I've tried removing the worddelimiter from the query analyzer,  
restarting and reindexing but i get the same result. Nothing is found.  
I assume one of the filters could be adjusted to keep the '+'.

Weird thing is that i tried to remove all filters from the analyzer  
and i get the same result.

Paul

On 14 Sep 2009, at 15:17, AHMET ARSLAN wrote:

>> Hi Ahmet,
>>
>> I believe its the WhitespaceTokenizerFactory, but i may be
>> wrong.
>>
>> I've pasted the schema.xml into http://pastie.org/616162
>>
>
> I looked at your field type named text.
>
> WordDelimiterFilterFactory is eating up '+'
>
> You can use .../solr/admin/analysis.jsp tool to see behaviour of  
> each tokenizer/tokenfilter for particular input.
>
> But more importantly do you want to return documents containing  
> 'product+' by searching '+'? You said you need to search for the  
> character '+'. What that query supposed to return back?
>
>
>

Best regards,

Paul Forsyth

mail: pf@ez.no
skype: paulforsyth

Re: Searching for the '+' character

Posted by AHMET ARSLAN <io...@yahoo.com>.

> Hi Ahmet,
> 
> I believe its the WhitespaceTokenizerFactory, but i may be
> wrong.
> 
> I've pasted the schema.xml into http://pastie.org/616162
> 

I looked at your field type named text. 

WordDelimiterFilterFactory is eating up '+'

You can use .../solr/admin/analysis.jsp tool to see behaviour of each tokenizer/tokenfilter for particular input.

But more importantly do you want to return documents containing 'product+' by searching '+'? You said you need to search for the character '+'. What that query supposed to return back?

Re: Searching for the '+' character

Posted by Paul Forsyth <pf...@ez.no>.

Hi Ahmet,

I believe its the WhitespaceTokenizerFactory, but i may be wrong.

I've pasted the schema.xml into http://pastie.org/616162



On 14 Sep 2009, at 14:29, AHMET ARSLAN wrote:

>> Hi all,
>>
>> I need some help with a curious problem i can't find a
>> solution for. I am somewhat of a newbie with the various
>> analyzers and handlers and how they work together, so im
>> looking for advice on how to proceed with my issue.
>>
>> I have content with text like 'product+' which has been
>> indexed as text. I need to search for the character '+', but
>> try as I might i can't do this.
>>
>> From the docs it should just be a matter of escaping:
>
>> I believe that the + is being stripped somehow but im not
>> sure where exactly to look.
>
> I think your analyzer is eating up +, which tokenizer are you using  
> in it?
> Do you want to return documents containing 'product+' by searching  
> '+'?
>
>
>
>

Best regards,

Paul Forsyth

mail: pf@ez.no
skype: paulforsyth

Re: Searching for the '+' character

Posted by AHMET ARSLAN <io...@yahoo.com>.

> Hi all,
> 
> I need some help with a curious problem i can't find a
> solution for. I am somewhat of a newbie with the various
> analyzers and handlers and how they work together, so im
> looking for advice on how to proceed with my issue.
> 
> I have content with text like 'product+' which has been
> indexed as text. I need to search for the character '+', but
> try as I might i can't do this.
> 
> From the docs it should just be a matter of escaping:

> I believe that the + is being stripped somehow but im not
> sure where exactly to look.

I think your analyzer is eating up +, which tokenizer are you using in it?
Do you want to return documents containing 'product+' by searching '+'?