You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Audrey Foo <au...@hotmail.com> on 2009/10/12 17:39:39 UTC

capitalization and delimiters


In my search docs, I have content such as 'powershot' and 'powerShot'.
I would expect 'powerShot' would be searched as 'power', 'shot' and 'powershot', so that results for all these are returned. Instead, only results for 'power' and 'shot' are returned.
Any suggestions?
In the schema, index analyzer:<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/><filter class="solr.LowerCaseFilterFactory"/>
In the schema, query analyzer<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/><filter class="solr.LowerCaseFilterFactory"/>
ThanksAudrey 		 	   		  
_________________________________________________________________
New! Open Messenger faster on the MSN homepage
http://go.microsoft.com/?linkid=9677405

Re: capitalization and delimiters

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Fri, Oct 16, 2009 at 2:20 PM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> On Fri, Oct 16, 2009 at 9:56 PM, Audrey Foo <au...@hotmail.com> wrote:
>
>>
>> Hi Shalin
>> I mixed up and sent the wrong schema, one that I had been testing with.
>> I was using the same configuration as the example schema with the same
>> results. I re-tested by re-indexing just to confirm. Also, yes I do have
>> lowercase factory after the word delimiter.
>> powerShot does not return the results for 'powershot' only for power and
>> shot.
>> If I switch lowercase factory before word delimiter, then I do get the
>> results for powershot, but may not get the results if just searching 'power'
>> or 'shot'.
>>
>
> OK, thanks for the clarification. You need to add preserveOriginal="1" to
> your index-time WDF configuration. This will index the original token as
> well as the parts so that all of "powershot", "power" and "shot" should
> match "powerShot".

That's not the problem... the WDF config in the example server splits
and catenates... no need for preserving the original.

The issue is that a query of "powershot" or "power shot" would match
an index with "PowerShot" or "power-shot".
But if the index contains "powershot", then a query of "powerShot"
will be split to "power Shot" and not match.
It's a known limitation on the query side (can't both catenate and
split on the query side).

-Yonik
http://www.lucidimagination.com

Re: capitalization and delimiters

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Fri, Oct 16, 2009 at 9:56 PM, Audrey Foo <au...@hotmail.com> wrote:

>
> Hi Shalin
> I mixed up and sent the wrong schema, one that I had been testing with.
> I was using the same configuration as the example schema with the same
> results. I re-tested by re-indexing just to confirm. Also, yes I do have
> lowercase factory after the word delimiter.
> powerShot does not return the results for 'powershot' only for power and
> shot.
> If I switch lowercase factory before word delimiter, then I do get the
> results for powershot, but may not get the results if just searching 'power'
> or 'shot'.
>

OK, thanks for the clarification. You need to add preserveOriginal="1" to
your index-time WDF configuration. This will index the original token as
well as the parts so that all of "powershot", "power" and "shot" should
match "powerShot". Make sure you re-index after making the changes.

-- 
Regards,
Shalin Shekhar Mangar.

RE: capitalization and delimiters

Posted by Audrey Foo <au...@hotmail.com>.

Hi Shalin
I mixed up and sent the wrong schema, one that I had been testing with. 
I was using the same configuration as the example schema with the same results. I re-tested by re-indexing just to confirm. Also, yes I do have lowercase factory after the word delimiter.
powerShot does not return the results for 'powershot' only for power and shot.
If I switch lowercase factory before word delimiter, then I do get the results for powershot, but may not get the results if just searching 'power' or 'shot'.
ThanksAudrey

> Date: Wed, 14 Oct 2009 23:28:46 +0530
> Subject: Re: capitalization and delimiters
> From: shalinmangar@gmail.com
> To: solr-user@lucene.apache.org
> CC: aufmy@hotmail.com
> 
> On Mon, Oct 12, 2009 at 9:09 PM, Audrey Foo <au...@hotmail.com> wrote:
> 
> >
> > In my search docs, I have content such as 'powershot' and 'powerShot'.
> > I would expect 'powerShot' would be searched as 'power', 'shot' and
> > 'powershot', so that results for all these are returned. Instead, only
> > results for 'power' and 'shot' are returned.
> > Any suggestions?
> > In the schema, index analyzer:<filter
> > class="solr.WordDelimiterFilterFactory" generateWordParts="0"
> > generateNumberParts="0" catenateWords="1" catenateNumbers="1"
> > catenateAll="0"/><filter class="solr.LowerCaseFilterFactory"/>
> > In the schema, query analyzer<filter
> > class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> > catenateAll="0" splitOnCaseChange="1"/><filter
> > class="solr.LowerCaseFilterFactory"/>
> >
> 
> I find your index-time and query-time configuration very strange. Assuming
> that you also have a lowercase filter, it seems that a token "powerShot"
> will not be split and indexed as "powershot". Then during query, both
> "power" and "shot" will match nothing.
> 
> I suggest you start with the configuration given in the example schema.
> Else, it'd be easier for us if you can help us understand the reasons behind
> changing these parameters.
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.
 		 	   		  
_________________________________________________________________
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403

Re: capitalization and delimiters

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Mon, Oct 12, 2009 at 9:09 PM, Audrey Foo <au...@hotmail.com> wrote:

>
> In my search docs, I have content such as 'powershot' and 'powerShot'.
> I would expect 'powerShot' would be searched as 'power', 'shot' and
> 'powershot', so that results for all these are returned. Instead, only
> results for 'power' and 'shot' are returned.
> Any suggestions?
> In the schema, index analyzer:<filter
> class="solr.WordDelimiterFilterFactory" generateWordParts="0"
> generateNumberParts="0" catenateWords="1" catenateNumbers="1"
> catenateAll="0"/><filter class="solr.LowerCaseFilterFactory"/>
> In the schema, query analyzer<filter
> class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/><filter
> class="solr.LowerCaseFilterFactory"/>
>

I find your index-time and query-time configuration very strange. Assuming
that you also have a lowercase filter, it seems that a token "powerShot"
will not be split and indexed as "powershot". Then during query, both
"power" and "shot" will match nothing.

I suggest you start with the configuration given in the example schema.
Else, it'd be easier for us if you can help us understand the reasons behind
changing these parameters.

-- 
Regards,
Shalin Shekhar Mangar.