You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by F Knudson <fk...@lanl.gov> on 2007/09/30 21:47:37 UTC

Letter-number transitions - can this be turned off

Is there a flag to disable the letter-number transition in the
solr.WordDelimiterFilterFactory?  We are indexing category codes, thesaurus
codes for which this letter number transition makes no sense.  It is
bloating the indexing (which is already large).

Thanks
F Knudson
-- 
View this message in context: http://www.nabble.com/Letter-number-transitions---can-this-be-turned-off-tf4544769.html#a12969359
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Letter-number transitions - can this be turned off

Posted by F Knudson <fk...@lanl.gov>.
Thanks for your helpful suggestions.

I have considered other analyzers but WDF has great strengths.  I will
experiment with maintaining transitions and then consider modifying the
code.

F. Knudson


Mike Klaas wrote:
> 
> On 30-Sep-07, at 12:47 PM, F Knudson wrote:
> 
>>
>> Is there a flag to disable the letter-number transition in the
>> solr.WordDelimiterFilterFactory?  We are indexing category codes,  
>> thesaurus
>> codes for which this letter number transition makes no sense.  It is
>> bloating the indexing (which is already large).
> 
> Have you considered using a different analyzer?
> 
> If you want to continue using WDF, you could make a quick change  
> around since 320:
> 
>              if (splitOnCaseChange == 0 &&
>                  (lastType & ALPHA) != 0 && (type & ALPHA) != 0) {
>                // ALPHA->ALPHA: always ignore if case isn't considered.
> 
>              } else if ((lastType & UPPER)!=0 && (type & LOWER)!=0) {
>                // UPPER->LOWER: Don't split
>              } else {
> 
> 	    ...
> 
> by adding a clause that catches ALPHA -> NUMERIC (and vice versa) and  
> ignores it.
> 
> Another approach that I am using locally is to maintain the  
> transitions, but force tokens to be a minimum size (so r2d2 doesn't  
> tokenize to four tokens but arrr2222deee2222 does).
> 
> There is a patch here: http://issues.apache.org/jira/browse/SOLR-293
> 
> If you vote for it, I promise to get it in for 1.3 <g>
> 
> -Mike
> 
> 

-- 
View this message in context: http://www.nabble.com/Letter-number-transitions---can-this-be-turned-off-tf4544769.html#a13003019
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Letter-number transitions - can this be turned off

Posted by Mike Klaas <mi...@gmail.com>.
On 30-Sep-07, at 12:47 PM, F Knudson wrote:

>
> Is there a flag to disable the letter-number transition in the
> solr.WordDelimiterFilterFactory?  We are indexing category codes,  
> thesaurus
> codes for which this letter number transition makes no sense.  It is
> bloating the indexing (which is already large).

Have you considered using a different analyzer?

If you want to continue using WDF, you could make a quick change  
around since 320:

             if (splitOnCaseChange == 0 &&
                 (lastType & ALPHA) != 0 && (type & ALPHA) != 0) {
               // ALPHA->ALPHA: always ignore if case isn't considered.

             } else if ((lastType & UPPER)!=0 && (type & LOWER)!=0) {
               // UPPER->LOWER: Don't split
             } else {

	    ...

by adding a clause that catches ALPHA -> NUMERIC (and vice versa) and  
ignores it.

Another approach that I am using locally is to maintain the  
transitions, but force tokens to be a minimum size (so r2d2 doesn't  
tokenize to four tokens but arrr2222deee2222 does).

There is a patch here: http://issues.apache.org/jira/browse/SOLR-293

If you vote for it, I promise to get it in for 1.3 <g>

-Mike