You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Floyd Wu <fl...@gmail.com> on 2013/08/22 03:54:33 UTC

How to avoid underscore sign indexing problem?

When using StandardAnalyzer to tokenize string "Pacific_Rim" will get

ST
textraw_bytesstartendtypeposition
pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1

How to make this string to be tokenized to these two tokens "Pacific",
"Rim"?
Set _ as stopword?
Please kindly help on this.
Many thanks.

Floyd

Re: How to avoid underscore sign indexing problem?

Posted by Floyd Wu <fl...@gmail.com>.

Alright, thanks for all your help. I finally fix this problem using
PatternReplaceFilterFactory + WordDelimeterfilterFactory.

I first replace _ (underscore) using PatternReplaceFilterFactory and then
using WordDelimeterFilterFactory to generate word and number part to
increase user search hit. Although this decrease search quality a little,
but user need higher recall rate than precision.

Thank you all.

Floyd





2013/8/22 Floyd Wu <fl...@gmail.com>

> After trying some search case and different params combination of
> WordDelimeter. I wonder what is the best strategy to index string
> "2DA012_ISO MARK 2" and can be search by term "2DA012"?
>
> What if I just want _ to be removed both query/index time, what and how to
> configure?
>
> Floyd
>
>
>
> 2013/8/22 Floyd Wu <fl...@gmail.com>
>
>> Thank you all.
>> By the way, Jack I gonna by your book. Where to buy?
>> Floyd
>>
>>
>> 2013/8/22 Jack Krupansky <ja...@basetechnology.com>
>>
>>> "I thought that the StandardTokenizer always split on punctuation, "
>>>
>>> Proving that you haven't read my book! The section on the standard
>>> tokenizer details the rules that the tokenizer uses (in addition to
>>> extensive examples.) That's what I mean by "deep dive."
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Shawn Heisey
>>> Sent: Wednesday, August 21, 2013 10:41 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to avoid underscore sign indexing problem?
>>>
>>>
>>> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>>>
>>>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>>>
>>>> ST
>>>> textraw_**bytesstartendtypeposition
>>>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>>>>
>>>> How to make this string to be tokenized to these two tokens "Pacific",
>>>> "Rim"?
>>>> Set _ as stopword?
>>>> Please kindly help on this.
>>>> Many thanks.
>>>>
>>>
>>> Interesting.  I thought that the StandardTokenizer always split on
>>> punctuation, but apparently that's not the case for the underscore
>>> character.
>>>
>>> You can always use the WordDelimeterFilter after the StandardTokenizer.
>>>
>>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
>>> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>>>
>>> Thanks,
>>> Shawn
>>>
>>
>>
>

Re: How to avoid underscore sign indexing problem?

Posted by Floyd Wu <fl...@gmail.com>.

After trying some search case and different params combination of
WordDelimeter. I wonder what is the best strategy to index string
"2DA012_ISO MARK 2" and can be search by term "2DA012"?

What if I just want _ to be removed both query/index time, what and how to
configure?

Floyd



2013/8/22 Floyd Wu <fl...@gmail.com>

> Thank you all.
> By the way, Jack I gonna by your book. Where to buy?
> Floyd
>
>
> 2013/8/22 Jack Krupansky <ja...@basetechnology.com>
>
>> "I thought that the StandardTokenizer always split on punctuation, "
>>
>> Proving that you haven't read my book! The section on the standard
>> tokenizer details the rules that the tokenizer uses (in addition to
>> extensive examples.) That's what I mean by "deep dive."
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Shawn Heisey
>> Sent: Wednesday, August 21, 2013 10:41 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to avoid underscore sign indexing problem?
>>
>>
>> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>>
>>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>>
>>> ST
>>> textraw_**bytesstartendtypeposition
>>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>>>
>>> How to make this string to be tokenized to these two tokens "Pacific",
>>> "Rim"?
>>> Set _ as stopword?
>>> Please kindly help on this.
>>> Many thanks.
>>>
>>
>> Interesting.  I thought that the StandardTokenizer always split on
>> punctuation, but apparently that's not the case for the underscore
>> character.
>>
>> You can always use the WordDelimeterFilter after the StandardTokenizer.
>>
>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
>> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>>
>> Thanks,
>> Shawn
>>
>
>

Re: How to avoid underscore sign indexing problem?

Posted by Floyd Wu <fl...@gmail.com>.

Thank you all.
By the way, Jack I gonna by your book. Where to buy?
Floyd


2013/8/22 Jack Krupansky <ja...@basetechnology.com>

> "I thought that the StandardTokenizer always split on punctuation, "
>
> Proving that you haven't read my book! The section on the standard
> tokenizer details the rules that the tokenizer uses (in addition to
> extensive examples.) That's what I mean by "deep dive."
>
> -- Jack Krupansky
>
> -----Original Message----- From: Shawn Heisey
> Sent: Wednesday, August 21, 2013 10:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to avoid underscore sign indexing problem?
>
>
> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>
>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>
>> ST
>> textraw_**bytesstartendtypeposition
>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>>
>> How to make this string to be tokenized to these two tokens "Pacific",
>> "Rim"?
>> Set _ as stopword?
>> Please kindly help on this.
>> Many thanks.
>>
>
> Interesting.  I thought that the StandardTokenizer always split on
> punctuation, but apparently that's not the case for the underscore
> character.
>
> You can always use the WordDelimeterFilter after the StandardTokenizer.
>
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>
> Thanks,
> Shawn
>

Re: How to avoid underscore sign indexing problem?

Posted by Jack Krupansky <ja...@basetechnology.com>.

Exactly - Solr does not define the punctuation, UAX#29 defines it, and I 
have deciphered the UAX#29 rules and included them in my book. Some 
punctuation is always punctuation and always removed, and some is 
conditional on context - I tried to lay out all the implied rules.

-- Jack Krupansky

-----Original Message----- 
From: Steve Rowe
Sent: Friday, August 23, 2013 12:30 AM
To: solr-user@lucene.apache.org
Subject: Re: How to avoid underscore sign indexing problem?

Dan,

StandardTokenizer implements the word boundary rules from the Unicode Text 
Segmentation standard annex UAX#29:

   http://www.unicode.org/reports/tr29/#Word_Boundaries

Every character sequence within UAX#29 boundaries that contains a numeric or 
an alphabetic character is emitted as a term, and nothing else is emitted.

Punctuation can be included within a term, e.g. "1,248.99" or "192.168.1.1".

To split on underscores, you can convert underscores to e.g. spaces by 
adding PatternReplaeCharFilterFactory to your analyzer:

    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="_" 
replacement=" "/>

This replacement will be performed prior to StandardTokenizer, which will 
then see token-splitting spaces instead of underscores.

Steve

On Aug 22, 2013, at 10:23 PM, Dan Davis <da...@gmail.com> wrote:

> Ah, but what is the definition of punctuation in Solr?
>
>
> On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky 
> <ja...@basetechnology.com>wrote:
>
>> "I thought that the StandardTokenizer always split on punctuation, "
>>
>> Proving that you haven't read my book! The section on the standard
>> tokenizer details the rules that the tokenizer uses (in addition to
>> extensive examples.) That's what I mean by "deep dive."
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Shawn Heisey
>> Sent: Wednesday, August 21, 2013 10:41 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to avoid underscore sign indexing problem?
>>
>>
>> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>>
>>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>>
>>> ST
>>> textraw_**bytesstartendtypeposition
>>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>>>
>>> How to make this string to be tokenized to these two tokens "Pacific",
>>> "Rim"?
>>> Set _ as stopword?
>>> Please kindly help on this.
>>> Many thanks.
>>>
>>
>> Interesting.  I thought that the StandardTokenizer always split on
>> punctuation, but apparently that's not the case for the underscore
>> character.
>>
>> You can always use the WordDelimeterFilter after the StandardTokenizer.
>>
>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
>> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>>
>> Thanks,
>> Shawn
>>

Re: How to avoid underscore sign indexing problem?

Posted by Steve Rowe <sa...@gmail.com>.

Dan,

StandardTokenizer implements the word boundary rules from the Unicode Text Segmentation standard annex UAX#29:

   http://www.unicode.org/reports/tr29/#Word_Boundaries

Every character sequence within UAX#29 boundaries that contains a numeric or an alphabetic character is emitted as a term, and nothing else is emitted.

Punctuation can be included within a term, e.g. "1,248.99" or "192.168.1.1".

To split on underscores, you can convert underscores to e.g. spaces by adding PatternReplaeCharFilterFactory to your analyzer:

    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="_" replacement=" "/>

This replacement will be performed prior to StandardTokenizer, which will then see token-splitting spaces instead of underscores.

Steve

On Aug 22, 2013, at 10:23 PM, Dan Davis <da...@gmail.com> wrote:

> Ah, but what is the definition of punctuation in Solr?
> 
> 
> On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky <ja...@basetechnology.com>wrote:
> 
>> "I thought that the StandardTokenizer always split on punctuation, "
>> 
>> Proving that you haven't read my book! The section on the standard
>> tokenizer details the rules that the tokenizer uses (in addition to
>> extensive examples.) That's what I mean by "deep dive."
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Shawn Heisey
>> Sent: Wednesday, August 21, 2013 10:41 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to avoid underscore sign indexing problem?
>> 
>> 
>> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>> 
>>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>> 
>>> ST
>>> textraw_**bytesstartendtypeposition
>>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>>> 
>>> How to make this string to be tokenized to these two tokens "Pacific",
>>> "Rim"?
>>> Set _ as stopword?
>>> Please kindly help on this.
>>> Many thanks.
>>> 
>> 
>> Interesting.  I thought that the StandardTokenizer always split on
>> punctuation, but apparently that's not the case for the underscore
>> character.
>> 
>> You can always use the WordDelimeterFilter after the StandardTokenizer.
>> 
>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
>> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>> 
>> Thanks,
>> Shawn
>>

Re: How to avoid underscore sign indexing problem?

Posted by Dan Davis <da...@gmail.com>.

Ah, but what is the definition of punctuation in Solr?


On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> "I thought that the StandardTokenizer always split on punctuation, "
>
> Proving that you haven't read my book! The section on the standard
> tokenizer details the rules that the tokenizer uses (in addition to
> extensive examples.) That's what I mean by "deep dive."
>
> -- Jack Krupansky
>
> -----Original Message----- From: Shawn Heisey
> Sent: Wednesday, August 21, 2013 10:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to avoid underscore sign indexing problem?
>
>
> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>
>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>
>> ST
>> textraw_**bytesstartendtypeposition
>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>>
>> How to make this string to be tokenized to these two tokens "Pacific",
>> "Rim"?
>> Set _ as stopword?
>> Please kindly help on this.
>> Many thanks.
>>
>
> Interesting.  I thought that the StandardTokenizer always split on
> punctuation, but apparently that's not the case for the underscore
> character.
>
> You can always use the WordDelimeterFilter after the StandardTokenizer.
>
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>
> Thanks,
> Shawn
>

Re: How to avoid underscore sign indexing problem?

Posted by Jack Krupansky <ja...@basetechnology.com>.

"I thought that the StandardTokenizer always split on punctuation, "

Proving that you haven't read my book! The section on the standard tokenizer 
details the rules that the tokenizer uses (in addition to extensive 
examples.) That's what I mean by "deep dive."

-- Jack Krupansky

-----Original Message----- 
From: Shawn Heisey
Sent: Wednesday, August 21, 2013 10:41 PM
To: solr-user@lucene.apache.org
Subject: Re: How to avoid underscore sign indexing problem?

On 8/21/2013 7:54 PM, Floyd Wu wrote:
> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>
> ST
> textraw_bytesstartendtypeposition
> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>
> How to make this string to be tokenized to these two tokens "Pacific",
> "Rim"?
> Set _ as stopword?
> Please kindly help on this.
> Many thanks.

Interesting.  I thought that the StandardTokenizer always split on
punctuation, but apparently that's not the case for the underscore
character.

You can always use the WordDelimeterFilter after the StandardTokenizer.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Thanks,
Shawn

Re: How to avoid underscore sign indexing problem?

Posted by Shawn Heisey <so...@elyograg.org>.

On 8/21/2013 7:54 PM, Floyd Wu wrote:
> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
> 
> ST
> textraw_bytesstartendtypeposition
> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
> 
> How to make this string to be tokenized to these two tokens "Pacific",
> "Rim"?
> Set _ as stopword?
> Please kindly help on this.
> Many thanks.

Interesting.  I thought that the StandardTokenizer always split on
punctuation, but apparently that's not the case for the underscore
character.

You can always use the WordDelimeterFilter after the StandardTokenizer.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Thanks,
Shawn