You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jian Xu <jo...@yahoo.com> on 2012/04/11 21:59:10 UTC

Question about solr.WordDelimiterFilterFactory

Hello,

I am new to solr/lucene. I am tasked to index a large number of documents. Some of these documents contain decimal points. I am looking for a way to index these documents so that adjacent numeric characters (such as [0-9.,]) are treated as single token. For example,

12.34 => "12.34"
12,345 => "12,345"

However, "," and "." should be treated as usual when around non-digital characters. For example,

ab,cd => "ab" "cd".

It is so that searching for "12.34" will match "12.34" not "12 34". Searching for "ab.cd" should match both "ab.cd" and "ab cd".

After doing some research on solr, It seems that there is a build-in analyzer called solr.WordDelimiterFilter that supports a "types" attribute which map special characters as different delimiters.  However, it isn't exactly what I want. It doesn't provide context check such as "," or "." must surround by digital characters, etc. 

Does anyone have any experience configuring solr to meet this requirements?  Is writing my own plugin necessary for this simple thing?

Thanks in advance!

-Jian

Re: Question about solr.WordDelimiterFilterFactory

Posted by Jian Xu <jo...@yahoo.com>.
Erick,

Thank you for your response! 

The problem with this approach is that searching for "12:34" will also match "12.34" which is not what I want.


________________________________
 From: Erick Erickson <er...@gmail.com>
To: solr-user@lucene.apache.org; Jian Xu <jo...@yahoo.com> 
Sent: Thursday, April 12, 2012 8:01 AM
Subject: Re: Question about solr.WordDelimiterFilterFactory
 
WordDelimiterFilterFactory will _almost_ do what you want
by setting things like catenateWords=0 and catenateNumbers=1,
_except_ that the punctuation will be removed. So
12.34 -> 1234
ab,cd -> ab cd

is that "close enough"?

Otherwise, writing a simple Filter is probably the way to go.

Best
Erick

On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu <jo...@yahoo.com> wrote:
> Hello,
>
> I am new to solr/lucene. I am tasked to index a large number of documents. Some of these documents contain decimal points. I am looking for a way to index these documents so that adjacent numeric characters (such as [0-9.,]) are treated as single token. For example,
>
> 12.34 => "12.34"
> 12,345 => "12,345"
>
> However, "," and "." should be treated as usual when around non-digital characters. For example,
>
> ab,cd => "ab" "cd".
>
> It is so that searching for "12.34" will match "12.34" not "12 34". Searching for "ab.cd" should match both "ab.cd" and "ab cd".
>
> After doing some research on solr, It seems that there is a build-in analyzer called solr.WordDelimiterFilter that supports a "types" attribute which map special characters as different delimiters.  However, it isn't exactly what I want. It doesn't provide context check such as "," or "." must surround by digital characters, etc.
>
> Does anyone have any experience configuring solr to meet this requirements?  Is writing my own plugin necessary for this simple thing?
>
> Thanks in advance!
>
> -Jian

Re: Question about solr.WordDelimiterFilterFactory

Posted by Erick Erickson <er...@gmail.com>.
WordDelimiterFilterFactory will _almost_ do what you want
by setting things like catenateWords=0 and catenateNumbers=1,
_except_ that the punctuation will be removed. So
12.34 -> 1234
ab,cd -> ab cd

is that "close enough"?

Otherwise, writing a simple Filter is probably the way to go.

Best
Erick

On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu <jo...@yahoo.com> wrote:
> Hello,
>
> I am new to solr/lucene. I am tasked to index a large number of documents. Some of these documents contain decimal points. I am looking for a way to index these documents so that adjacent numeric characters (such as [0-9.,]) are treated as single token. For example,
>
> 12.34 => "12.34"
> 12,345 => "12,345"
>
> However, "," and "." should be treated as usual when around non-digital characters. For example,
>
> ab,cd => "ab" "cd".
>
> It is so that searching for "12.34" will match "12.34" not "12 34". Searching for "ab.cd" should match both "ab.cd" and "ab cd".
>
> After doing some research on solr, It seems that there is a build-in analyzer called solr.WordDelimiterFilter that supports a "types" attribute which map special characters as different delimiters.  However, it isn't exactly what I want. It doesn't provide context check such as "," or "." must surround by digital characters, etc.
>
> Does anyone have any experience configuring solr to meet this requirements?  Is writing my own plugin necessary for this simple thing?
>
> Thanks in advance!
>
> -Jian