You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Floyd Wu <fl...@gmail.com> on 2012/03/03 04:52:48 UTC

Remove underscore char when indexing and query problem

Hi there,

I have a document and its title is "20111213_solr_apache conference report".

When I use analysis web interface to see what tokens exactly solr analyze
and the following is the result

term text20111213_solrapacheconferencereportterm type<NUM><ALPHANUM>
<ALPHANUM><ALPHANUM>


Why 20111213_solr tokenized as <NUM> and "_" char won't be removed? (I've
add "_" as stop word in stopwords.txt)

I did another test when "20111213_solr_apache conference_report".
As you can see the difference is I add an underscore char between
conference and report. To analyze this string
term text20111213_solrapacheconferencereportterm type<NUM><ALPHANUM>
<ALPHANUM><ALPHANUM>
this time the underscore char between conference and report is removed!

Why? How to make solr remove underscore char and behave consistent?
Please help on this.

Thanks in advance.

Floyd

Re: Remove underscore char when indexing and query problem

Posted by Erick Erickson <er...@gmail.com>.
Look at the admin/analysis page and be sure to check the "verbose"
checkboxes. that'll show you what each filter does to the input. My
guess is that WordDelimiterFilterFactory has different parameters
and that's what you're seeing. WDFF can be tricky to understand...

If that's not helpful, you need to provide your field definition.

Best
Erick

On Fri, Mar 2, 2012 at 10:52 PM, Floyd Wu <fl...@gmail.com> wrote:
> Hi there,
>
> I have a document and its title is "20111213_solr_apache conference report".
>
> When I use analysis web interface to see what tokens exactly solr analyze
> and the following is the result
>
> term text20111213_solrapacheconferencereportterm type<NUM><ALPHANUM>
> <ALPHANUM><ALPHANUM>
>
>
> Why 20111213_solr tokenized as <NUM> and "_" char won't be removed? (I've
> add "_" as stop word in stopwords.txt)
>
> I did another test when "20111213_solr_apache conference_report".
> As you can see the difference is I add an underscore char between
> conference and report. To analyze this string
> term text20111213_solrapacheconferencereportterm type<NUM><ALPHANUM>
> <ALPHANUM><ALPHANUM>
> this time the underscore char between conference and report is removed!
>
> Why? How to make solr remove underscore char and behave consistent?
> Please help on this.
>
> Thanks in advance.
>
> Floyd