You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Anirudha Jadhav <an...@nyu.edu> on 2014/04/09 23:51:00 UTC

WordDelimiterFilter issue and suggested fix

indexed term: bdeut_NullableValue
query term 1 : bdeut_nullablevalue (lowercase)
query term 2 : nullablevalue (lowercase)

current field type definition in order:
Whitespace Tokenizer
Word Delimiter Filter
Lowercase filter

current analysis output: [INDEX]schema_field
WT
text
bdeut_NullableValue


WDF
text
bdeut_NullableValuebdeut
Nullable
Value
bdeutNullableValue


LCF
text
bdeut_nullablevaluebdeutnullablevaluebdeutnullablevalue

current Query Analysis output:schema_field

WT
text
bdeut_nullablevalue

WDF
text
bdeut_nullablevalue
bdeut
nullablevalue
bdeutnullablevalue

LCF
text
bdeut_nullablevalue
bdeut
nullablevalue
bdeutnullablevalue



Problem:

1. Query : bdeut_nullablevalue gets no hit
we cannot find lowercase terms unless we remove camel casing/ or move LCF
before WDF, either way we lose the value of the camel case filter in WDF.
debug phrase query strings:
<str name="parsedquery_toString">schema_field:"(bdeut_nullablevalue bdeut)
(nullablevalue bdeutnullablevalue)"</str>
<str name="parsedquery_toString">schema_field:"(bdeut_nullablevalue bdeut)
nullable (value bdeutnullablevalue)"</str>

2. Query : "nullablevalue" would also not work.


The solution I propose is as follows, let me know your suggestions.

suggested fix : Add option to word delimiter factor
"recursive=[FALSE/True]",
this would run the WDF including preserve original on the subset of tokens
note: on multiple passes over the WDF word positions of the terms do not
change

To test the recursive step i just modified the analyser chain to have WDF
twice

WT
text
NullableValue



WDF
text
NullableValue
Nullable
Value
NullableValue



WDF
text
NullableValue
Nullable
Value
NullableValue
Nullable
Value
NullableValue
Nullable
Value
NullableValue


LCF
text
nullablevalue
nullable
value
nullablevalue
nullable
value
nullablevalue
nullable
value
nullablevalue



-- 
Anirudha P. Jadhav

Re: WordDelimiterFilter issue and suggested fix

Posted by Erick Erickson <er...@gmail.com>.

This really doesn't seem necessary. What is your actual field
definition? I also think your cut/paste is messed up, this is wrong:
bdeut_nullablevaluebdeutnullablevaluebdeutnullablevalue

the vertical bars in the admin/analysis page are quite important here.
Either the cut/paste is doing weird things in the e-mail or you have
something I've never seen before happening. You should be able to
index your test string and search it just fine without resorting to
changing WDFF.

Sometimes all the options for WDFF can mess you up. It works for me
OOB with the default settings for text_en_splitting, indexing
bdeut_NullableValue
and searching
bdeut_NullableValue
bdeutNullableValue
"bdeut NullableValue"

all get hits as I would expect. What version of Solr?

Best,
Erick

On Wed, Apr 9, 2014 at 2:51 PM, Anirudha Jadhav <an...@nyu.edu> wrote:
> indexed term: bdeut_NullableValue
> query term 1 : bdeut_nullablevalue (lowercase)
> query term 2 : nullablevalue (lowercase)
>
> current field type definition in order:
> Whitespace Tokenizer
> Word Delimiter Filter
> Lowercase filter
>
> current analysis output: [INDEX]schema_field
> WT
> text
> bdeut_NullableValue
>
>
> WDF
> text
> bdeut_NullableValuebdeut
> Nullable
> Value
> bdeutNullableValue
>
>
> LCF
> text
> bdeut_nullablevaluebdeutnullablevaluebdeutnullablevalue
>
> current Query Analysis output:schema_field
>
> WT
> text
> bdeut_nullablevalue
>
> WDF
> text
> bdeut_nullablevalue
> bdeut
> nullablevalue
> bdeutnullablevalue
>
> LCF
> text
> bdeut_nullablevalue
> bdeut
> nullablevalue
> bdeutnullablevalue
>
>
>
> Problem:
>
> 1. Query : bdeut_nullablevalue gets no hit
> we cannot find lowercase terms unless we remove camel casing/ or move LCF
> before WDF, either way we lose the value of the camel case filter in WDF.
> debug phrase query strings:
> <str name="parsedquery_toString">schema_field:"(bdeut_nullablevalue bdeut)
> (nullablevalue bdeutnullablevalue)"</str>
> <str name="parsedquery_toString">schema_field:"(bdeut_nullablevalue bdeut)
> nullable (value bdeutnullablevalue)"</str>
>
> 2. Query : "nullablevalue" would also not work.
>
>
> The solution I propose is as follows, let me know your suggestions.
>
> suggested fix : Add option to word delimiter factor
> "recursive=[FALSE/True]",
> this would run the WDF including preserve original on the subset of tokens
> note: on multiple passes over the WDF word positions of the terms do not
> change
>
> To test the recursive step i just modified the analyser chain to have WDF
> twice
>
> WT
> text
> NullableValue
>
>
>
> WDF
> text
> NullableValue
> Nullable
> Value
> NullableValue
>
>
>
> WDF
> text
> NullableValue
> Nullable
> Value
> NullableValue
> Nullable
> Value
> NullableValue
> Nullable
> Value
> NullableValue
>
>
> LCF
> text
> nullablevalue
> nullable
> value
> nullablevalue
> nullable
> value
> nullablevalue
> nullable
> value
> nullablevalue
>
>
>
> --
> Anirudha P. Jadhav