You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by anuvenk <an...@hotmail.com> on 2008/01/05 07:24:14 UTC

solr word delimiter

I have the word delimiter filter factory in the text field definition both at
index and query time. 
But it does have some negative effects on some search terms like h1-b visa
It splits this in to three tokens h,1,b. Now if i understand right, does
solr look for matches for 'h' separately, '1' separately and 'b' separately
because they are three different tokens. This is giving some undesired
results..docs that have 'h' somewhere, '1' somewhere and 'b' somewhere. How
to solve this problem?
I tried adding synonym like h1-b => h1b visa
It does filter some results, but i'm trying to find a global solution rather
adding synonyms for all kinds of immigration forms like i-94, k-1 etc
-- 
View this message in context: http://www.nabble.com/solr-word-delimiter-tp14630435p14630435.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr word delimiter

Posted by anuvenk <an...@hotmail.com>.

The worddelimiter filter is set to
generatewordparts=1,generatenumberparts=1,catenatewords=1,catenatenumbers=1
both at index and querytime.

Now i have this synonym mapping k-1 => k1 visa

Here is the parsedquery_ToString
<str name="parsedquery_toString">
+(text:"k (1 k) 1 visa"^0.8 | name:"k (1 k) 1 visa"^2.0)~0.01 (text:"k (1 k)
1 visa"~25^0.8 | name:"k (1 k) 1 visa"~25^2.0)~0.01
</str>

Why is solr grouping this way?k (1 k) 1 visa (i mean the 1k within
brackets?)
Also now after k-1 gets split by worddelimiter, does catenatewords=1 make k1
to be a single token?

As far as with the matching, 
(text:"k (1 k) 1 visa"^0.8
documents that have k1 visa exact phrase would rank higher, docs with just
k1 might rank next 
and since i have ps set to 25, would it also match docs that have 'k' and
'1' within 25 words of one another? or k1 and visa within 25 words of one
another because k1 is a single token? I seem to get confused with how solr
matches documents in cases like this.

Yonik Seeley wrote:
> 
> On Jan 5, 2008 2:28 PM, anuvenk <an...@hotmail.com> wrote:
>> Thats what i'm thinking too. If i remove solr.worddelimiter filter from
>> both
>> index and query, the word h1-b will remain as is in the index correct, so
>> if
>> someone searches for h1b (without hyphens) would it still return the h1-b
>> doc.
> 
> for "h1-b" to match "h1b", it will take either a synonym or something
> like WordDelimiterFilter.
> You can configure WordDelimiterFilter to only catenate too... so h1-b
> would become h1b at both index and query time.  The downside is that
> it might catenate things you want.
> 
> -Yonik
> 
> 

-- 
View this message in context: http://www.nabble.com/solr-word-delimiter-tp14630435p14641602.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr word delimiter

Posted by Yonik Seeley <yo...@apache.org>.

On Jan 5, 2008 2:28 PM, anuvenk <an...@hotmail.com> wrote:
> Thats what i'm thinking too. If i remove solr.worddelimiter filter from both
> index and query, the word h1-b will remain as is in the index correct, so if
> someone searches for h1b (without hyphens) would it still return the h1-b
> doc.

for "h1-b" to match "h1b", it will take either a synonym or something
like WordDelimiterFilter.
You can configure WordDelimiterFilter to only catenate too... so h1-b
would become h1b at both index and query time.  The downside is that
it might catenate things you want.

-Yonik

Re: solr word delimiter

Posted by anuvenk <an...@hotmail.com>.

Thats what i'm thinking too. If i remove solr.worddelimiter filter from both
index and query, the word h1-b will remain as is in the index correct, so if
someone searches for h1b (without hyphens) would it still return the h1-b
doc. 

Otis Gospodnetic wrote:
> 
> It sounds like you simply want to drop solr.WordDelimiterFilterFactory
> from your analyzer definition, no?
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> ----- Original Message ----
> From: anuvenk <an...@hotmail.com>
> To: solr-user@lucene.apache.org
> Sent: Saturday, January 5, 2008 1:24:14 AM
> Subject: solr word delimiter
> 
> 
> I have the word delimiter filter factory in the text field definition
>  both at
> index and query time. 
> But it does have some negative effects on some search terms like h1-b
>  visa
> It splits this in to three tokens h,1,b. Now if i understand right,
>  does
> solr look for matches for 'h' separately, '1' separately and 'b'
>  separately
> because they are three different tokens. This is giving some undesired
> results..docs that have 'h' somewhere, '1' somewhere and 'b' somewhere.
>  How
> to solve this problem?
> I tried adding synonym like h1-b => h1b visa
> It does filter some results, but i'm trying to find a global solution
>  rather
> adding synonyms for all kinds of immigration forms like i-94, k-1 etc
> -- 
> View this message in context:
>  http://www.nabble.com/solr-word-delimiter-tp14630435p14630435.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/solr-word-delimiter-tp14630435p14637863.html
Sent from the Solr - User mailing list archive at Nabble.com.