You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Lukas Kahwe Smith <ml...@pooteeweet.org> on 2010/06/13 11:27:30 UTC

dealing with dash chars in fields when using dismax

Hi,

I am using dismax on solr 1.4 and I am running into an issue with fields that contain dash chars:
Foo-Bar - Company

Now if someone searches for exactly "Foo-Bar - Company" the resulting dismax query would disallow "Company" when trying to find a match.
Obviously I could just ignore all "-" in any query strings, but that would prevent power users from prohibiting words is there some magic I can enable that would make this use case possible with sensible scoring for the results?

Of course I could put in some work on the client side to index things accordingly, but I guess in that case I rather just remove support for prohibiting words.

regards,
Lukas Kahwe Smith
mls@pooteeweet.org




Re: dealing with dash chars in fields when using dismax

Posted by Lance Norskog <go...@gmail.com>.
There will always be edge cases and the parser cannot be all things to
all people. Most applications have an application layer that create
the actual Solr query, and that is where you'll have to handle this
one.

On Sun, Jun 13, 2010 at 8:25 AM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>
> On 13.06.2010, at 17:20, Erick Erickson wrote:
>
>> <<<but still is there some clean solution that doesnt mean a lot of coding
>> work on my end to handle dash both as a special and as a normal char.>>>
>>
>> And how would the code know? You're essentially asking for DWIM (Do What I
>> Mean) functionality, which I've been awaiting for many years....
>>
>> It seems a reasonable approach would be to have your power users understand
>> they needed to escape hyphens. Or introduce your own syntax for negation
>> which would be a simple string substitution on the way through. Or.....
>> Because somewhere you need some external input that distinguishes between "I
>> mean this hyphen to be a negation, but this other one to be a literal".
>>
>> If this seems irrelevant, then I'm missing your point pretty badly. A use
>> case or two where this distinction is important would be helpful. Or is that
>> use-case <G>?
>
>
> No, I was just wondering if someone by chance implemented the DWIM I want :)
> But I guess for now I will just escape, since we do not advertise + and - syntax anyway atm.
> Then again more and more people are learning how it works in google and are starting to just try it out when they are doing searches.
>
> What I might end up doing though is not escape dashes only in specific cases:
> foo-bar (escape)
> foo - bar (escape)
> foo -bar (not escape, aka probihit bar)
>
> This should enable power users and should rarely hit non power users.
>
> regards,
> Lukas Kahwe Smith
> mls@pooteeweet.org
>
>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: dealing with dash chars in fields when using dismax

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.
On 13.06.2010, at 17:20, Erick Erickson wrote:

> <<<but still is there some clean solution that doesnt mean a lot of coding
> work on my end to handle dash both as a special and as a normal char.>>>
> 
> And how would the code know? You're essentially asking for DWIM (Do What I
> Mean) functionality, which I've been awaiting for many years....
> 
> It seems a reasonable approach would be to have your power users understand
> they needed to escape hyphens. Or introduce your own syntax for negation
> which would be a simple string substitution on the way through. Or.....
> Because somewhere you need some external input that distinguishes between "I
> mean this hyphen to be a negation, but this other one to be a literal".
> 
> If this seems irrelevant, then I'm missing your point pretty badly. A use
> case or two where this distinction is important would be helpful. Or is that
> use-case <G>?


No, I was just wondering if someone by chance implemented the DWIM I want :)
But I guess for now I will just escape, since we do not advertise + and - syntax anyway atm.
Then again more and more people are learning how it works in google and are starting to just try it out when they are doing searches.

What I might end up doing though is not escape dashes only in specific cases:
foo-bar (escape)
foo - bar (escape)
foo -bar (not escape, aka probihit bar)

This should enable power users and should rarely hit non power users.

regards,
Lukas Kahwe Smith
mls@pooteeweet.org




Re: dealing with dash chars in fields when using dismax

Posted by Erick Erickson <er...@gmail.com>.
<<<sure .. escaping ends up being the same as removing>>>
I don't think so. Removing would mean that the same exact match search would
match documents with and without hyphens. I.e. searching for "my - way"
would match either
original content of "my way" or "my - way". Whereas escaping the hyphen
would cause only the correct exact match to be returned. This may or may not
be desired behavior...

<<<but still is there some clean solution that doesnt mean a lot of coding
work on my end to handle dash both as a special and as a normal char.>>>

And how would the code know? You're essentially asking for DWIM (Do What I
Mean) functionality, which I've been awaiting for many years....

It seems a reasonable approach would be to have your power users understand
they needed to escape hyphens. Or introduce your own syntax for negation
which would be a simple string substitution on the way through. Or.....
Because somewhere you need some external input that distinguishes between "I
mean this hyphen to be a negation, but this other one to be a literal".

If this seems irrelevant, then I'm missing your point pretty badly. A use
case or two where this distinction is important would be helpful. Or is that
use-case <G>?

Best
Erick

On Sun, Jun 13, 2010 at 11:00 AM, Lukas Kahwe Smith <ml...@pooteeweet.org>wrote:

>
> On 13.06.2010, at 16:57, Erick Erickson wrote:
>
> > Have you tried escaping the dashes? Your dismax definition
> > and the output from the analysis admin page would also help.
>
>
> sure .. escaping ends up being the same as removing. but i guess it would
> be the better approach of course. but still is there some clean solution
> that doesnt mean a lot of coding work on my end to handle dash both as a
> special and as a normal char.
>
> something like doing the search twice both with the dash escaped and not
> escaped and then some intelligent scoring to produce the final result set.
>
> regards,
> Lukas Kahwe Smith
> mls@pooteeweet.org
>
>
>
>

Re: dealing with dash chars in fields when using dismax

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.
On 13.06.2010, at 16:57, Erick Erickson wrote:

> Have you tried escaping the dashes? Your dismax definition
> and the output from the analysis admin page would also help.


sure .. escaping ends up being the same as removing. but i guess it would be the better approach of course. but still is there some clean solution that doesnt mean a lot of coding work on my end to handle dash both as a special and as a normal char.

something like doing the search twice both with the dash escaped and not escaped and then some intelligent scoring to produce the final result set.

regards,
Lukas Kahwe Smith
mls@pooteeweet.org




Re: dealing with dash chars in fields when using dismax

Posted by Erick Erickson <er...@gmail.com>.
Have you tried escaping the dashes? Your dismax definition
and the output from the analysis admin page would also help.

Best
Erick

On Sun, Jun 13, 2010 at 5:27 AM, Lukas Kahwe Smith <ml...@pooteeweet.org>wrote:

> Hi,
>
> I am using dismax on solr 1.4 and I am running into an issue with fields
> that contain dash chars:
> Foo-Bar - Company
>
> Now if someone searches for exactly "Foo-Bar - Company" the resulting
> dismax query would disallow "Company" when trying to find a match.
> Obviously I could just ignore all "-" in any query strings, but that would
> prevent power users from prohibiting words is there some magic I can enable
> that would make this use case possible with sensible scoring for the
> results?
>
> Of course I could put in some work on the client side to index things
> accordingly, but I guess in that case I rather just remove support for
> prohibiting words.
>
> regards,
> Lukas Kahwe Smith
> mls@pooteeweet.org
>
>
>
>