You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Burgmans, Tom" <to...@wolterskluwer.com> on 2013/03/13 16:55:52 UTC

RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)

The main reason of using stopwords is to speed up query performance, since we see that a huge part is consumed by highlighting stopwords. Also when reading the full highlighted document, we think that it makes a document better readable when only meaningful words are highlighted.

For searching in fact I like to keep stopwords...


-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org]
Sent: Wednesday 13 March 2013 04:43
To: solr-user@lucene.apache.org
Subject: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)
Importance: Low

Or don't use stopwords. I haven't used stopwords for, oh, a dozen years or so.

Removing stopwords was a hack developed for 16-bit computers and 40 megabyte disks. We don't need to do that any more.

wunder

On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote:

> I would merge stop_en.txt and stop_fr.txt. Use same set of stop words for all fields that you search on.
>
> You might find this useful : http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>
> --- On Wed, 3/13/13, Burgmans, Tom <to...@wolterskluwer.com> wrote:
>
>> From: Burgmans, Tom <to...@wolterskluwer.com>
>> Subject: strange edismax parsing when searching in multiple fields (#TB)
>> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
>> Date: Wednesday, March 13, 2013, 5:22 PM
>> Hi group,
>>
>> Background:
>> I have a collection containing English and French documents.
>> I made sure to index the English content in field "body"
>> (fieldType=text_en) and the French content in field
>> "body_fr" (fieldType=text_fr).
>>
>> The user could be either English of French so the goal is to
>> execute the queries against both fields simultaneously
>> without knowing the query language upfront. The query is
>> analyzed differently for each field. For both fields a
>> stopFilter is configured with each its own list of stopwords
>> (different per language).
>>
>> The issue:
>> When I search for 'a result' (without single quotes) in
>> field "body" and "body_fr" at the same time, then "a" is
>> considered a stopword in English and removed for field
>> "body", but not in French so both terms are still searched
>> inside "body_fr". What happens is that the query is parsed
>> (edismax) into this construction:
>>
>> ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0)
>>
>> This query returns only French documents, although there are
>> many English documents in the index that contain the term
>> 'result' as well. How can that happen? I think it is related
>> to the way my query is parsed: there seems to be an
>> AND-relationship between (body_fr:a) and (body:result |
>> body_fr:result). There is no English document that has
>> (body_fr:a), so that's why they don't show up. For me a much
>> more logic parsed query would be:
>>
>> ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0)
>>
>> How should I interpret this? Is it a bug in edismax? Is it
>> intended and if yes: why?
>>
>> Thanks for any hint,
>> Tom
>>
>> This email and any attachments may contain confidential or
>> privileged information
>> and is intended for the addressee only. If you are not the
>> intended recipient, please
>> immediately notify us by email or telephone and delete the
>> original email and attachments
>> without using, disseminating or reproducing its contents to
>> anyone other than the intended
>> recipient. Wolters Kluwer shall not be liable for the
>> incorrect or incomplete transmission of
>> of this email or any attachments, nor for unauthorized use
>> by its employees.
>>
>> Wolters Kluwer nv has its registered address in Alphen aan
>> den Rijn, The Netherlands, and is registered
>> with the Trade Registry of the Dutch Chamber of Commerce
>> under number 33202517.
>>

--
Walter Underwood
wunder@wunderwood.org




This email and any attachments may contain confidential or privileged information
and is intended for the addressee only. If you are not the intended recipient, please
immediately notify us by email or telephone and delete the original email and attachments
without using, disseminating or reproducing its contents to anyone other than the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

Re: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)

Posted by Walter Underwood <wu...@wunderwood.org>.
Yeah, the Ultraseek highlighter did not highlight standalone stopwords. It did highlight stopwords in phrases. That is the "vitamin a" test.

wunder

On Mar 13, 2013, at 8:55 AM, Burgmans, Tom wrote:

> The main reason of using stopwords is to speed up query performance, since we see that a huge part is consumed by highlighting stopwords. Also when reading the full highlighted document, we think that it makes a document better readable when only meaningful words are highlighted.
> 
> For searching in fact I like to keep stopwords...
> 
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org]
> Sent: Wednesday 13 March 2013 04:43
> To: solr-user@lucene.apache.org
> Subject: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)
> Importance: Low
> 
> Or don't use stopwords. I haven't used stopwords for, oh, a dozen years or so.
> 
> Removing stopwords was a hack developed for 16-bit computers and 40 megabyte disks. We don't need to do that any more.
> 
> wunder
> 
> On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote:
> 
>> I would merge stop_en.txt and stop_fr.txt. Use same set of stop words for all fields that you search on.
>> 
>> You might find this useful : http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>> 
>> --- On Wed, 3/13/13, Burgmans, Tom <to...@wolterskluwer.com> wrote:
>> 
>>> From: Burgmans, Tom <to...@wolterskluwer.com>
>>> Subject: strange edismax parsing when searching in multiple fields (#TB)
>>> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
>>> Date: Wednesday, March 13, 2013, 5:22 PM
>>> Hi group,
>>> 
>>> Background:
>>> I have a collection containing English and French documents.
>>> I made sure to index the English content in field "body"
>>> (fieldType=text_en) and the French content in field
>>> "body_fr" (fieldType=text_fr).
>>> 
>>> The user could be either English of French so the goal is to
>>> execute the queries against both fields simultaneously
>>> without knowing the query language upfront. The query is
>>> analyzed differently for each field. For both fields a
>>> stopFilter is configured with each its own list of stopwords
>>> (different per language).
>>> 
>>> The issue:
>>> When I search for 'a result' (without single quotes) in
>>> field "body" and "body_fr" at the same time, then "a" is
>>> considered a stopword in English and removed for field
>>> "body", but not in French so both terms are still searched
>>> inside "body_fr". What happens is that the query is parsed
>>> (edismax) into this construction:
>>> 
>>> ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0)
>>> 
>>> This query returns only French documents, although there are
>>> many English documents in the index that contain the term
>>> 'result' as well. How can that happen? I think it is related
>>> to the way my query is parsed: there seems to be an
>>> AND-relationship between (body_fr:a) and (body:result |
>>> body_fr:result). There is no English document that has
>>> (body_fr:a), so that's why they don't show up. For me a much
>>> more logic parsed query would be:
>>> 
>>> ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0)
>>> 
>>> How should I interpret this? Is it a bug in edismax? Is it
>>> intended and if yes: why?
>>> 
>>> Thanks for any hint,
>>> Tom
>>> 
>>> This email and any attachments may contain confidential or
>>> privileged information
>>> and is intended for the addressee only. If you are not the
>>> intended recipient, please
>>> immediately notify us by email or telephone and delete the
>>> original email and attachments
>>> without using, disseminating or reproducing its contents to
>>> anyone other than the intended
>>> recipient. Wolters Kluwer shall not be liable for the
>>> incorrect or incomplete transmission of
>>> of this email or any attachments, nor for unauthorized use
>>> by its employees.
>>> 
>>> Wolters Kluwer nv has its registered address in Alphen aan
>>> den Rijn, The Netherlands, and is registered
>>> with the Trade Registry of the Dutch Chamber of Commerce
>>> under number 33202517.
>>> 
> 
> --
> Walter Underwood
> wunder@wunderwood.org
> 
> 
> 
> 
> This email and any attachments may contain confidential or privileged information
> and is intended for the addressee only. If you are not the intended recipient, please
> immediately notify us by email or telephone and delete the original email and attachments
> without using, disseminating or reproducing its contents to anyone other than the intended
> recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of
> of this email or any attachments, nor for unauthorized use by its employees.
> 
> Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered
> with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

--
Walter Underwood
wunder@wunderwood.org




RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Tom,

I don't use stop word removal either. I use hl.q parameter fed with "meaningful words". 
 http://wiki.apache.org/solr/HighlightingParameters#hl.q


--- On Wed, 3/13/13, Burgmans, Tom <to...@wolterskluwer.com> wrote:

> From: Burgmans, Tom <to...@wolterskluwer.com>
> Subject: RE: [SPAM]  Re: strange edismax parsing when searching in multiple fields (#TB)
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Date: Wednesday, March 13, 2013, 5:55 PM
> The main reason of using stopwords is
> to speed up query performance, since we see that a huge part
> is consumed by highlighting stopwords. Also when reading the
> full highlighted document, we think that it makes a document
> better readable when only meaningful words are highlighted.
> 
> For searching in fact I like to keep stopwords...
> 
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org]
> Sent: Wednesday 13 March 2013 04:43
> To: solr-user@lucene.apache.org
> Subject: [SPAM] Re: strange edismax parsing when searching
> in multiple fields (#TB)
> Importance: Low
> 
> Or don't use stopwords. I haven't used stopwords for, oh, a
> dozen years or so.
> 
> Removing stopwords was a hack developed for 16-bit computers
> and 40 megabyte disks. We don't need to do that any more.
> 
> wunder
> 
> On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote:
> 
> > I would merge stop_en.txt and stop_fr.txt. Use same set
> of stop words for all fields that you search on.
> >
> > You might find this useful : http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
> >
> > --- On Wed, 3/13/13, Burgmans, Tom <to...@wolterskluwer.com>
> wrote:
> >
> >> From: Burgmans, Tom <to...@wolterskluwer.com>
> >> Subject: strange edismax parsing when searching in
> multiple fields (#TB)
> >> To: "solr-user@lucene.apache.org"
> <so...@lucene.apache.org>
> >> Date: Wednesday, March 13, 2013, 5:22 PM
> >> Hi group,
> >>
> >> Background:
> >> I have a collection containing English and French
> documents.
> >> I made sure to index the English content in field
> "body"
> >> (fieldType=text_en) and the French content in
> field
> >> "body_fr" (fieldType=text_fr).
> >>
> >> The user could be either English of French so the
> goal is to
> >> execute the queries against both fields
> simultaneously
> >> without knowing the query language upfront. The
> query is
> >> analyzed differently for each field. For both
> fields a
> >> stopFilter is configured with each its own list of
> stopwords
> >> (different per language).
> >>
> >> The issue:
> >> When I search for 'a result' (without single
> quotes) in
> >> field "body" and "body_fr" at the same time, then
> "a" is
> >> considered a stopword in English and removed for
> field
> >> "body", but not in French so both terms are still
> searched
> >> inside "body_fr". What happens is that the query is
> parsed
> >> (edismax) into this construction:
> >>
> >> ((body_fr:a)~1.0 (body:result |
> body_fr:result)~1.0)
> >>
> >> This query returns only French documents, although
> there are
> >> many English documents in the index that contain
> the term
> >> 'result' as well. How can that happen? I think it
> is related
> >> to the way my query is parsed: there seems to be
> an
> >> AND-relationship between (body_fr:a) and
> (body:result |
> >> body_fr:result). There is no English document that
> has
> >> (body_fr:a), so that's why they don't show up. For
> me a much
> >> more logic parsed query would be:
> >>
> >> ((body:result)~1.0 | (body_fr:a
> body_fr:result)~1.0)
> >>
> >> How should I interpret this? Is it a bug in
> edismax? Is it
> >> intended and if yes: why?
> >>
> >> Thanks for any hint,
> >> Tom
> >>
> >> This email and any attachments may contain
> confidential or
> >> privileged information
> >> and is intended for the addressee only. If you are
> not the
> >> intended recipient, please
> >> immediately notify us by email or telephone and
> delete the
> >> original email and attachments
> >> without using, disseminating or reproducing its
> contents to
> >> anyone other than the intended
> >> recipient. Wolters Kluwer shall not be liable for
> the
> >> incorrect or incomplete transmission of
> >> of this email or any attachments, nor for
> unauthorized use
> >> by its employees.
> >>
> >> Wolters Kluwer nv has its registered address in
> Alphen aan
> >> den Rijn, The Netherlands, and is registered
> >> with the Trade Registry of the Dutch Chamber of
> Commerce
> >> under number 33202517.
> >>
> 
> --
> Walter Underwood
> wunder@wunderwood.org
> 
> 
> 
> 
> This email and any attachments may contain confidential or
> privileged information
> and is intended for the addressee only. If you are not the
> intended recipient, please
> immediately notify us by email or telephone and delete the
> original email and attachments
> without using, disseminating or reproducing its contents to
> anyone other than the intended
> recipient. Wolters Kluwer shall not be liable for the
> incorrect or incomplete transmission of
> of this email or any attachments, nor for unauthorized use
> by its employees.
> 
> Wolters Kluwer nv has its registered address in Alphen aan
> den Rijn, The Netherlands, and is registered
> with the Trade Registry of the Dutch Chamber of Commerce
> under number 33202517.
>