You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tom Mortimer <to...@gmail.com> on 2013/11/07 12:49:27 UTC

eDisMax, multiple language support and stopwords

Hi all,

Thanks for the help and advice I've got here so far!

Another question - I want to support stopwords at search time, so that e.g.
the query "oscar and wilde" is equivalent to "oscar wilde" (this is with
lowercaseOperators=false). Fair enough, I have stopword "and" in the query
analyser chain.

However, I also need to support French as well as English, so I've got _en
and _fr versions of the text fields, with appropriate stemming and
stopwords. I index French content into the _fr fields and English into the
_en fields. I'm searching with eDisMax over both versions, e.g.:

    <str name="qf">headline_en headline_fr</str>

However, this means I get no results for "oscar and wilde". The parsed
query is:

    (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
DisjunctionMaxQuery((headline_fr:and))
DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord

If I add "and" to the French stopwords list, I *do* get results, and the
parsed query is:

    (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord

This implies that the only solution is to have a minimal, shared stopwords
list for all languages I want to support. Is this correct, or is there a
way of supporting this kind of searching with per-language stopword lists?

Thanks for any ideas!

Tom

Re: eDisMax, multiple language support and stopwords

Posted by Liu Bo <di...@gmail.com>.
Happy to see some one have similar solutions as ours.

we have similar multi-language search feature and we index different
language content to _fr, _en field like you've done

but in search, we need a language code as a parameter to specify the
language client wants to search on which is normally decided by the website
visited, such as: qf=name description&language=en

and in our search components we find the right field: name_en and
description_en to be searched on

we used to support on all language search and removed that later, as the
site tells the customer which language is supported, we also don't think we
have many language experts on our web sites that knows more than two
language and need to search them at the same time.


On 7 November 2013 23:01, Tom Mortimer <to...@gmail.com> wrote:

> Ah, thanks Markus. I think I'll just add the Boolean operators to the
> stopwords list in that case.
>
> Tom
>
>
>
> On 7 November 2013 12:01, Markus Jelsma <ma...@openindex.io>
> wrote:
>
> > This is an ancient problem. The issue here is your mm-parameter, it gets
> > confused because for separate fields different amount of tokens are
> > filtered/emitted so it is never going to work just like this. The easiest
> > option is not to use the stopfilter.
> >
> >
> >
> http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
> > https://issues.apache.org/jira/browse/SOLR-3085
> >
> > -----Original message-----
> > > From:Tom Mortimer <to...@gmail.com>
> > > Sent: Thursday 7th November 2013 12:50
> > > To: solr-user@lucene.apache.org
> > > Subject: eDisMax, multiple language support and stopwords
> > >
> > > Hi all,
> > >
> > > Thanks for the help and advice I've got here so far!
> > >
> > > Another question - I want to support stopwords at search time, so that
> > e.g.
> > > the query "oscar and wilde" is equivalent to "oscar wilde" (this is
> with
> > > lowercaseOperators=false). Fair enough, I have stopword "and" in the
> > query
> > > analyser chain.
> > >
> > > However, I also need to support French as well as English, so I've got
> > _en
> > > and _fr versions of the text fields, with appropriate stemming and
> > > stopwords. I index French content into the _fr fields and English into
> > the
> > > _en fields. I'm searching with eDisMax over both versions, e.g.:
> > >
> > >     <str name="qf">headline_en headline_fr</str>
> > >
> > > However, this means I get no results for "oscar and wilde". The parsed
> > > query is:
> > >
> > >     (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
> > > DisjunctionMaxQuery((headline_fr:and))
> > > DisjunctionMaxQuery((headline_fr:wild |
> headline_en:wild)))~3))/no_coord
> > >
> > > If I add "and" to the French stopwords list, I *do* get results, and
> the
> > > parsed query is:
> > >
> > >     (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
> > > DisjunctionMaxQuery((headline_fr:wild |
> headline_en:wild)))~2))/no_coord
> > >
> > > This implies that the only solution is to have a minimal, shared
> > stopwords
> > > list for all languages I want to support. Is this correct, or is there
> a
> > > way of supporting this kind of searching with per-language stopword
> > lists?
> > >
> > > Thanks for any ideas!
> > >
> > > Tom
> > >
> >
>



-- 
All the best

Liu Bo

Re: eDisMax, multiple language support and stopwords

Posted by Tom Mortimer <to...@gmail.com>.
Ah, thanks Markus. I think I'll just add the Boolean operators to the
stopwords list in that case.

Tom



On 7 November 2013 12:01, Markus Jelsma <ma...@openindex.io> wrote:

> This is an ancient problem. The issue here is your mm-parameter, it gets
> confused because for separate fields different amount of tokens are
> filtered/emitted so it is never going to work just like this. The easiest
> option is not to use the stopfilter.
>
>
> http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
> https://issues.apache.org/jira/browse/SOLR-3085
>
> -----Original message-----
> > From:Tom Mortimer <to...@gmail.com>
> > Sent: Thursday 7th November 2013 12:50
> > To: solr-user@lucene.apache.org
> > Subject: eDisMax, multiple language support and stopwords
> >
> > Hi all,
> >
> > Thanks for the help and advice I've got here so far!
> >
> > Another question - I want to support stopwords at search time, so that
> e.g.
> > the query "oscar and wilde" is equivalent to "oscar wilde" (this is with
> > lowercaseOperators=false). Fair enough, I have stopword "and" in the
> query
> > analyser chain.
> >
> > However, I also need to support French as well as English, so I've got
> _en
> > and _fr versions of the text fields, with appropriate stemming and
> > stopwords. I index French content into the _fr fields and English into
> the
> > _en fields. I'm searching with eDisMax over both versions, e.g.:
> >
> >     <str name="qf">headline_en headline_fr</str>
> >
> > However, this means I get no results for "oscar and wilde". The parsed
> > query is:
> >
> >     (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
> > DisjunctionMaxQuery((headline_fr:and))
> > DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord
> >
> > If I add "and" to the French stopwords list, I *do* get results, and the
> > parsed query is:
> >
> >     (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
> > DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord
> >
> > This implies that the only solution is to have a minimal, shared
> stopwords
> > list for all languages I want to support. Is this correct, or is there a
> > way of supporting this kind of searching with per-language stopword
> lists?
> >
> > Thanks for any ideas!
> >
> > Tom
> >
>

RE: eDisMax, multiple language support and stopwords

Posted by Markus Jelsma <ma...@openindex.io>.
This is an ancient problem. The issue here is your mm-parameter, it gets confused because for separate fields different amount of tokens are filtered/emitted so it is never going to work just like this. The easiest option is not to use the stopfilter.

http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
https://issues.apache.org/jira/browse/SOLR-3085
 
-----Original message-----
> From:Tom Mortimer <to...@gmail.com>
> Sent: Thursday 7th November 2013 12:50
> To: solr-user@lucene.apache.org
> Subject: eDisMax, multiple language support and stopwords
> 
> Hi all,
> 
> Thanks for the help and advice I've got here so far!
> 
> Another question - I want to support stopwords at search time, so that e.g.
> the query "oscar and wilde" is equivalent to "oscar wilde" (this is with
> lowercaseOperators=false). Fair enough, I have stopword "and" in the query
> analyser chain.
> 
> However, I also need to support French as well as English, so I've got _en
> and _fr versions of the text fields, with appropriate stemming and
> stopwords. I index French content into the _fr fields and English into the
> _en fields. I'm searching with eDisMax over both versions, e.g.:
> 
>     <str name="qf">headline_en headline_fr</str>
> 
> However, this means I get no results for "oscar and wilde". The parsed
> query is:
> 
>     (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
> DisjunctionMaxQuery((headline_fr:and))
> DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord
> 
> If I add "and" to the French stopwords list, I *do* get results, and the
> parsed query is:
> 
>     (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
> DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord
> 
> This implies that the only solution is to have a minimal, shared stopwords
> list for all languages I want to support. Is this correct, or is there a
> way of supporting this kind of searching with per-language stopword lists?
> 
> Thanks for any ideas!
> 
> Tom
>