You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Basheer K <ba...@digitalapicraft.com> on 2021/04/23 04:59:18 UTC

Want to know about Solr security updates.

I want to subscribe to know about security updates for Apache Solr. Please
let me know the mailing list ID?

Best Regards,
Basheer K

Re: When to use or not use stem field

Posted by Alessandro Benedetti <a....@sease.io>.

Elaborating on top of the already good answers:
"Out of the box, the scoring will already take care of it."
Are we sure? I mean, it will "mostly" take care of it.

When using multi-field search, you can approach scoring in different ways,
for example using edismax and the tie factor you can move from a pure
disjunction query to a pure boolean query and anything in the middle to
calculate score.
a query "term1" on the fields qf= text text_stemmed produces:

Query Term = term1
Stemmed Query Term = term

*Pure Disjunction*
text:term1 | text_stemmed:term
The score is the max scoring clause.
For a document that contains the exact term "term1" the winning clause
could be any of the two.

*term1* in the field text has term frequency TF1 and document frequency DF1
*term* in the field text_stemmed has term frequency TF and document
frequency DF
TF >= TF1 (= if only term1 was originally present in the field, > if term1,
term2, term were present and stemmed to 'term')
IDF <= IDF1 (= if only term1 was originally present in the field in the
corpus, > if term1, term2, term were present and stemmed to 'term')

Documents containing different terms may have matches with higher or lower
TF, while DF is always going to be >=.
BM25 approaches saturation for the impact on the score of Term Frequency,
still you may get the winning clause to derive from text_stemmed:term
because of term frequency.
So I think we can say that the exact match is likely to win because of the
Inverse Document Frequency factor, but it's not guaranteed in a pure
disjunction.

e.g.
*Doc1*
text: "*term1* bla bla bla bla"
TF(stemmed)= 1
TF1(un-stemmed)=1
DF1=100
DF=101

*Doc2*:
text:"*term2* *term3* *term4* *term5* bla bla *term6* bla bla"
TF(stemmed)= 5
DF= 101
TF1(un-stemmed)=0 - no match

*Pure Boolean*
text:term1 | text_stemmed:term
The score is the sum of the scoring clauses.
But the observation is similar:
Depending on the Term Frequency, we are going to likely see a better score
for documents matching the exact term in the field 'text' (caused by the
fact that the exact term in the field 'text' has higher inverse document
frequency and we sum the stemmed counterpart).
But not always because the Inverse Document Frequency could not compensate
enough.

I know many other factors affect the score, but without boosting to a
certain extent (what extent is not easy to say), I don't think we can
guarantee the un-stemmed match wins.

Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io

On Fri, 23 Apr 2021 at 12:35, Markus Jelsma <ma...@openindex.io>
wrote:

> Hallo,
>
> I would use both at the same time. You do not always want to find all
> stemmed forms of a term, but the unstemmed form instead, or at least have
> the latter being scored higher. Out of the box, the scoring will already
> take care of it.
>
> Although i actually prefer both in one field, using the KeywordRepeat
> filter. But that leads to other headaches that require even more work to
> fix it. Use both fields and keep it simple.
>
> Regards,
> Markus
>
> Op vr 23 apr. 2021 om 11:50 schreef The Maverick <ma...@posteo.de>:
>
> > Hello
> >
> > I have aschema with two fields
> > One is stemmed and one isn't.
> > When I would use the stemmed field in my search. ( or when I shouldn't do
> > it )
> >
> > Regards
> > S
> >
>

Re: When to use or not use stem field

Posted by Walter Underwood <wu...@wunderwood.org>.

Use stemming for regular text, like news articles or product descriptions.
You want to match “job” to both “jobs report” and “job numbers”.

Use unstemmed for proper names—people, places, products. You do not
want “job” to match “Steve Jobs” but you do want it to match “Book of Job”.
You don’t want “gate” to match “Bill Gates”. You do not want “see” to match
the movie “Saw”. Really, you don’t, that happened at Netflix.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 23, 2021, at 4:35 AM, Markus Jelsma <ma...@openindex.io> wrote:
> 
> Hallo,
> 
> I would use both at the same time. You do not always want to find all
> stemmed forms of a term, but the unstemmed form instead, or at least have
> the latter being scored higher. Out of the box, the scoring will already
> take care of it.
> 
> Although i actually prefer both in one field, using the KeywordRepeat
> filter. But that leads to other headaches that require even more work to
> fix it. Use both fields and keep it simple.
> 
> Regards,
> Markus
> 
> Op vr 23 apr. 2021 om 11:50 schreef The Maverick <ma...@posteo.de>:
> 
>> Hello
>> 
>> I have aschema with two fields
>> One is stemmed and one isn't.
>> When I would use the stemmed field in my search. ( or when I shouldn't do
>> it )
>> 
>> Regards
>> S
>>

Re: When to use or not use stem field

Posted by Markus Jelsma <ma...@openindex.io>.

Hallo,

I would use both at the same time. You do not always want to find all
stemmed forms of a term, but the unstemmed form instead, or at least have
the latter being scored higher. Out of the box, the scoring will already
take care of it.

Although i actually prefer both in one field, using the KeywordRepeat
filter. But that leads to other headaches that require even more work to
fix it. Use both fields and keep it simple.

Regards,
Markus

Op vr 23 apr. 2021 om 11:50 schreef The Maverick <ma...@posteo.de>:

> Hello
>
> I have aschema with two fields
> One is stemmed and one isn't.
> When I would use the stemmed field in my search. ( or when I shouldn't do
> it )
>
> Regards
> S
>

When to use or not use stem field

Posted by The Maverick <ma...@posteo.de>.

Hello

I have aschema with two fields
One is stemmed and one isn't.
When I would use the stemmed field in my search. ( or when I shouldn't do it )

Regards
S

Re: Want to know about Solr security updates.

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Almost same reply as for new releases. There's a separate security section in the website, also with an ATOM feed.
You could probably also setup an email rule on this list, looking for CVE in the subject, and route those to a different folder.

Jan

> 23. apr. 2021 kl. 06:59 skrev Basheer K <ba...@digitalapicraft.com>:
> 
> I want to subscribe to know about security updates for Apache Solr. Please
> let me know the mailing list ID?
> 
> Best Regards,
> Basheer K