You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Anh Dũng Bùi <du...@gmail.com> on 2022/12/27 23:22:40 UTC

Question for SynonymQuery

Hi Lucene users,

I recently came across SynonymQuery and found out that it only supports
single-term synonyms (since it accepts a list of Term which will be
considered as synonyms). We have some multi-term synonyms like "internet
device" <-> "wifi router" or "dns" <-> "domain name service". Am I right
that I need to use something like a BooleanQuery for these cases?

I have 2 other follow-up questions:
- Does SynonymQuery have any advantage over BooleanQuery? Or is it only
different in how scores are computed? As I understand SynonymWeight will
consider all terms as exactly the same while BooleanQuery will favor the
documents with more matched terms.
- Is it worth it to support multi-term synonyms in SynonymQuery? My feeling
is that it's better to just use BooleanQuery in those cases, since to
support multi-term synonyms it needs to accept a list of Query, which would
make it behave like a BooleanQuery. Also how scoring works with multi-term
is another problem.

Thanks & Regards!

Re: Question for SynonymQuery

Posted by Mikhail Khludnev <mk...@apache.org>.
Hello.

1)Yes. That's the purpose.
2) I've skimmed through QueryBuilder.java. Conclusion is that it creates
BQ.SHOULD (however, there should be something like DisjunctionMaxQuery)
over PhraseQuery or MultiPhraseQuery (-ies).
Good hack!

On Wed, Dec 28, 2022 at 2:23 AM Anh Dũng Bùi <du...@gmail.com> wrote:

> Hi Lucene users,
>
> I recently came across SynonymQuery and found out that it only supports
> single-term synonyms (since it accepts a list of Term which will be
> considered as synonyms). We have some multi-term synonyms like "internet
> device" <-> "wifi router" or "dns" <-> "domain name service". Am I right
> that I need to use something like a BooleanQuery for these cases?
>
> I have 2 other follow-up questions:
> - Does SynonymQuery have any advantage over BooleanQuery? Or is it only
> different in how scores are computed? As I understand SynonymWeight will
> consider all terms as exactly the same while BooleanQuery will favor the
> documents with more matched terms.
> - Is it worth it to support multi-term synonyms in SynonymQuery? My feeling
> is that it's better to just use BooleanQuery in those cases, since to
> support multi-term synonyms it needs to accept a list of Query, which would
> make it behave like a BooleanQuery. Also how scoring works with multi-term
> is another problem.
>
> Thanks & Regards!
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Question for SynonymQuery

Posted by Michael Wechner <mi...@wyona.com>.
independent of the synonym implementation you might want to consider vector/similarity search, for example if the query is "internet device",
then the cosine similarity of the multi-terms "internet device", "wifi router" and "wifi device" using the "all-mpnet-base-v2" are

{"cosineSimilarity":1,"cosineDistance":0,"sentenceOne":"internet 
device","sentenceTwo":"internet device"}

{"cosineSimilarity":0.47380197,"cosineDistance":0.526198,"sentenceOne":"internet 
device","sentenceTwo":"wifi router"}

{"cosineSimilarity":0.74852204,"cosineDistance":0.25147796,"sentenceOne":"internet 
device","sentenceTwo":"wifi device"} whereas as you can see "wifi 
device" is closer to "internet device" than "wifi router" to "internet 
device" using the model "all-mpnet-base-v2", whereas if you consider 
"wifi device" a false positive, then it is not helpful of course, but it 
might be useful otherwise considering the original question of this 
thread. HTH Michael



Am 02.01.23 um 17:54 schrieb Mikhail Khludnev:
> Hello Trevor.
> Can you help me better understand this approach? If we have a text "wifi
> router" and inject "internet device" at indexing time, terms reside at the
> same positions. How to avoid false positive match for query "wifi device"?
>
> On Mon, Jan 2, 2023 at 4:16 PM Trevor Nicholls<tr...@castingthevoid.com>
> wrote:
>
>> Hi Anh
>>
>> The two links Michael shared relate to questions I asked when I was trying
>> to get synonym matching with our application.
>>
>> I really do have multi-term synonym matching working at this point;
>> there's always scope for improvement of course but with the hints suppled
>> in those threads I was able to index our documents and search them using a
>> variety of synonymous terms, both single words and phrases.
>>
>> Our application does not use either BooleanQuery or SynonymQuery; I have
>> just used the standard QueryParser. Instead the synonym processing occurs
>> in the indexing phase, which is not only simpler (one search pattern, one
>> query), but also I think you would also find it gives you superior
>> performance (because the synonym processing occurs once at indexing time
>> and not at all during searching - and I'm sure you'll be doing far more
>> searching than indexing).
>>
>> cheers
>> T
>>
>>
>> -----Original Message-----
>> From: Michael Wechner<mi...@wyona.com>
>> Sent: Thursday, 29 December 2022 08:56
>> To:java-user@lucene.apache.org
>> Subject: Re: Question for SynonymQuery
>>
>> Hi Anh
>>
>> The following Stackoverflow link might help
>>
>>
>> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
>>
>> The following thread seems to confirm, that escaping the space with a
>> backslash does not help
>>
>> https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
>>
>> HTH
>>
>> Michael
>>
>>
>> Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
>>> Hi Lucene users,
>>>
>>> I recently came across SynonymQuery and found out that it only
>>> supports single-term synonyms (since it accepts a list of Term which
>>> will be considered as synonyms). We have some multi-term synonyms like
>>> "internet device" <-> "wifi router" or "dns" <-> "domain name
>>> service". Am I right that I need to use something like a BooleanQuery
>> for these cases?
>>> I have 2 other follow-up questions:
>>> - Does SynonymQuery have any advantage over BooleanQuery? Or is it
>>> only different in how scores are computed? As I understand
>>> SynonymWeight will consider all terms as exactly the same while
>>> BooleanQuery will favor the documents with more matched terms.
>>> - Is it worth it to support multi-term synonyms in SynonymQuery? My
>>> feeling is that it's better to just use BooleanQuery in those cases,
>>> since to support multi-term synonyms it needs to accept a list of
>>> Query, which would make it behave like a BooleanQuery. Also how
>>> scoring works with multi-term is another problem.
>>>
>>> Thanks & Regards!
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail:java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail:java-user-help@lucene.apache.org
>>
>>

RE: Question for SynonymQuery

Posted by Trevor Nicholls <tr...@castingthevoid.com>.
Hi Mikhail

Yes, if my text contains "wifi router", and my synonym map includes "wifi router","internet device", then if I search for "wifi device" I will get a match. While I can see that on the strictest criteria this might be incorrect, in practice I would happily see that returned as a match. I wouldn't call it a false positive, it's more like an unintended benefit.

No doubt there are pathological cases where I would not be so happy but nobody has come up with one in our application yet. As I said there's scope for improvement in our implementation, but at this point I'm not convinced that the benefit of plugging this gap justifies the cost.

If somebody points you to a better option I would also be interested in seeing it.

cheers
T

-----Original Message-----
From: Mikhail Khludnev <mk...@apache.org> 
Sent: Tuesday, 3 January 2023 09:55
To: java-user@lucene.apache.org
Subject: Re: Question for SynonymQuery

Hello Trevor.
Can you help me better understand this approach? If we have a text "wifi router" and inject "internet device" at indexing time, terms reside at the same positions. How to avoid false positive match for query "wifi device"?

On Mon, Jan 2, 2023 at 4:16 PM Trevor Nicholls <tr...@castingthevoid.com>
wrote:

> Hi Anh
>
> The two links Michael shared relate to questions I asked when I was 
> trying to get synonym matching with our application.
>
> I really do have multi-term synonym matching working at this point; 
> there's always scope for improvement of course but with the hints 
> suppled in those threads I was able to index our documents and search 
> them using a variety of synonymous terms, both single words and phrases.
>
> Our application does not use either BooleanQuery or SynonymQuery; I 
> have just used the standard QueryParser. Instead the synonym 
> processing occurs in the indexing phase, which is not only simpler 
> (one search pattern, one query), but also I think you would also find 
> it gives you superior performance (because the synonym processing 
> occurs once at indexing time and not at all during searching - and I'm 
> sure you'll be doing far more searching than indexing).
>
> cheers
> T
>
>
> -----Original Message-----
> From: Michael Wechner <mi...@wyona.com>
> Sent: Thursday, 29 December 2022 08:56
> To: java-user@lucene.apache.org
> Subject: Re: Question for SynonymQuery
>
> Hi Anh
>
> The following Stackoverflow link might help
>
>
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-wit
> h-a-multi-word-synonym-problem-in-lucene
>
> The following thread seems to confirm, that escaping the space with a 
> backslash does not help
>
> https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
>
> HTH
>
> Michael
>
>
> Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
> > Hi Lucene users,
> >
> > I recently came across SynonymQuery and found out that it only 
> > supports single-term synonyms (since it accepts a list of Term which 
> > will be considered as synonyms). We have some multi-term synonyms 
> > like "internet device" <-> "wifi router" or "dns" <-> "domain name 
> > service". Am I right that I need to use something like a 
> > BooleanQuery
> for these cases?
> >
> > I have 2 other follow-up questions:
> > - Does SynonymQuery have any advantage over BooleanQuery? Or is it 
> > only different in how scores are computed? As I understand 
> > SynonymWeight will consider all terms as exactly the same while 
> > BooleanQuery will favor the documents with more matched terms.
> > - Is it worth it to support multi-term synonyms in SynonymQuery? My 
> > feeling is that it's better to just use BooleanQuery in those cases, 
> > since to support multi-term synonyms it needs to accept a list of 
> > Query, which would make it behave like a BooleanQuery. Also how 
> > scoring works with multi-term is another problem.
> >
> > Thanks & Regards!
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Question for SynonymQuery

Posted by Mikhail Khludnev <mk...@apache.org>.
Hello Trevor.
Can you help me better understand this approach? If we have a text "wifi
router" and inject "internet device" at indexing time, terms reside at the
same positions. How to avoid false positive match for query "wifi device"?

On Mon, Jan 2, 2023 at 4:16 PM Trevor Nicholls <tr...@castingthevoid.com>
wrote:

> Hi Anh
>
> The two links Michael shared relate to questions I asked when I was trying
> to get synonym matching with our application.
>
> I really do have multi-term synonym matching working at this point;
> there's always scope for improvement of course but with the hints suppled
> in those threads I was able to index our documents and search them using a
> variety of synonymous terms, both single words and phrases.
>
> Our application does not use either BooleanQuery or SynonymQuery; I have
> just used the standard QueryParser. Instead the synonym processing occurs
> in the indexing phase, which is not only simpler (one search pattern, one
> query), but also I think you would also find it gives you superior
> performance (because the synonym processing occurs once at indexing time
> and not at all during searching - and I'm sure you'll be doing far more
> searching than indexing).
>
> cheers
> T
>
>
> -----Original Message-----
> From: Michael Wechner <mi...@wyona.com>
> Sent: Thursday, 29 December 2022 08:56
> To: java-user@lucene.apache.org
> Subject: Re: Question for SynonymQuery
>
> Hi Anh
>
> The following Stackoverflow link might help
>
>
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
>
> The following thread seems to confirm, that escaping the space with a
> backslash does not help
>
> https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
>
> HTH
>
> Michael
>
>
> Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
> > Hi Lucene users,
> >
> > I recently came across SynonymQuery and found out that it only
> > supports single-term synonyms (since it accepts a list of Term which
> > will be considered as synonyms). We have some multi-term synonyms like
> > "internet device" <-> "wifi router" or "dns" <-> "domain name
> > service". Am I right that I need to use something like a BooleanQuery
> for these cases?
> >
> > I have 2 other follow-up questions:
> > - Does SynonymQuery have any advantage over BooleanQuery? Or is it
> > only different in how scores are computed? As I understand
> > SynonymWeight will consider all terms as exactly the same while
> > BooleanQuery will favor the documents with more matched terms.
> > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > feeling is that it's better to just use BooleanQuery in those cases,
> > since to support multi-term synonyms it needs to accept a list of
> > Query, which would make it behave like a BooleanQuery. Also how
> > scoring works with multi-term is another problem.
> >
> > Thanks & Regards!
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

RE: Question for SynonymQuery

Posted by Trevor Nicholls <tr...@castingthevoid.com>.
Hi Anh

The two links Michael shared relate to questions I asked when I was trying to get synonym matching with our application.

I really do have multi-term synonym matching working at this point; there's always scope for improvement of course but with the hints suppled in those threads I was able to index our documents and search them using a variety of synonymous terms, both single words and phrases.

Our application does not use either BooleanQuery or SynonymQuery; I have just used the standard QueryParser. Instead the synonym processing occurs in the indexing phase, which is not only simpler (one search pattern, one query), but also I think you would also find it gives you superior performance (because the synonym processing occurs once at indexing time and not at all during searching - and I'm sure you'll be doing far more searching than indexing).

cheers
T


-----Original Message-----
From: Michael Wechner <mi...@wyona.com> 
Sent: Thursday, 29 December 2022 08:56
To: java-user@lucene.apache.org
Subject: Re: Question for SynonymQuery

Hi Anh

The following Stackoverflow link might help

https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene

The following thread seems to confirm, that escaping the space with a backslash does not help

https://lists.apache.org/list?java-user@lucene.apache.org:2022-3

HTH

Michael


Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
> Hi Lucene users,
>
> I recently came across SynonymQuery and found out that it only 
> supports single-term synonyms (since it accepts a list of Term which 
> will be considered as synonyms). We have some multi-term synonyms like 
> "internet device" <-> "wifi router" or "dns" <-> "domain name 
> service". Am I right that I need to use something like a BooleanQuery for these cases?
>
> I have 2 other follow-up questions:
> - Does SynonymQuery have any advantage over BooleanQuery? Or is it 
> only different in how scores are computed? As I understand 
> SynonymWeight will consider all terms as exactly the same while 
> BooleanQuery will favor the documents with more matched terms.
> - Is it worth it to support multi-term synonyms in SynonymQuery? My 
> feeling is that it's better to just use BooleanQuery in those cases, 
> since to support multi-term synonyms it needs to accept a list of 
> Query, which would make it behave like a BooleanQuery. Also how 
> scoring works with multi-term is another problem.
>
> Thanks & Regards!
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Question for SynonymQuery

Posted by Mikhail Khludnev <mk...@apache.org>.
Hello, Santam.
It seems I achieved what you asking for.
https://github.com/mkhludnev/likely/blob/381b491d25e4d2035dd5b8a891dfdcfe2b986b90/src/test/java/org/apache/lucene/playground/TestMultiPulty.java#L32
It expands API and UI into phrases, which match like you expect.

On Fri, Jan 20, 2023 at 4:18 PM _ SATNAM <sa...@gmail.com> wrote:

> Hey Mikhail and  Anh Dung Bui
> i am also struggling with synonym query
> my use case  for eg
> I created synonyms for word
> API ------> Application program interface
> UI ---------> user interface
>
> doc 1 --->  This is API and it is called Application program interface
> doc2  ----> How i help you in UI things
> doc3-----> my substance interface
> doc4 ------> how to write c++ program
>
> what i want to achieve is when i search for API UI together
>
> expected result
> it must highlight  ---> API and  Application program interface in doc1
> ------> UI in doc2
>
> but coming output is
> it  highlighted  ---> API and  Application program interface in doc1
> ------> UI in doc2
> -----> interface  in doc 3
> ------> program in doc4
>
> Do you have any suggesting how i achieve this
>
> (API) OR (UI)
> Each term act as phrase query for  API  UI
> no single tokens be matched ,phrase should be matched
>
>
>
>
>
> On Thu, Jan 19, 2023 at 6:56 AM Anh Dũng Bùi <du...@gmail.com> wrote:
>
> > Thanks Mikhail!
> >
> > It turns out I used FlattenGraphFilter and cause the PositionLength to be
> > all 1 and resulted in the behavior above =)
> >
> > A side note is that we don't need to use WORD_SEPARATOR in the synonym
> > file. SynonymMap.Parser.analyze would tokenize and append the separator
> for
> > us.
> >
> > Regards,
> > Anh Dung Bui
> >
> > On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev <mk...@apache.org> wrote:
> >
> > > Hello Anh,
> > > I was intrigued by your question. And I managed it to work somehow.
> > > see
> > >
> > >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java
> > > Beware, synonym files
> > >
> > >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/resources/org/apache/lucene/playground/multy-syn.txt
> > > should use
> > >
> > >
> >
> https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR
> > > Have a nice hack!
> > >
> > > On Thu, Dec 29, 2022 at 10:00 AM Anh Dũng Bùi <du...@gmail.com>
> > wrote:
> > >
> > > > Thanks everyone for the insight. I guess I'll use BooleanQuery then.
> > > >
> > > > There is also a caveat I noticed (not sure if it's an issue or not),
> > > which
> > > > is slightly different from the mentioned thread. When I have a
> > multi-word
> > > > synonym, let say "wifi router" and "internet device". Then using
> > > > SynonymGraphFilter at query time (when building the SynonymMap I
> > already
> > > > escaped space with the backslash) would produce this TokenStream for
> a
> > > > query of "wifi router"
> > > >
> > > > "wifi" (PositionIncrement=1,PositionLength=1), "internet"
> > > > (PositionIncrement=0,PositionLength=1), "router"
> > > > (PositionIncrement=1,PositionLength=1), "device"
> > > > (PositionIncrement=0,PositionLength=1)
> > > >
> > > > This has the same effect as if I had 2 synonyms: "wifi"/"internet"
> and
> > > > "router"/"device". If I convert this to a BooleanQuery it would
> become
> > > > ("wifi" OR "internet") AND ("router" OR "device"), but what I would
> > like
> > > to
> > > > achieve is ("wifi" AND "router") OR ("internet" AND "device")
> > > >
> > > > I'm curious if there would be some workaround for this case
> > > >
> > > > Thanks,
> > > > Anh Dung Bui
> > > >
> > > >
> > > > On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <
> > > michael.wechner@wyona.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi Anh
> > > > >
> > > > > The following Stackoverflow link might help
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
> > > > >
> > > > > The following thread seems to confirm, that escaping the space
> with a
> > > > > backslash does not help
> > > > >
> > > > > https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
> > > > >
> > > > > HTH
> > > > >
> > > > > Michael
> > > > >
> > > > >
> > > > > Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
> > > > > > Hi Lucene users,
> > > > > >
> > > > > > I recently came across SynonymQuery and found out that it only
> > > supports
> > > > > > single-term synonyms (since it accepts a list of Term which will
> be
> > > > > > considered as synonyms). We have some multi-term synonyms like
> > > > "internet
> > > > > > device" <-> "wifi router" or "dns" <-> "domain name service". Am
> I
> > > > right
> > > > > > that I need to use something like a BooleanQuery for these cases?
> > > > > >
> > > > > > I have 2 other follow-up questions:
> > > > > > - Does SynonymQuery have any advantage over BooleanQuery? Or is
> it
> > > only
> > > > > > different in how scores are computed? As I understand
> SynonymWeight
> > > > will
> > > > > > consider all terms as exactly the same while BooleanQuery will
> > favor
> > > > the
> > > > > > documents with more matched terms.
> > > > > > - Is it worth it to support multi-term synonyms in SynonymQuery?
> My
> > > > > feeling
> > > > > > is that it's better to just use BooleanQuery in those cases,
> since
> > to
> > > > > > support multi-term synonyms it needs to accept a list of Query,
> > which
> > > > > would
> > > > > > make it behave like a BooleanQuery. Also how scoring works with
> > > > > multi-term
> > > > > > is another problem.
> > > > > >
> > > > > > Thanks & Regards!
> > > > > >
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > https://t.me/MUST_SEARCH
> > > A caveat: Cyrillic!
> > >
> >
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: Question for SynonymQuery

Posted by _ SATNAM <sa...@gmail.com>.
Hey Mikhail and  Anh Dung Bui
i am also struggling with synonym query
my use case  for eg
I created synonyms for word
API ------> Application program interface
UI ---------> user interface

doc 1 --->  This is API and it is called Application program interface
doc2  ----> How i help you in UI things
doc3-----> my substance interface
doc4 ------> how to write c++ program

what i want to achieve is when i search for API UI together

expected result
it must highlight  ---> API and  Application program interface in doc1
------> UI in doc2

but coming output is
it  highlighted  ---> API and  Application program interface in doc1
------> UI in doc2
-----> interface  in doc 3
------> program in doc4

Do you have any suggesting how i achieve this

(API) OR (UI)
Each term act as phrase query for  API  UI
no single tokens be matched ,phrase should be matched





On Thu, Jan 19, 2023 at 6:56 AM Anh Dũng Bùi <du...@gmail.com> wrote:

> Thanks Mikhail!
>
> It turns out I used FlattenGraphFilter and cause the PositionLength to be
> all 1 and resulted in the behavior above =)
>
> A side note is that we don't need to use WORD_SEPARATOR in the synonym
> file. SynonymMap.Parser.analyze would tokenize and append the separator for
> us.
>
> Regards,
> Anh Dung Bui
>
> On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev <mk...@apache.org> wrote:
>
> > Hello Anh,
> > I was intrigued by your question. And I managed it to work somehow.
> > see
> >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java
> > Beware, synonym files
> >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/resources/org/apache/lucene/playground/multy-syn.txt
> > should use
> >
> >
> https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR
> > Have a nice hack!
> >
> > On Thu, Dec 29, 2022 at 10:00 AM Anh Dũng Bùi <du...@gmail.com>
> wrote:
> >
> > > Thanks everyone for the insight. I guess I'll use BooleanQuery then.
> > >
> > > There is also a caveat I noticed (not sure if it's an issue or not),
> > which
> > > is slightly different from the mentioned thread. When I have a
> multi-word
> > > synonym, let say "wifi router" and "internet device". Then using
> > > SynonymGraphFilter at query time (when building the SynonymMap I
> already
> > > escaped space with the backslash) would produce this TokenStream for a
> > > query of "wifi router"
> > >
> > > "wifi" (PositionIncrement=1,PositionLength=1), "internet"
> > > (PositionIncrement=0,PositionLength=1), "router"
> > > (PositionIncrement=1,PositionLength=1), "device"
> > > (PositionIncrement=0,PositionLength=1)
> > >
> > > This has the same effect as if I had 2 synonyms: "wifi"/"internet" and
> > > "router"/"device". If I convert this to a BooleanQuery it would become
> > > ("wifi" OR "internet") AND ("router" OR "device"), but what I would
> like
> > to
> > > achieve is ("wifi" AND "router") OR ("internet" AND "device")
> > >
> > > I'm curious if there would be some workaround for this case
> > >
> > > Thanks,
> > > Anh Dung Bui
> > >
> > >
> > > On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <
> > michael.wechner@wyona.com
> > > >
> > > wrote:
> > >
> > > > Hi Anh
> > > >
> > > > The following Stackoverflow link might help
> > > >
> > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
> > > >
> > > > The following thread seems to confirm, that escaping the space with a
> > > > backslash does not help
> > > >
> > > > https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
> > > >
> > > > HTH
> > > >
> > > > Michael
> > > >
> > > >
> > > > Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
> > > > > Hi Lucene users,
> > > > >
> > > > > I recently came across SynonymQuery and found out that it only
> > supports
> > > > > single-term synonyms (since it accepts a list of Term which will be
> > > > > considered as synonyms). We have some multi-term synonyms like
> > > "internet
> > > > > device" <-> "wifi router" or "dns" <-> "domain name service". Am I
> > > right
> > > > > that I need to use something like a BooleanQuery for these cases?
> > > > >
> > > > > I have 2 other follow-up questions:
> > > > > - Does SynonymQuery have any advantage over BooleanQuery? Or is it
> > only
> > > > > different in how scores are computed? As I understand SynonymWeight
> > > will
> > > > > consider all terms as exactly the same while BooleanQuery will
> favor
> > > the
> > > > > documents with more matched terms.
> > > > > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > > > feeling
> > > > > is that it's better to just use BooleanQuery in those cases, since
> to
> > > > > support multi-term synonyms it needs to accept a list of Query,
> which
> > > > would
> > > > > make it behave like a BooleanQuery. Also how scoring works with
> > > > multi-term
> > > > > is another problem.
> > > > >
> > > > > Thanks & Regards!
> > > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/MUST_SEARCH
> > A caveat: Cyrillic!
> >
>

Re: Question for SynonymQuery

Posted by Mikhail Khludnev <mk...@apache.org>.
Right.  SynonymMap.html#WORD_SEPARATOR
<https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR>
was
a redundant complication. Spaces work fine.

On Thu, Jan 19, 2023 at 4:26 AM Anh Dũng Bùi <du...@gmail.com> wrote:

> Thanks Mikhail!
>
> It turns out I used FlattenGraphFilter and cause the PositionLength to be
> all 1 and resulted in the behavior above =)
>
> A side note is that we don't need to use WORD_SEPARATOR in the synonym
> file. SynonymMap.Parser.analyze would tokenize and append the separator for
> us.
>
> Regards,
> Anh Dung Bui
>
> On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev <mk...@apache.org> wrote:
>
> > Hello Anh,
> > I was intrigued by your question. And I managed it to work somehow.
> > see
> >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java
> > Beware, synonym files
> >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/resources/org/apache/lucene/playground/multy-syn.txt
> > should use
> >
> >
> https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR
> > Have a nice hack!
> >
> > On Thu, Dec 29, 2022 at 10:00 AM Anh Dũng Bùi <du...@gmail.com>
> wrote:
> >
> > > Thanks everyone for the insight. I guess I'll use BooleanQuery then.
> > >
> > > There is also a caveat I noticed (not sure if it's an issue or not),
> > which
> > > is slightly different from the mentioned thread. When I have a
> multi-word
> > > synonym, let say "wifi router" and "internet device". Then using
> > > SynonymGraphFilter at query time (when building the SynonymMap I
> already
> > > escaped space with the backslash) would produce this TokenStream for a
> > > query of "wifi router"
> > >
> > > "wifi" (PositionIncrement=1,PositionLength=1), "internet"
> > > (PositionIncrement=0,PositionLength=1), "router"
> > > (PositionIncrement=1,PositionLength=1), "device"
> > > (PositionIncrement=0,PositionLength=1)
> > >
> > > This has the same effect as if I had 2 synonyms: "wifi"/"internet" and
> > > "router"/"device". If I convert this to a BooleanQuery it would become
> > > ("wifi" OR "internet") AND ("router" OR "device"), but what I would
> like
> > to
> > > achieve is ("wifi" AND "router") OR ("internet" AND "device")
> > >
> > > I'm curious if there would be some workaround for this case
> > >
> > > Thanks,
> > > Anh Dung Bui
> > >
> > >
> > > On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <
> > michael.wechner@wyona.com
> > > >
> > > wrote:
> > >
> > > > Hi Anh
> > > >
> > > > The following Stackoverflow link might help
> > > >
> > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
> > > >
> > > > The following thread seems to confirm, that escaping the space with a
> > > > backslash does not help
> > > >
> > > > https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
> > > >
> > > > HTH
> > > >
> > > > Michael
> > > >
> > > >
> > > > Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
> > > > > Hi Lucene users,
> > > > >
> > > > > I recently came across SynonymQuery and found out that it only
> > supports
> > > > > single-term synonyms (since it accepts a list of Term which will be
> > > > > considered as synonyms). We have some multi-term synonyms like
> > > "internet
> > > > > device" <-> "wifi router" or "dns" <-> "domain name service". Am I
> > > right
> > > > > that I need to use something like a BooleanQuery for these cases?
> > > > >
> > > > > I have 2 other follow-up questions:
> > > > > - Does SynonymQuery have any advantage over BooleanQuery? Or is it
> > only
> > > > > different in how scores are computed? As I understand SynonymWeight
> > > will
> > > > > consider all terms as exactly the same while BooleanQuery will
> favor
> > > the
> > > > > documents with more matched terms.
> > > > > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > > > feeling
> > > > > is that it's better to just use BooleanQuery in those cases, since
> to
> > > > > support multi-term synonyms it needs to accept a list of Query,
> which
> > > > would
> > > > > make it behave like a BooleanQuery. Also how scoring works with
> > > > multi-term
> > > > > is another problem.
> > > > >
> > > > > Thanks & Regards!
> > > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/MUST_SEARCH
> > A caveat: Cyrillic!
> >
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: Question for SynonymQuery

Posted by Anh Dũng Bùi <du...@gmail.com>.
Thanks Mikhail!

It turns out I used FlattenGraphFilter and cause the PositionLength to be
all 1 and resulted in the behavior above =)

A side note is that we don't need to use WORD_SEPARATOR in the synonym
file. SynonymMap.Parser.analyze would tokenize and append the separator for
us.

Regards,
Anh Dung Bui

On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev <mk...@apache.org> wrote:

> Hello Anh,
> I was intrigued by your question. And I managed it to work somehow.
> see
>
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java
> Beware, synonym files
>
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/resources/org/apache/lucene/playground/multy-syn.txt
> should use
>
> https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR
> Have a nice hack!
>
> On Thu, Dec 29, 2022 at 10:00 AM Anh Dũng Bùi <du...@gmail.com> wrote:
>
> > Thanks everyone for the insight. I guess I'll use BooleanQuery then.
> >
> > There is also a caveat I noticed (not sure if it's an issue or not),
> which
> > is slightly different from the mentioned thread. When I have a multi-word
> > synonym, let say "wifi router" and "internet device". Then using
> > SynonymGraphFilter at query time (when building the SynonymMap I already
> > escaped space with the backslash) would produce this TokenStream for a
> > query of "wifi router"
> >
> > "wifi" (PositionIncrement=1,PositionLength=1), "internet"
> > (PositionIncrement=0,PositionLength=1), "router"
> > (PositionIncrement=1,PositionLength=1), "device"
> > (PositionIncrement=0,PositionLength=1)
> >
> > This has the same effect as if I had 2 synonyms: "wifi"/"internet" and
> > "router"/"device". If I convert this to a BooleanQuery it would become
> > ("wifi" OR "internet") AND ("router" OR "device"), but what I would like
> to
> > achieve is ("wifi" AND "router") OR ("internet" AND "device")
> >
> > I'm curious if there would be some workaround for this case
> >
> > Thanks,
> > Anh Dung Bui
> >
> >
> > On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <
> michael.wechner@wyona.com
> > >
> > wrote:
> >
> > > Hi Anh
> > >
> > > The following Stackoverflow link might help
> > >
> > >
> > >
> >
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
> > >
> > > The following thread seems to confirm, that escaping the space with a
> > > backslash does not help
> > >
> > > https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
> > >
> > > HTH
> > >
> > > Michael
> > >
> > >
> > > Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
> > > > Hi Lucene users,
> > > >
> > > > I recently came across SynonymQuery and found out that it only
> supports
> > > > single-term synonyms (since it accepts a list of Term which will be
> > > > considered as synonyms). We have some multi-term synonyms like
> > "internet
> > > > device" <-> "wifi router" or "dns" <-> "domain name service". Am I
> > right
> > > > that I need to use something like a BooleanQuery for these cases?
> > > >
> > > > I have 2 other follow-up questions:
> > > > - Does SynonymQuery have any advantage over BooleanQuery? Or is it
> only
> > > > different in how scores are computed? As I understand SynonymWeight
> > will
> > > > consider all terms as exactly the same while BooleanQuery will favor
> > the
> > > > documents with more matched terms.
> > > > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > > feeling
> > > > is that it's better to just use BooleanQuery in those cases, since to
> > > > support multi-term synonyms it needs to accept a list of Query, which
> > > would
> > > > make it behave like a BooleanQuery. Also how scoring works with
> > > multi-term
> > > > is another problem.
> > > >
> > > > Thanks & Regards!
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>

Re: Question for SynonymQuery

Posted by Mikhail Khludnev <mk...@apache.org>.
Hello Anh,
I was intrigued by your question. And I managed it to work somehow.
see
https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java
Beware, synonym files
https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/resources/org/apache/lucene/playground/multy-syn.txt
should use
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR
Have a nice hack!

On Thu, Dec 29, 2022 at 10:00 AM Anh Dũng Bùi <du...@gmail.com> wrote:

> Thanks everyone for the insight. I guess I'll use BooleanQuery then.
>
> There is also a caveat I noticed (not sure if it's an issue or not), which
> is slightly different from the mentioned thread. When I have a multi-word
> synonym, let say "wifi router" and "internet device". Then using
> SynonymGraphFilter at query time (when building the SynonymMap I already
> escaped space with the backslash) would produce this TokenStream for a
> query of "wifi router"
>
> "wifi" (PositionIncrement=1,PositionLength=1), "internet"
> (PositionIncrement=0,PositionLength=1), "router"
> (PositionIncrement=1,PositionLength=1), "device"
> (PositionIncrement=0,PositionLength=1)
>
> This has the same effect as if I had 2 synonyms: "wifi"/"internet" and
> "router"/"device". If I convert this to a BooleanQuery it would become
> ("wifi" OR "internet") AND ("router" OR "device"), but what I would like to
> achieve is ("wifi" AND "router") OR ("internet" AND "device")
>
> I'm curious if there would be some workaround for this case
>
> Thanks,
> Anh Dung Bui
>
>
> On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <michael.wechner@wyona.com
> >
> wrote:
>
> > Hi Anh
> >
> > The following Stackoverflow link might help
> >
> >
> >
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
> >
> > The following thread seems to confirm, that escaping the space with a
> > backslash does not help
> >
> > https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
> >
> > HTH
> >
> > Michael
> >
> >
> > Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
> > > Hi Lucene users,
> > >
> > > I recently came across SynonymQuery and found out that it only supports
> > > single-term synonyms (since it accepts a list of Term which will be
> > > considered as synonyms). We have some multi-term synonyms like
> "internet
> > > device" <-> "wifi router" or "dns" <-> "domain name service". Am I
> right
> > > that I need to use something like a BooleanQuery for these cases?
> > >
> > > I have 2 other follow-up questions:
> > > - Does SynonymQuery have any advantage over BooleanQuery? Or is it only
> > > different in how scores are computed? As I understand SynonymWeight
> will
> > > consider all terms as exactly the same while BooleanQuery will favor
> the
> > > documents with more matched terms.
> > > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > feeling
> > > is that it's better to just use BooleanQuery in those cases, since to
> > > support multi-term synonyms it needs to accept a list of Query, which
> > would
> > > make it behave like a BooleanQuery. Also how scoring works with
> > multi-term
> > > is another problem.
> > >
> > > Thanks & Regards!
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: Question for SynonymQuery

Posted by Anh Dũng Bùi <du...@gmail.com>.
Thanks everyone for the insight. I guess I'll use BooleanQuery then.

There is also a caveat I noticed (not sure if it's an issue or not), which
is slightly different from the mentioned thread. When I have a multi-word
synonym, let say "wifi router" and "internet device". Then using
SynonymGraphFilter at query time (when building the SynonymMap I already
escaped space with the backslash) would produce this TokenStream for a
query of "wifi router"

"wifi" (PositionIncrement=1,PositionLength=1), "internet"
(PositionIncrement=0,PositionLength=1), "router"
(PositionIncrement=1,PositionLength=1), "device"
(PositionIncrement=0,PositionLength=1)

This has the same effect as if I had 2 synonyms: "wifi"/"internet" and
"router"/"device". If I convert this to a BooleanQuery it would become
("wifi" OR "internet") AND ("router" OR "device"), but what I would like to
achieve is ("wifi" AND "router") OR ("internet" AND "device")

I'm curious if there would be some workaround for this case

Thanks,
Anh Dung Bui


On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <mi...@wyona.com>
wrote:

> Hi Anh
>
> The following Stackoverflow link might help
>
>
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
>
> The following thread seems to confirm, that escaping the space with a
> backslash does not help
>
> https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
>
> HTH
>
> Michael
>
>
> Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
> > Hi Lucene users,
> >
> > I recently came across SynonymQuery and found out that it only supports
> > single-term synonyms (since it accepts a list of Term which will be
> > considered as synonyms). We have some multi-term synonyms like "internet
> > device" <-> "wifi router" or "dns" <-> "domain name service". Am I right
> > that I need to use something like a BooleanQuery for these cases?
> >
> > I have 2 other follow-up questions:
> > - Does SynonymQuery have any advantage over BooleanQuery? Or is it only
> > different in how scores are computed? As I understand SynonymWeight will
> > consider all terms as exactly the same while BooleanQuery will favor the
> > documents with more matched terms.
> > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> feeling
> > is that it's better to just use BooleanQuery in those cases, since to
> > support multi-term synonyms it needs to accept a list of Query, which
> would
> > make it behave like a BooleanQuery. Also how scoring works with
> multi-term
> > is another problem.
> >
> > Thanks & Regards!
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Question for SynonymQuery

Posted by Michael Wechner <mi...@wyona.com>.
Hi Anh

The following Stackoverflow link might help

https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene

The following thread seems to confirm, that escaping the space with a 
backslash does not help

https://lists.apache.org/list?java-user@lucene.apache.org:2022-3

HTH

Michael


Am 27.12.22 um 20:22 schrieb Anh Dũng Bùi:
> Hi Lucene users,
>
> I recently came across SynonymQuery and found out that it only supports
> single-term synonyms (since it accepts a list of Term which will be
> considered as synonyms). We have some multi-term synonyms like "internet
> device" <-> "wifi router" or "dns" <-> "domain name service". Am I right
> that I need to use something like a BooleanQuery for these cases?
>
> I have 2 other follow-up questions:
> - Does SynonymQuery have any advantage over BooleanQuery? Or is it only
> different in how scores are computed? As I understand SynonymWeight will
> consider all terms as exactly the same while BooleanQuery will favor the
> documents with more matched terms.
> - Is it worth it to support multi-term synonyms in SynonymQuery? My feeling
> is that it's better to just use BooleanQuery in those cases, since to
> support multi-term synonyms it needs to accept a list of Query, which would
> make it behave like a BooleanQuery. Also how scoring works with multi-term
> is another problem.
>
> Thanks & Regards!
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org