You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Colvin Cowie <co...@gmail.com> on 2022/08/01 08:58:55 UTC

What's the best way to formulate this query?

Hello,

Maybe the answer to this is obvious and I'm missing something, but here
goes:

Suppose I have a field which contains a string of one or more tokens from a
set. The set has about 50 possible values, and the values themselves are
arbitrary (though they are known ahead of time, and could be ordered
alphabetically if it helped). e.g.
doc1: "red"
doc2: "blip red"
doc3: "aardvark blip red"
doc4: "aardvark potato"

I want to query the field for all documents that contain at least one of
the tokens specified in the query *and no tokens that aren't in the query*.
What's the best query for that?

For example, querying for

   - "*red*" should *only* match doc1 above
   - "*blip red*" should match doc1 *and* doc2
   - "*blip red potato*" should also match doc1 and doc 2.
   - "*aardvark blip*" would not match any of the documents since neither
   term appears on its own above, and it would need "*red*" as well to
   match doc3.
   - "*aardvark blip red potato*" would match all of the documents.


Options?

   1. I could formulate the query to include all the required tokens and
   negate all the other tokens from the set, e.g. "*blip red*" would
be "*+(blip
   red) -(aardvark potato....)*", and "*red*" would be "*+(red) -(aardvark
   blip potato...)*"... The size of the set is fixed, so the number of
   terms in the query won't change, just whether they are included or
   excluded. But having to specify all the negations seems inefficient.
   2. I could change the way the data is indexed so that the field is
   concatenated deterministically and tokenized as a single value, and query
   for combinations of terms. e.g. "*blip red*" would be "*blip red
   blip-red*", but with more than a handful of terms the fan-out becomes
   significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
   aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on, with
   (2^N)-1 combinations.

So option 1 should be fairly constant regardless of the number of terms but
may be wasteful for low numbers of terms, while option 2 generates > 1000
combinations for a query with 10 terms. Is that a problem for Lucene in
practice though? For 20 terms it would create >1 million combinations,
which does sound like a problem, but a query with that many terms may not
be needed.

I'm leaning towards 1 - but is it a bad solution? Is there a better option
I'm missing?

On a related note, does the EnumFieldType enable a more efficient query
than other field types, or does it just provide explicit sorting? i.e.
would a multivalued EFT be better for this?

Thanks,
Colvin

Re: What's the best way to formulate this query?

Posted by Colvin Cowie <co...@gmail.com>.

Thanks for the suggestions Mikhail, cheers

On Tue, 2 Aug 2022 at 07:25, Mikhail Khludnev <mk...@apache.org> wrote:

> Here are a few discussions
> https://issues.apache.org/jira/browse/LUCENE-7148
> and
>
> https://lucene.apache.org/core/8_11_0/sandbox/org/apache/lucene/search/CoveringQuery.html
>
>
> On Tue, Aug 2, 2022 at 1:25 PM Mikhail Khludnev <mk...@apache.org> wrote:
>
> > Hi, Colvin.
> > It reminds me of percolator match logic. I've heard of such plugins for
> > Elastic&Solr.
> > Think about min_should_match in dismax - mm.
> > If one indexes a number of words in a dedicated field, then count every
> > term hit via constant score ^=1, sum hits score, then cut off matches
> with
> > a weak coverage via {!frange} (compare sum of scores to a field with a
> > number of tokens). It was discussed in comments/list years ago. Not sure
> if
> > we moved toward it already.  I also remember that such logic built-in
> > you-know-where
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-set-query.html
> > .
> >
> > On Mon, Aug 1, 2022 at 6:59 PM Colvin Cowie <co...@gmail.com>
> > wrote:
> >
> >> Hello,
> >>
> >> Maybe the answer to this is obvious and I'm missing something, but here
> >> goes:
> >>
> >> Suppose I have a field which contains a string of one or more tokens
> from
> >> a
> >> set. The set has about 50 possible values, and the values themselves are
> >> arbitrary (though they are known ahead of time, and could be ordered
> >> alphabetically if it helped). e.g.
> >> doc1: "red"
> >> doc2: "blip red"
> >> doc3: "aardvark blip red"
> >> doc4: "aardvark potato"
> >>
> >> I want to query the field for all documents that contain at least one of
> >> the tokens specified in the query *and no tokens that aren't in the
> >> query*.
> >> What's the best query for that?
> >>
> >> For example, querying for
> >>
> >>    - "*red*" should *only* match doc1 above
> >>    - "*blip red*" should match doc1 *and* doc2
> >>    - "*blip red potato*" should also match doc1 and doc 2.
> >>    - "*aardvark blip*" would not match any of the documents since
> neither
> >>    term appears on its own above, and it would need "*red*" as well to
> >>    match doc3.
> >>    - "*aardvark blip red potato*" would match all of the documents.
> >>
> >>
> >> Options?
> >>
> >>    1. I could formulate the query to include all the required tokens and
> >>    negate all the other tokens from the set, e.g. "*blip red*" would
> >> be "*+(blip
> >>    red) -(aardvark potato....)*", and "*red*" would be "*+(red)
> -(aardvark
> >>    blip potato...)*"... The size of the set is fixed, so the number of
> >>    terms in the query won't change, just whether they are included or
> >>    excluded. But having to specify all the negations seems inefficient.
> >>    2. I could change the way the data is indexed so that the field is
> >>    concatenated deterministically and tokenized as a single value, and
> >> query
> >>    for combinations of terms. e.g. "*blip red*" would be "*blip red
> >>    blip-red*", but with more than a handful of terms the fan-out becomes
> >>    significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
> >>    aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on,
> >> with
> >>    (2^N)-1 combinations.
> >>
> >> So option 1 should be fairly constant regardless of the number of terms
> >> but
> >> may be wasteful for low numbers of terms, while option 2 generates >
> 1000
> >> combinations for a query with 10 terms. Is that a problem for Lucene in
> >> practice though? For 20 terms it would create >1 million combinations,
> >> which does sound like a problem, but a query with that many terms may
> not
> >> be needed.
> >>
> >> I'm leaning towards 1 - but is it a bad solution? Is there a better
> option
> >> I'm missing?
> >>
> >> On a related note, does the EnumFieldType enable a more efficient query
> >> than other field types, or does it just provide explicit sorting? i.e.
> >> would a multivalued EFT be better for this?
> >>
> >> Thanks,
> >> Colvin
> >>
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: What's the best way to formulate this query?

Posted by Mikhail Khludnev <mk...@apache.org>.

Here are a few discussions https://issues.apache.org/jira/browse/LUCENE-7148
and
https://lucene.apache.org/core/8_11_0/sandbox/org/apache/lucene/search/CoveringQuery.html


On Tue, Aug 2, 2022 at 1:25 PM Mikhail Khludnev <mk...@apache.org> wrote:

> Hi, Colvin.
> It reminds me of percolator match logic. I've heard of such plugins for
> Elastic&Solr.
> Think about min_should_match in dismax - mm.
> If one indexes a number of words in a dedicated field, then count every
> term hit via constant score ^=1, sum hits score, then cut off matches with
> a weak coverage via {!frange} (compare sum of scores to a field with a
> number of tokens). It was discussed in comments/list years ago. Not sure if
> we moved toward it already.  I also remember that such logic built-in
> you-know-where
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-set-query.html
> .
>
> On Mon, Aug 1, 2022 at 6:59 PM Colvin Cowie <co...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Maybe the answer to this is obvious and I'm missing something, but here
>> goes:
>>
>> Suppose I have a field which contains a string of one or more tokens from
>> a
>> set. The set has about 50 possible values, and the values themselves are
>> arbitrary (though they are known ahead of time, and could be ordered
>> alphabetically if it helped). e.g.
>> doc1: "red"
>> doc2: "blip red"
>> doc3: "aardvark blip red"
>> doc4: "aardvark potato"
>>
>> I want to query the field for all documents that contain at least one of
>> the tokens specified in the query *and no tokens that aren't in the
>> query*.
>> What's the best query for that?
>>
>> For example, querying for
>>
>>    - "*red*" should *only* match doc1 above
>>    - "*blip red*" should match doc1 *and* doc2
>>    - "*blip red potato*" should also match doc1 and doc 2.
>>    - "*aardvark blip*" would not match any of the documents since neither
>>    term appears on its own above, and it would need "*red*" as well to
>>    match doc3.
>>    - "*aardvark blip red potato*" would match all of the documents.
>>
>>
>> Options?
>>
>>    1. I could formulate the query to include all the required tokens and
>>    negate all the other tokens from the set, e.g. "*blip red*" would
>> be "*+(blip
>>    red) -(aardvark potato....)*", and "*red*" would be "*+(red) -(aardvark
>>    blip potato...)*"... The size of the set is fixed, so the number of
>>    terms in the query won't change, just whether they are included or
>>    excluded. But having to specify all the negations seems inefficient.
>>    2. I could change the way the data is indexed so that the field is
>>    concatenated deterministically and tokenized as a single value, and
>> query
>>    for combinations of terms. e.g. "*blip red*" would be "*blip red
>>    blip-red*", but with more than a handful of terms the fan-out becomes
>>    significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
>>    aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on,
>> with
>>    (2^N)-1 combinations.
>>
>> So option 1 should be fairly constant regardless of the number of terms
>> but
>> may be wasteful for low numbers of terms, while option 2 generates > 1000
>> combinations for a query with 10 terms. Is that a problem for Lucene in
>> practice though? For 20 terms it would create >1 million combinations,
>> which does sound like a problem, but a query with that many terms may not
>> be needed.
>>
>> I'm leaning towards 1 - but is it a bad solution? Is there a better option
>> I'm missing?
>>
>> On a related note, does the EnumFieldType enable a more efficient query
>> than other field types, or does it just provide explicit sorting? i.e.
>> would a multivalued EFT be better for this?
>>
>> Thanks,
>> Colvin
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Sincerely yours
Mikhail Khludnev

Re: What's the best way to formulate this query?

Posted by Mikhail Khludnev <mk...@apache.org>.

Hi, Colvin.
It reminds me of percolator match logic. I've heard of such plugins for
Elastic&Solr.
Think about min_should_match in dismax - mm.
If one indexes a number of words in a dedicated field, then count every
term hit via constant score ^=1, sum hits score, then cut off matches with
a weak coverage via {!frange} (compare sum of scores to a field with a
number of tokens). It was discussed in comments/list years ago. Not sure if
we moved toward it already.  I also remember that such logic built-in
you-know-where
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-set-query.html
.

On Mon, Aug 1, 2022 at 6:59 PM Colvin Cowie <co...@gmail.com>
wrote:

> Hello,
>
> Maybe the answer to this is obvious and I'm missing something, but here
> goes:
>
> Suppose I have a field which contains a string of one or more tokens from a
> set. The set has about 50 possible values, and the values themselves are
> arbitrary (though they are known ahead of time, and could be ordered
> alphabetically if it helped). e.g.
> doc1: "red"
> doc2: "blip red"
> doc3: "aardvark blip red"
> doc4: "aardvark potato"
>
> I want to query the field for all documents that contain at least one of
> the tokens specified in the query *and no tokens that aren't in the query*.
> What's the best query for that?
>
> For example, querying for
>
>    - "*red*" should *only* match doc1 above
>    - "*blip red*" should match doc1 *and* doc2
>    - "*blip red potato*" should also match doc1 and doc 2.
>    - "*aardvark blip*" would not match any of the documents since neither
>    term appears on its own above, and it would need "*red*" as well to
>    match doc3.
>    - "*aardvark blip red potato*" would match all of the documents.
>
>
> Options?
>
>    1. I could formulate the query to include all the required tokens and
>    negate all the other tokens from the set, e.g. "*blip red*" would
> be "*+(blip
>    red) -(aardvark potato....)*", and "*red*" would be "*+(red) -(aardvark
>    blip potato...)*"... The size of the set is fixed, so the number of
>    terms in the query won't change, just whether they are included or
>    excluded. But having to specify all the negations seems inefficient.
>    2. I could change the way the data is indexed so that the field is
>    concatenated deterministically and tokenized as a single value, and
> query
>    for combinations of terms. e.g. "*blip red*" would be "*blip red
>    blip-red*", but with more than a handful of terms the fan-out becomes
>    significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
>    aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on, with
>    (2^N)-1 combinations.
>
> So option 1 should be fairly constant regardless of the number of terms but
> may be wasteful for low numbers of terms, while option 2 generates > 1000
> combinations for a query with 10 terms. Is that a problem for Lucene in
> practice though? For 20 terms it would create >1 million combinations,
> which does sound like a problem, but a query with that many terms may not
> be needed.
>
> I'm leaning towards 1 - but is it a bad solution? Is there a better option
> I'm missing?
>
> On a related note, does the EnumFieldType enable a more efficient query
> than other field types, or does it just provide explicit sorting? i.e.
> would a multivalued EFT be better for this?
>
> Thanks,
> Colvin
>


-- 
Sincerely yours
Mikhail Khludnev

Re: What's the best way to formulate this query?

Posted by Colvin Cowie <co...@gmail.com>.

That's an interesting suggestion Radu, thank you

On Mon, 1 Aug 2022 at 10:24, Radu Gheorghe <ra...@sematext.com>
wrote:

> Hi Colvin,
>
> You wouldn't normally query with more than e.g. 1K terms at once, because
> the query can get expensive.
>
> Here's a crazy idea: map words to numbers, sorted alphabetically. For
> example:
>
> aardvark - 1
> blip - 2
> potato - 3
> red - 4
>
> When you formulate the query, you do the same translation, sort the terms,
> then search for something like:
> - any of the words
> - negate any ranges between them
>
> For example, if I'm searching for "red potato", then the query will be
> something like:
>
> (3 OR 4) -{* TO 3} -{3 TO 4} -{4 TO *}
>
> Note that I added the 3 to 4 range (exclusive), even though it doesn't make
> sense, because the naive implementation wouldn't check if some numbers are
> consecutive and remove ranges that make no sense. That would be an
> optimization.
>
> Best regards,
> Radu
> --
> Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training
> Sematext Cloud - Full Stack Observability
> https://sematext.com/ <http://sematext.com/>
>
>
> On Mon, Aug 1, 2022 at 11:59 AM Colvin Cowie <co...@gmail.com>
> wrote:
>
> > Hello,
> >
> > Maybe the answer to this is obvious and I'm missing something, but here
> > goes:
> >
> > Suppose I have a field which contains a string of one or more tokens
> from a
> > set. The set has about 50 possible values, and the values themselves are
> > arbitrary (though they are known ahead of time, and could be ordered
> > alphabetically if it helped). e.g.
> > doc1: "red"
> > doc2: "blip red"
> > doc3: "aardvark blip red"
> > doc4: "aardvark potato"
> >
> > I want to query the field for all documents that contain at least one of
> > the tokens specified in the query *and no tokens that aren't in the
> query*.
> > What's the best query for that?
> >
> > For example, querying for
> >
> >    - "*red*" should *only* match doc1 above
> >    - "*blip red*" should match doc1 *and* doc2
> >    - "*blip red potato*" should also match doc1 and doc 2.
> >    - "*aardvark blip*" would not match any of the documents since neither
> >    term appears on its own above, and it would need "*red*" as well to
> >    match doc3.
> >    - "*aardvark blip red potato*" would match all of the documents.
> >
> >
> > Options?
> >
> >    1. I could formulate the query to include all the required tokens and
> >    negate all the other tokens from the set, e.g. "*blip red*" would
> > be "*+(blip
> >    red) -(aardvark potato....)*", and "*red*" would be "*+(red)
> -(aardvark
> >    blip potato...)*"... The size of the set is fixed, so the number of
> >    terms in the query won't change, just whether they are included or
> >    excluded. But having to specify all the negations seems inefficient.
> >    2. I could change the way the data is indexed so that the field is
> >    concatenated deterministically and tokenized as a single value, and
> > query
> >    for combinations of terms. e.g. "*blip red*" would be "*blip red
> >    blip-red*", but with more than a handful of terms the fan-out becomes
> >    significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
> >    aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on,
> with
> >    (2^N)-1 combinations.
> >
> > So option 1 should be fairly constant regardless of the number of terms
> but
> > may be wasteful for low numbers of terms, while option 2 generates > 1000
> > combinations for a query with 10 terms. Is that a problem for Lucene in
> > practice though? For 20 terms it would create >1 million combinations,
> > which does sound like a problem, but a query with that many terms may not
> > be needed.
> >
> > I'm leaning towards 1 - but is it a bad solution? Is there a better
> option
> > I'm missing?
> >
> > On a related note, does the EnumFieldType enable a more efficient query
> > than other field types, or does it just provide explicit sorting? i.e.
> > would a multivalued EFT be better for this?
> >
> > Thanks,
> > Colvin
> >
>

Re: What's the best way to formulate this query?

Posted by Radu Gheorghe <ra...@sematext.com>.

Hi Colvin,

You wouldn't normally query with more than e.g. 1K terms at once, because
the query can get expensive.

Here's a crazy idea: map words to numbers, sorted alphabetically. For
example:

aardvark - 1
blip - 2
potato - 3
red - 4

When you formulate the query, you do the same translation, sort the terms,
then search for something like:
- any of the words
- negate any ranges between them

For example, if I'm searching for "red potato", then the query will be
something like:

(3 OR 4) -{* TO 3} -{3 TO 4} -{4 TO *}

Note that I added the 3 to 4 range (exclusive), even though it doesn't make
sense, because the naive implementation wouldn't check if some numbers are
consecutive and remove ranges that make no sense. That would be an
optimization.

Best regards,
Radu
--
Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training
Sematext Cloud - Full Stack Observability
https://sematext.com/ <http://sematext.com/>


On Mon, Aug 1, 2022 at 11:59 AM Colvin Cowie <co...@gmail.com>
wrote:

> Hello,
>
> Maybe the answer to this is obvious and I'm missing something, but here
> goes:
>
> Suppose I have a field which contains a string of one or more tokens from a
> set. The set has about 50 possible values, and the values themselves are
> arbitrary (though they are known ahead of time, and could be ordered
> alphabetically if it helped). e.g.
> doc1: "red"
> doc2: "blip red"
> doc3: "aardvark blip red"
> doc4: "aardvark potato"
>
> I want to query the field for all documents that contain at least one of
> the tokens specified in the query *and no tokens that aren't in the query*.
> What's the best query for that?
>
> For example, querying for
>
>    - "*red*" should *only* match doc1 above
>    - "*blip red*" should match doc1 *and* doc2
>    - "*blip red potato*" should also match doc1 and doc 2.
>    - "*aardvark blip*" would not match any of the documents since neither
>    term appears on its own above, and it would need "*red*" as well to
>    match doc3.
>    - "*aardvark blip red potato*" would match all of the documents.
>
>
> Options?
>
>    1. I could formulate the query to include all the required tokens and
>    negate all the other tokens from the set, e.g. "*blip red*" would
> be "*+(blip
>    red) -(aardvark potato....)*", and "*red*" would be "*+(red) -(aardvark
>    blip potato...)*"... The size of the set is fixed, so the number of
>    terms in the query won't change, just whether they are included or
>    excluded. But having to specify all the negations seems inefficient.
>    2. I could change the way the data is indexed so that the field is
>    concatenated deterministically and tokenized as a single value, and
> query
>    for combinations of terms. e.g. "*blip red*" would be "*blip red
>    blip-red*", but with more than a handful of terms the fan-out becomes
>    significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
>    aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on, with
>    (2^N)-1 combinations.
>
> So option 1 should be fairly constant regardless of the number of terms but
> may be wasteful for low numbers of terms, while option 2 generates > 1000
> combinations for a query with 10 terms. Is that a problem for Lucene in
> practice though? For 20 terms it would create >1 million combinations,
> which does sound like a problem, but a query with that many terms may not
> be needed.
>
> I'm leaning towards 1 - but is it a bad solution? Is there a better option
> I'm missing?
>
> On a related note, does the EnumFieldType enable a more efficient query
> than other field types, or does it just provide explicit sorting? i.e.
> would a multivalued EFT be better for this?
>
> Thanks,
> Colvin
>