You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Vincenzo D'Amore <v....@gmail.com> on 2020/04/21 10:15:26 UTC

Payloads

Hi All,


still struggling with payloads. Trying to understand better my problem I've
created a minimal reproducible example.

Basically I have a multivalued field with payloads with this schema
configuration:



  <fieldType name="payloads" stored="true" indexed="true"
class="solr.TextField">

    <analyzer type="index">

      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

      <filter class="solr.DelimitedPayloadTokenFilterFactory"
encoder="float" delimiter=":"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

    </analyzer>

  </fieldType>



  <field name="multipayload" type="payloads" indexed="true" stored="true"
multiValued="true" />



That are populated with data like this:


  <doc>

    <field name="id">1</field>

    <field name="multipayload">A:1 B:2 C:3 D:4</field>

    <field name="multipayload">A:0.1 B:0.2 E:5 F:6</field>

    <field name="multipayload">E:0.5 F:0.6</field>

  </doc>



I want to be able to query on the multipayload field with a free number of
token in any possible sequence and having as a result the SUM of the
payloads values of those tokens only for the rows of the multipayload field
that satisfy the condition of having all the tokens of the query as clause
(basically the same of saying AND condition on the row). For example:



   1. I run the query having B F A as clauses, I expect to obtain a match
   on the second row for doc with id=1, and so a score of 0.2 + 0.1 + 6 = 6.3
   2. I run the query having F E as clauses, I expect to obtain a match on
   the second and the third row for doc with id=1 and thus a score of (6 + 5)
   + (0.6 + 0.5) = 12.1
   3. I run the query having A F as clauses, I expect to have no match and
   thus a score of 0.0



I tried to use a query like this:



http://localhost:8983/solr/test/select?debugQuery=true&q={!payload_score
f=multipayload v=$pl func=sum includeSpanScore=false
operator=phrase}&pl=__MY_CLAUSES__



The results I obtain are:



   1. B F A: No results
   2. F E:  6.5 (resulting from match of row#2: 6 and row#3: 0.5) – as
   result of the span query I presume
   3. E F:  12.1 (as expected, but only because “by chance” the sequence
   matches as a phrase on rows #2 and #3)
   4. A F: No results (as expected)



Looking into Solr payloads code (

https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/util/PayloadUtils.java#L139

), I see that:



   - There are only two options: OR and phrase, while I think that my case
   should need to have an AND operator
   - The phrase option has an hardwired distance of 0 for the span query:
   query = new SpanNearQuery(terms.toArray(new SpanTermQuery[terms.size()]),
   0, true);



I think that a phrase query with a huge distance (i.e. 100) could behave as
an AND query, but I’m just guessing. But anyway to suit my case I think
that in general I’d need an AND option or the possibility to define the
span behaviour in a more flexible way for the phrase query).



Even if my case is quite specific, I think that the current implementation
of the phrase option is not really well suited also for a more general case
of having weights associated to Part-of-speech classes, that is in my
opinion a more classic usage of payloads, where for example I want to
deboost adjectives against nouns, as for example:



   - a *race horse* is a *horse* that runs in races
   - a *horse race* is a *race* for horses



In general it seems to me that the absence of an AND option and the
hardwired phrase span to 0 is quite limiting.


Thanks in advance for your time,

Vincenzo

-- 
Vincenzo D'Amore