You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Michael Sokolov <ms...@gmail.com> on 2022/01/13 14:41:57 UTC

Re: Payloads for each term

Oh interesting! I did not know about this FeatureField (link was to
the old repo, now gone:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/FeatureField.java
worked for me)

On Wed, Nov 11, 2020 at 4:37 PM Mayya Sharipova
<ma...@elastic.co.invalid> wrote:
>
> For sparse vectors, we found that Lucene's FeatureField could also be useful. It stores features as terms and feature values as term frequencies, and provides several convenient functions to calculate scores based on feature values.
>
> On Fri, Nov 6, 2020 at 11:16 AM Michael McCandless <lu...@mikemccandless.com> wrote:
>>
>> Also, be aware that recent Lucene versions enabled compression for BinaryDocValues fields, which might hurt performance of your second solution.
>>
>> This compression is not yet something you can easily turn off, but there are ongoing discussions/PRs about how to make it more easily configurable for applications that really care more about search CPU cost over index size for BinaryDocValues fields: https://issues.apache.org/jira/browse/LUCENE-9378
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Nov 6, 2020 at 10:21 AM Michael McCandless <lu...@mikemccandless.com> wrote:
>>>
>>> In addition to payloads having kinda of high-ish overhead (slow down indexing, do not compress very well I think, and slow down search as you must pull positions), they are also sort of a forced fit for your use case, right?  Because a payload in Lucene is per-term-position, whereas you really need this vector per-term (irrespective of the positions where that term occurs in each document)?
>>>
>>> Your second solution is an intriguing one.  So you would use Lucene's custom term frequencies to store indices into that per-document map encoded into a BinaryDocValues field?  During indexing I guess you would need a TokenFilter that hands out these indices in order (0, 1, 2, ...) based on the unique terms it sees, and after all tokens are done, it exports a byte[] serialized map?  Hmm, except term frequency 0 is not allowed, so you'd need to + 1 to all indices.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Mon, Oct 26, 2020 at 6:16 AM Bruno Roustant <br...@gmail.com> wrote:
>>>>
>>>> Hi Ankur,
>>>> Indeed payloads are the standard way to solve this problem. For light queries with a few top N results that should be efficient. For multi-term queries that could become penalizing if you need to access the payloads of too many terms.
>>>> Also, there is an experimental PostingsFormat called SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that would allow you to effectively share the overlapping terms in the index while having 50 fields. This would solve the index bloat issue, but would not fully solve the seeks issue. You might want to benchmark this approach too.
>>>>
>>>> Bruno
>>>>
>>>> Le ven. 23 oct. 2020 à 02:48, Ankur Goel <an...@gmail.com> a écrit :
>>>>>
>>>>> Hi Lucene Devs,
>>>>>            I have a need to store a sparse feature vector on a per term basis. The total number of possible dimensions are small (~50) and known at indexing time. The feature values will be used in scoring along with corpus statistics. It looks like payloads were created for this exact same purpose but some workaround is needed to minimize the performance penalty as mentioned on the wiki .
>>>>>
>>>>> An alternative is to override term frequency to be a pointer in a Map<pointer, Feature_Vector> serialized and stored in BinaryDocValues. At query time, the matching docId will be used to advance the pointer to the starting offset of this map. The term frequency will be used to perform lookup into the serialized map to retrieve the Feature_Vector. That's my current plan but I haven't benchmarked it.
>>>>>
>>>>> The problem that I am trying to solve is to reduce the index bloat and eliminate unnecessary seeks as currently these ~50 dimensions are stored as separate fields in the index with very high term overlap and Lucene does not share Terms dictionary across different fields. This itself can be a new feature for Lucene but will reqiure lots of work I imagine.
>>>>>
>>>>> Any ideas are welcome :-)
>>>>>
>>>>> Thanks
>>>>> -Ankur

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org