You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2017/05/27 13:57:04 UTC

[jira] [Comment Edited] (LUCENE-7854) Indexing custom term frequencies

    [ https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027435#comment-16027435 ] 

Uwe Schindler edited comment on LUCENE-7854 at 5/27/17 1:56 PM:
----------------------------------------------------------------

Patch looks good, although I am not sure if we really should check for existence of the attribute. What happens if somebody has configured an analyzer with a TermFreqAttribute and uses it for all fields? If heshe disables freqs for one field this will break indexing. We also don't throw ex, if one has a tokenstream with offsets and positions and we don't index them!!! :-)

IMHO: I'd always add the termfreq attribute when creating the inverter. As the default is "1" anyways, if there is no tokenfilter that modifies the attribute all works as it worked before. If we have a filter that changes the attribute it is used. Quite simple and less if/then/else logic.


was (Author: thetaphi):
Patch looks good, although I am not sure if we really should check for existence of the attribute. What happens if somebody has configured an analyzer with a TermFreqAttribute and uses it for all fields. If he disables freqs for one field this will break indexing. We also don't throw ex, if one has a tokenstream with offsets and positions and we don't index them!!! :-)

IMHO: I'd always add the termfreq attribute when creating the inverter. As the default is "1" anyways, if there is no tokenfilter that modifies the attribute all works as it worked before. If we have a filter that changes the attribute it is used. Quite simple and less if/then/else logic.

> Indexing custom term frequencies
> --------------------------------
>
>                 Key: LUCENE-7854
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7854
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0)
>
>         Attachments: LUCENE-7854.patch, LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will store just the docID and term frequency (how many times that term occurred in that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as the term frequency, e.g. to hold custom scoring signals that are a function of term and document (this is my use case).  Users have also asked for this before, e.g. see https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}} payload.  But that's quite inefficient, forcing you to index positions, and pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times where N is the custom number you want to store, but that's also inefficient when N gets high.
> I think we can make this simple to do in Lucene.  I have a working version, using my own custom indexing chain, but the required changes are quite simple so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked the indexing chain to use that attribute's value as the term frequency if it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org