You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Edward Ribeiro <ed...@gmail.com> on 2019/11/28 21:51:24 UTC

DelimitedTermFrequencyTokenFilter

Hi,

Please, anyone has an example of DelimitedTermFrequencyTokenFilter use that
could share?

I have been banging my head against the wall trying to make it work (
https://gist.github.com/eribeiro/ebb24feb3fd84931b7c288b9b716ed49 ) and idk
what I am doing wrong.

I am creating a custom analyzer that uses a WhitespaceTokenizer to parse a
string like "a|10 b|2 c|9", and pass it to
DelimitedTermFrequencyTokenFilter. I am inserting a custom field that is
added to the document to prevent it from having positions and offsets.

The debugger shows the string is being correctly parsed by DTFTF and its
char and term attributes are properly set up. But the term frequency of
each term is 1 when I inspect the index via Luke. Curiously, the output of
my snippet shows the correct total term frequency as seen below:

field="text",maxDoc=1,docCount=1,sumTotalTermFreq=123,sumDocFreq=3
a|10 b|23 c|90
SumTotalTermFreq: 123
SumDocFreq: 3

Cheers,
Edward
PS: I am a Lucene newbie so it may be something quite stupid.

Re: DelimitedTermFrequencyTokenFilter

Posted by Edward Ribeiro <ed...@gmail.com>.

Oh, silly of me. :)

Thanks,
Edward

Em sex, 29 de nov de 2019 07:13, Alan Woodward <ro...@gmail.com>
escreveu:

> I think it’s working fine - Luke is showing you the docFreq of the term,
> which will be 1 as it only appears in a single document.
>
> On 28 Nov 2019, at 21:51, Edward Ribeiro <ed...@gmail.com> wrote:
>
> Hi,
>
> Please, anyone has an example of DelimitedTermFrequencyTokenFilter use
> that could share?
>
> I have been banging my head against the wall trying to make it work (
> https://gist.github.com/eribeiro/ebb24feb3fd84931b7c288b9b716ed49 ) and
> idk what I am doing wrong.
>
> I am creating a custom analyzer that uses a WhitespaceTokenizer to parse a
> string like "a|10 b|2 c|9", and pass it to
> DelimitedTermFrequencyTokenFilter. I am inserting a custom field that is
> added to the document to prevent it from having positions and offsets.
>
> The debugger shows the string is being correctly parsed by DTFTF and its
> char and term attributes are properly set up. But the term frequency of
> each term is 1 when I inspect the index via Luke. Curiously, the output of
> my snippet shows the correct total term frequency as seen below:
>
> field="text",maxDoc=1,docCount=1,sumTotalTermFreq=123,sumDocFreq=3
> a|10 b|23 c|90
> SumTotalTermFreq: 123
> SumDocFreq: 3
>
> Cheers,
> Edward
> PS: I am a Lucene newbie so it may be something quite stupid.
>
>
>

Re: DelimitedTermFrequencyTokenFilter

Posted by Alan Woodward <ro...@gmail.com>.

I think it’s working fine - Luke is showing you the docFreq of the term, which will be 1 as it only appears in a single document.

> On 28 Nov 2019, at 21:51, Edward Ribeiro <edward.ribeiro@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi,
> 
> Please, anyone has an example of DelimitedTermFrequencyTokenFilter use that could share? 
> 
> I have been banging my head against the wall trying to make it work ( https://gist.github.com/eribeiro/ebb24feb3fd84931b7c288b9b716ed49 <https://gist.github.com/eribeiro/ebb24feb3fd84931b7c288b9b716ed49> ) and idk what I am doing wrong. 
> 
> I am creating a custom analyzer that uses a WhitespaceTokenizer to parse a string like "a|10 b|2 c|9", and pass it to DelimitedTermFrequencyTokenFilter. I am inserting a custom field that is added to the document to prevent it from having positions and offsets.
> 
> The debugger shows the string is being correctly parsed by DTFTF and its char and term attributes are properly set up. But the term frequency of each term is 1 when I inspect the index via Luke. Curiously, the output of my snippet shows the correct total term frequency as seen below:
> 
> field="text",maxDoc=1,docCount=1,sumTotalTermFreq=123,sumDocFreq=3
> a|10 b|23 c|90
> SumTotalTermFreq: 123
> SumDocFreq: 3
> 
> Cheers,
> Edward
> PS: I am a Lucene newbie so it may be something quite stupid. 
>