You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2009/11/08 01:47:21 UTC
Omit positions but not TF
Hi,
During one of discussions at ApacheCon it occurred to me that it would
be useful to have an option to discard positional information but still
keep the term frequency. Even though position-dependent queries wouldn't
work then, still any other queries would work fine and we would get the
right scoring.
I believe it should be possible to do this without changing the file
format, if we used a negative term frequency for terms without postings
- we would have to check for that condition in SegmentTermDocs, change
the flags there and flip the sign of docFreq. And eventually we may want
to add a separate flag for this and bump the format version.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Omit positions but not TF
Posted by Simon Willnauer <si...@googlemail.com>.
On Mon, Nov 9, 2009 at 6:03 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> How about opening an issue? This way someone else can come along and
> pick up the torch...
+1
>
> Mike
>
> On Mon, Nov 9, 2009 at 11:26 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
>> Andrzej Bialecki wrote:
>>>
>>> Michael McCandless wrote:
>>>>
>>>> +1
>>>>
>>>> I guess we'd add a Fieldable.setOmitPositions? And then save that in
>>>> FieldInfos, and fix the postings writing/reading to respect it? Ie,
>>>> we can just change the index format. Encoding as negative numbers
>>>
>>> Yes, that's what I had in mind. I was a bit shy of bumping the format
>>> version, but likely there will be other changes that we can put under the
>>> same next version of the format.
>>>
>>>> isn't great because the termFreq is written as a vInt, which consumes
>>>> 5 bytes to encode any negative number. Wanna cough up a patch?
>>>
>>> Heh .. that's the right term for it, I haven't looked at the details of
>>> oal.index.* since 2.4-ish or so ... we'll see ;)
>>
>> Ehh, sorry - I think I'll give up for now, after looking at the combinatoric
>> increase in the number of arguments to various indexing classes ...
>>
>> --
>> Best regards,
>> Andrzej Bialecki <><
>> ___. ___ ___ ___ _ _ __________________________________
>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>> ___|||__|| \| || | Embedded Unix, System Integration
>> http://www.sigram.com Contact: info at sigram dot com
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Omit positions but not TF
Posted by Michael McCandless <lu...@mikemccandless.com>.
How about opening an issue? This way someone else can come along and
pick up the torch...
Mike
On Mon, Nov 9, 2009 at 11:26 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
> Andrzej Bialecki wrote:
>>
>> Michael McCandless wrote:
>>>
>>> +1
>>>
>>> I guess we'd add a Fieldable.setOmitPositions? And then save that in
>>> FieldInfos, and fix the postings writing/reading to respect it? Ie,
>>> we can just change the index format. Encoding as negative numbers
>>
>> Yes, that's what I had in mind. I was a bit shy of bumping the format
>> version, but likely there will be other changes that we can put under the
>> same next version of the format.
>>
>>> isn't great because the termFreq is written as a vInt, which consumes
>>> 5 bytes to encode any negative number. Wanna cough up a patch?
>>
>> Heh .. that's the right term for it, I haven't looked at the details of
>> oal.index.* since 2.4-ish or so ... we'll see ;)
>
> Ehh, sorry - I think I'll give up for now, after looking at the combinatoric
> increase in the number of arguments to various indexing classes ...
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Omit positions but not TF
Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki wrote:
> Michael McCandless wrote:
>> +1
>>
>> I guess we'd add a Fieldable.setOmitPositions? And then save that in
>> FieldInfos, and fix the postings writing/reading to respect it? Ie,
>> we can just change the index format. Encoding as negative numbers
>
> Yes, that's what I had in mind. I was a bit shy of bumping the format
> version, but likely there will be other changes that we can put under
> the same next version of the format.
>
>> isn't great because the termFreq is written as a vInt, which consumes
>> 5 bytes to encode any negative number. Wanna cough up a patch?
>
> Heh .. that's the right term for it, I haven't looked at the details of
> oal.index.* since 2.4-ish or so ... we'll see ;)
Ehh, sorry - I think I'll give up for now, after looking at the
combinatoric increase in the number of arguments to various indexing
classes ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Omit positions but not TF
Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael McCandless wrote:
> +1
>
> I guess we'd add a Fieldable.setOmitPositions? And then save that in
> FieldInfos, and fix the postings writing/reading to respect it? Ie,
> we can just change the index format. Encoding as negative numbers
Yes, that's what I had in mind. I was a bit shy of bumping the format
version, but likely there will be other changes that we can put under
the same next version of the format.
> isn't great because the termFreq is written as a vInt, which consumes
> 5 bytes to encode any negative number. Wanna cough up a patch?
Heh .. that's the right term for it, I haven't looked at the details of
oal.index.* since 2.4-ish or so ... we'll see ;)
> Probably this should wait until 3.1.
+1.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Omit positions but not TF
Posted by Michael McCandless <lu...@mikemccandless.com>.
+1
I guess we'd add a Fieldable.setOmitPositions? And then save that in
FieldInfos, and fix the postings writing/reading to respect it? Ie,
we can just change the index format. Encoding as negative numbers
isn't great because the termFreq is written as a vInt, which consumes
5 bytes to encode any negative number. Wanna cough up a patch?
Probably this should wait until 3.1.
Mike
On Sat, Nov 7, 2009 at 7:47 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> Hi,
>
> During one of discussions at ApacheCon it occurred to me that it would be
> useful to have an option to discard positional information but still keep
> the term frequency. Even though position-dependent queries wouldn't work
> then, still any other queries would work fine and we would get the right
> scoring.
>
> I believe it should be possible to do this without changing the file format,
> if we used a negative term frequency for terms without postings - we would
> have to check for that condition in SegmentTermDocs, change the flags there
> and flip the sign of docFreq. And eventually we may want to add a separate
> flag for this and bump the format version.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org