You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2009/11/08 01:47:21 UTC

Omit positions but not TF

Hi,

During one of discussions at ApacheCon it occurred to me that it would 
be useful to have an option to discard positional information but still 
keep the term frequency. Even though position-dependent queries wouldn't 
work then, still any other queries would work fine and we would get the 
right scoring.

I believe it should be possible to do this without changing the file 
format, if we used a negative term frequency for terms without postings 
- we would have to check for that condition in SegmentTermDocs, change 
the flags there and flip the sign of docFreq. And eventually we may want 
to add a separate flag for this and bump the format version.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Omit positions but not TF

Posted by Simon Willnauer <si...@googlemail.com>.
On Mon, Nov 9, 2009 at 6:03 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> How about opening an issue?  This way someone else can come along and
> pick up the torch...
+1
>
> Mike
>
> On Mon, Nov 9, 2009 at 11:26 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
>> Andrzej Bialecki wrote:
>>>
>>> Michael McCandless wrote:
>>>>
>>>> +1
>>>>
>>>> I guess we'd add a Fieldable.setOmitPositions?  And then save that in
>>>> FieldInfos, and fix the postings writing/reading to respect it?  Ie,
>>>> we can just change the index format.  Encoding as negative numbers
>>>
>>> Yes, that's what I had in mind. I was a bit shy of bumping the format
>>> version, but likely there will be other changes that we can put under the
>>> same next version of the format.
>>>
>>>> isn't great because the termFreq is written as a vInt, which consumes
>>>> 5 bytes to encode any negative number.  Wanna cough up a patch?
>>>
>>> Heh .. that's the right term for it, I haven't looked at the details of
>>> oal.index.* since 2.4-ish or so ... we'll see ;)
>>
>> Ehh, sorry - I think I'll give up for now, after looking at the combinatoric
>> increase in the number of arguments to various indexing classes ...
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Omit positions but not TF

Posted by Michael McCandless <lu...@mikemccandless.com>.
How about opening an issue?  This way someone else can come along and
pick up the torch...

Mike

On Mon, Nov 9, 2009 at 11:26 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
> Andrzej Bialecki wrote:
>>
>> Michael McCandless wrote:
>>>
>>> +1
>>>
>>> I guess we'd add a Fieldable.setOmitPositions?  And then save that in
>>> FieldInfos, and fix the postings writing/reading to respect it?  Ie,
>>> we can just change the index format.  Encoding as negative numbers
>>
>> Yes, that's what I had in mind. I was a bit shy of bumping the format
>> version, but likely there will be other changes that we can put under the
>> same next version of the format.
>>
>>> isn't great because the termFreq is written as a vInt, which consumes
>>> 5 bytes to encode any negative number.  Wanna cough up a patch?
>>
>> Heh .. that's the right term for it, I haven't looked at the details of
>> oal.index.* since 2.4-ish or so ... we'll see ;)
>
> Ehh, sorry - I think I'll give up for now, after looking at the combinatoric
> increase in the number of arguments to various indexing classes ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Omit positions but not TF

Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki wrote:
> Michael McCandless wrote:
>> +1
>>
>> I guess we'd add a Fieldable.setOmitPositions?  And then save that in
>> FieldInfos, and fix the postings writing/reading to respect it?  Ie,
>> we can just change the index format.  Encoding as negative numbers
> 
> Yes, that's what I had in mind. I was a bit shy of bumping the format 
> version, but likely there will be other changes that we can put under 
> the same next version of the format.
> 
>> isn't great because the termFreq is written as a vInt, which consumes
>> 5 bytes to encode any negative number.  Wanna cough up a patch?
> 
> Heh .. that's the right term for it, I haven't looked at the details of 
> oal.index.* since 2.4-ish or so ... we'll see ;)

Ehh, sorry - I think I'll give up for now, after looking at the 
combinatoric increase in the number of arguments to various indexing 
classes ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Omit positions but not TF

Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael McCandless wrote:
> +1
> 
> I guess we'd add a Fieldable.setOmitPositions?  And then save that in
> FieldInfos, and fix the postings writing/reading to respect it?  Ie,
> we can just change the index format.  Encoding as negative numbers

Yes, that's what I had in mind. I was a bit shy of bumping the format 
version, but likely there will be other changes that we can put under 
the same next version of the format.

> isn't great because the termFreq is written as a vInt, which consumes
> 5 bytes to encode any negative number.  Wanna cough up a patch?

Heh .. that's the right term for it, I haven't looked at the details of 
oal.index.* since 2.4-ish or so ... we'll see ;)

> Probably this should wait until 3.1.

+1.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Omit positions but not TF

Posted by Michael McCandless <lu...@mikemccandless.com>.
+1

I guess we'd add a Fieldable.setOmitPositions?  And then save that in
FieldInfos, and fix the postings writing/reading to respect it?  Ie,
we can just change the index format.  Encoding as negative numbers
isn't great because the termFreq is written as a vInt, which consumes
5 bytes to encode any negative number.  Wanna cough up a patch?
Probably this should wait until 3.1.

Mike

On Sat, Nov 7, 2009 at 7:47 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> Hi,
>
> During one of discussions at ApacheCon it occurred to me that it would be
> useful to have an option to discard positional information but still keep
> the term frequency. Even though position-dependent queries wouldn't work
> then, still any other queries would work fine and we would get the right
> scoring.
>
> I believe it should be possible to do this without changing the file format,
> if we used a negative term frequency for terms without postings - we would
> have to check for that condition in SegmentTermDocs, change the flags there
> and flip the sign of docFreq. And eventually we may want to add a separate
> flag for this and bump the format version.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org