You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Aleksandra Woźniak <al...@gmail.com> on 2013/08/01 10:20:02 UTC

VInt block lenght in Lucene 4.1 postings format

Hi all,

recently I wanted to try out some modifications of Lucene's postings
format (namely, copying blocks that have no deletions without
int-decoding/encoding -- this is similar to what was described here:
https://issues.apache.org/jira/browse/LUCENE-2082). I started with changing
Lucene 4.1 postings format to check what can be done there.

I came across the following problem: in Lucene41PostingsReader the length
(number of bytes) of the last, vInt-encoded, block of posting in not known
before all individual postings are read and decoded. When reading this
block we only know the number of postings that should be read and decoded
-- since vInts have different sizes by definition.

If I wanted to copy the whole block without vInt decoding/encoding, I need
to know how many bytes I have to read from postings index input. So, my
question is: is there a clean way to determine the length of this block
(ie. the number of bytes that this block has)? Is the number of bytes in a
posting list tracked somewhere in Lucene 4.1 postings format?

Thanks,
Aleksandra

Re: VInt block lenght in Lucene 4.1 postings format

Posted by Han Jiang <ji...@gmail.com>.
Hi Aleksandra,

The PostingsReader uses a skip list to determine the start file
pointer of each block (both FOR packed and vInt encoded). The
information
is currently maintained by Lucene41SkipReader.

The tricky part is, for each term, the skip data is exactly at the end
of TermFreqs blocks, so, if you fetch the startFP for vInt block, and
knows the docTermStartOffset & skipOffset for current term, you can
calculate out what you need.

http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Frequencies

On Thu, Aug 1, 2013 at 4:20 PM, Aleksandra Woźniak
<al...@gmail.com> wrote:
> Hi all,
>
> recently I wanted to try out some modifications of Lucene's postings format
> (namely, copying blocks that have no deletions without int-decoding/encoding
> -- this is similar to what was described here:
> https://issues.apache.org/jira/browse/LUCENE-2082). I started with changing
> Lucene 4.1 postings format to check what can be done there.
>
> I came across the following problem: in Lucene41PostingsReader the length
> (number of bytes) of the last, vInt-encoded, block of posting in not known
> before all individual postings are read and decoded. When reading this block
> we only know the number of postings that should be read and decoded -- since
> vInts have different sizes by definition.
>
> If I wanted to copy the whole block without vInt decoding/encoding, I need
> to know how many bytes I have to read from postings index input. So, my
> question is: is there a clean way to determine the length of this block (ie.
> the number of bytes that this block has)? Is the number of bytes in a
> posting list tracked somewhere in Lucene 4.1 postings format?
>
> Thanks,
> Aleksandra



-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org