You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Sebastiano Vigna <vi...@di.unimi.it> on 2013/02/16 11:40:21 UTC

Interleaving and new Lucene formats

I'd like to redo the benchmarks published on MG4J's home page with Lucene 4.1. However, for this I'd need to know whether when using PForDelta coding the counts (a.k.a. within-document frequencies) are stored interleaved with the document pointers as in 3.6.2 (and, if not so, the cheapest way to force a count read for each returned document, even modifiying the code if it's more efficient than otherwise).

It would also be important for me to force PForDelta everywhere, if possible, as the point is benchmarking different index representations, and mixing with variable-byte makes the benchmark difficult to interpret.

Thank you!

Ciao,

					seba


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Interleaving and new Lucene formats

Posted by Sebastiano Vigna <vi...@di.unimi.it>.
On 16 February 2013 14:35, Robert Muir <rc...@gmail.com> wrote:

2. index them, but specify you won't ask for them in the DocsEnum: and
> just use that to iterate documents.
>
>       TermsEnum termsEnum = reader.terms("body").iterator(null);
>       boolean found = termsEnum.seekExact(new BytesRef("dogs"), false);
>       // pass 0, to not ask for frequencies
>       DocsEnum docsEnum = termsEnum.docs(reader.getLiveDocs(), null, 0);
>

Thank you. That's exactly the information I was looking for.

Re: Interleaving and new Lucene formats

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
>From IndexReader API javadoc:

 <p>There are two different types of IndexReaders:
 <ul>
  <li>{@link AtomicReader}: These indexes do not consist of several sub-readers,
  they are atomic. They support retrieval of stored fields, doc values, terms,
  and postings.
  <li>{@link CompositeReader}: Instances (like {@link DirectoryReader})
  of this reader can only
  be used to get stored fields from the underlying AtomicReaders,
  but it is not possible to directly retrieve postings. To do that, get
  the sub-readers via {@link CompositeReader#getSequentialSubReaders}.
  Alternatively, you can mimic an {@link AtomicReader} (with a serious
slowdown),
  by wrapping composite readers with {@link SlowCompositeReaderWrapper}.
 </ul>

You need to get an AtomicReader somehow (or wrap with a
SlowCompositeReaderWrapper).

D.

On Sun, Feb 17, 2013 at 11:30 AM, Sebastiano Vigna <vi...@di.unimi.it> wrote:
> On 16 February 2013 14:35, Robert Muir <rc...@gmail.com> wrote:
>>
>>
>>       TermsEnum termsEnum = reader.terms("body").iterator(null);
>>       boolean found = termsEnum.seekExact(new BytesRef("dogs"), false);
>>       // pass 0, to not ask for frequencies
>>       DocsEnum docsEnum = termsEnum.docs(reader.getLiveDocs(), null, 0);
>
>
> I'm painstakingly chasing  through the API, but IndexReader (I had to guess
> the class you meant for "reader") does not have a terms() method. Only the
> Fields class. Could you please expand the first line in something that
> compiles? Thanks!
>
> seba

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Interleaving and new Lucene formats

Posted by Sebastiano Vigna <vi...@di.unimi.it>.
On 16 February 2013 14:35, Robert Muir <rc...@gmail.com> wrote:

>
>       TermsEnum termsEnum = reader.terms("body").iterator(null);
>       boolean found = termsEnum.seekExact(new BytesRef("dogs"), false);
>       // pass 0, to not ask for frequencies
>       DocsEnum docsEnum = termsEnum.docs(reader.getLiveDocs(), null, 0);
>

I'm painstakingly chasing  through the API, but IndexReader (I had to guess
the class you meant for "reader") does not have a terms() method. Only the
Fields class. Could you please expand the first line in something that
compiles? Thanks!

seba

Re: Interleaving and new Lucene formats

Posted by Robert Muir <rc...@gmail.com>.
On Sat, Feb 16, 2013 at 8:19 AM, Sebastiano Vigna <vi...@di.unimi.it> wrote:
>
> I never asked for that. It looks like you're entirely missing my point.
> Which is to do a fair benchmark between radically different implementations
> of an index structure.

"It would also be important for me to force PForDelta everywhere"

>
>>
>> Thats right. Also keep in mind: in the FOR case the blocks themselves
>> are interleaved, so you have a block of 128 doc deltas, then a block
>> of 128 freqs follow, then 128 doc deltas again, then 128 freqs.
>> finally the vint remainder is docs+freqs interleaved as vints.
>
>
> OK, we are slowly getting there.
>
> So the question is: do you decode interleaved freqs blocks *always*, or do
> you do it *lazily* when freqs are actually used?

They are only decoded when they are present, and asked for up front in
the enumerator.
Currently scorers ask for these, because they use them for scoring.

If you want to remove frequencies from the equation, you can:

1. omit them completely at indexing time:

    FieldType ft = new FieldType(TextField.TYPE_NOT_STORED);
    ft.setIndexOptions(IndexOptions.DOCS_ONLY);
    Field field = new Field("body", "body contents", ft);

2. index them, but specify you won't ask for them in the DocsEnum: and
just use that to iterate documents.

      TermsEnum termsEnum = reader.terms("body").iterator(null);
      boolean found = termsEnum.seekExact(new BytesRef("dogs"), false);
      // pass 0, to not ask for frequencies
      DocsEnum docsEnum = termsEnum.docs(reader.getLiveDocs(), null, 0);

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Interleaving and new Lucene formats

Posted by Sebastiano Vigna <vi...@di.unimi.it>.
On 16 February 2013 13:19, Robert Muir <rc...@gmail.com> wrote:

I think you are missing my point: this interleaving is part of the
> whole design of this postings format. You can't just turn it off and
> force it to be always FOR: or you would need a new postings format
>

I never asked for that. It looks like you're entirely missing my point.
Which is to do a fair benchmark between radically different implementations
of an index structure.


> Thats right. Also keep in mind: in the FOR case the blocks themselves
> are interleaved, so you have a block of 128 doc deltas, then a block
> of 128 freqs follow, then 128 doc deltas again, then 128 freqs.
> finally the vint remainder is docs+freqs interleaved as vints.
>

OK, we are slowly getting there.

So the question is: do you decode interleaved freqs blocks *always*, or do
you do it *lazily* when freqs are actually used?

My only concern, again, is to do a fair benchmark against a non-interleaved
index. Which means that I have to force freqs reading on the
non-interleaved index, or I would penalize Lucene's structure. On the other
hand, if Lucene does not decode freqs when this is not necessary, I would
penalize the non-interleaved index by forcing freqs reads (I am using
a NullCollector to avoid any kind of scoring).

If both indices do not read freqs when this is not necessary, then I'm also
interested in forcing a count read to test the speed of access to counts
during conjunctive queries.

The alternative is doing tests only on phrasal and span queries, where
everybody has to get the same data anyway, but I think this misses to
detect some aspects because there's a lot of CPU work and sequential
position reading.

Re: Interleaving and new Lucene formats

Posted by Robert Muir <rc...@gmail.com>.
On Sat, Feb 16, 2013 at 7:05 AM, Sebastiano Vigna <vi...@di.unimi.it> wrote:
> On 16 February 2013 11:45, Robert Muir <rc...@gmail.com> wrote:
>
>> But forcing that wouldn't be testing the 4.1 index format, it would be
>> something else (something not interesting).
>
>
> Do you mind if I have my own share of knowledge and have my idea about
> interesting benchmarks? :)
>
> You didn't answer, but the undertext *seems* that counts are no longer
> interleaved. Again, is it the case?
>
> Forcing a count is an essential test for the index efficiency, as you need
> counts to do scoring. Testing with a scorer is not a good idea because the
> scorer CPU usage is difficult to control across different implementations.
> So the only way of testing a non-interleaved index against an interleaved
> index (or comparing the speed of count access against a non-interleaved
> index) is to force a count reading without any other activity.

I think you are missing my point: this interleaving is part of the
whole design of this postings format. You can't just turn it off and
force it to be always FOR: or you would need a new postings format
with a different design to match (it would need to encode term
dictionary and skip data and other things differently, and track
offsets within FOR blocks and other things).

For example by only recording full FOR blocks of 128 that are not
shared across terms, there are less pointers (e.g. term dictionary)
and other upkeep and hassle. And thats an important part of the design
of this format, that the block encoding doesn't need to worry about
being efficient for all these low frequency terms, we just continue to
encode them pretty much as we did before. It makes the index slightly
larger, but simplifies the case that does matter, and after all, these
low frequency terms aren't bottlenecking anyone anyway.

>
> So essentially you code every blocks of 128 postings using FOR, but fall
> back to VByte for the tail ( <128). For low-frequency terms, this means just
> VByte. Right?
>

Thats right. Also keep in mind: in the FOR case the blocks themselves
are interleaved, so you have a block of 128 doc deltas, then a block
of 128 freqs follow, then 128 doc deltas again, then 128 freqs.
finally the vint remainder is docs+freqs interleaved as vints.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Interleaving and new Lucene formats

Posted by Sebastiano Vigna <vi...@di.unimi.it>.
On 16 February 2013 11:45, Robert Muir <rc...@gmail.com> wrote:

> But forcing that wouldn't be testing the 4.1 index format, it would be
> something else (something not interesting).
>

Do you mind if I have my own share of knowledge and have my idea about
interesting benchmarks? :)

You didn't answer, but the undertext *seems* that counts are no longer
interleaved. Again, is it the case?

Forcing a count is an essential test for the index efficiency, as you need
counts to do scoring. Testing with a scorer is not a good idea because the
scorer CPU usage is difficult to control across different implementations.
So the only way of testing a non-interleaved index against an interleaved
index (or comparing the speed of count access against a non-interleaved
index) is to force a count reading without any other activity.

4.1 index format mixes with variable byte because its more efficient
than using FOR everywhere. This means FOR blocks in this format are
always size 128. The remainder is encoded as vbyte.

So essentially you code every blocks of 128 postings using FOR, but fall
back to VByte for the tail ( <128). For low-frequency terms, this means
just VByte. Right?

Re: Interleaving and new Lucene formats

Posted by Robert Muir <rc...@gmail.com>.
On Sat, Feb 16, 2013 at 5:40 AM, Sebastiano Vigna <vi...@di.unimi.it> wrote:
> I'd like to redo the benchmarks published on MG4J's home page with Lucene 4.1. However, for this I'd need to know whether when using PForDelta coding the counts (a.k.a. within-document frequencies) are stored interleaved with the document pointers as in 3.6.2 (and, if not so, the cheapest way to force a count read for each returned document, even modifiying the code if it's more efficient than otherwise).
>
> It would also be important for me to force PForDelta everywhere, if possible, as the point is benchmarking different index representations, and mixing with variable-byte makes the benchmark difficult to interpret.

But forcing that wouldn't be testing the 4.1 index format, it would be
something else (something not interesting).

4.1 index format mixes with variable byte because its more efficient
than using FOR everywhere. This means FOR blocks in this format are
always size 128. The remainder is encoded as vbyte.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org