You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Konstantin <ac...@gmail.com> on 2016/07/08 14:13:27 UTC

Compaction logic

Hello, my name is Konstantin, I'm currently reading Lucene's sources and
wondering why particular technical decisions were made.

Full disclosure - I'm writing my own inverted index implementation as a pet
project https://github.com/kk00ss/Rhinodog . It's about 4 kloc of scala,
and there are tests comparing it with Lucene on wiki dump (I actually run
only on small part of it ~500MB).

Most interesting to me, is why compaction algorithm is implemented this way
- it's clear and simple, but wouldn't it be better to merge postings lists
on a per term basis. Well current Lucene implementation is probably better
for HDDs and proposed would need SSD to show adequate performance.
But that would be more of smaller compactions, each much chipper. Some
times if a term has small posting list - it would be inefficient, but I
think some threshold can be used.
This idea comes from an assumption that when half of the documents were
removed from a segment - not all the terms might need compaction,  assuming
non-uniform distribution of terms among documents (which seems likely to
me, an amateur ;-) ).

Does it make any sense ?
BTW, Any input about Rhinodog and it's benchmarks vs Lucene would be
appreciated.

Re: Compaction logic

Posted by Konstantin <ac...@gmail.com>.
2 Michael McCandless
Yes I meant Lucene Codecs abstraction - but it alone doesn't cut it. Well
there is probably no easy way to integrate this approach into Lucene.
And there is not enough evidence that it makes sense at all. I should test
this approach with B-trees and small compactions first.
 With existing Lucene's benchmarks as reference.
2 Erick Erickson About "typical user query" - now it's clear that was
speculation.

Thanks everybody

2016-07-10 22:08 GMT+03:00 Erick Erickson <er...@gmail.com>:

> Comments from the peanut gallery...
>
> I'd state it much more harshly. There's no such thing as a "typical
> user query" ;) We spend a lot of time trying to score documents to
> return the "best" answer.... which is totally irrelevant to some of
> the applications we see where the only concern is aggregations. Which
> is totally irrelevant for apps (think, say patents or drug research or
> most other things legal and many things academic) where the overriding
> concern is seeing _all_ the documents pertaining to the search. Which
> is totally irrelevant to e-commerce where the concern is how much
> margin say an aggregator makes which is totally irrelevant to <insert
> case N+1 here>
>
> I truly wish there was a better answer here, but until there is I'd
> just use Mike's stuff if you can, at least that way you're comparing a
> long-running benchmark with the new code.
>
> FWIW,
> Erick
>
> On Sun, Jul 10, 2016 at 6:45 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
> > On Sat, Jul 9, 2016 at 5:44 PM, Konstantin <ac...@gmail.com>
> wrote:
> >>
> >> Thanks
> >> I'm aware of current implementation of merging in Lucene on a high
> level.
> >> Yes Rhinodog uses B-tree for storing everything, it is a bottleneck on
> >> writes, but it's almost as fast on reads as direct access to location on
> >> disk.
> >
> > Slower writes for faster reads is the right tradeoff for a search
> engine, in
> > general, IMO.
> >>
> >> (With cold cache, while using SSD reads take less time than decoding
> >> blocks) But may be there is a way to decouple merging/storing + codes
> from
> >> everything else? Just quickly  looking over the sources it actually
> seems
> >> like a hard task to me. With yet unclear benefits. I'll compare this
> >> compaction strategies.
> >
> > You mean like Lucene's Codec abstractions?
> >>
> >> Also, I have a question  about search performance - I'm most likely
> >> testing  it in a wrong way - do you test performance on real users
> queries?
> >> What kinds of queries are more likely? Those where query word's have
> similar
> >> frequencies, or those where word's frequencies differ by orders of
> >> magnitude?
> >
> > It's not possible to answer this :(
> >
> > Real user queries and real documents those users were querying is by far
> > best, but they are not easy to come by.
> >
> > In the nightly wikipedia benchmark, e.g.
> > http://home.apache.org/~mikemccand/lucenebench/Phrase.html , I use
> > synthetically generated queries derived from an index to try to mix up
> the
> > relative frequencies of the terms.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Compaction logic

Posted by Konstantin <ac...@gmail.com>.
2 Michael McCandless
Yes I meant Lucene Codecs abstraction - but it alone doesn't cut it. Well
there is probably no easy way to integrate this approach into Lucene.
And there is not enough evidence that it makes sense at all. I should test
this approach with B-trees and small compactions first.
 With existing Lucene's benchmarks as reference.
2 Erick Erickson About "typical user query" - now it's clear that was
speculation.

Thanks everybody

2016-07-10 22:08 GMT+03:00 Erick Erickson <er...@gmail.com>:

> Comments from the peanut gallery...
>
> I'd state it much more harshly. There's no such thing as a "typical
> user query" ;) We spend a lot of time trying to score documents to
> return the "best" answer.... which is totally irrelevant to some of
> the applications we see where the only concern is aggregations. Which
> is totally irrelevant for apps (think, say patents or drug research or
> most other things legal and many things academic) where the overriding
> concern is seeing _all_ the documents pertaining to the search. Which
> is totally irrelevant to e-commerce where the concern is how much
> margin say an aggregator makes which is totally irrelevant to <insert
> case N+1 here>
>
> I truly wish there was a better answer here, but until there is I'd
> just use Mike's stuff if you can, at least that way you're comparing a
> long-running benchmark with the new code.
>
> FWIW,
> Erick
>
> On Sun, Jul 10, 2016 at 6:45 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
> > On Sat, Jul 9, 2016 at 5:44 PM, Konstantin <ac...@gmail.com>
> wrote:
> >>
> >> Thanks
> >> I'm aware of current implementation of merging in Lucene on a high
> level.
> >> Yes Rhinodog uses B-tree for storing everything, it is a bottleneck on
> >> writes, but it's almost as fast on reads as direct access to location on
> >> disk.
> >
> > Slower writes for faster reads is the right tradeoff for a search
> engine, in
> > general, IMO.
> >>
> >> (With cold cache, while using SSD reads take less time than decoding
> >> blocks) But may be there is a way to decouple merging/storing + codes
> from
> >> everything else? Just quickly  looking over the sources it actually
> seems
> >> like a hard task to me. With yet unclear benefits. I'll compare this
> >> compaction strategies.
> >
> > You mean like Lucene's Codec abstractions?
> >>
> >> Also, I have a question  about search performance - I'm most likely
> >> testing  it in a wrong way - do you test performance on real users
> queries?
> >> What kinds of queries are more likely? Those where query word's have
> similar
> >> frequencies, or those where word's frequencies differ by orders of
> >> magnitude?
> >
> > It's not possible to answer this :(
> >
> > Real user queries and real documents those users were querying is by far
> > best, but they are not easy to come by.
> >
> > In the nightly wikipedia benchmark, e.g.
> > http://home.apache.org/~mikemccand/lucenebench/Phrase.html , I use
> > synthetically generated queries derived from an index to try to mix up
> the
> > relative frequencies of the terms.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Compaction logic

Posted by Erick Erickson <er...@gmail.com>.
Comments from the peanut gallery...

I'd state it much more harshly. There's no such thing as a "typical
user query" ;) We spend a lot of time trying to score documents to
return the "best" answer.... which is totally irrelevant to some of
the applications we see where the only concern is aggregations. Which
is totally irrelevant for apps (think, say patents or drug research or
most other things legal and many things academic) where the overriding
concern is seeing _all_ the documents pertaining to the search. Which
is totally irrelevant to e-commerce where the concern is how much
margin say an aggregator makes which is totally irrelevant to <insert
case N+1 here>

I truly wish there was a better answer here, but until there is I'd
just use Mike's stuff if you can, at least that way you're comparing a
long-running benchmark with the new code.

FWIW,
Erick

On Sun, Jul 10, 2016 at 6:45 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Sat, Jul 9, 2016 at 5:44 PM, Konstantin <ac...@gmail.com> wrote:
>>
>> Thanks
>> I'm aware of current implementation of merging in Lucene on a high level.
>> Yes Rhinodog uses B-tree for storing everything, it is a bottleneck on
>> writes, but it's almost as fast on reads as direct access to location on
>> disk.
>
> Slower writes for faster reads is the right tradeoff for a search engine, in
> general, IMO.
>>
>> (With cold cache, while using SSD reads take less time than decoding
>> blocks) But may be there is a way to decouple merging/storing + codes from
>> everything else? Just quickly  looking over the sources it actually seems
>> like a hard task to me. With yet unclear benefits. I'll compare this
>> compaction strategies.
>
> You mean like Lucene's Codec abstractions?
>>
>> Also, I have a question  about search performance - I'm most likely
>> testing  it in a wrong way - do you test performance on real users queries?
>> What kinds of queries are more likely? Those where query word's have similar
>> frequencies, or those where word's frequencies differ by orders of
>> magnitude?
>
> It's not possible to answer this :(
>
> Real user queries and real documents those users were querying is by far
> best, but they are not easy to come by.
>
> In the nightly wikipedia benchmark, e.g.
> http://home.apache.org/~mikemccand/lucenebench/Phrase.html , I use
> synthetically generated queries derived from an index to try to mix up the
> relative frequencies of the terms.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Compaction logic

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sat, Jul 9, 2016 at 5:44 PM, Konstantin <ac...@gmail.com> wrote:

> Thanks
> I'm aware of current implementation of merging in Lucene on a high level.
> Yes Rhinodog uses B-tree for storing everything, it is a bottleneck on
> writes, but it's almost as fast on reads as direct access to location on
> disk.
>
Slower writes for faster reads is the right tradeoff for a search engine,
in general, IMO.

> (With cold cache, while using SSD reads take less time than decoding
> blocks) But may be there is a way to decouple merging/storing + codes from
> everything else? Just quickly  looking over the sources it actually seems
> like a hard task to me. With yet unclear benefits. I'll compare this
> compaction strategies.
>
You mean like Lucene's Codec abstractions?

> Also, I have a question  about search performance - I'm most likely
> testing  it in a wrong way - do you test performance on real users
> queries?  What kinds of queries are more likely? Those where query word's
> have similar frequencies, or those where word's frequencies differ by
> orders of magnitude?
>
It's not possible to answer this :(

Real user queries and real documents those users were querying is by far
best, but they are not easy to come by.

In the nightly wikipedia benchmark, e.g.
http://home.apache.org/~mikemccand/lucenebench/Phrase.html , I use
synthetically generated queries derived from an index to try to mix up the
relative frequencies of the terms.

Mike McCandless

http://blog.mikemccandless.com

Re: Compaction logic

Posted by Konstantin <ac...@gmail.com>.
Thanks
I'm aware of current implementation of merging in Lucene on a high level.
Yes Rhinodog uses B-tree for storing everything, it is a bottleneck on
writes, but it's almost as fast on reads as direct access to location on
disk. (With cold cache, while using SSD reads take less time than decoding
blocks) But may be there is a way to decouple merging/storing + codes from
everything else? Just quickly  looking over the sources it actually seems
like a hard task to me. With yet unclear benefits. I'll compare this
compaction strategies.

Also, I have a question  about search performance - I'm most likely
testing  it in a wrong way - do you test performance on real users
queries?  What kinds of queries are more likely? Those where query word's
have similar frequencies, or those where word's frequencies differ by
orders of magnitude?
10 Июл 2016 г. 0:22 пользователь "Michael McCandless" <
lucene@mikemccandless.com> написал:

> Rhinodog looks neat!  Impressively small sources :)
>
> Merging postings per term would be hard for Lucene, because it's write
> once, i.e. once a segment's terms and postings are written, we cannot
> update them in place.  Instead, we merge N segments together to a new
> larger (also write once) segment.
>
> Whereas it looks like Rhinodog's term dictionary uses on-disk data
> structures (btree?) that can be updated in place?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Fri, Jul 8, 2016 at 10:13 AM, Konstantin <ac...@gmail.com>
> wrote:
>
>> Hello, my name is Konstantin, I'm currently reading Lucene's sources and
>> wondering why particular technical decisions were made.
>>
>> Full disclosure - I'm writing my own inverted index implementation as a
>> pet project https://github.com/kk00ss/Rhinodog . It's about 4 kloc of
>> scala, and there are tests comparing it with Lucene on wiki dump (I
>> actually run only on small part of it ~500MB).
>>
>> Most interesting to me, is why compaction algorithm is implemented this
>> way - it's clear and simple, but wouldn't it be better to merge postings
>> lists on a per term basis. Well current Lucene implementation is probably
>> better for HDDs and proposed would need SSD to show adequate performance.
>> But that would be more of smaller compactions, each much chipper. Some
>> times if a term has small posting list - it would be inefficient, but I
>> think some threshold can be used.
>> This idea comes from an assumption that when half of the documents were
>> removed from a segment - not all the terms might need compaction,  assuming
>> non-uniform distribution of terms among documents (which seems likely to
>> me, an amateur ;-) ).
>>
>> Does it make any sense ?
>> BTW, Any input about Rhinodog and it's benchmarks vs Lucene would be
>> appreciated.
>>
>>
>

Re: Compaction logic

Posted by Michael McCandless <lu...@mikemccandless.com>.
Rhinodog looks neat!  Impressively small sources :)

Merging postings per term would be hard for Lucene, because it's write
once, i.e. once a segment's terms and postings are written, we cannot
update them in place.  Instead, we merge N segments together to a new
larger (also write once) segment.

Whereas it looks like Rhinodog's term dictionary uses on-disk data
structures (btree?) that can be updated in place?

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jul 8, 2016 at 10:13 AM, Konstantin <ac...@gmail.com>
wrote:

> Hello, my name is Konstantin, I'm currently reading Lucene's sources and
> wondering why particular technical decisions were made.
>
> Full disclosure - I'm writing my own inverted index implementation as a
> pet project https://github.com/kk00ss/Rhinodog . It's about 4 kloc of
> scala, and there are tests comparing it with Lucene on wiki dump (I
> actually run only on small part of it ~500MB).
>
> Most interesting to me, is why compaction algorithm is implemented this
> way - it's clear and simple, but wouldn't it be better to merge postings
> lists on a per term basis. Well current Lucene implementation is probably
> better for HDDs and proposed would need SSD to show adequate performance.
> But that would be more of smaller compactions, each much chipper. Some
> times if a term has small posting list - it would be inefficient, but I
> think some threshold can be used.
> This idea comes from an assumption that when half of the documents were
> removed from a segment - not all the terms might need compaction,  assuming
> non-uniform distribution of terms among documents (which seems likely to
> me, an amateur ;-) ).
>
> Does it make any sense ?
> BTW, Any input about Rhinodog and it's benchmarks vs Lucene would be
> appreciated.
>
>