You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Mikhail Khludnev <mk...@griddynamics.com> on 2014/02/12 22:12:16 UTC

yet another one (mad) approach for block join

Hello,

Some time ago Uwe defined the problem of making block-join more cute.
https://issues.apache.org/jira/browse/LUCENE-5092?focusedCommentId=13736713&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13736713
I'm not sure I got him right, but recently I thought (what to talk about in
Washington) about comprehensive relations modeling cases. Anyway, I started
from simple test for alternative block join implementation. The overall
idea is:
 - to keep blocks as-is, they are cute;
 - to use term enum for looping parents while enumerating children on
nextDoc(), hence these terms should be equal to docnums;
 - to use a single element doclist to jump back to the previous parent for
advance().

Now you can see that I just tried to reuse trendy Lucene data-structures to
get rid of rewindable bit-set. Right now, the code is ugly because I
reusing them by plain document indexing, later it can be done better with a
specialized codec/enum api. It makes no sense as just a block join
replacement, but it might work out as general modelling approach.

Here is the code
https://github.com/m-khl/solr-patches/blob/af089475ec122630e231dbba397d5639013668e5/lucene/join/src/test/org/apache/lucene/search/join/TestBlockRelations.java?source=cc#L131

Here it the scratches which might explain the current implementation
http://goo.gl/yS1VZN

Your feedback is appreciated.
-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: yet another one (mad) approach for block join

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Feb 13, 2014 at 11:49 AM, Mikhail Khludnev
<mk...@griddynamics.com> wrote:
> Mike,
> Thanks for the clue. It raises a lot of questions:
>  - Is this cost caused by random access nature of the seekCeil() /** The
> target term may be before or after the current term. */? Is there any chance
> to make it more efficient by requesting "forward only" TermEnum?

Maybe we could get some gains with a "forward only" mode ... not sure.
 The enum already shares state, i.e. it checks for the common prefix
b/w the term it's on now and the term you're seeking to.

But also traversing the FST is not cheap.

>  - will it be faster with 'entirely memory residend term dictionary'?

Likely ...

>  - or the overall idea of using TermEnum just complies with the sub, and
> it's worth to experiment with writing the previous parent docnum (or current
> block size) in payload and reading it when we need to jump back on
> advance()?

Maybe?

But I think using FBS is an OK solution too... that's a very fast way
to find prev/next parent.

>  - once again, would you mind to remind why making DocEnum capable to jump
> back is so hard? Can you recommend any starting point for hacking?

Well, all the impls are heavily "forward only", e.g. we store the doc
deltas in an int[128] and sum as we go.

But anything we can do to improve block join, or query time join,
would be great. E.g. for query time join I think this issue would be a
big win: https://issues.apache.org/jira/browse/LUCENE-4771

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: yet another one (mad) approach for block join

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Mike,
Thanks for the clue. It raises a lot of questions:
 - Is this cost caused by random access nature of the seekCeil() /** The
target term may be before or after the current term. */? Is there any
chance to make it more efficient by requesting "forward only" TermEnum?
 - will it be faster with 'entirely memory residend term dictionary'?
 - or the overall idea of using TermEnum just complies with the sub, and
it's worth to experiment with writing the previous parent docnum (or
current block size) in payload and reading it when we need to jump back on
advance()?
 - once again, would you mind to remind why making DocEnum capable to jump
back is so hard? Can you recommend any starting point for hacking?

Your answers are really appreciated.


On Thu, Feb 13, 2014 at 3:10 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Unfortunately, the terms dict is quite costly, so e.g. doing a
> TermsEnum.seekCeil inside a DocsEnum.advance will probably really hurt
> performance?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 12, 2014 at 4:12 PM, Mikhail Khludnev
> <mk...@griddynamics.com> wrote:
> > Hello,
> >
> > Some time ago Uwe defined the problem of making block-join more cute.
> >
> https://issues.apache.org/jira/browse/LUCENE-5092?focusedCommentId=13736713&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13736713
> > I'm not sure I got him right, but recently I thought (what to talk about
> in
> > Washington) about comprehensive relations modeling cases. Anyway, I
> started
> > from simple test for alternative block join implementation. The overall
> idea
> > is:
> >  - to keep blocks as-is, they are cute;
> >  - to use term enum for looping parents while enumerating children on
> > nextDoc(), hence these terms should be equal to docnums;
> >  - to use a single element doclist to jump back to the previous parent
> for
> > advance().
> >
> > Now you can see that I just tried to reuse trendy Lucene data-structures
> to
> > get rid of rewindable bit-set. Right now, the code is ugly because I
> reusing
> > them by plain document indexing, later it can be done better with a
> > specialized codec/enum api. It makes no sense as just a block join
> > replacement, but it might work out as general modelling approach.
> >
> > Here is the code
> >
> https://github.com/m-khl/solr-patches/blob/af089475ec122630e231dbba397d5639013668e5/lucene/join/src/test/org/apache/lucene/search/join/TestBlockRelations.java?source=cc#L131
> >
> > Here it the scratches which might explain the current implementation
> > http://goo.gl/yS1VZN
> >
> > Your feedback is appreciated.
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: yet another one (mad) approach for block join

Posted by Michael McCandless <lu...@mikemccandless.com>.
Unfortunately, the terms dict is quite costly, so e.g. doing a
TermsEnum.seekCeil inside a DocsEnum.advance will probably really hurt
performance?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 12, 2014 at 4:12 PM, Mikhail Khludnev
<mk...@griddynamics.com> wrote:
> Hello,
>
> Some time ago Uwe defined the problem of making block-join more cute.
> https://issues.apache.org/jira/browse/LUCENE-5092?focusedCommentId=13736713&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13736713
> I'm not sure I got him right, but recently I thought (what to talk about in
> Washington) about comprehensive relations modeling cases. Anyway, I started
> from simple test for alternative block join implementation. The overall idea
> is:
>  - to keep blocks as-is, they are cute;
>  - to use term enum for looping parents while enumerating children on
> nextDoc(), hence these terms should be equal to docnums;
>  - to use a single element doclist to jump back to the previous parent for
> advance().
>
> Now you can see that I just tried to reuse trendy Lucene data-structures to
> get rid of rewindable bit-set. Right now, the code is ugly because I reusing
> them by plain document indexing, later it can be done better with a
> specialized codec/enum api. It makes no sense as just a block join
> replacement, but it might work out as general modelling approach.
>
> Here is the code
> https://github.com/m-khl/solr-patches/blob/af089475ec122630e231dbba397d5639013668e5/lucene/join/src/test/org/apache/lucene/search/join/TestBlockRelations.java?source=cc#L131
>
> Here it the scratches which might explain the current implementation
> http://goo.gl/yS1VZN
>
> Your feedback is appreciated.
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: yet another one (mad) approach for block join

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello Paul,

Definitely no. TestBlockRelations.testSimple() is a copy of block join
"smoke test" TestBlockJoin.testSimple() with some changes.

Once again it's NOT a functionally complete replacement, but possible
approach (with a lot of challenges so far
).



On Thu, Feb 13, 2014 at 3:09 AM, Paul Elschot <pa...@gmail.com>wrote:

> Mikhail,
>
> Does this pass the existing join module tests?
>
> Regards,
> Paul Elschot




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: yet another one (mad) approach for block join

Posted by Paul Elschot <pa...@gmail.com>.
Mikhail,

Does this pass the existing join module tests?

Regards,
Paul Elschot