You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Dawid Weiss <da...@gmail.com> on 2017/10/01 11:01:45 UTC

Re: Question concerning refs on TestDemoParallelLeafReader

> I'll have to think about the first 2 questions still, but MDW stands for
> MockDirectoryWrapper!

Ah, sure thing. For what it's worth, I locally removed this delayed
'delete' list and removed the leaf folder immediately -- the tests
passed without any problems on my Windows machine. Could be I didn't
hit the corner case, so I'm interested in any follow-up you might
have, Mike.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Question concerning refs on TestDemoParallelLeafReader

Posted by Dawid Weiss <da...@gmail.com>.

Thanks again for the explanation, Mike. I understand it now.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Question concerning refs on TestDemoParallelLeafReader

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, Oct 2, 2017 at 2:25 PM, Dawid Weiss <da...@gmail.com> wrote:

> I think the delayed deletes might have to do w/ segment warming?
>
> I'll have to digest the scenario you described tomorrow. I didn't hit
> any exceptions when running those modified code snippets (which I'd be
> very grateful to see -- they'd provide an immediate proof something is
> wrong...).

Yeah, it's disappointing the test didn't fail when you removed it.  If my
theory is right (and I'm not sure it is!), removing that code would make
much higher NRT latency after a big merge finished, because the refresh
thread would pay the price of going off and building the parallel index for
the newly merged segment, instead of the bg merge thread.

> I am glad you're finding a use for this crazy class!
>
> It's super-useful for people who wish to low-level tweak the index
> format. I dreaded this for a long time, but for us it'd provide many
> benefits. We have a scenario where documents can be indexed once (and
> stay in the primary index) and certain derived indexes (features
> indexed on top of those documents) can be placed in the secondary
> index. The benefit here is that our data used to index features can
> change from time to time (as new documents emerge); then we can simply
> drop those existing secondary indexes and provide up-to-date ones.
> This saves disk I/O and is still fairly transparent to the rest of the
> application (because fields never clash between the primary and the
> secondary index and documents are always aligned).
>

Great!  That's exactly what it should work well for!

> Your 'demo' class is a great example of how this can be done. The
> class is surely advanced. Read: it crams way too many aspects into one
> class :) Each of these could be a separate demo:
>

Sorry :)  This is why it's a test class.

If you have ideas to make it easier to use, please refactor away!  I think
it can open up all sorts of unexpected use cases for Lucene, letting you
change your mind / experiment later about how exactly to index your raw
content.

> - splitting indexes into parallel once (primary/ secondary), with
> automatic secondary index creation on merges and startup.
> - folding back secondary index data into the primary index on merges
> (we don't need it, but I imagine there exist a scenario for this),
> - keeping multiple versions of the secondary index (those "generations").
>

I agree these are separate concerns if we can tease them out.

> And probably lots more. It's a very interesting advanced use case.
>
> > And how did you find this test :)
>
> I've been looking at ParallelCompositeReader for some time; as I was
> scanning it internally for its use cases within the code I somehow
> came across that "demo" class which leveraged its lower-level
> internals. It did take me some time to go through the class's internal
> workings because of confusingly named variables (I ended up renaming
> them to 'primary' and 'secondary' index instead of the original
> 'parallel'). But hey, I don't complain -- it's still an awesome piece
> of code!

Thanks :)  Keep up the renaming/refactoring!

I'm am still unsure why I tracked ref counts at the leaf reader level; did
this somehow enable re-using the parallel leaf readers on each refresh vs.
opening all leaves on each reopen?

Mike McCandless

http://blog.mikemccandless.com

Re: Question concerning refs on TestDemoParallelLeafReader

Posted by Dawid Weiss <da...@gmail.com>.

Hi Mike,

Thanks for the feedback.

> I think the delayed deletes might have to do w/ segment warming?

I'll have to digest the scenario you described tomorrow. I didn't hit
any exceptions when running those modified code snippets (which I'd be
very grateful to see -- they'd provide an immediate proof something is
wrong...).

> I am glad you're finding a use for this crazy class!

It's super-useful for people who wish to low-level tweak the index
format. I dreaded this for a long time, but for us it'd provide many
benefits. We have a scenario where documents can be indexed once (and
stay in the primary index) and certain derived indexes (features
indexed on top of those documents) can be placed in the secondary
index. The benefit here is that our data used to index features can
change from time to time (as new documents emerge); then we can simply
drop those existing secondary indexes and provide up-to-date ones.
This saves disk I/O and is still fairly transparent to the rest of the
application (because fields never clash between the primary and the
secondary index and documents are always aligned).

Your 'demo' class is a great example of how this can be done. The
class is surely advanced. Read: it crams way too many aspects into one
class :) Each of these could be a separate demo:

- splitting indexes into parallel once (primary/ secondary), with
automatic secondary index creation on merges and startup.
- folding back secondary index data into the primary index on merges
(we don't need it, but I imagine there exist a scenario for this),
- keeping multiple versions of the secondary index (those "generations").

And probably lots more. It's a very interesting advanced use case.

> And how did you find this test :)

I've been looking at ParallelCompositeReader for some time; as I was
scanning it internally for its use cases within the code I somehow
came across that "demo" class which leveraged its lower-level
internals. It did take me some time to go through the class's internal
workings because of confusingly named variables (I ended up renaming
them to 'primary' and 'secondary' index instead of the original
'parallel'). But hey, I don't complain -- it's still an awesome piece
of code!

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Question concerning refs on TestDemoParallelLeafReader

Posted by David Smiley <da...@gmail.com>.

On Mon, Oct 2, 2017 at 9:34 AM Michael McCandless <lu...@mikemccandless.com>
wrote:

> I am glad you're finding a use for this crazy class!  I think it is a
> powerful way for Lucene to efficiently add "derived fields" at search time.
>

+1 agreed!   Could be used for NRT updates as well.  But very expert; it'd
be nice if it was easier to use achieve higher level goals.
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Question concerning refs on TestDemoParallelLeafReader

Posted by Michael McCandless <lu...@mikemccandless.com>.

I think the delayed deletes might have to do w/ segment warming?

I.e., after a merge finishes, but before IW exposes that segment in the
current SIS, it's merged, at which point (via the merged segment warmer the
test installs) we build its parallel index, but then I think (maybe!) its
parallel reader is closed?  But we don't want to rm its index directory,
because on the next NRT refresh the merged segment becomes live and we will
open that parallel index.  This ensures that it's the BG merge thread that
pays the cost to build the parallel index, not the NRT reopen thread,
keeping NRT reopen latency low (ish).

I am glad you're finding a use for this crazy class!  I think it is a
powerful way for Lucene to efficiently add "derived fields" at search
time.  Can you share any details on how you are using it?
And how did you find this test :)

Mike McCandless

http://blog.mikemccandless.com

On Sun, Oct 1, 2017 at 7:01 AM, Dawid Weiss <da...@gmail.com> wrote:

> > I'll have to think about the first 2 questions still, but MDW stands for
> > MockDirectoryWrapper!
>
> Ah, sure thing. For what it's worth, I locally removed this delayed
> 'delete' list and removed the leaf folder immediately -- the tests
> passed without any problems on my Windows machine. Could be I didn't
> hit the corner case, so I'm interested in any follow-up you might
> have, Mike.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>