You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2013/10/10 06:06:49 UTC

[lucy-dev] Safety of SortFieldWriter changes

On Sat, Sep 14, 2013 at 12:29 PM,  <nw...@apache.org> wrote:
> Rework previous change to SortFieldWriter
>
> I'm still not sure whether this is safe.

> Project: http://git-wip-us.apache.org/repos/asf/lucy/repo
> Commit: http://git-wip-us.apache.org/repos/asf/lucy/commit/3c9c7da1
> Tree: http://git-wip-us.apache.org/repos/asf/lucy/tree/3c9c7da1
> Diff: http://git-wip-us.apache.org/repos/asf/lucy/diff/3c9c7da1

Heh.  I'll finally dive in and give it a thorough review tomorrow.

Marvin Humphrey

Re: [lucy-dev] Safety of SortFieldWriter changes

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Oct 10, 2013 at 12:57 PM, Nick Wellnhofer <we...@aevum.de> wrote:
> On Oct 10, 2013, at 06:06 , Marvin Humphrey <ma...@rectangular.com> wrote:

> Now that I understand the code in SortFieldWriter, I think this is the proper fix:
>
> https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=80b15686ae1cde8238fa707a809ece5ee8e7b30c;hp=fde94c411c7c73ad35171bb19295f781ed48e0dd

+1

As you have grasped, the value popped off the top of a child SortFieldWriter
may get freed on the next call to Fetch(), because Fetch() may invoke Refill()
which frees all the keys from the ZombieKeyedHash at once.  Your solution of
storing the value in the child SortFieldWriter's ivars rather than a local
variable, replacing it with a non-Zombie copy when necessary, is a solid fix.

All recent changes to SortFieldWriter.c look good.

> Some other things I noticed when looking at SortFieldWriter:
>
> 1. There are a couple of opportunities for optimizations in
> SortFieldWriter_Refill. Currently, the elements are sorted twice which is
> unnecessary. Sorting can be further improved by using a counting sort with
> linear runtime at the expense of additional memory.
>
> 2. The logic that checks that we stay below the memory threshold in
> SortFieldWriter_Refill looks wrong to me. In effect, the limit is ignored in
> almost all cases. This could be a major problem for huge indexes with
> sortable fields.
>
> 3. I don’t see much benefit in using a hash to detect duplicate keys. I
> think the idea is to save memory. But a single HashEntry already takes 24
> bytes on x64, so there would have to be quite a lot of duplicates for a net
> gain. Also, it would be nice to get rid of ZombieKeyedHash.
>
> 4. Adding lots of new documents in a single commit (exceeding the memory
> threshold) could be optimized with a custom temp file format.

I see no problem with any of these.

With regards to deduping keys using a hash, what I would say is that
having many duplicates in a field is common ('city', 'genre', 'domain', etc),
so I think that optimizing for that case is reasonable -- but not a slam dunk
of course, since fields with high cardinality are also common (UID, title,
etc).  I would be happy if we were to trade away some of SortFieldWriter's
optimization for some simplification.  The serious bug in the memory
thresholding you describe is the natural consequence of excessive complexity.

SortFieldWriter is one of those projects which took a long time but which was
not perfected.  Your fresh perspective on how its implementation might change
is much appreciated.

Marvin Humphrey

Re: [lucy-dev] Safety of SortFieldWriter changes

Posted by Nick Wellnhofer <we...@aevum.de>.
On Oct 10, 2013, at 06:06 , Marvin Humphrey <ma...@rectangular.com> wrote:

> On Sat, Sep 14, 2013 at 12:29 PM,  <nw...@apache.org> wrote:
>> Rework previous change to SortFieldWriter
>> 
>> I'm still not sure whether this is safe.
> 
>> Project: http://git-wip-us.apache.org/repos/asf/lucy/repo
>> Commit: http://git-wip-us.apache.org/repos/asf/lucy/commit/3c9c7da1
>> Tree: http://git-wip-us.apache.org/repos/asf/lucy/tree/3c9c7da1
>> Diff: http://git-wip-us.apache.org/repos/asf/lucy/diff/3c9c7da1
> 
> Heh.  I'll finally dive in and give it a thorough review tomorrow.

Now that I understand the code in SortFieldWriter, I think this is the proper fix:

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=80b15686ae1cde8238fa707a809ece5ee8e7b30c;hp=fde94c411c7c73ad35171bb19295f781ed48e0dd

Some other things I noticed when looking at SortFieldWriter:

1. There are a couple of opportunities for optimizations in SortFieldWriter_Refill. Currently, the elements are sorted twice which is unnecessary. Sorting can be further improved by using a counting sort with linear runtime at the expense of additional memory.

2. The logic that checks that we stay below the memory threshold in SortFieldWriter_Refill looks wrong to me. In effect, the limit is ignored in almost all cases. This could be a major problem for huge indexes with sortable fields.

3. I don’t see much benefit in using a hash to detect duplicate keys. I think the idea is to save memory. But a single HashEntry already takes 24 bytes on x64, so there would have to be quite a lot of duplicates for a net gain. Also, it would be nice to get rid of ZombieKeyedHash.

4. Adding lots of new documents in a single commit (exceeding the memory threshold) could be optimized with a custom temp file format.

Nick