You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Nathan Kurz <na...@verse.com> on 2009/03/13 20:48:11 UTC

Re: real time updates

[moved from private to lucy-dev in case others are interested]

On Thu, Mar 12, 2009 at 5:13 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Fri, Mar 06, 2009 at 10:17:55AM -0700, Nathan Kurz wrote:
>> There was an article recently that might be relevant to your desire
>> for real time updates of KinoSearch databases:
>> http://news.ycombinator.com/item?id=497039
>
> It looks like that system can handle a much greater change rate than KS.  KS
> has a slow best-case write speed, but I'm not worried about that.  The problem
> me and the Lucene folks are trying to address under the topic heading of
> "real-time indexing" is *worst-case* write speed: most of the time you're
> fine, but every once in a while you trigger a large merge and you wait a
> loooooong time.  That problem has a different solution.

What is the actual case that would trigger such a problem?  My
instinct is that while there is no way to avoid the long merge, that
there are schemes where only the update is slow, and the readers can
continue at more or less full speed.

> Except on NFS, KS doesn't have much in the way of lock contention issues
> because index files are never modified.  But regardless of its applicability
> to KS, this "sneaky lock" trick is pretty nice:
> ...
> We couldn't do that because we can't reach out across the system to active
> IndexReaders in other processes, but still: nice.

I realize this, but I'm wondering if a locking approach might be
preferable.  Would the equivalent of row-level-locking allow you to
modify the index files in real time, instead of adding addenda and
tombstones?   I'm not necessarily suggesting that the default index
format should do this, but that it might be worth considering whether
a proposed API would support such a real-time format.


> I'm remind of this presentation on a lock-free hash table:
>
>    http://video.google.com/videoplay?docid=2139967204534450862
>   > http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf

Thanks, I took a glance, and it does seem interesting.  In addition to
the shared memory relevance, I've been looking at CUDA programming on
GPU's, and am interested in lock-free data structures such as this.

> This was well put:
>
>    RAM can be viewed as a 16 Gig L4 cache, and disk as a multi-Terabyte L5.
>    Just as one currently writes code that doesn't distinguish between a fetch
>    from L1 and a fetch from main memory, mmap() allows extending this syntax
>    all the way to a fetch from disk.

Thanks.  While I certainly think that mmap() can have great
performance advantages, it's the simplification it provides that
really appeals to me.  Instead of fighting with the OS, use it!

Nathan Kurz
nate@verse.com

Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sun, Mar 29, 2009 at 9:11 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Sun, Mar 29, 2009 at 06:01:16AM -0400, Michael McCandless wrote:
>
>> You might consider adding the hooks for a tracing collector, but not
>> creating such a collector now.  IE have each Lucy obj expose a method
>> to visit the other objs it refers to.
>
> We can probably have Boilerplater autogenerate that code.  That way we only
> have to only write one compiler module instead of manually maintaining methods
> for X hundred classes.

Excellent!  I guess Boilerplate knows enough about the members of an
object to determine which are objects vs which are atomic (and don't
need visiting by tracer).

>> I think you'd be surprised at how easily cycles are accidentally created.
>>
>> It's sort of like saying programmers are accustomed to writing high
>> performance code, but then an O(N^2) slips in somewhere and goes
>> undetected until N happens to get large deep in production.
>
> Heh.  Or a reference cycle leak slips in that happens to get large and causes
> a performance hit in production.  Been there done that.

Tell me about it.  We all have our battle scars ;)  I've had great
"fun" writing tools in Python to trace the sneaky cycles... once an
app gets big enough it's practically impossible to prevent accidental
cycles.

Mike

Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Mar 31, 2009 at 12:06 AM, Marvin Humphrey
<ma...@rectangular.com> wrote:
> On Mon, Mar 30, 2009 at 06:19:09PM -0400, Michael McCandless wrote:
>
>> > Analyzer: maybe not so much because the analysis method would be non-trivial.
>> > (Assuming that we settle on something like the KS "Inversion"-style Analyzers
>> > rather than the method-call-orgy of Lucene TokenStream.)
>>
>> What's an inversion style analyzer?
>
> Sorry, that wasn't a good way of putting it.  KS passes around Tokens as
> arrays rather than as iterators; Inversion is the class that holds Tokens, and
> it actually descends from VArray.
>
> It's hard to write an Analyzer in pure Perl that operates on multiple tokens
> and isn't a sloth.  Because of method call overhead, having to call next()
> for every token makes that even harder.

Ahh, OK.

>> OK, I'm convinced... it seems like we should stick with your approach
>> (make & cache host wrapper when requested or RC becomes 2).
>
> Thanks to you and Peter for driving the discussion towards an improved version
> of that model.

It's been great fun :)  Lucy looks to have a great infrastructure design.

Mike

Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Mar 30, 2009 at 06:19:09PM -0400, Michael McCandless wrote:

> > Analyzer: maybe not so much because the analysis method would be non-trivial.
> > (Assuming that we settle on something like the KS "Inversion"-style Analyzers
> > rather than the method-call-orgy of Lucene TokenStream.)
> 
> What's an inversion style analyzer?

Sorry, that wasn't a good way of putting it.  KS passes around Tokens as
arrays rather than as iterators; Inversion is the class that holds Tokens, and
it actually descends from VArray.

It's hard to write an Analyzer in pure Perl that operates on multiple tokens
and isn't a sloth.  Because of method call overhead, having to call next()
for every token makes that even harder.

> OK, I'm convinced... it seems like we should stick with your approach
> (make & cache host wrapper when requested or RC becomes 2).

Thanks to you and Peter for driving the discussion towards an improved version
of that model.

Marvin Humphrey


Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sun, Mar 29, 2009 at 9:20 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Sun, Mar 29, 2009 at 05:55:11AM -0400, Michael McCandless wrote:
>
>> > FWIW, the cached-host-object algo wins the tight loop performance contest.
>>
>> Which makes sense, but how often are such loops [expected to be] used
>> in practice?  EG we already thought Collector inside host would be
>> doable but expected poor performance.
>
> The performance hit will matter whenever someone overrides a tight-loop
> method.  How much it matters will vary from host to host.
>
> Probably any Scorer or Collector would feel it.  DataReader: it depends on
> whether there are Scorer implementations which call on it in tight loops.
>
> InStream and OutStream: definitely would have, except that they're final
> classes so you can't override any methods.

Yeah OK.

> Analyzer: maybe not so much because the analysis method would be non-trivial.
> (Assuming that we settle on something like the KS "Inversion"-style Analyzers
> rather than the method-call-orgy of Lucene TokenStream.)

What's an inversion style analyzer?

> Searcher, Indexer, Query, QueryParser, DataWriter, IndexManager, Lock: no,
> because they don't operate light methods within tight loops.  FieldSpec,
> Schema: I don't think so because so far it's been possible to cache the return
> values of certain methods which would otherwise be called often.
> (KinoSearch::Index::Inverter performs such caching.)
>
>> How much work is being done when you create a new host-wrapper object?
>
> That will vary from Host to Host.  In Perl, we're using blessed scalars, which
> are about as cheap as Perl objects get.
>
> However, Perl method dispatch is comparatively slow; if method calls were
> faster, the effect would loom larger.  In Java, for instance, method calls are
> very fast, and creating a host wrapper object on each loop will probably cause
> a significant degradation much sooner.
>
> Put another way: caching a host wrapper object probably matters a lot more
> when method dispatch is fast.

OK.

>> Could you somehow re-use such objects?  (Sort of like how when you go
>> to a children's museum, to the water section, they provide a bunch of
>> yellow frocks that the kids can wear when they play with the water,
>> and return when they are done).
>
> Evocative metaphor. :)  So, who manages the pool?

It's a massively concurrent free for all with no locking :)  And
sometimes frocks get lost for a little while and a background sweeping
thread eventually reclaims them.

> Ideally, when we know we're going to be calling back to the host in a tight
> loop, we'd want the caller to cache the host object.  But that's not easy.
>
> Here's the basic implementation of Matcher_Collect():
>
>  void
>  Matcher_collect(Matcher *self, Collector *collector)
>  {
>    i32_t doc_num;
>    Collector_Set_Matcher(collector, self);
>    while (0 != (doc_num = Matcher_Next(self))) {
>      Coll_Collect(doc_num);
>    }
>  }
>
> Say that the user has overridden Collector's Collect() method in the host, so
> it's slow.  To get best performance, we need to change how *Matcher* works,
> even though it's *Collector* that's the source of the slowdown.
>
>  void
>  Matcher_collect(Matcher *self, Collector *collector)
>  {
>    i32_t  doc_num;
>    void  *host_obj = Coll_To_Host(collector); /* <--- Cache wrapper here */
>
>    Collector_Set_Matcher(collector, self);
>    while (0 != (doc_num = Matcher_Next(self))) {
>      Host_callback(host_obj, "collect", 1, ARG_I32("doc_num", doc_num));
>    }
>
>    Coll_Destroy_Host(collector, host_obj); /* <--- Destroy wrapper here */
>  }
>
> (Host_callback() ordinarily takes a Lucy object as its first argument, but
> pretend that it takes a host wrapper object for now.)
>
> Coll_Collect doesn't know it's being called in a loop.  Even if it did, it
> could only cache host object wrappers using static or global vars. (Bye-bye
> thread safety, and what on earth do we do about subclasses?)

OK.

>  static void *Coll_collect_HOST_OBJ = NULL;
>  void
>  Coll_collect_OVERRIDE(Collector* self, i32_t doc_num)
>  {
>    /* Bleah. */
>    Coll_collect_HOST_OBJ = Coll_Validate_Or_Make_Wrapper(self,
>        Coll_collect_HOST_OBJ);
>    Host_callback(Coll_collect_HOST_OBJ, "collect", 1,
>        ARG_I32("doc_num", doc_num));
>  }
>
>> > We lose convenient use of the "inside-out-object" pattern. :(
>>
>> Neat.... I've seen a similar approach (though likely someone has given
>> / would give it a different named pattern)
>
> The "FlyWeight" design pattern is very similar.
>
>> In your example, shouldn't/couldn't the hash key somehow be the id of
>> the *Lucy* object, not of its fleeting host wrapper obj?
>
> Actually, that's what $$self is.
>
>   $self          <---- address of Perl object[1]
>   $$self         <---- address of Lucy object
>
> However, invoking $self->DESTROY on multiple wrappers around the same Lucy
> object is still a problem.

Hmmmm true.

>  sub DESTROY {
>      my $self = shift;
>      delete $foo{$$self};
>      $self->SUPER::DESTROY;  # Calls Dec_RefCount(), not Destroy() as it
>                              # would under the cached-object model.
>  }
>
> As soon as the wrapper has its refcount fall to 0, Perl invokes DESTROY, and
> despite the fact that the Lucy object doesn't get zapped yet, $foo{$$self}
> gets deleted.

OK, I'm convinced... it seems like we should stick with your approach
(make & cache host wrapper when requested or RC becomes 2).

Mike

Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Tue, Mar 31, 2009 at 12:52 AM, Marvin Humphrey
<ma...@rectangular.com> wrote:
> On Mon, Mar 30, 2009 at 06:34:18PM -0400, Michael McCandless wrote:
>
>> What is the vtable for vtable vtable?  Itself?
>
> This is going to get a bit "meta" -- prepare yourself.  :)
>
> The "vtable" member for VTable objects is &VTABLE.
>
> So, all of the following are true:
>
>  OBJ.vtable    == &VTABLE;
>  VARRAY.vtable == &VTABLE;
>  VTABLE.vtable == &VTABLE;

OK so VTABLE's vtable is simply itself.

> We need VTABLE because VTables are Lucy objects, and not just arrays of
> function pointers.  They have methods, and they also contain a couple very
> important pieces of class data, e.g. obj_alloc_size which is sizeof(Obj) for
> Obj, sizeof(VArray) for VArray, etc.

Got it.

>> >> Do VTables only store methods?  Or can they store fields as well?
>> >
>> > You mean could they store class data?  I suppose they could.  Initializing
>> > might get a little messy, and I'd want to make sure that we locked them down
>> > and made them stateless before threading starts.
>>
>> I actually meant data fields on the object, but class data ("static"
>> in Java) is also good.
>
> Oh, I think I understand now.  Each VTable object does have a few member vars:
> obj_alloc_size, vt_alloc_size, etc.

Actually I meant eg SegMetaData (not sure on the name) would have a
String data memeber segmentName in it.  Is there an entry in VTable
for members that are data, not methods?

>> Hmm...  Python's cyclic collector won't collect cycles involving
>> classes that have __del__ since it can't guess a safe order to run the
>> __del__ methods.  (I know we're expecting people to just avoid making
>> cycles, but still important to know).
>
> That's a bummer.  But I think we'd need __del__ no matter what, since we need
> to trigger C-level destructors.

Yeah.  The whole "finalizers cannot be run in a deterministic manner
so we simply don't collect class with finalizers" things is scary...
all tracing collectors struggle with this.

>> you could probably simply use that host_obj
>> field to hold low value ref counts (which are not valid pointers).
>> Though that's scary-C-hack-territory :)
>
> If we could guarantee that, say, 0x0 (NULL), 0x1, 0x2, and 0x3 could never
> represent a pointer to a real host object for every single system that Lucy
> might be compiled on, we could probably prevent a few unnecessary host object
> creations that way -- more than 0x3 and we're at the point of diminishing
> returns.  And I think we can, because malloc() always returns word-aligned
> pointers, which 0x1, 0x2, and 0x3 would not be for any 32-bit or 64-bit system
> where NULL was 0x0.
>
> So, this is almost certainly safe, but makes the library less maintainable
> because the hack has to be explained.
>
> I think we should do it.
>
> If we're really paranoid, we can check the return value for To_Host() and
> throw an exception if it's less than 4.

Nice!  The scary things you can do in C...

Mike

Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Mar 30, 2009 at 06:34:18PM -0400, Michael McCandless wrote:

> What is the vtable for vtable vtable?  Itself?

This is going to get a bit "meta" -- prepare yourself.  :)

The "vtable" member for VTable objects is &VTABLE.

So, all of the following are true:

  OBJ.vtable    == &VTABLE;
  VARRAY.vtable == &VTABLE;
  VTABLE.vtable == &VTABLE;

We need VTABLE because VTables are Lucy objects, and not just arrays of
function pointers.  They have methods, and they also contain a couple very
important pieces of class data, e.g. obj_alloc_size which is sizeof(Obj) for
Obj, sizeof(VArray) for VArray, etc.

> >> Do VTables only store methods?  Or can they store fields as well?
> >
> > You mean could they store class data?  I suppose they could.  Initializing
> > might get a little messy, and I'd want to make sure that we locked them down
> > and made them stateless before threading starts.
> 
> I actually meant data fields on the object, but class data ("static"
> in Java) is also good.

Oh, I think I understand now.  Each VTable object does have a few member vars:
obj_alloc_size, vt_alloc_size, etc.

> Hmm...  Python's cyclic collector won't collect cycles involving
> classes that have __del__ since it can't guess a safe order to run the
> __del__ methods.  (I know we're expecting people to just avoid making
> cycles, but still important to know).

That's a bummer.  But I think we'd need __del__ no matter what, since we need
to trigger C-level destructors.

> you could probably simply use that host_obj
> field to hold low value ref counts (which are not valid pointers).
> Though that's scary-C-hack-territory :)

If we could guarantee that, say, 0x0 (NULL), 0x1, 0x2, and 0x3 could never
represent a pointer to a real host object for every single system that Lucy
might be compiled on, we could probably prevent a few unnecessary host object
creations that way -- more than 0x3 and we're at the point of diminishing
returns.  And I think we can, because malloc() always returns word-aligned
pointers, which 0x1, 0x2, and 0x3 would not be for any 32-bit or 64-bit system
where NULL was 0x0.

So, this is almost certainly safe, but makes the library less maintainable
because the hack has to be explained.

I think we should do it.  

If we're really paranoid, we can check the return value for To_Host() and
throw an exception if it's less than 4.

Marvin Humphrey


Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sun, Mar 29, 2009 at 11:48 PM, Marvin Humphrey
<ma...@rectangular.com> wrote:
> On Sun, Mar 29, 2009 at 08:36:57AM -0400, Michael McCandless wrote:
>
>> Are VTables also considered Obj's?
>
> Yes.  They are structs, with the last member being a "flexible array member"
> which can hold any number of "method_t" pointers.  Here's the list of members
> (omitting host_obj and vtable, which are inherited).
>
>    VTable            *parent;
>    CharBuf           *name;         /* class name */
>    u32_t              flags;
>    size_t             obj_alloc_size;
>    size_t             vt_alloc_size;
>    Callback         **callbacks;
>    boil_method_t[1]   methods;      /* flexible array */
>
> I've included the generated code for the VARRAY VTable below my sig.  It's a
> little messy because some mild hacks are needed to satisfy C89 and portability
> constraints.

What is the vtable for vtable vtable?  Itself?

>> Do VTables only store methods?  Or can they store fields as well?
>
> You mean could they store class data?  I suppose they could.  Initializing
> might get a little messy, and I'd want to make sure that we locked them down
> and made them stateless before threading starts.

I actually meant data fields on the object, but class data ("static"
in Java) is also good.

>> Can an arbitrary Obj at runtime become a VTable for another Obj?
>> (True "prototype" programming language).  Seems like "no", because an
>> arbitrary Obj is not allowed to add new members wrt its parent (only
>> new VTables can do so).
>
> Correct.
>
>> Does VARRAY (VTable for VArray objects) hold a reference to OBJ?
>
> Yes: self->vtable->parent.  The Obj_Is_A() method, similar to Java's
> "instanceof" follows the chain upwards:
>
>    bool_t
>    Obj_is_a(Obj *self, VTable *target_vtable)
>    {
>        VTable *vtable = self ? self->vtable : NULL;
>        while (vtable != NULL) {
>            if (vtable == target_vtable) { return true; }
>            vtable = vtable->parent;
>        }
>        return false;
>    }

OK.

>> How are these trees of VTables init'd?
>
> All core classes have VTables which are global structs initialized at compile
> time.  Boilerplater spits them out into a file called "boil.c".
>
> User subclass VTables are allocated on the fly at runtime.

OK.

>> And Lucy objs are single inheritance.
>
> Correct.
>
>> So a VArray is allowed to have C NULLs in its elems array (vs say Java
>> which always inits the array to hold Java null's).
>
> That's how it's implemented now.  Changing it over to something like Java
> null's might be a good idea.

Saves having to NULL check, though it's wasteful if the code that
created the VArray will immediately then fill it in.

>> Is there an explicit object in Lucy that represents null (java), None
>> (Python), etc.?
>
> Yes: UNDEF, a singleton belonging to the class Undefined.
> However, it's not used very much.

OK.

>> > First, note that the destructor for VArray invokes the destructor of its
>> > parent class, Obj.  This superclass call makes it possible for us to add
>> > members to a base class without having to manually edit all subclasses.
>>
>> Great.
>
> The same is true at construction time; each class implements two functions,
> new() and init(), and subclasses call their parent class's init():
>
>    TermQuery*
>    TermQuery_new(const CharBuf *field, const Obj *term)
>    {
>        TermQuery *self = (TermQuery*)CREATE(NULL, TERMQUERY);
>        return TermQuery_init(self, field, term);
>    }
>
>    TermQuery*
>    TermQuery_init(TermQuery *self, const CharBuf *field, const Obj *term)
>    {
>        Query_init((Query*)self, 1.0f);
>        self->field       = CB_Clone(field);
>        self->term        = Obj_Clone(term);
>        return self;
>    }

OK.

>> I'm a bit confused: what if you have a Lucy obj, that's got a cached
>> host obj, such that the host obj is not referred to anywhere in the
>> host language, but is referred to in Lucy, and Lucy finally decrefs
>> its last reference.
>
> At that point, the Lucy object and the Python object share a single refcount,
> which resides in the Python object's ob_refcnt member.  When you call PyDECREF
> on the Python object and ob_refcnt falls to 0, Python then invokes the
> "__del__" method.  We will define "__del__" so that it invokes Destroy().

AHH, sorry, I see now.

Hmm...  Python's cyclic collector won't collect cycles involving
classes that have __del__ since it can't guess a safe order to run the
__del__ methods.  (I know we're expecting people to just avoid making
cycles, but still important to know).

>> How is the cycle broken in that case?
>
> There's no reference cycle because the Lucy object and the Python object share
> a single unified refcount.  The have C pointers which point at each other, but
> they don't hold refcounts open against each other.
>
>> (Ie, Destroy should be invoked via Lucy).
>
> The call sequence will be:
>
>    Obj_Dec_Refcount(lucy_obj);
>    PyDECREF(lucy_obj->host_obj);
>    python_obj.__del__()
>    Obj_Destroy(lucy_obj);

OK.

>> OK.  That 1 refCount "belonging" to Lucy.  This is essentially an
>> efficient way to represent the common case of "only Lucy has a single
>> reference to this object".
>
> Yes.  In addition, it allows us to initialize structs at compile-time.
>
>   CharBuf snapshot_new = {
>       (VTable*)&CHARBUF, /* vtable */
>       NULL,              /* <------ NULL host_obj */
>       "snapshot.new",    /* ptr */
>       12,                /* size */
>       13                 /* capacity */
>   }
>
> Since Python objects live on the heap, we can't create them at compile-time.
> The NULL conveniently stands in for them.

Great.

>> >  void*
>> >  Obj_to_host(Obj *self)
>> >  {
>> >      if (self->host_obj) {
>> >          /* The Python object is already cached, so incref it and return. */
>> >          PyINCREF((PyObject*)self->host_obj);
>> >          return self->host_obj;
>> >      }
>> >      else {
>> >          /* Lazily create Python object. */
>> >          self->host_obj = PyCObject_FromVoidPtr((void*)self, Obj_Destroy)
>> >      }
>> >  }
>>
>> (missing a return on the else clause, but I get it).
>
> Heh.  Why won't my email client test my code for me? :)

Someday :)

>> So this is a great approach, in that a host obj
>> is not created immediately on creating a Lucy obj.  However, it's
>> still "falsely" created, in order to track refCount > 1 from within
>> Lucy, even when the obj never crosses the bridge.
>
> Yes.  But if we are parsimonious about creating objects in the first place, it
> doesn't matter so much if constant costs per object are large.

Yes.

>> If only host languages let us override what decRef does for a given
>> obj... then we could break the tight cycles ourself and only allocate
>> a host obj when needed.
>
> There's an alternative.  We can waste an extra 4 bytes per object to hold an
> integer refcount, and use that unless a host object has been cached.  The
> instant that the host object has been cached, we set its refcount and use that
> instead.
>
> I'm not sure that's either clearer or faster, but it does mean that we
> wouldn't ever needlessly create host objects.

That's interesting... you could probably simply use that host_obj
field to hold low value ref counts (which are not valid pointers).
Though that's scary-C-hack-territory :)

Mike

Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Mar 29, 2009 at 08:36:57AM -0400, Michael McCandless wrote:

> Are VTables also considered Obj's?  

Yes.  They are structs, with the last member being a "flexible array member"
which can hold any number of "method_t" pointers.  Here's the list of members
(omitting host_obj and vtable, which are inherited).

    VTable            *parent;
    CharBuf           *name;         /* class name */
    u32_t              flags;
    size_t             obj_alloc_size;
    size_t             vt_alloc_size;
    Callback         **callbacks;
    boil_method_t[1]   methods;      /* flexible array */

I've included the generated code for the VARRAY VTable below my sig.  It's a
little messy because some mild hacks are needed to satisfy C89 and portability
constraints.

> Do VTables only store methods?  Or can they store fields as well?

You mean could they store class data?  I suppose they could.  Initializing
might get a little messy, and I'd want to make sure that we locked them down
and made them stateless before threading starts.

> Can an arbitrary Obj at runtime become a VTable for another Obj?
> (True "prototype" programming language).  Seems like "no", because an
> arbitrary Obj is not allowed to add new members wrt its parent (only
> new VTables can do so).

Correct.

> Does VARRAY (VTable for VArray objects) hold a reference to OBJ?  

Yes: self->vtable->parent.  The Obj_Is_A() method, similar to Java's
"instanceof" follows the chain upwards:

    bool_t
    Obj_is_a(Obj *self, VTable *target_vtable)
    {
        VTable *vtable = self ? self->vtable : NULL;
        while (vtable != NULL) {
            if (vtable == target_vtable) { return true; }
            vtable = vtable->parent;
        }   
        return false;
    }

> How are these trees of VTables init'd?

All core classes have VTables which are global structs initialized at compile
time.  Boilerplater spits them out into a file called "boil.c".

User subclass VTables are allocated on the fly at runtime.

> And Lucy objs are single inheritance.

Correct.

> So a VArray is allowed to have C NULLs in its elems array (vs say Java
> which always inits the array to hold Java null's).  

That's how it's implemented now.  Changing it over to something like Java
null's might be a good idea.

> Is there an explicit object in Lucy that represents null (java), None
> (Python), etc.?

Yes: UNDEF, a singleton belonging to the class Undefined.
However, it's not used very much.

> > First, note that the destructor for VArray invokes the destructor of its
> > parent class, Obj.  This superclass call makes it possible for us to add
> > members to a base class without having to manually edit all subclasses.
> 
> Great.

The same is true at construction time; each class implements two functions,
new() and init(), and subclasses call their parent class's init():

    TermQuery*
    TermQuery_new(const CharBuf *field, const Obj *term)
    {
        TermQuery *self = (TermQuery*)CREATE(NULL, TERMQUERY);
        return TermQuery_init(self, field, term);
    }

    TermQuery*
    TermQuery_init(TermQuery *self, const CharBuf *field, const Obj *term)
    {
        Query_init((Query*)self, 1.0f);
        self->field       = CB_Clone(field);
        self->term        = Obj_Clone(term);
        return self;
    }

> I'm a bit confused: what if you have a Lucy obj, that's got a cached
> host obj, such that the host obj is not referred to anywhere in the
> host language, but is referred to in Lucy, and Lucy finally decrefs
> its last reference.  

At that point, the Lucy object and the Python object share a single refcount,
which resides in the Python object's ob_refcnt member.  When you call PyDECREF
on the Python object and ob_refcnt falls to 0, Python then invokes the
"__del__" method.  We will define "__del__" so that it invokes Destroy().

> How is the cycle broken in that case?  

There's no reference cycle because the Lucy object and the Python object share
a single unified refcount.  The have C pointers which point at each other, but
they don't hold refcounts open against each other.

> (Ie, Destroy should be invoked via Lucy).

The call sequence will be: 

    Obj_Dec_Refcount(lucy_obj);
    PyDECREF(lucy_obj->host_obj);
    python_obj.__del__()
    Obj_Destroy(lucy_obj);

> OK.  That 1 refCount "belonging" to Lucy.  This is essentially an
> efficient way to represent the common case of "only Lucy has a single
> reference to this object".

Yes.  In addition, it allows us to initialize structs at compile-time.

   CharBuf snapshot_new = {
       (VTable*)&CHARBUF, /* vtable */
       NULL,              /* <------ NULL host_obj */
       "snapshot.new",    /* ptr */
       12,                /* size */
       13                 /* capacity */
   }

Since Python objects live on the heap, we can't create them at compile-time.
The NULL conveniently stands in for them.

> >  void*
> >  Obj_to_host(Obj *self)
> >  {
> >      if (self->host_obj) {
> >          /* The Python object is already cached, so incref it and return. */
> >          PyINCREF((PyObject*)self->host_obj);
> >          return self->host_obj;
> >      }
> >      else {
> >          /* Lazily create Python object. */
> >          self->host_obj = PyCObject_FromVoidPtr((void*)self, Obj_Destroy)
> >      }
> >  }
> 
> (missing a return on the else clause, but I get it).

Heh.  Why won't my email client test my code for me? :)

> So this is a great approach, in that a host obj
> is not created immediately on creating a Lucy obj.  However, it's
> still "falsely" created, in order to track refCount > 1 from within
> Lucy, even when the obj never crosses the bridge.  

Yes.  But if we are parsimonious about creating objects in the first place, it
doesn't matter so much if constant costs per object are large.

> If only host languages let us override what decRef does for a given
> obj... then we could break the tight cycles ourself and only allocate
> a host obj when needed.

There's an alternative.  We can waste an extra 4 bytes per object to hold an
integer refcount, and use that unless a host object has been cached.  The
instant that the host object has been cached, we set its refcount and use that
instead.

I'm not sure that's either clearer or faster, but it does mean that we
wouldn't ever needlessly create host objects.

Marvin Humphrey

/************************** VARRAY VTable definition ************************/

KINO_VARRAY_VT KINO_VARRAY = {
    (kino_VTable*)&KINO_VTABLE, /* vtable vtable */
    NULL, /* host_obj */
    (kino_VTable*)&KINO_OBJ, /* parent */
    (kino_CharBuf*)&KINO_VARRAY_CLASS_NAME, /* class name */
    KINO_VTABLE_F_IMMORTAL, /* flags */
    sizeof(kino_VArray), /* obj_alloc_size */
    offsetof(kino_VTable, methods) 
        + 38 * sizeof(boil_method_t), /* vt_alloc_size */
    (kino_Callback**)&KINO_VARRAY_CALLBACKS,  /* callbacks */
    {
        (boil_method_t)kino_Obj_create,
        (boil_method_t)kino_Obj_foster,
        (boil_method_t)kino_Obj_make,
        (boil_method_t)kino_Obj_get_refcount,
        (boil_method_t)kino_Obj_inc_refcount,
        (boil_method_t)kino_Obj_dec_refcount,
        (boil_method_t)kino_Obj_to_host,
        (boil_method_t)kino_VA_clone,
        (boil_method_t)kino_VA_destroy,
        (boil_method_t)kino_VA_equals,
        (boil_method_t)kino_Obj_compare_to,
        (boil_method_t)kino_Obj_hash_code,
        (boil_method_t)kino_Obj_get_vtable,
        (boil_method_t)kino_Obj_get_class_name,
        (boil_method_t)kino_Obj_is_a,
        (boil_method_t)kino_Obj_to_string,
        (boil_method_t)kino_Obj_to_i64,
        (boil_method_t)kino_Obj_to_f64,
        (boil_method_t)kino_VA_serialize,
        (boil_method_t)kino_VA_deserialize,
        (boil_method_t)kino_VA_dump,
        (boil_method_t)kino_VA_load,
        (boil_method_t)kino_VA_push,
        (boil_method_t)kino_VA_push_varray,
        (boil_method_t)kino_VA_pop,
        (boil_method_t)kino_VA_unshift,
        (boil_method_t)kino_VA_shift,
        (boil_method_t)kino_VA_grow,
        (boil_method_t)kino_VA_fetch,
        (boil_method_t)kino_VA_store,
        (boil_method_t)kino_VA_delete,
        (boil_method_t)kino_VA_splice,
        (boil_method_t)kino_VA_shallow_copy,
        (boil_method_t)kino_VA_sort,
        (boil_method_t)kino_VA_resize,
        (boil_method_t)kino_VA_clear,
        (boil_method_t)kino_VA_get_size,
        (boil_method_t)kino_VA_grep
    }
};



Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sat, Mar 28, 2009 at 8:24 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Fri, Mar 27, 2009 at 08:21:54AM -0400, Michael McCandless wrote:
>
>> On Fri, Mar 27, 2009 at 1:08 AM, Marvin Humphrey <ma...@rectangular.com>
>> wrote:
>>
>> > I think I have an approach that's going to allow us to eliminate FastObj:
>> > We lazily create the host object, and treat a NULL host_obj as
>> > semantically equivalent to a refcount of 1.
>
> I'm happy to report that this approach succeeded.  FastObj is now history.  :)

Most excellent!  Though we do need to figure out how tracing GC languages play.

>> Much of this is beyond me, but...
>
> Hopefully we will soon get to the point where that's no longer the case.
>
> Our KS prototype now has only one object model.  I'll describe how it works
> for refcounting hosts like Perl and Python.
>
> ---
>
> Every Obj is a struct with a VTable and a host object as its first two members:
>
>  struct Obj {
>      VTable   *vtable;
>      void     *host_obj;
>  };
>
>  struct VArray {
>      VTable   *vtable;
>      void     *host_obj;
>      Obj     **elems;
>      u32_t     size;
>      u32_t     capacity;
>  }
>
> When any Lucy object is created, self->host_obj is NULL.
>
> Here's are some simplified sample constructors, for Lucy::Obj, and our
> variable-sized array class Lucy::Util::VArray:
>
>  Obj*
>  Obj_new() {
>      Obj *self      = (Obj*)malloc(sizeof(Obj));
>      self->vtable   = (VTable*)VTable_Inc_RefCount(&OBJ);
>      self->host_obj = NULL;
>      return self;
>  }
>
>  VArray*
>  VA_new(u32_t capacity)
>  {
>      VArray *self   = (Obj*)malloc(sizeof(VArray));
>      self->vtable   = (VTable*)VTable_Inc_RefCount(&VARRAY);
>      self->host_obj = NULL;
>      self->elems    = (Obj**)calloc(capacity * sizeof(Obj*));
>      self->size     = 0;
>      self->capacity = capacity;
>      return self;
>  }
>
> Note that the VTable for the Obj class is OBJ, and the VTable for VArray is
> VARRAY.  The same pattern holds true for other classes: TermScorer's VTable is
> TERMSCORER, etc.

OK.

Are VTables also considered Obj's?  Ie, you are Inc/DecRef'ing them --
will a VTable be destroyed when it DecRefs to 0?

Do VTables only store methods?  Or can they store fields as well?

Can an arbitrary Obj at runtime become a VTable for another Obj?
(True "prototype" programming language).  Seems like "no", because an
arbitrary Obj is not allowed to add new members wrt its parent (only
new VTables can do so).

Does VARRAY (VTable for VArray objects) hold a reference to OBJ?  How
are these trees of VTables init'd?

And Lucy objs are single inheritance.

(NOTE: I'm just "probing" with these questions... I'm certainly not
implying Lucy should have / needs 1all of these capabilities).

> Here are corresponding destructors for Obj and VArray:
>
>  void
>  Obj_destroy(Obj *self)
>  {
>      VTable_Dec_RefCount(self->vtable);
>      free(self);
>  }
>
>  void
>  VA_destroy(VArray *self)
>  {
>      u32_t i;
>      for (i = 0; i < self->size, i++) {
>          if (self->elems[i]) {
>              Obj_Dec_RefCount(self->elems[i]);
>          }
>      }
>      free(self->elems);
>      Obj_destroy((Obj*)self); /* super */
>  }

OK, it's making sense.

So a VArray is allowed to have C NULLs in its elems array (vs say Java
which always inits the array to hold Java null's).  Is there an
explicit object in Lucy that represents null (java), None (Python),
etc.?

> Two items of note about the destructors:
>
> First, note that the destructor for VArray invokes the destructor of its
> parent class, Obj.  This superclass call makes it possible for us to add
> members to a base class without having to manually edit all subclasses.

Great.

> Second, there is no mention whatsoever of self->host_obj in the destructor.
> That's because there are only two paths into the destructor, and both of them
> avoid the need for Lucy core code to worry about the cached host object.
>
>  1) The cached host object was never created so it doesn't need to be
>     cleaned up.
>  2) Destroy() is being invoked from host-space via e.g. Pythons "__del__"
>     method, and after it returns the host will clean up the host object
>     itself.

I'm a bit confused: what if you have a Lucy obj, that's got a cached
host obj, such that the host obj is not referred to anywhere in the
host language, but is referred to in Lucy, and Lucy finally decrefs
its last reference.  How is the cycle broken in that case?  (Ie,
Destroy should be invoked via Lucy).

> Obj declares four methods which each host must implement:
>
>   Get_RefCount
>   Inc_RefCount
>   Dec_RefCount
>   To_Host
>
> Mike, since you're familiar with Python, I'll have a go at implementing those
> methods for the Python bindings.
>
> First, the accessor for the object's refcount, which is shared by the Lucy
> object and the Python object.  If self->host_obj is NULL, then the refcount is
> 1.  Otherwise, we delegate responsibility for tracking the refcount to the
> Python object cached in self->host_obj.
>
>  u32_t
>  Obj_get_refcount(Obj *self)
>  {
>      if (self->host_obj == NULL) {
>          return 1;  /* NULL host_obj implies a refcount of 1. */
>      }
>      else {
>          PyObject *py_object = (PyObject*)self->host_obj;
>          return py_object->ob_refcnt;
>      }
>  }
>
> Next, the method which increments the refcount.  Calling this method even once
> guarantees that a Python object will be created, since the first time it is
> called, the refcount will progress from 1 to 2, and we need a place to put
> that number.
>
> This means that there are two ways to indicate a refcount of 1.  Either we
> have a newly created Lucy object with a NULL self->host_obj which *implies* a
> refcount of 1, or we have a cached host object which had a refcount of 2 or
> more at some point, but which has fallen back down to an *explicit* refcount
> of 1.

OK.  That 1 refCount "belonging" to Lucy.  This is essentially an
efficient way to represent the common case of "only Lucy has a single
reference to this object".

>  Obj*
>  Obj_inc_refcount(Obj *self)
>  {
>      if (self->host_obj == NULL) {
>        self->host_obj = Obj_To_Host(self);
>      }
>      PyINCREF((PyObject*)self->host_obj);
>      return self;
>  }
>
> Once the host object is cached, it never goes away -- it's there for the life
> of the Lucy object.
>
> Next, the method to decrement the refcount.  Note that we only call Destroy()
> directly if self->host_obj is NULL.  If we've created a Python object, then we
> count on it to invoke the __del__ method when its refcount falls to 0; we will
> have defined __del__ to invoke Destroy().

I'm still confused because it seems like there should be times when
Lucy needs to destroy a Lucy obj that had at some point crossed the
bridge, yet the host retained no reference.  Ie the destruction may
need to initiate from the Lucy side of the bridge.

>  u32_t
>  Obj_dec_refcount(Obj *self)
>  {
>      if (self->host_obj == NULL) {
>          /* NULL host object implies a refcount of 1.  That's dropping to 0
>           * as a result of this call, so it's time to invoke Destroy(). */
>          Obj_Destroy(self);
>      }
>      else {
>          /* If the PyObject's ob_refcnt falls to 0, then the destructor will
>           * be invoked from Python-space via the "__del__" method */
>          PyDECREF((PyObject*)self->host_obj);
>      }
>  }
>
> The last method we need to define is To_Host(), which, in the parlance of the
> Python C API docs, will return a "new reference".
>
> (I'm not sure that this implementation is correct, but it should convey
> the gist.)
>
>  void*
>  Obj_to_host(Obj *self)
>  {
>      if (self->host_obj) {
>          /* The Python object is already cached, so incref it and return. */
>          PyINCREF((PyObject*)self->host_obj);
>          return self->host_obj;
>      }
>      else {
>          /* Lazily create Python object. */
>          self->host_obj = PyCObject_FromVoidPtr((void*)self, Obj_Destroy)
>      }
>  }

(missing a return on the else clause, but I get it).

>> won't there be multiple references in C to a given Lucy object, each of
>> which would need to incRef the RC?
>
> Yes.  As soon as the refcount has to be increased above 1, we lazily create a
> Host object to hold the refcount.
>
> ---
>
> Leaving aside the question of tracing GC hosts for now... does the
> cached-host-object delegated refcounting model seem sufficiently clear to you
> for use within Lucy Python bindings?
>
> The rest of the Lucy library doesn't need to know about the host object
> caching -- it just uses the opaque refcounting API, which looks like plain
> old integer refcounting from the outside.

Yes, makes sense (except for the "Lucy destroys object that has cached
host object" case).  So this is a great approach, in that a host obj
is not created immediately on creating a Lucy obj.  However, it's
still "falsely" created, in order to track refCount > 1 from within
Lucy, even when the obj never crosses the bridge.  I wonder in
practice how often that actually happens -- my guess is the
RefCount==1 case is the "long tail" in a typical snapshot of a future
running Lucy app.

If only host languages let us override what decRef does for a given
obj... then we could break the tight cycles ourself and only allocate
a host obj when needed.

Mike

Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Mar 27, 2009 at 08:21:54AM -0400, Michael McCandless wrote:

> On Fri, Mar 27, 2009 at 1:08 AM, Marvin Humphrey <ma...@rectangular.com>
> wrote:
> 
> > I think I have an approach that's going to allow us to eliminate FastObj:
> > We lazily create the host object, and treat a NULL host_obj as
> > semantically equivalent to a refcount of 1.

I'm happy to report that this approach succeeded.  FastObj is now history.  :)

> Much of this is beyond me, but... 

Hopefully we will soon get to the point where that's no longer the case.  

Our KS prototype now has only one object model.  I'll describe how it works
for refcounting hosts like Perl and Python.

---

Every Obj is a struct with a VTable and a host object as its first two members:

  struct Obj {
      VTable   *vtable;
      void     *host_obj;
  };

  struct VArray {
      VTable   *vtable;
      void     *host_obj;
      Obj     **elems;
      u32_t     size;
      u32_t     capacity;
  }

When any Lucy object is created, self->host_obj is NULL.  

Here's are some simplified sample constructors, for Lucy::Obj, and our
variable-sized array class Lucy::Util::VArray:

  Obj*
  Obj_new() {
      Obj *self      = (Obj*)malloc(sizeof(Obj));
      self->vtable   = (VTable*)VTable_Inc_RefCount(&OBJ);
      self->host_obj = NULL;
      return self;
  }

  VArray*
  VA_new(u32_t capacity) 
  {
      VArray *self   = (Obj*)malloc(sizeof(VArray));
      self->vtable   = (VTable*)VTable_Inc_RefCount(&VARRAY);
      self->host_obj = NULL;
      self->elems    = (Obj**)calloc(capacity * sizeof(Obj*));
      self->size     = 0;
      self->capacity = capacity;
      return self;
  }

Note that the VTable for the Obj class is OBJ, and the VTable for VArray is
VARRAY.  The same pattern holds true for other classes: TermScorer's VTable is
TERMSCORER, etc.

Here are corresponding destructors for Obj and VArray:  

  void
  Obj_destroy(Obj *self)
  {
      VTable_Dec_RefCount(self->vtable);
      free(self);
  }

  void
  VA_destroy(VArray *self)
  {
      u32_t i;
      for (i = 0; i < self->size, i++) {
          if (self->elems[i]) {
              Obj_Dec_RefCount(self->elems[i]);
          }
      }
      free(self->elems);
      Obj_destroy((Obj*)self); /* super */
  }

Two items of note about the destructors:

First, note that the destructor for VArray invokes the destructor of its
parent class, Obj.  This superclass call makes it possible for us to add
members to a base class without having to manually edit all subclasses.

Second, there is no mention whatsoever of self->host_obj in the destructor.
That's because there are only two paths into the destructor, and both of them
avoid the need for Lucy core code to worry about the cached host object.
    
  1) The cached host object was never created so it doesn't need to be 
     cleaned up.
  2) Destroy() is being invoked from host-space via e.g. Pythons "__del__"
     method, and after it returns the host will clean up the host object
     itself.

Obj declares four methods which each host must implement: 

   Get_RefCount
   Inc_RefCount
   Dec_RefCount
   To_Host

Mike, since you're familiar with Python, I'll have a go at implementing those
methods for the Python bindings.

First, the accessor for the object's refcount, which is shared by the Lucy
object and the Python object.  If self->host_obj is NULL, then the refcount is
1.  Otherwise, we delegate responsibility for tracking the refcount to the
Python object cached in self->host_obj.

  u32_t
  Obj_get_refcount(Obj *self) 
  {
      if (self->host_obj == NULL) {
          return 1;  /* NULL host_obj implies a refcount of 1. */
      }
      else {
          PyObject *py_object = (PyObject*)self->host_obj;
          return py_object->ob_refcnt;
      }
  }

Next, the method which increments the refcount.  Calling this method even once
guarantees that a Python object will be created, since the first time it is
called, the refcount will progress from 1 to 2, and we need a place to put
that number.

This means that there are two ways to indicate a refcount of 1.  Either we
have a newly created Lucy object with a NULL self->host_obj which *implies* a
refcount of 1, or we have a cached host object which had a refcount of 2 or
more at some point, but which has fallen back down to an *explicit* refcount
of 1.

  Obj*
  Obj_inc_refcount(Obj *self)
  {
      if (self->host_obj == NULL) {
        self->host_obj = Obj_To_Host(self);
      }
      PyINCREF((PyObject*)self->host_obj);
      return self;
  }

Once the host object is cached, it never goes away -- it's there for the life
of the Lucy object.

Next, the method to decrement the refcount.  Note that we only call Destroy()
directly if self->host_obj is NULL.  If we've created a Python object, then we
count on it to invoke the __del__ method when its refcount falls to 0; we will
have defined __del__ to invoke Destroy().

  u32_t
  Obj_dec_refcount(Obj *self)
  {
      if (self->host_obj == NULL) {
          /* NULL host object implies a refcount of 1.  That's dropping to 0
           * as a result of this call, so it's time to invoke Destroy(). */
          Obj_Destroy(self);
      }
      else {
          /* If the PyObject's ob_refcnt falls to 0, then the destructor will
           * be invoked from Python-space via the "__del__" method */
          PyDECREF((PyObject*)self->host_obj);
      }
  }

The last method we need to define is To_Host(), which, in the parlance of the
Python C API docs, will return a "new reference".  

(I'm not sure that this implementation is correct, but it should convey
the gist.)

  void*
  Obj_to_host(Obj *self)
  {
      if (self->host_obj) {
          /* The Python object is already cached, so incref it and return. */
          PyINCREF((PyObject*)self->host_obj);
          return self->host_obj;
      }
      else {
          /* Lazily create Python object. */
          self->host_obj = PyCObject_FromVoidPtr((void*)self, Obj_Destroy)
      }
  }

> won't there be multiple references in C to a given Lucy object, each of
> which would need to incRef the RC?

Yes.  As soon as the refcount has to be increased above 1, we lazily create a
Host object to hold the refcount.  

---

Leaving aside the question of tracing GC hosts for now... does the
cached-host-object delegated refcounting model seem sufficiently clear to you
for use within Lucy Python bindings?

The rest of the Lucy library doesn't need to know about the host object
caching -- it just uses the opaque refcounting API, which looks like plain
old integer refcounting from the outside.

Marvin Humphrey


Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Mar 29, 2009 at 05:55:11AM -0400, Michael McCandless wrote:

> > FWIW, the cached-host-object algo wins the tight loop performance contest.
> 
> Which makes sense, but how often are such loops [expected to be] used
> in practice?  EG we already thought Collector inside host would be
> doable but expected poor performance.

The performance hit will matter whenever someone overrides a tight-loop
method.  How much it matters will vary from host to host.

Probably any Scorer or Collector would feel it.  DataReader: it depends on
whether there are Scorer implementations which call on it in tight loops.

InStream and OutStream: definitely would have, except that they're final
classes so you can't override any methods.

Analyzer: maybe not so much because the analysis method would be non-trivial.
(Assuming that we settle on something like the KS "Inversion"-style Analyzers
rather than the method-call-orgy of Lucene TokenStream.)

Searcher, Indexer, Query, QueryParser, DataWriter, IndexManager, Lock: no,
because they don't operate light methods within tight loops.  FieldSpec,
Schema: I don't think so because so far it's been possible to cache the return
values of certain methods which would otherwise be called often.
(KinoSearch::Index::Inverter performs such caching.)

> How much work is being done when you create a new host-wrapper object?

That will vary from Host to Host.  In Perl, we're using blessed scalars, which
are about as cheap as Perl objects get.

However, Perl method dispatch is comparatively slow; if method calls were
faster, the effect would loom larger.  In Java, for instance, method calls are
very fast, and creating a host wrapper object on each loop will probably cause
a significant degradation much sooner.  

Put another way: caching a host wrapper object probably matters a lot more
when method dispatch is fast.

> Could you somehow re-use such objects?  (Sort of like how when you go
> to a children's museum, to the water section, they provide a bunch of
> yellow frocks that the kids can wear when they play with the water,
> and return when they are done).  

Evocative metaphor. :)  So, who manages the pool?  

Ideally, when we know we're going to be calling back to the host in a tight
loop, we'd want the caller to cache the host object.  But that's not easy.

Here's the basic implementation of Matcher_Collect():

  void
  Matcher_collect(Matcher *self, Collector *collector)
  {
    i32_t doc_num;
    Collector_Set_Matcher(collector, self);
    while (0 != (doc_num = Matcher_Next(self))) {
      Coll_Collect(doc_num);
    }
  }

Say that the user has overridden Collector's Collect() method in the host, so
it's slow.  To get best performance, we need to change how *Matcher* works,
even though it's *Collector* that's the source of the slowdown.

  void
  Matcher_collect(Matcher *self, Collector *collector)
  {
    i32_t  doc_num;
    void  *host_obj = Coll_To_Host(collector); /* <--- Cache wrapper here */

    Collector_Set_Matcher(collector, self);
    while (0 != (doc_num = Matcher_Next(self))) {
      Host_callback(host_obj, "collect", 1, ARG_I32("doc_num", doc_num));
    }

    Coll_Destroy_Host(collector, host_obj); /* <--- Destroy wrapper here */
  }

(Host_callback() ordinarily takes a Lucy object as its first argument, but
pretend that it takes a host wrapper object for now.)

Coll_Collect doesn't know it's being called in a loop.  Even if it did, it
could only cache host object wrappers using static or global vars. (Bye-bye
thread safety, and what on earth do we do about subclasses?)

  static void *Coll_collect_HOST_OBJ = NULL;
  void
  Coll_collect_OVERRIDE(Collector* self, i32_t doc_num)
  {
    /* Bleah. */
    Coll_collect_HOST_OBJ = Coll_Validate_Or_Make_Wrapper(self, 
        Coll_collect_HOST_OBJ);  
    Host_callback(Coll_collect_HOST_OBJ, "collect", 1, 
        ARG_I32("doc_num", doc_num));
  }

> > We lose convenient use of the "inside-out-object" pattern. :(
> 
> Neat.... I've seen a similar approach (though likely someone has given
> / would give it a different named pattern) 

The "FlyWeight" design pattern is very similar.

> In your example, shouldn't/couldn't the hash key somehow be the id of
> the *Lucy* object, not of its fleeting host wrapper obj?  

Actually, that's what $$self is.

   $self          <---- address of Perl object[1]
   $$self         <---- address of Lucy object

However, invoking $self->DESTROY on multiple wrappers around the same Lucy
object is still a problem.

  sub DESTROY {
      my $self = shift;
      delete $foo{$$self};
      $self->SUPER::DESTROY;  # Calls Dec_RefCount(), not Destroy() as it
                              # would under the cached-object model.
  }

As soon as the wrapper has its refcount fall to 0, Perl invokes DESTROY, and
despite the fact that the Lucy object doesn't get zapped yet, $foo{$$self}
gets deleted.

Marvin Humphrey

[1] Slight oversimplification.

Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Mar 29, 2009 at 06:01:16AM -0400, Michael McCandless wrote:

> You might consider adding the hooks for a tracing collector, but not
> creating such a collector now.  IE have each Lucy obj expose a method
> to visit the other objs it refers to.

We can probably have Boilerplater autogenerate that code.  That way we only
have to only write one compiler module instead of manually maintaining methods
for X hundred classes.

> I think you'd be surprised at how easily cycles are accidentally created.
> 
> It's sort of like saying programmers are accustomed to writing high
> performance code, but then an O(N^2) slips in somewhere and goes
> undetected until N happens to get large deep in production.

Heh.  Or a reference cycle leak slips in that happens to get large and causes
a performance hit in production.  Been there done that.

Marvin Humphrey


Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sat, Mar 28, 2009 at 6:55 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Fri, Mar 27, 2009 at 08:23:45AM -0400, Michael McCandless wrote:
>
>> since Lucy will use reference counting, how will you deal with cycles?
>
> I don't think we should do anything about them.  It will be up to the user to
> avoid them.
>
> There *are* limits to my ambitions for the Lucy object model.  :)

OK :)  Intentionally leaky abstraction.  FWIW Python took that same
approach for many years, before sprouting a cycle detector addition to
its GC.  But it's awkward, because sometimes things are collected
immediately (when they decref to 0), but at other times, only when the
cyclic collector sweeps.  And of course, because it's tracing, it
requires each class to expose a "visit all things I refer to" API.

You might consider adding the hooks for a tracing collector, but not
creating such a collector now.  IE have each Lucy obj expose a method
to visit the other objs it refers to.

>> since the host language is involved, a cycle could easily reach out to the
>> host language and wrap back around into Lucy?
>
> Possible.  I doubt it will be much of a problem for programmers accustomed to
> working within a refcounted environment.

I think you'd be surprised at how easily cycles are accidentally created.

It's sort of like saying programmers are accustomed to writing high
performance code, but then an O(N^2) slips in somewhere and goes
undetected until N happens to get large deep in production.

> Programmers who normally work within a tracing GC environment may be more
> prone to create reference cycles, because they're not trained to avoid
> cyclical data structures and alarms might not go off in their heads when they
> see them.
>
> But what are our options?  Our class hierarchy is sophisticated and large
> enough that we have to use either tracing GC or refcounting.  Writing our own
> tracing GC, now *that* would be ambitious. ;)  And though I haven't studied
> cycle detection, I imagine it has to be both involved and costly.

Agreed.  I wonder if there's a decent tracing GC library somehow out
there, that only requires certain limited methods to be exposed for
your objects...

And note that I'm really playing devil's advocate here -- I don't have
good solutions to offer for these hard problems.

Mike

Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sat, Mar 28, 2009 at 6:08 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Fri, Mar 27, 2009 at 08:21:54AM -0400, Michael McCandless wrote:
>
>> what if the Lucy obj never held onto the reference to the mirror host obj?
>> (And the Lucy obj did its own reference counting, separately from the
>> host's GC).
>
> What you've laid out is *exactly* the same scheme Peter Karman used in SWISH3,
> the same scheme Dave Balmain used in Ferret, and the same scheme I used in
> KinoSearch for objects which subclass FastObj.

OK, excellent.  I'm playing catch up over here...

> Since this algorithm seems to be an intuitive winner, and since I'd like to
> just solve this problem and move on, I took a stab at converting KS over to
> using it exclusively.  Switching over to something that everybody else seems to
> grok would be great -- indeed, I was rooting for my cached-host-object algo to
> lose.  Sacrificing a little performance for the sake of an implementation that
> other people might have less difficulty understanding, maintaining, and
> modifying would be just fine with me.
>
> The results were disappointing.  The cached-host-object algo has certain
> advantages we should be reluctant to discard.

Interesting.

>> When we cross the bridge, Lucy to host, we would make a new host
>> wrapper obj each time.  These host mirror objs are very lightweight
>> wrappers, right?
>
> FWIW, the cached-host-object algo wins the tight loop performance contest.

Which makes sense, but how often are such loops [expected to be] used
in practice?  EG we already thought Collector inside host would be
doable but expected poor performance.

> Below my sig, you'll find a patch that applies against KinoSearch svn revision
> 4361, and a benchmarking script. The patch gives TermQuery a member var named
> "thing", plus Set_Thing() and Get_Thing() methods.  The loop compares the
> performance of calling $term_query->get_thing from Perl-space when "thing" is a
> VArray (which subclasses FastObj) vs. when "thing" is a Hash (which subclasses
> Obj).
>
> The only difference in those loops is the behavior of FastObj_to_host (which
> creates a new host object every time) vs.  Obj_to_host (which exploits a cached
> host object).  The results:
>
>  marvin@smokey:~/projects/ks4361/perl $ perl -Mblib time_callbacks.pl
>                     Rate no_cached_object    cached_object
>  no_cached_object 1.93/s               --             -66%
>  cached_object    5.71/s             196%               --
>
> On its own, this performance gap probably doesn't justify sacrificing clarity
> of implementation.  It only matters for host-language subclassing, where
> performance is going to be heavily compromised anyway -- and if your method
> does something non-trivial, this effect is going to get swamped.

How much work is being done when you create a new host-wrapper object?
 Could you somehow re-use such objects?  (Sort of like how when you go
to a children's museum, to the water section, they provide a bunch of
yellow frocks that the kids can wear when they play with the water,
and return when they are done).  I want to make sure we've fully
explored performance improvements for the "simpler"
non-cached-host-object approach before making the decision.

> However...
>
>> But what do we lost by not retaining a permanent host obj wrapper?
>
> We lose convenient use of the "inside-out-object" pattern. :(

Neat.... I've seen a similar approach (though likely someone has given
/ would give it a different named pattern) applied to fine-grained
locks, too, where you want to make a lock for objects where you may
have zillions of them, so instead of actually making a lock per
object, you make a single shared array of locks and each object is
hashed into one lock in that array.

In your example, shouldn't/couldn't the hash key somehow be the id of
the *Lucy* object, not of its fleeting host wrapper obj?  I'm not sure
how easy this'd be ($$self likely is not something you could override
to return the pointer to the Lucy object instead of the host
wrapper?).  And then when the Lucy obj is really destroyed, you'd have
to cross the bridge and call DESTROY in the host.

>> (Note that if it's a real performance problem, then in certain cases
>> you could eg explicitly retain a reference at the top of the loop, do
>> the loop, then drop the reference at the end).

I view this as being similar to the "though shalt not create cycles"
approach.  We are expecting the user (not end-user) to understand some
limitations of how Lucy works and to properly work around them.  It's
an intentionally leaky abstraction...

Mike

Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Mar 27, 2009 at 08:21:54AM -0400, Michael McCandless wrote:

> what if the Lucy obj never held onto the reference to the mirror host obj?
> (And the Lucy obj did its own reference counting, separately from the
> host's GC).

What you've laid out is *exactly* the same scheme Peter Karman used in SWISH3,
the same scheme Dave Balmain used in Ferret, and the same scheme I used in
KinoSearch for objects which subclass FastObj.

Since this algorithm seems to be an intuitive winner, and since I'd like to
just solve this problem and move on, I took a stab at converting KS over to
using it exclusively.  Switching over to something that everybody else seems to
grok would be great -- indeed, I was rooting for my cached-host-object algo to
lose.  Sacrificing a little performance for the sake of an implementation that
other people might have less difficulty understanding, maintaining, and
modifying would be just fine with me.

The results were disappointing.  The cached-host-object algo has certain
advantages we should be reluctant to discard.

> When we cross the bridge, Lucy to host, we would make a new host
> wrapper obj each time.  These host mirror objs are very lightweight
> wrappers, right?  

FWIW, the cached-host-object algo wins the tight loop performance contest.

Below my sig, you'll find a patch that applies against KinoSearch svn revision
4361, and a benchmarking script. The patch gives TermQuery a member var named
"thing", plus Set_Thing() and Get_Thing() methods.  The loop compares the
performance of calling $term_query->get_thing from Perl-space when "thing" is a
VArray (which subclasses FastObj) vs. when "thing" is a Hash (which subclasses
Obj).

The only difference in those loops is the behavior of FastObj_to_host (which
creates a new host object every time) vs.  Obj_to_host (which exploits a cached
host object).  The results:

  marvin@smokey:~/projects/ks4361/perl $ perl -Mblib time_callbacks.pl 
                     Rate no_cached_object    cached_object
  no_cached_object 1.93/s               --             -66%
  cached_object    5.71/s             196%               --

On its own, this performance gap probably doesn't justify sacrificing clarity
of implementation.  It only matters for host-language subclassing, where
performance is going to be heavily compromised anyway -- and if your method
does something non-trivial, this effect is going to get swamped.  

However...

> But what do we lost by not retaining a permanent host obj wrapper?

We lose convenient use of the "inside-out-object" pattern. :(

The "inside-out object" design pattern is a technique that has been increasing
in popularity within the Perl community for a few years now -- Perl 5.10
provides the Hash::Util::FieldHash class specifically to support it:

    http://perldoc.perl.org/Hash/Util/FieldHash.html#The-Inside-out-Technique

The inside-out technique is the *only* way we will be able to attach arbitary
data to a pure-Perl subclass of a Lucy object.  Within the KinoSearch
distribution, all of the pure-Perl KSx classes depend on it --
KSx::Search::MockScorer, KSx::Simple, KSx::Remote::SearchServer,
KSx::Remote::SearchClient, etc -- and it is essential for making pure-Perl
subclassing of Lucy feasible.

Here's example code from the docs for KinoSearch::Obj which illustrates the
technique, attaching a "foo" variable to our "MyObj" subclass.

    package MyObj;
    use base qw( KinoSearch::Obj );
    
    # Inside-out member var.
    my %foo;
    
    sub new {
        my ( $class, %args ) = @_;
        my $foo = delete $args{foo};
        my $self = $class->SUPER::new(%args);
        $foo{$$self} = $foo;
        return $self;
    }
    
    sub get_foo {
        my $self = shift;
        return $foo{$$self};
    }
    
    sub DESTROY {
        my $self = shift;
        delete $foo{$$self};
        $self->SUPER::DESTROY;
    }

The problem is that DESTROY method.  It is vital, as the Hash::Util::FieldHash
documentation explains:

    When a normal object dies, anything stored in the object body is
    garbage-collected by perl. With inside-out objects, Perl knows nothing
    about the data stored in field hashes by a class, but these must be
    deleted when the object goes out of scope. Thus the class must provide a
    DESTROY method to take care of that.

Under the cached-host-object algorithm, that DESTROY method is called only
once, when the Lucy object has really had its refcount fall to 0.  However, if
we create host wrappers for each callback, DESTROY gets called *multiple*
times, and the attached data gets zapped too soon.

It is techically possible to address the problem like so:

    sub DESTROY {
        my $self = shift;
        if ( $self->get_refcount == 1 ) {
            delete $foo{$$self};
        }
        $self->SUPER::DESTROY;
    }

However, having to explain to the user that DESTROY will be called many times
for each Lucy object -- and having to expose get_refcount and explain why it's
needed -- makes for a stupid and ugly API.

> (Note that if it's a real performance problem, then in certain cases
> you could eg explicitly retain a reference at the top of the loop, do
> the loop, then drop the reference at the end).

This is a novel idea.  I'd really like to get to a place where someone besides
me would feel comfortable going in and making such an optimization.

I've at least made some progress on the simplification front, which I'll
explain in another post.

Marvin Humphrey

##############################################
# Benchmark Obj_to_host vs. FastObj_to_host. #
##############################################

use strict;
use warnings;

use KinoSearch;
use Benchmark qw( cmpthese );

my $foo_query = KinoSearch::Search::TermQuery->new(
    field => 'content',
    term  => 'foo',
);
my $bar_query = KinoSearch::Search::TermQuery->new(
    field => 'content',
    term  => 'bar',
);
$foo_query->set_thing( KinoSearch::Util::VArray->new( capacity => 16 ) );
$bar_query->set_thing( KinoSearch::Util::Hash->new( capacity => 16 ) );

cmpthese(
    10, 
    {   no_cached_object => sub { $foo_query->get_thing for 1 .. 100000 },
        cached_object    => sub { $bar_query->get_thing for 1 .. 100000 },
    }   
);

#############################################
# Patch to add "thing" member to TermQuery. #
#############################################

Index: ../core/KinoSearch/Search/TermQuery.bp
===================================================================
--- ../core/KinoSearch/Search/TermQuery.bp  (revision 4361)
+++ ../core/KinoSearch/Search/TermQuery.bp  (working copy)
@@ -12,6 +12,7 @@
 
     CharBuf *field;
     Obj     *term;
+    Obj     *thing;
 
     static incremented TermQuery*
     new(const CharBuf *field, const Obj *term);
@@ -33,6 +34,12 @@
     Obj*
     Get_Term(TermQuery *self);
 
+    public void
+    Set_Thing(TermQuery *self, Obj *thing);
+
+    public Obj*
+    Get_Thing(TermQuery *self);
+
     public incremented Compiler*
     Make_Compiler(TermQuery *self, Searchable *searchable, float boost);
 
Index: ../core/KinoSearch/Search/TermQuery.c
===================================================================
--- ../core/KinoSearch/Search/TermQuery.c   (revision 4361)
+++ ../core/KinoSearch/Search/TermQuery.c   (working copy)
@@ -29,6 +29,7 @@
     Query_init((Query*)self, 1.0f);
     self->field       = CB_Clone(field);
     self->term        = Obj_Clone(term);
+    self->thing       = NULL;
     return self;
 }
 
@@ -55,6 +56,7 @@
 {
     DECREF(self->field);
     DECREF(self->term);
+    DECREF(self->thing);
     Query_destroy((Query*)self);
 }
 
@@ -63,6 +65,15 @@
 Obj*
 TermQuery_get_term(TermQuery *self)  { return self->term; }
 
+Obj*
+TermQuery_get_thing(TermQuery *self)  { return self->thing; }
+void
+TermQuery_set_thing(TermQuery *self, Obj *thing)  
+{
+    DECREF(self->thing);
+    self->thing = thing ? INCREF(thing) : NULL;
+}
+
 bool_t
 TermQuery_equals(TermQuery *self, Obj *other)
 {
Index: ../perl/lib/KinoSearch/Search/TermQuery.pm
===================================================================
--- ../perl/lib/KinoSearch/Search/TermQuery.pm  (revision 4361)
+++ ../perl/lib/KinoSearch/Search/TermQuery.pm  (working copy)
@@ -41,7 +41,7 @@
 END_CONSTRUCTOR
 
 {   "KinoSearch::Search::TermQuery" => {
-        bind_methods      => [qw( Get_Field )],
+        bind_methods      => [qw( Get_Field Set_Thing Get_Thing )],
         make_constructors => ["new"],
         make_pod          => {
             synopsis    => $synopsis,


Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Mar 27, 2009 at 08:23:45AM -0400, Michael McCandless wrote:

> since Lucy will use reference counting, how will you deal with cycles?

I don't think we should do anything about them.  It will be up to the user to
avoid them.

There *are* limits to my ambitions for the Lucy object model.  :)

> since the host language is involved, a cycle could easily reach out to the
> host language and wrap back around into Lucy?

Possible.  I doubt it will be much of a problem for programmers accustomed to
working within a refcounted environment.  

Programmers who normally work within a tracing GC environment may be more
prone to create reference cycles, because they're not trained to avoid
cyclical data structures and alarms might not go off in their heads when they
see them.

But what are our options?  Our class hierarchy is sophisticated and large
enough that we have to use either tracing GC or refcounting.  Writing our own
tracing GC, now *that* would be ambitious. ;)  And though I haven't studied
cycle detection, I imagine it has to be both involved and costly.

Marvin Humphrey


Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
A new top-level question: since Lucy will use reference counting, how
will you deal with cycles?

You could try to ensure the Lucy core doesn't create cycles of its
own, but.... since the host language is involved, a cycle could easily
reach out to the host language and wrap back around into Lucy?

Mike

On Fri, Mar 27, 2009 at 8:21 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Fri, Mar 27, 2009 at 1:08 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
>
>> I think I have an approach that's going to allow us to eliminate FastObj: We
>> lazily create the host object, and treat a NULL host_obj as semantically
>> equivalent to a refcount of 1.
>
> Much of this is beyond me, but... won't there be multiple references
> in C to a given Lucy object, each of which would need to incRef the
> RC?
>
> Expanding on/restating my "sort of like autoboxing" idea...: what if
> the Lucy obj never held onto the reference to the mirror host obj?
> (And the Lucy obj did its own reference counting, separately from the
> host's GC).
>
> But, the reverse is allowed (the host can hold onto the Lucy mirror
> obj that has incRef's (refers to) the Lucy obj).
>
> When we cross the bridge, Lucy to host, we would make a new host
> wrapper obj each time.  These host mirror objs are very lightweight
> wrappers, right?  When that call returns across the bridge, Lucy drops
> the reference to it.
>
> The host is allowed to keep a reference somewhere to that host mirror
> obj, and later cross the bridge in the reverse direction, using it.
>
> Note that Lucy can still clearly store "other" host objects.  It's
> just these 'thin host mirrors Lucy object" objects that would not be
> retained in Lucy.
>
> This then breaks the circular ref, and gives complete freedom for Lucy
> to do whatever GC it needs, separate from the Host's GC.
>
> But what do we lost by not retaining a permanent host obj wrapper?
> (Note that if it's a real performance problem, then in certain cases
> you could eg explicitly retain a reference at the top of the loop, do
> the loop, then drop the reference at the end).
>
> Mike
>

Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Fri, Mar 27, 2009 at 1:08 AM, Marvin Humphrey <ma...@rectangular.com> wrote:

> I think I have an approach that's going to allow us to eliminate FastObj: We
> lazily create the host object, and treat a NULL host_obj as semantically
> equivalent to a refcount of 1.

Much of this is beyond me, but... won't there be multiple references
in C to a given Lucy object, each of which would need to incRef the
RC?

Expanding on/restating my "sort of like autoboxing" idea...: what if
the Lucy obj never held onto the reference to the mirror host obj?
(And the Lucy obj did its own reference counting, separately from the
host's GC).

But, the reverse is allowed (the host can hold onto the Lucy mirror
obj that has incRef's (refers to) the Lucy obj).

When we cross the bridge, Lucy to host, we would make a new host
wrapper obj each time.  These host mirror objs are very lightweight
wrappers, right?  When that call returns across the bridge, Lucy drops
the reference to it.

The host is allowed to keep a reference somewhere to that host mirror
obj, and later cross the bridge in the reverse direction, using it.

Note that Lucy can still clearly store "other" host objects.  It's
just these 'thin host mirrors Lucy object" objects that would not be
retained in Lucy.

This then breaks the circular ref, and gives complete freedom for Lucy
to do whatever GC it needs, separate from the Host's GC.

But what do we lost by not retaining a permanent host obj wrapper?
(Note that if it's a real performance problem, then in certain cases
you could eg explicitly retain a reference at the top of the loop, do
the loop, then drop the reference at the end).

Mike

Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
> >> > I'd really like to kill of FastObj just for the sake of simplicity,
> >> > though.
> >>
> >> What objects are planned to subclass FastObj?
> >
> > VTable, plus the CharBuf family.
> >
> > Right now in KS, the following classes also descend from FastObj:
> >
> >    VArray
> >    Posting
> >    Token
> >    Inversion (nee TokenBatch)
> >    ByteBuf
> >    ViewByteBuf
> >    Undefined
> >
> > However, none of them have the declaration/bootstrapping problem presented by
> > VTable and CharBuf.  

I think I have an approach that's going to allow us to eliminate FastObj: We
lazily create the host object, and treat a NULL host_obj as semantically
equivalent to a refcount of 1.

chy_u32_t
lucy_Obj_dec_refcount(lucy_Obj *self)
{
    if (self->ref.host_obj) {
        chy_u32_t modified_refcount = SvREFCNT((SV*)self->ref.host_obj) - 1;
        /* If the SV's refcount falls to 0, DESTROY will be invoked from
         * Perl-space.
         */
        SvREFCNT_dec((SV*)self->ref.host_obj);
        return modified_refcount;
    }
    else {
        /* NULL host_obj implies a refcount of 1, so invoke Destroy(). */
        Lucy_Obj_Destroy(self);
        return 0;
    }
}

That allows us to declare VTable and CharBuf structs at compile time, by
inserting NULL where the host object would be.

I've switched oer VTable and CharBuf with either zero or maybe 1% impact on
performance (the numbers are noisy).

The only thing that's weird is that I can't figure out what's happening to the
Perl objects these persistent objects must be acquiring.  I think they're
getting zapped during Perl's "everyone out of the pool" global destruction
phase, but I'm not sure yet -- I tried to make VTables leak by incrementing the
refcount a while bunch of times in the VTable_singleton() constructor, but
couldn't get Valgrind to show anything.

Marvin Humphrey

Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
Marvin Humphrey <ma...@rectangular.com> wrote:

>> So... one alternative would be to separately track a private-to-Lucy
>> refCount from the host object's refCount?
>
> Since you refer to the "host object's refCount", I take it the following
> passage applies to refcounted systems like Perl and Python, and not to GC
> systems like Ruby and Java:

Sorry, let me refine the proposal: let the host GC the host object,
and Lucy GC's the C object, and cross-link (circular ref) the firt
time the bridge is crossed.

I think Java (JNI) actually exposes a refCount-like API, ie you can
say "I (C) am referring to Java object X", which I think just adds a
new global ref that the tracing collector includes.  Presumably Ruby
exposes something similar?  And RefCount host languages obviously let
you incRef/decRef.

>> Then, for Lucy objects that
>> never cross the bridge you wouldn't have to make a "false" host
>> object.  But you'd need to take care to destroy an object when both
>> Lucy's Obj & the host's wrapper obj drop to refCount 1.
>
> OK, if I understand you correctly, then there would be two refcounts.
> In essence, we intentionally create a circular reference between the Lucy
> object and the host wrapper object.  However, we avoid the memory leak because
> whenever either of the two refcounts falls to one, we see whether the other
> refcount is also at 1 -- and if that's the case, we destroy the object.
>
> The problem I see here is that we don't necessarily have control over what
> happens during the host refcounting.  In Perl at least, I don't know of a way
> to override what happens during SvREFCNT_dec, because it's not a method.  So
> we can't set up a trigger that fires when the Perl object wrapper's refcount
> falls to 1 -- the only event we can count on is the call to $obj->DESTROY()
> when the perl reference count falls to 0.

Hmm, right, that's a problem.... I think Python also doesn't let you
override incRef/decRef for an object.

Maybe you could not keep a permanent ref to the host object?  Eg say a
Term crosses the bridge; at that point you could make a host object
for it, but when the term comes back across the bridge, remove that
reference.  If the host language keeps a ref and later crosses the
bridge backwards, it would hold a ref to the Lucy obj.

Ie things that make a stack-only foray into the host language would
get a temporary host object wrapper; sort of like autoboxing.

I haven't really though through the implications of that though...

>> This may also be better for non-refCount languages (Java).
>
> How would maintaining a refcount for a Java object work?  Under a tracing GC
> system, I think you could say that the refcount is "true" until the object
> becomes unreachable, at which point it's "false".  Other than the fact that
> the object is reachable, you don't know how many things are pointing at it.
> Because of that, there's no way to detect when the only reference left is the
> one from the Lucy object.

Sorry, I meant you'd use JNI's API to say "I am now referring to java
object X" and "I have stopped referring to java object X".  Basically,
a refCount.  But we still have the same tight-cycle challenge.

>> > However, that's not a major problem unless we're creating and
>> > destroying a boatload of small objects.
>
> ---->8 snip 8<----
>
>> I think Lucene has improved here too, especially on the indexing side
>> (though the searching side doesn't create too many tiny objects I
>> think).
>
> Nothing like indexing.  The only spot I can think of is the term dictionary
> index, with all those Term and TermInfo objects.

Yeah... LUCENE-1458 addresses that (no more TermInfo & Term instances
created for the terms index... though it's still creates String
instances).

> In Java, you could address that by turning a 1000-object array of 5-element
> TermInfo objects into 5 primitive-type arrays with 1000 slots each.  We want
> to do something like that anyway because of flexible indexing.

Right, column stride instead of row stride.

> In KS, we no longer generate a lot of Term/TermInfo objects because that data
> structure has been modified for use with mmap.  However, we'll want to mod
> that system further for flexible indexing and PFOR.

Right, LUCENE-1458 takes some similar steps forward -- it stores less
in RAM, eg we don't need most of the TermInfo for each indexed term;
just the long pointer into the tis file.

> I look forward to taking that up in the future.  :)

There's alot to take up in the future!!

>> > literals, and such in a bootstrap routine, but it's enough of a pain
>> > to set something like that up that I haven't gone and made such a
>> > change in KS.
>> >
>> > I'd really like to kill of FastObj just for the sake of simplicity,
>> > though.
>>
>> What objects are planned to subclass FastObj?
>
> VTable, plus the CharBuf family.
>
> Right now in KS, the following classes also descend from FastObj:
>
>    VArray
>    Posting
>    Token
>    Inversion (nee TokenBatch)
>    ByteBuf
>    ViewByteBuf
>    Undefined
>
> However, none of them have the declaration/bootstrapping problem presented by
> VTable and CharBuf.  Additionally, I hope to eliminate Token (by expanding the
> capabilities of Inversion), Posting is due for an index-time overhaul but
> isn't a problem at search-time, Undefined isn't important, and ByteBuf and
> ViewByteBuf don't present speed problems.
>
> Rebasing VArray and Inversion (which is a subclass of VArray) to be children
> of Obj rather than FastObj causes a performance dip of about 3% on the
> indexing benchmark.  I imagine that's as bad as it would be for any binding
> because Perl objects are pretty heavy.

OK

>> > The Ferret scheme won't cause problems with light usage of the
>> > library, because most of Lucy's work will be done within tight loops
>> > in the C core.
>>
>> What about a HitCollector in the host language?  Can you efficiently
>> re-use an object from the host language?  (Python has such tricks, eg
>> to re-use a TupleObject during iteration).
>
> Oh, definitely a HitCollector would be a problem.
>
> Restating the original assertion: the Ferret scheme won't cause problems if
> you avoid host-language subclasses.

OK

>> > Do we need to
>> > add all new objects to a giant Hash that we tell the host about, and
>> > yank C stack vars out of that Hash before the C function returns?
>>
>> What does Ruby's C-embedding API expose to interact with its GC?  I
>> would imagine it'd be similar to Java's (ie "here's a new global root
>> object")?
>
> I understand you can do something like that.

OK

>> I don't understand the C stack vars / hash question.
>
> Consider the following routine, set up for tracing garbage collection:
>
>    void
>    FSFolder_delete(FSFolder *self, const CharBuf *filename)
>    {
>        CharBuf *fullpath = (CharBuf*)GC_add(
>            CB_newf("%o/%o", self->path, filename));
>        bool_t result = Host_callback_i(self, "do_delete", 1,
>            ARG_STR("path", fullpath));
>        if (!result) {
>            THROW("failed to delete %o", fullpath);
>        }
>        GC_remove(fullpath);
>    }
>
> Assume that the CharBuf formatted constructor CB_newf() caches a Host obj
> within the CharBuf, and that no reference counting is performed because we're
> going to rely upon the host's tracing garbage collector.
>
> When "fullpath" is created at the top of our function, it's unreferenced as
> far as the Host's garbage collector is concerned.  The only place it exists is
> on the C stack -- but the garbage collector doesn't walk the C stack looking
> for roots.  It walks the host's stack, but not the C stack.
>
> As soon as fullpath is born, it is a candidate for garbage collection, and
> could be reclaimed at any moment.  Therefore, we *have* to call GC_add() to
> inform the Host about our new CharBuf.

You're going to need to register all threads with the host language
for this reason?  Doesn't java need the threads to stop mutating
things, at least to get the roots for tracing?

> If we *don't* call GC_add(), we could potentially get an invalid read error
> while composing the message for THROW.  Say that Host_callback_i doesn't send
> "fullpath" itself into host-space, but instead copies its contents into a host
> string type.  (This is actually what happens in the current KS implementation.)
> Now, say that a GC takes place during Host_callback_i().  Uh-oh, "fullpath" is
> toast -- it was unreferenced, and got reclaimed.  If THROW needs it, we'll be
> reading from freed memory.
>
> Our only remedy is to call GC_add() early on, protecting fullpath until we
> know we don't need it any more.  But calling GC_add() means that we have to
> call GC_remove() somewhere, or we'll leak memory.
>
> That call to GC_remove() at the end was what I meant by "yank C stack vars out
> of that Hash before the C function returns".
>
> Now, it turns out that the same function set up for a refcounting host looks
> pretty similar... eliminate the call to GC_add(), and change the GC_remove() to
> a DECREF():
>
>    void
>    FSFolder_delete(FSFolder *self, const CharBuf *filename)
>    {
>        CharBuf *fullpath = CB_newf("%o/%o", self->path, filename);
>        bool_t result = Host_callback_i(self, "do_delete", 1,
>            ARG_STR("path", fullpath));
>        if (!result) {
>            THROW("failed to delete %o", fullpath);
>        }
>        DECREF(fullpath);
>    }
>
> Perhaps it's possible to somehow have the DECREF() op perform double duty and
> trigger GC_remove() on a garbage collecting host.  I dunno, that seems likely
> to cause double-free problems, but I'll try thinking that through in a future
> post.

This is devilishly tricky stuff!

Mike

Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Mar 24, 2009 at 08:23:49PM -0500, Peter Karman wrote:

> I tried all the patterns thus far mentioned in the
> libswish3 Perl/XS bindings (SWISH::3) and finally (through much trial and error)
> ended up with the same pattern that Ferret (and I believe KS) uses: create host
> objects at the C boundaries, refcounting only the internal C structs, and then
> in the perl DESTROY() method doing something like:
> 
>   if (c_obj->ref_cnt--) {
>       swish_destroy(c_obj);
>   }
> 
> I've chosen to mitigate the host object expense by doing all the required heavy
> lifting (tight loops, etc) in native C, exposing only those methods that the
> Perl user really needs.

That's a good system.  And it may be that we can't improve on it in the
context of a garbage collecting host.

KS does something a little different these days, though.   Instead of creating
Perl objects at the Perl/C boundary, KS creates a Perl object and a C object
at the same time, and they are mated for life.[1]

HOWEVER, despite the fact that the C object and the Perl object point at each
other, we don't have a circular reference problem because the two objects
*share* a single, unified refcount.  

  Obj *self = Obj_new();        // SvREFCNT((SV*)self->ref.host_obj) == 1
  Obj_Inc_RefCount(self);       // SvREFCNT((SV*)self->ref.host_obj) == 2
  SV *pobj = Obj_To_Host(self); // SvREFCNT((SV*)self->ref.host_obj) == 3
  SV *copy = newSVsv(pobj);     // SvREFCNT((SV*)self->ref.host_obj) == 4
  Obj_Dec_RefCount(self)        // SvREFCNT((SV*)self->ref.host_obj) == 3
  Obj_Dec_RefCount(self)        // SvREFCNT((SV*)self->ref.host_obj) == 2
  Obj_Dec_RefCount(self)        // SvREFCNT((SV*)self->ref.host_obj) == 1
  Obj_Dec_RefCount(self)        // $obj->DESTROY invoked from Perl-space

Unlike SWISH3, KinoSearch::Obj::DESTROY knows that it will only be invoked
when the refcount is 0, so it can invoke the C destructor every time.

    void
    DESTROY(self)
        kino_Obj *self;
    PPCODE:
        Kino_Obj_Destroy(self);

In fact, every time that a KinoSearch::Obj has its refcount falls to 0, we venture
back into Perl-space.  That's because Obj_Dec_RefCount() is a wrapper around
SvREFCNT_dec(), and when SvREFCNT_dec() hits 0, it invokes the Perl DESTROY
method.

The advantage of this scheme is that we create fewer objects, and thus incur
less overhead at the host/C boundary.

The question I'm trying to answer is: Is there any way to make this work in
the context of a host like Ruby or Java, which use tracing garbage collection
instead of a refcounting scheme that we can piggyback on?

Marvin Humphrey

[1] OK, I'm lying.  KinoSearch classes which subclass  
    KinoSearch::Obj::FastObj actually behave exactly the same way as Ferret
    and SWISH.  It would be really nice if that were not the case, because if
    even really really smart people like Peter find this hard to follow, the
    system needs to be simplified.


Re: Reference counting inside a GC host (was "real time updates")

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/24/09 7:12 PM:
> On Fri, Mar 20, 2009 at 04:30:21PM -0400, Michael McCandless wrote:
> 
>> So... one alternative would be to separately track a private-to-Lucy
>> refCount from the host object's refCount?  
> 
> Since you refer to the "host object's refCount", I take it the following
> passage applies to refcounted systems like Perl and Python, and not to GC
> systems like Ruby and Java:
> 
>> Then, for Lucy objects that
>> never cross the bridge you wouldn't have to make a "false" host
>> object.  But you'd need to take care to destroy an object when both
>> Lucy's Obj & the host's wrapper obj drop to refCount 1.
> 
> OK, if I understand you correctly, then there would be two refcounts.
> In essence, we intentionally create a circular reference between the Lucy
> object and the host wrapper object.  However, we avoid the memory leak because
> whenever either of the two refcounts falls to one, we see whether the other
> refcount is also at 1 -- and if that's the case, we destroy the object.
> 
> The problem I see here is that we don't necessarily have control over what
> happens during the host refcounting.  In Perl at least, I don't know of a way
> to override what happens during SvREFCNT_dec, because it's not a method.  So
> we can't set up a trigger that fires when the Perl object wrapper's refcount
> falls to 1 -- the only event we can count on is the call to $obj->DESTROY()
> when the perl reference count falls to 0.
> 

I'm chiming in late on this thread, but I want to echo what Marvin says here
about the host refcounting. I tried all the patterns thus far mentioned in the
libswish3 Perl/XS bindings (SWISH::3) and finally (through much trial and error)
ended up with the same pattern that Ferret (and I believe KS) uses: create host
objects at the C boundaries, refcounting only the internal C structs, and then
in the perl DESTROY() method doing something like:

  if (c_obj->ref_cnt--) {
      swish_destroy(c_obj);
  }

I've chosen to mitigate the host object expense by doing all the required heavy
lifting (tight loops, etc) in native C, exposing only those methods that the
Perl user really needs.

>>> The Ferret scheme won't cause problems with light usage of the
>>> library, because most of Lucy's work will be done within tight loops
>>> in the C core.
>> What about a HitCollector in the host language?  Can you efficiently
>> re-use an object from the host language?  (Python has such tricks, eg
>> to re-use a TupleObject during iteration).
>
> Oh, definitely a HitCollector would be a problem.
>
> Restating the original assertion: the Ferret scheme won't cause problems if
> you avoid host-language subclasses.
>

Amen. I had to create a subclassing scheme for SWISH::3 for just that reason.
It's not as elegant as the Boilerplater VTable stuff, but does address the same
problem.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Mar 20, 2009 at 04:30:21PM -0400, Michael McCandless wrote:

> So... one alternative would be to separately track a private-to-Lucy
> refCount from the host object's refCount?  

Since you refer to the "host object's refCount", I take it the following
passage applies to refcounted systems like Perl and Python, and not to GC
systems like Ruby and Java:

> Then, for Lucy objects that
> never cross the bridge you wouldn't have to make a "false" host
> object.  But you'd need to take care to destroy an object when both
> Lucy's Obj & the host's wrapper obj drop to refCount 1.

OK, if I understand you correctly, then there would be two refcounts.
In essence, we intentionally create a circular reference between the Lucy
object and the host wrapper object.  However, we avoid the memory leak because
whenever either of the two refcounts falls to one, we see whether the other
refcount is also at 1 -- and if that's the case, we destroy the object.

The problem I see here is that we don't necessarily have control over what
happens during the host refcounting.  In Perl at least, I don't know of a way
to override what happens during SvREFCNT_dec, because it's not a method.  So
we can't set up a trigger that fires when the Perl object wrapper's refcount
falls to 1 -- the only event we can count on is the call to $obj->DESTROY()
when the perl reference count falls to 0.

> This may also be better for non-refCount languages (Java).

How would maintaining a refcount for a Java object work?  Under a tracing GC
system, I think you could say that the refcount is "true" until the object
becomes unreachable, at which point it's "false".  Other than the fact that
the object is reachable, you don't know how many things are pointing at it.
Because of that, there's no way to detect when the only reference left is the
one from the Lucy object.

> > However, that's not a major problem unless we're creating and
> > destroying a boatload of small objects.

---->8 snip 8<----

> I think Lucene has improved here too, especially on the indexing side
> (though the searching side doesn't create too many tiny objects I
> think).

Nothing like indexing.  The only spot I can think of is the term dictionary
index, with all those Term and TermInfo objects.  

In Java, you could address that by turning a 1000-object array of 5-element
TermInfo objects into 5 primitive-type arrays with 1000 slots each.  We want
to do something like that anyway because of flexible indexing.

In KS, we no longer generate a lot of Term/TermInfo objects because that data
structure has been modified for use with mmap.  However, we'll want to mod
that system further for flexible indexing and PFOR.  

I look forward to taking that up in the future.  :)

> > literals, and such in a bootstrap routine, but it's enough of a pain
> > to set something like that up that I haven't gone and made such a
> > change in KS.
> >
> > I'd really like to kill of FastObj just for the sake of simplicity,
> > though.
> 
> What objects are planned to subclass FastObj?

VTable, plus the CharBuf family.  

Right now in KS, the following classes also descend from FastObj:

    VArray 
    Posting
    Token
    Inversion (nee TokenBatch)
    ByteBuf
    ViewByteBuf
    Undefined 
    
However, none of them have the declaration/bootstrapping problem presented by
VTable and CharBuf.  Additionally, I hope to eliminate Token (by expanding the
capabilities of Inversion), Posting is due for an index-time overhaul but
isn't a problem at search-time, Undefined isn't important, and ByteBuf and
ViewByteBuf don't present speed problems.

Rebasing VArray and Inversion (which is a subclass of VArray) to be children
of Obj rather than FastObj causes a performance dip of about 3% on the
indexing benchmark.  I imagine that's as bad as it would be for any binding
because Perl objects are pretty heavy.

> > The Ferret scheme won't cause problems with light usage of the
> > library, because most of Lucy's work will be done within tight loops
> > in the C core.
> 
> What about a HitCollector in the host language?  Can you efficiently
> re-use an object from the host language?  (Python has such tricks, eg
> to re-use a TupleObject during iteration).

Oh, definitely a HitCollector would be a problem.

Restating the original assertion: the Ferret scheme won't cause problems if
you avoid host-language subclasses.

> > Do we need to
> > add all new objects to a giant Hash that we tell the host about, and
> > yank C stack vars out of that Hash before the C function returns?
>
> What does Ruby's C-embedding API expose to interact with its GC?  I
> would imagine it'd be similar to Java's (ie "here's a new global root
> object")?

I understand you can do something like that.

> I don't understand the C stack vars / hash question.

Consider the following routine, set up for tracing garbage collection:

    void
    FSFolder_delete(FSFolder *self, const CharBuf *filename)
    {
        CharBuf *fullpath = (CharBuf*)GC_add(
            CB_newf("%o/%o", self->path, filename));
        bool_t result = Host_callback_i(self, "do_delete", 1, 
            ARG_STR("path", fullpath));
        if (!result) {
            THROW("failed to delete %o", fullpath);
        }
        GC_remove(fullpath);
    }

Assume that the CharBuf formatted constructor CB_newf() caches a Host obj
within the CharBuf, and that no reference counting is performed because we're
going to rely upon the host's tracing garbage collector.  

When "fullpath" is created at the top of our function, it's unreferenced as
far as the Host's garbage collector is concerned.  The only place it exists is
on the C stack -- but the garbage collector doesn't walk the C stack looking
for roots.  It walks the host's stack, but not the C stack.  
    
As soon as fullpath is born, it is a candidate for garbage collection, and
could be reclaimed at any moment.  Therefore, we *have* to call GC_add() to
inform the Host about our new CharBuf.

If we *don't* call GC_add(), we could potentially get an invalid read error
while composing the message for THROW.  Say that Host_callback_i doesn't send
"fullpath" itself into host-space, but instead copies its contents into a host
string type.  (This is actually what happens in the current KS implementation.)
Now, say that a GC takes place during Host_callback_i().  Uh-oh, "fullpath" is
toast -- it was unreferenced, and got reclaimed.  If THROW needs it, we'll be
reading from freed memory.

Our only remedy is to call GC_add() early on, protecting fullpath until we
know we don't need it any more.  But calling GC_add() means that we have to
call GC_remove() somewhere, or we'll leak memory.

That call to GC_remove() at the end was what I meant by "yank C stack vars out
of that Hash before the C function returns".

Now, it turns out that the same function set up for a refcounting host looks
pretty similar... eliminate the call to GC_add(), and change the GC_remove() to
a DECREF():

    void
    FSFolder_delete(FSFolder *self, const CharBuf *filename)
    {
        CharBuf *fullpath = CB_newf("%o/%o", self->path, filename);
        bool_t result = Host_callback_i(self, "do_delete", 1, 
            ARG_STR("path", fullpath));
        if (!result) {
            THROW("failed to delete %o", fullpath);
        }
        DECREF(fullpath);
    }

Perhaps it's possible to somehow have the DECREF() op perform double duty and
trigger GC_remove() on a garbage collecting host.  I dunno, that seems likely
to cause double-free problems, but I'll try thinking that through in a future
post.

Marvin Humphrey


Re: Reference counting inside a GC host (was "real time updates")

Posted by Michael McCandless <lu...@mikemccandless.com>.
Marvin Humphrey <ma...@rectangular.com> wrote:

>> > There are some quirks, though, with how it manages host objects.
>> > The default behavior is to create a host language object at the
>> > same time as the Boilerplater object, and have the host object
>> > manage the refcount.
>>
>> Hmm, sounds tricky... because there are typically consumers in C
>> and in the Host language and both need to incRef/decRef.
>
> Indeed, we need to accomodate refcounting ops both in the Lucy core
> and in the Host.  For a refcounted Host like Perl or Python, all of
> these ops will affect a single, *unified* refcount which resides in
> the cached Perl/Python object at self->ref.host_obj.  That's what I
> meant by "have the host manage the refcount" -- I wasn't clear
> enough that Lucy would be able to manipulate that host refcount
> using wrapper methods.
>
> The Lucy::Obj header at trunk/core/Lucy/Obj.bp will declare
> Inc_RefCount() and Dec_RefCount() methods:
>
>    /** Increment an object's refcount.
>     *
>     * @return the object, allowing an assignment idiom.
>     */
>    public incremented Obj*
>    Inc_RefCount(Obj *self);
>
>    /** Decrement an object's refcount, calling Destroy() if it hits 0.
>     *
>     * @return the modified refcount.
>     */
>    u32_t
>    Dec_RefCount(Obj *self);
>
> However, no implementation for these methods is provided in
> trunk/core/Lucy/Obj.c.  It will be up to the bindings to provide an
> implentation, or a linking error will occur.
>
> For the Perl bindings, we'll provide a second
> Obj.c at trunk/perl/xs/Lucy/Obj.c which will contain the following:
>
>    lucy_Obj*
>    lucy_Obj_inc_refcount(lucy_Obj *self)
>    {
>        SvREFCNT_inc_simple_void_NN((SV*)self->ref.host_obj);
>        return self;
>    }
>
>    chy_u32_t
>    lucy_Obj_dec_refcount(lucy_Obj *self)
>    {
>        chy_u32_t modified_refcount = SvREFCNT((SV*)self->ref.host_obj) - 1;
>        /* If the SV's refcount falls to 0, DESTROY will be invoked from
>         * Perl-space.
>         */
>        SvREFCNT_dec((SV*)self->ref.host_obj);
>        return modified_refcount;
>    }
>
> That's how most objects in Lucy will be managed.  However, that
> approach isn't ideal for all of them.
>
> The first, obvious objection to caching a host object inside every
> single Lucy object is that it wastes memory for those objects which
> never venture into Host-space; an integer refcount would require
> less overhead.  The "FastObj" class was originally written to
> address this concern.

So... one alternative would be to separately track a private-to-Lucy
refCount from the host object's refCount?  Then, for Lucy objects that
never cross the bridge you wouldn't have to make a "false" host
object.  But you'd need to take care to destroy an object when both
Lucy's Obj & the host's wrapper obj drop to refCount 1.

This may also be better for non-refCount languages (Java).

> However, that's not a major problem unless we're creating and
> destroying a boatload of small objects.  Lucene 1.4.3 was a
> profligate wastrel in this regard, but KinoSearch's basic
> architecture has gotten pretty lean and has room to get leaner
> still.  If memory use and speed were the only reasons to use
> FastObj, I think we could kill it off.

I think Lucene has improved here too, especially on the indexing side
(though the searching side doesn't create too many tiny objects I
think).

> However, there's a second, more annoying problem.  It's not possible
> to declare static structs which contain e.g. a Perl object, because
> all Perl objects are malloc'd at runtime.  That's inconvenient for
> declaring things like CharBuf literals or VTables:
>
>   /* Can't do this unless CharBuf is a subclass of FastObj. */
>   static CharBuf foo = {
>        (VTable*)&CHARBUF,
>        1,      /* ref.count */
>        "foo",  /* character data */
>        3,      /* size */
>        4       /* capacity (includes terminating NULL) */
>   };
>
> It's probably possible to initialize all of our VTables, CharBuf
> literals, and such in a bootstrap routine, but it's enough of a pain
> to set something like that up that I haven't gone and made such a
> change in KS.
>
> I'd really like to kill of FastObj just for the sake of simplicity,
> though.

What objects are planned to subclass FastObj?

>> > I've tried searching the web for resources on how to make
>> > refcounting and GC coexist happily, but I haven't found anything
>> > so far.  If anybody's got strong google-fu today or actually
>> > knows and can recommend some literature, I'm all ears.
>>
>> This is tricky!
>
> There's one scheme that I know will at least work under a tracing
> garbage collector: the one used by Ferret.
>
>    * Within the C portion of Lucy, perform integer refcounting.
>    * Every time a unique host wrapper object is created, increment
>      the refcount.
>    * Every time a host wrapper is destroyed, decrement the
>      refcount.
>
> In other words, for Hosts that use tracing garbage collection, all
> Lucy objects would use an integer refcount, and nobody would cache
> any host objects.

I think, similarly, Java (JNI) lets you add a global reference to a
Java object which is analagous to incref-ing.  But caching a host
object would be fine?  Why lose that? (I don't know Ferret/Ruby very
well...).

> However, the Ferret approach has a drawback: You create and destroy
> host wrappers every time you cross the host/C boundary.  That'll
> create a performance drag in some situations.

Yeah.

> The Ferret scheme won't cause problems with light usage of the
> library, because most of Lucy's work will be done within tight loops
> in the C core.

What about a HitCollector in the host language?  Can you efficiently
re-use an object from the host language?  (Python has such tricks, eg
to re-use a TupleObject during iteration).

> It also doesn't stop you from attaching host data to the C object
> using the "flyweight" design pattern, a.k.a. the "inside-out object"
> pattern, because you can still key data off of the unchanging C
> object memory address.

OK.

> However, once you start writing subclasses, all that OO overhead at
> the host/C boundary is going to slow down tight loops like
> Scorer_Next.
>
> Caching a host object for the life of the the Lucy object solves
> that problem, but I'm not sure how to do that within the context of
> a tracing garbage collector.
>
> We can assume that the program initiates within Host-space; most
> Lucy objects will be able to trace back to the host.  However,
> independent objects that we create as statics, globals, or C stack
> vars won't be visible to the garbage collector and will get
> reclaimed prematurely.
>
> Any ideas on how to pull off the caching trick?  Is there something
> we can do if we allocate space within all of our Lucy objects for
> *both* an integer refcount and a cached host object?  Do we need to
> add all new objects to a giant Hash that we tell the host about, and
> yank C stack vars out of that Hash before the C function returns?

What does Ruby's C-embedding API expose to interact with its GC?  I
would imagine it'd be similar to Java's (ie "here's a new global root
object")?

I don't understand the C stack vars / hash question.

Mike

Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Mar 18, 2009 at 06:41:25PM -0400, Michael McCandless wrote:

> But at the end of the session you still insist on merging down to one
> segment before closing right?  So if my added docs all fit in RAM,
> closing is fast, but if I go a bit over 1 RAM buffer's worth, then
> closing is suddenly slower since it must merge those two runs before
> closing?

There's definitely a threshold effect.  But the time is still small relative
to the total run time of the indexer app.  10 or 15%, I think?

> I think having a close that's unexpectedly slow is irksome.  

It's not unexpectedly slow.  It's consistently slow.  :D  

Since KS *always* performs merging of runs at the end, the cost doesn't come
as a surprise.

Moving all the hard work into a Prepare_Commit() method would solve this and
would be a piece of cake to implement.  Then we could have a Commit() which
was always fast. 

Marvin Humphrey


Re: real time updates

Posted by Michael McCandless <lu...@mikemccandless.com>.
Marvin Humphrey <ma...@rectangular.com> wrote:
> Mike McCandless:
>
>> It seems like redundant code to merge a run and to merge segments.
>
>
> There's not that much duplication any more.  The stable branch of KS
> can only merge postings and lexicon data by dumping everything into
> the external sorter, adding an inefficient extra step.  However, the
> dev branch can read directly directly from existing content.
>
> The runs which are merged by the external sorter are represented by
> individual "PostingPool" objects, which can be fed either from run
> data written during the present indexing session, or by live segment
> postings and lexicon files.  Flushing a run to disk uses the same
> format, so we only need one read class.

OK sounds like the code is well shared.

> I think that algorithmically, this means KS has moved towards Lucene -- though
> the class infrastructure is still compeletely different.

But at the end of the session you still insist on merging down to one
segment before closing right?  So if my added docs all fit in RAM,
closing is fast, but if I go a bit over 1 RAM buffer's worth, then
closing is suddenly slower since it must merge those two runs before
closing?

I think having a close that's unexpectedly slow is irksome.  I'd
prefer to pay my cost as I go...

Though, Lucene's close can also be unexpectedly slow, if a big BG
merge is still running and you have to wait.

>> I'm not sure I like tying single-search runtime concurrency directly
>> to the index structure (this was broached on a Jira issue somewhere).
>> I think I'd prefer to have each thread skipTo its starting docID and
>> process its chunk, instead of wiring one-thread-per-segment, but I'm
>> really not sure yet.
>
> That would work fine if Lucene indexes were block-based.  Does the
> format of the current "multi-level skip data" allow you to jump into
> the middle of a segment?  The Lucene file format documentation is
> gnarly in general, and the skip data spec was the gnarliest section
> last I looked, so I don't want to go check myself.  :) You wouldn't
> have been able to skip forward efficiently with the 1.4.3 format I
> managed to get working with KinoSearch 0.05, but maybe it would work
> now.

Gnarly is a good word to describe it.  That conjures up these vivid
images for me :)

The multi-level skip list should let you quickly jump to any spot
(though I haven't specifically tested this).

> The nice thing about your approach is that it wouldn't require any
> special intervention at the indexing stage.

Right, and you're free to change your mind at search time about how
much concurrency you want to spend.

> However, it would still be possible to tune the merge algorithm so
> that it tried to keep the segments mostly equal in size without
> requiring the user to call indexWriter.optimize(int numSegments).

True; they're not mutually exclusive.

Mike

Reference counting inside a GC host (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Mar 17, 2009 at 05:50:14AM -0400, Michael McCandless wrote:

> > There are some quirks, though, with how it manages host objects.
> > The default behavior is to create a host language object at the same
> > time as the Boilerplater object, and have the host object manage the
> > refcount.
> 
> Hmm, sounds tricky... because there are typically consumers in C and
> in the Host language and both need to incRef/decRef.

Indeed, we need to accomodate refcounting ops both in the Lucy core and in the
Host.  For a refcounted Host like Perl or Python, all of these ops will affect
a single, *unified* refcount which resides in the cached Perl/Python object at
self->ref.host_obj.  That's what I meant by "have the host manage the
refcount" -- I wasn't clear enough that Lucy would be able to manipulate that
host refcount using wrapper methods.

The Lucy::Obj header at trunk/core/Lucy/Obj.bp will declare Inc_RefCount() and
Dec_RefCount() methods:

    /** Increment an object's refcount.
     * 
     * @return the object, allowing an assignment idiom.
     */
    public incremented Obj* 
    Inc_RefCount(Obj *self);

    /** Decrement an object's refcount, calling Destroy() if it hits 0.
     * 
     * @return the modified refcount.
     */
    u32_t
    Dec_RefCount(Obj *self);

However, no implementation for these methods is provided in
trunk/core/Lucy/Obj.c.  It will be up to the bindings to provide an
implentation, or a linking error will occur.  

For the Perl bindings, we'll provide a second
Obj.c at trunk/perl/xs/Lucy/Obj.c which will contain the following:

    lucy_Obj*
    lucy_Obj_inc_refcount(lucy_Obj *self)
    {
        SvREFCNT_inc_simple_void_NN((SV*)self->ref.host_obj);
        return self;
    }

    chy_u32_t
    lucy_Obj_dec_refcount(lucy_Obj *self)
    {
        chy_u32_t modified_refcount = SvREFCNT((SV*)self->ref.host_obj) - 1;
        /* If the SV's refcount falls to 0, DESTROY will be invoked from
         * Perl-space.
         */
        SvREFCNT_dec((SV*)self->ref.host_obj);
        return modified_refcount;
    }

That's how most objects in Lucy will be managed.  However, that approach isn't
ideal for all of them.

The first, obvious objection to caching a host object inside every single Lucy
object is that it wastes memory for those objects which never venture into
Host-space; an integer refcount would require less overhead.  The "FastObj"
class was originally written to address this concern.

However, that's not a major problem unless we're creating and destroying a
boatload of small objects.  Lucene 1.4.3 was a profligate wastrel in this
regard, but KinoSearch's basic architecture has gotten pretty lean and has
room to get leaner still.  If memory use and speed were the only reasons to
use FastObj, I think we could kill it off.

However, there's a second, more annoying problem.  It's not possible to
declare static structs which contain e.g. a Perl object, because all Perl
objects are malloc'd at runtime.  That's inconvenient for declaring things
like CharBuf literals or VTables:

   /* Can't do this unless CharBuf is a subclass of FastObj. */
   static CharBuf foo = {
        (VTable*)&CHARBUF,
        1,      /* ref.count */
        "foo",  /* character data */
        3,      /* size */
        4       /* capacity (includes terminating NULL) */
   };

It's probably possible to initialize all of our VTables, CharBuf literals, and
such in a bootstrap routine, but it's enough of a pain to set something like
that up that I haven't gone and made such a change in KS.

I'd really like to kill of FastObj just for the sake of simplicity, though.

> > I've tried searching the web for resources on how to make
> > refcounting and GC coexist happily, but I haven't found anything so
> > far.  If anybody's got strong google-fu today or actually knows and
> > can recommend some literature, I'm all ears.
> 
> This is tricky!

There's one scheme that I know will at least work under a tracing garbage
collector: the one used by Ferret.

    * Within the C portion of Lucy, perform integer refcounting.
    * Every time a unique host wrapper object is created, increment the
      refcount.
    * Every time a host wrapper is destroyed, decrement the refcount.

In other words, for Hosts that use tracing garbage collection, all Lucy
objects would use an integer refcount, and nobody would cache any host
objects.

However, the Ferret approach has a drawback: You create and destroy host
wrappers every time you cross the host/C boundary.  That'll create a
performance drag in some situations.

The Ferret scheme won't cause problems with light usage of the library,
because most of Lucy's work will be done within tight loops in the C core.  It
also doesn't stop you from attaching host data to the C object using the
"flyweight" design pattern, a.k.a. the "inside-out object" pattern, because
you can still key data off of the unchanging C object memory address.

However, once you start writing subclasses, all that OO overhead at the host/C
boundary is going to slow down tight loops like Scorer_Next.  

Caching a host object for the life of the the Lucy object solves that problem,
but I'm not sure how to do that within the context of a tracing garbage
collector.  

We can assume that the program initiates within Host-space; most Lucy objects
will be able to trace back to the host.  However, independent objects that we
create as statics, globals, or C stack vars won't be visible to the garbage
collector and will get reclaimed prematurely.

Any ideas on how to pull off the caching trick?  Is there something we can do
if we allocate space within all of our Lucy objects for *both* an integer
refcount and a cached host object?  Do we need to add all new objects to a
giant Hash that we tell the host about, and yank C stack vars out of that Hash
before the C function returns?

Marvin Humphrey


Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
Mike McCandless: 

> It seems like redundant code to merge a run and to merge segments.  

There's not that much duplication any more.  The stable branch of KS can only
merge postings and lexicon data by dumping everything into the external sorter,
adding an inefficient extra step.  However, the dev branch can read directly
directly from existing content.  

The runs which are merged by the external sorter are represented by individual
"PostingPool" objects, which can be fed either from run data written during the
present indexing session, or by live segment postings and lexicon files.
Flushing a run to disk uses the same format, so we only need one read class.

I think that algorithmically, this means KS has moved towards Lucene -- though
the class infrastructure is still compeletely different.

> I'm not sure I like tying single-search runtime concurrency directly
> to the index structure (this was broached on a Jira issue somewhere).
> I think I'd prefer to have each thread skipTo its starting docID and
> process its chunk, instead of wiring one-thread-per-segment, but I'm
> really not sure yet.

That would work fine if Lucene indexes were block-based.  Does the format of
the current "multi-level skip data" allow you to jump into the middle of a
segment?  The Lucene file format documentation is gnarly in general, and the
skip data spec was the gnarliest section last I looked, so I don't want to go
check myself.  :) You wouldn't have been able to skip forward efficiently with
the 1.4.3 format I managed to get working with KinoSearch 0.05, but maybe it
would work now.

The nice thing about your approach is that it wouldn't require any special
intervention at the indexing stage.  However, it would still be possible to
tune the merge algorithm so that it tried to keep the segments mostly equal in
size without requiring the user to call indexWriter.optimize(int numSegments).

Marvin Humphrey




Re: real time updates

Posted by Michael McCandless <lu...@mikemccandless.com>.
Marvin Humphrey <ma...@rectangular.com> wrote:

>> OK; I guess the salient question is how important is it, really,
>> for deletes to apply to docs added in the current session.  Do
>> users ever miss that?
>
> It's been noted.  I don't think anyone thinks the current KS
> behavior is better.

OK

>> Right, OK.  Though Lucene takes a similar approach to KS, it's just
>> that we allow the intermediate segments (runs?) to be committed as
>> segments, and then normal merging merges them (ie, the doc store
>> files are shared across segments for a single session).
>
> I think we can make that work for Lucy, too.  The KS external sorter
> now writes temporary files using the standard lexicon and postings
> formats.  It's flushing these to "runs", but there's not much of a
> difference between these runs and actual segments.

Right.  It seems like redundant code to merge a run and to merge
segments.  And it makes close() unexpectedly costly when you force
merging of the runs to complete.

> Plus, the Snapshot class I'm about to propose for Lucy is
> specifically designed to accommodate extra-segment files like the
> docs stores.
>
> Cross-pollination FTW. :)

Yes!

>> > Yeah.  I think if we cap the deletion percentage at 10% and merge
>> > any segment that exceeds that rate, plus apply deletions at the
>> > Scorer_Collect level, tombstones won't absolutely murder us.  But
>> > there's some work to do there.
>>
>> Run some C perf tests first!
>
> For now I'll just put it off and use bit vector deletions.  I only need one
> writer and one consolidator at present.  Total throughput isn't a problem --
> worst-case performance is what's driving this feature.

With Lucene, since we carry deletes in RAM and committing is expected
to be relatively rare, our "worst case" measure is the time it takes
to reopen a reader after making a bunch of adds/deletes to the index.

To reopen the reader we need to clone the norms/deletes, and that's
where our equivalent worst-case cost comes in.

But since we're in RAM we can consider other interesting transactional
data structures.

>> Interesting, Lucy will do GC of the search objects?  Is this used
>> for all objects, eg Document, Field, Term, Query, etc.?
>
> It's basically a straight-up refcounting model: each object keeps
> its own refcount, call Destroy() as soon as the refcount hits zero.
> All full-fledged objects are refcounted.
>
> There are some quirks, though, with how it manages host objects.
> The default behavior is to create a host language object at the same
> time as the Boilerplater object, and have the host object manage the
> refcount.

Hmm, sounds tricky... because there are typically consumers in C and
in the Host language and both need to incRef/decRef.

> However, there are a few classes -- CharBuf, VTable, and Posting
> among them -- that subclass "FastObj", which has its own refcount,
> and creates a host wrapper each time it crosses into host space.

OK

> The second member in the Obj struct is a union:
>
>    typedef union boil_ref_t {
>        chy_u32_t count;
>        void *host_obj;
>    } boil_ref_t;
>
> FastObj keeps track of its refcounts using self->ref.count; Obj delegates to
> self->ref.host_obj.  It's up to the binding to provide implementations for
> Obj_Inc_RefCount and Obj_Dec_RefCount.
>
> I haven't worked out the details on how the refcounted model is going to
> function within GC systems like Java and Ruby.  However, I think writing a
> full-featured tracing garbage collector that's supposed to work within a
> reference counted system like Perl's would be harder.
>
> I've tried searching the web for resources on how to make
> refcounting and GC coexist happily, but I haven't found anything so
> far.  If anybody's got strong google-fu today or actually knows and
> can recommend some literature, I'm all ears.

This is tricky!

>> I've been wondering how we can take advantage of concurrency within
>> a single search in Lucene.
>
> Well, doesn't it at least make logical sense to dedicate one thread
> per segment during search?  Then you just have to adopt a different
> definition for "optimize": instead of an "optimized" index always
> meaning "one segment", it means "one segment per thread".

I'm not sure I like tying single-search runtime concurrency directly
to the index structure (this was broached on a Jira issue somewhere).
I think I'd prefer to have each thread skipTo its starting docID and
process its chunk, instead of wiring one-thread-per-segment, but I'm
really not sure yet.

Mike

Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Mar 16, 2009 at 07:02:05PM -0400, Michael McCandless wrote:

> OK; I guess the salient question is how important is it, really, for
> deletes to apply to docs added in the current session.  Do users ever
> miss that?

It's been noted.  I don't think anyone thinks the current KS behavior is
better.

> Right, OK.  Though Lucene takes a similar approach to KS, it's just
> that we allow the intermediate segments (runs?) to be committed as
> segments, and then normal merging merges them (ie, the doc store files
> are shared across segments for a single session).

I think we can make that work for Lucy, too.  The KS external sorter now
writes temporary files using the standard lexicon and postings formats.  It's
flushing these to "runs", but there's not much of a difference between these
runs and actual segments.

Plus, the Snapshot class I'm about to propose for Lucy is specifically
designed to accommodate extra-segment files like the docs stores.

Cross-pollination FTW. :)

> > Yeah.  I think if we cap the deletion percentage at 10% and merge
> > any segment that exceeds that rate, plus apply deletions at the
> > Scorer_Collect level, tombstones won't absolutely murder us.  But
> > there's some work to do there.
> 
> Run some C perf tests first!

For now I'll just put it off and use bit vector deletions.  I only need one
writer and one consolidator at present.  Total throughput isn't a problem --
worst-case performance is what's driving this feature.

> Interesting, Lucy will do GC of the search objects?  Is this used for
> all objects, eg Document, Field, Term, Query, etc.?

It's basically a straight-up refcounting model: each object keeps its own
refcount, call Destroy() as soon as the refcount hits zero.  All full-fledged
objects are refcounted.

There are some quirks, though, with how it manages host objects.  The default
behavior is to create a host language object at the same time as the
Boilerplater object, and have the host object manage the refcount.  

However, there are a few classes -- CharBuf, VTable, and Posting among them --
that subclass "FastObj", which has its own refcount, and creates a host
wrapper each time it crosses into host space.

The second member in the Obj struct is a union:

    typedef union boil_ref_t {
        chy_u32_t count;
        void *host_obj;
    } boil_ref_t;

FastObj keeps track of its refcounts using self->ref.count; Obj delegates to
self->ref.host_obj.  It's up to the binding to provide implementations for
Obj_Inc_RefCount and Obj_Dec_RefCount.

I haven't worked out the details on how the refcounted model is going to
function within GC systems like Java and Ruby.  However, I think writing a
full-featured tracing garbage collector that's supposed to work within a
reference counted system like Perl's would be harder.

I've tried searching the web for resources on how to make refcounting and GC
coexist happily, but I haven't found anything so far.  If anybody's got strong
google-fu today or actually knows and can recommend some literature, I'm all
ears.

> I've been wondering how we can take advantage of concurrency within a single
> search in Lucene.

Well, doesn't it at least make logical sense to dedicate one thread per
segment during search?  Then you just have to adopt a different definition for
"optimize": instead of an "optimized" index always meaning "one segment", it
means "one segment per thread".

Marvin Humphrey


Re: real time updates

Posted by Michael McCandless <lu...@mikemccandless.com>.
Marvin Humphrey <ma...@rectangular.com> wrote:

>> Remind me again how KS/Lucy does the deletes?  It must buffer if
>> you've buffered add docs?  Or is it not possible to delete a
>> recently added doc?
>
> The present implementation in KS proceeds commit by commit.  You
> can't delete a doc that hasn't been committed yet.

OK; I guess the salient question is how important is it, really, for
deletes to apply to docs added in the current session.  Do users ever
miss that?

I think with realtime search (at least how Lucene is planning to do
it), it actually is important because one could keep a writer session
open arbitrarily long, reopening readers as you go.

> A common technique when you are either adding or updating but don't
> know which is to always delete based on a primary key field, then
> add:
>
>  while ( my $doc = give_me_another_doc() ) {
>    $indexer->delete_by_term( field => 'key', term => $doc->{key} );
>    $indexer->add_doc($doc);
>  }
>
> If you use an app with that loop to add a very large number of docs,
> the buffered deletes will take up a lot of space in RAM.

Lucene also flushes based on number of buffered deletes.

> I think in Lucene, you flush more often to a new segment.  In KS
> that only happens once per indexing session.

Right, OK.  Though Lucene takes a similar approach to KS, it's just
that we allow the intermediate segments (runs?) to be committed as
segments, and then normal merging merges them (ie, the doc store files
are shared across segments for a single session).

>> > Claiming a new segment directory and committing a new master file
>> > (segments_XXX in Lucene, snapshot_XXX.json in KS) wouldn't require
>> > synchronization: if those ops fail because your process lost out in
>> > the race condition, you just retry.  The only time we have true
>> > synchronization requirements is during merging.
>>
>> Wouldn't you need to synchronize here?  If two writers try to suddenly
>> write to the same snapshot_XXX.json, is that really safe (ie, one
>> always wins, and the other always loses and can reliably detect that
>> it lost)?
>
> You're right; I hadn't thought things through well enough.  There's
> no C API for a rename() op with the semantics of POSIX open() with
> O_EXCL.  No big deal, though -- we can use a commit.lock file for
> synchronizing those two ops.

OK.

>> I think this (reader ORs the multiple deleted docs) would be
>> workable; though it makes realtime search challenging.
>
> Yeah.  I think if we cap the deletion percentage at 10% and merge
> any segment that exceeds that rate, plus apply deletions at the
> Scorer_Collect level, tombstones won't absolutely murder us.  But
> there's some work to do there.

Run some C perf tests first!

>> > > Ugh, lock starvation.  Really the OS should provide a FIFO lock
>> > > queue of some sort.
>
> Heh, I think I just figured out how to solve the lock starvation
> issue.  :)
>
> When the consolidator completes, it will attempt once to acquire
> write.lock.  If it fails, it will leave behind the consolidate.lock
> file, plus write a "consolidate.done" file containing the name of
> the newly added segment and maybe some other info, then exit.
>
> Whenever one of the write processes finishes, it will look for
> "consolidate.done".  If it's found, the write process will pick up
> where the consolidator left off, finishing all merge-related
> operations and then releasing consolidate.lock.

Ahh, nice.  Drop a continuation into the filesystem :)

>> How, in general, are you approaching threads with Lucy?  It seems a
>> shame to forego threads entirely.
>
> We've never formally addressed the issue of threads.  The approach
> I've taken so far with KS is to try to keep my options open.  I've
> tried to keep most classes thread-safe when it's not too much work.
> I think the overall architecture should still be compatible with
> threads.  The only gaping holes are that KS lacks a lock-free hash
> for storing VTables and perhaps interned strings, and that the
> reference-counting GC model might be a synchronization bottleneck.

Interesting, Lucy will do GC of the search objects?  Is this used for
all objects, eg Document, Field, Term, Query, etc.?

Watch out for cycles with reference counting...

> However, I'm of two minds about this.  Threads are a PITA, and
> thanks to mmap and the OS cache, we might be able to use OS
> processes as our exclusive concurrency model.  Launching any number
> of search processes won't be a problem, and it looks like we might
> be able to pull off multiple write processes as well.
>
> There's a theoretical search-time performance issue on the horizon
> relating to threads: it would be nice to have a large index divided
> up into several segments of equal size, allowing us to dedicate one
> thread to each segment and finish a single search faster.  But even
> that is theoretically achievable using multiple processes under a
> MultiSearcher.

Yes, being mmap driven should mean processes are a fine substitute for
threads.

Yes I've been wondering how we can take advantage of concurrency
within a single search in Lucene.  I think it's important, especially
as we all switch to SSDs.

> It would be nice, though, if we could make Lucy thread-safe as a
> library, so that we don't break other people's threading models.
> That way, you could, say, run several indexers and one consolidator
> within a single multi-threaded app -- realizing the benefit of
> threads even if no individual Lucy object knows how to exploit them.

True.

Mike

Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Mar 16, 2009 at 05:51:22AM -0400, Michael McCandless wrote:

> Remind me again how KS/Lucy does the deletes?  It must buffer if
> you've buffered add docs?  Or is it not possible to delete a recently
> added doc?

The present implementation in KS proceeds commit by commit.  You can't delete
a doc that hasn't been committed yet.

Lucy need not adopt that model; we could go the buffering route.  However,
that path has implementation drawbacks.  A common technique when you are
either adding or updating but don't know which is to always delete based on a
primary key field, then add:

  while ( my $doc = give_me_another_doc() ) { 
    $indexer->delete_by_term( field => 'key', term => $doc->{key} );
    $indexer->add_doc($doc);
  }

If you use an app with that loop to add a very large number of docs, the 
buffered deletes will take up a lot of space in RAM.

I think in Lucene, you flush more often to a new segment.  In KS that only
happens once per indexing session.  Lucy is ostensibly supposed to model its
internals after the KS design because of the OO overhead streamlining, so we'd
have to find a way to address that issue.

> > Claiming a new segment directory and committing a new master file
> > (segments_XXX in Lucene, snapshot_XXX.json in KS) wouldn't require
> > synchronization: if those ops fail because your process lost out in
> > the race condition, you just retry.  The only time we have true
> > synchronization requirements is during merging.
> 
> Wouldn't you need to synchronize here?  If two writers try to suddenly
> write to the same snapshot_XXX.json, is that really safe (ie, one
> always wins, and the other always loses and can reliably detect that
> it lost)?

You're right; I hadn't thought things through well enough.  There's no C API
for a rename() op with the semantics of POSIX open() with O_EXCL.  No big
deal, though -- we can use a commit.lock file for synchronizing those two ops.

> I think this (reader ORs the multiple deleted docs) would be workable;
> though it makes realtime search challenging.

Yeah.  I think if we cap the deletion percentage at 10% and merge any segment
that exceeds that rate, plus apply deletions at the Scorer_Collect level,
tombstones won't absolutely murder us.  But there's some work to do there.

> > > Ugh, lock starvation.  Really the OS should provide a FIFO lock
> > > queue of some sort.

Heh, I think I just figured out how to solve the lock starvation issue.  :)

When the consolidator completes, it will attempt once to acquire write.lock.
If it fails, it will leave behind the consolidate.lock file, plus write a
"consolidate.done" file containing the name of the newly added segment and
maybe some other info, then exit.

Whenever one of the write processes finishes, it will look for
"consolidate.done".  If it's found, the write process will pick up where the
consolidator left off, finishing all merge-related operations and then
releasing consolidate.lock.

> How, in general, are you approaching threads with Lucy?  It seems
> a shame to forego threads entirely.  

We've never formally addressed the issue of threads.  The approach I've taken
so far with KS is to try to keep my options open.  I've tried to keep most
classes thread-safe when it's not too much work.  I think the overall
architecture should still be compatible with threads.  The only gaping holes
are that KS lacks a lock-free hash for storing VTables and perhaps interned
strings, and that the reference-counting GC model might be a synchronization
bottleneck.

However, I'm of two minds about this.  Threads are a PITA, and thanks to mmap
and the OS cache, we might be able to use OS processes as our exclusive
concurrency model.  Launching any number of search processes won't be a
problem, and it looks like we might be able to pull off multiple write
processes as well.

There's a theoretical search-time performance issue on the horizon relating to
threads: it would be nice to have a large index divided up into several
segments of equal size, allowing us to dedicate one thread to each segment and
finish a single search faster.  But even that is theoretically achievable
using multiple processes under a MultiSearcher.

It would be nice, though, if we could make Lucy thread-safe as a library, so
that we don't break other people's threading models.  That way, you could,
say, run several indexers and one consolidator within a single multi-threaded
app -- realizing the benefit of threads even if no individual Lucy object
knows how to exploit them.

> > PS: FYI, your messages today have premature line-wrapping issues --
> > your original text, not just the quotes.
> 
> GRRR.  Thanks for pointing it out.  Why can't everything just
> work?  Is this email better?

Yes, it was perfect.  Thanks.  :)

Marvin Humphrey


Re: real time updates

Posted by Michael McCandless <lu...@mikemccandless.com>.
Marvin Humphrey wrote:

> > Right.  I guess it's because Lucene buffers up deletes that it can
> > continue to accept adds & deletes even during the blip.  But it
> > cannot write a new segment (materialize the adds & deletes) during
> > the blip.
>
> OK, I think that makes sense.  Lucene isn't so much performing
> deletions as promising to perform deletions at some point in the
> future.  There's still a window where no new deletions are being
> performed (the "blip"), and the process of reconciling deletions
> finishes during this window.

Right.  But added docs are in fact indexed during the blip (just no
segment can be written).

Remind me again how KS/Lucy does the deletes?  It must buffer if
you've buffered add docs?  Or is it not possible to delete a recently
added doc?

> > (Though... that's tricky, with deletes; oh maybe because you store
> > new deletes for an old segment along with the new segment that's
> > OK?  Hmm, it still seems like you'd have a staleness problem).
>
> What if we have the deletions reader OR together all bit vectors
> against a given segment?  Search-time performance would dive of
> course, but I believe we'd get logically correct results.
>
> Under the Lucene bit-vector naming scheme, you'd need to keep every
> deletions file around for the life of a given segment -- at least
> until you had a consolidator process lock everything down and write
> an authoritative bit vector.  With the current KS bit-vector naming
> scheme, out of date bit-vector files would be zapped by the merging
> process (which in this case means the consolidator).  I don't think
> it's any more efficient, though it's arguably cleaner.

You mean if we were to implement such a multi-writer approach in
Lucene, we'd need to keep all the _X_N.del's around?  Actually, I
think we'd also zap them on completing the merge (& carrying over any
new deletes).  But I don't think Lucene will do multi-process writing
any time soon.  We have good concurrency (I think -- there has been at
least one user report to the contrary) w/ multiple threads in a single
IndexWriter.

> The tombstone approach would work for the same reason.  It doesn't
> matter if multiple tombstone rows contain a tombstone for the same
> document, because the priority queue ORs together the results.
> Therfore, you don't need to coordinate the addition of new
> tombstones.

Yes.

> Claiming a new segment directory and committing a new master file
> (segments_XXX in Lucene, snapshot_XXX.json in KS) wouldn't require
> synchronization: if those ops fail because your process lost out in
> the race condition, you just retry.  The only time we have true
> synchronization requirements is during merging.

Wouldn't you need to synchronize here?  If two writers try to suddenly
write to the same snapshot_XXX.json, is that really safe (ie, one
always wins, and the other always loses and can reliably detect that
it lost)?

> So... if we were to somehow make tombstones perform adequately at
> search-time, I think we could make a many-writers-single-merger
> model work.

I think this (reader ORs the multiple deleted docs) would be workable;
though it makes realtime search challenging.

> > Ugh, lock starvation.  Really the OS should provide a FIFO lock
> > queue of some sort.
>
> Well, I think this would be less of a headache if we didn't need
> portability.  It's just that the locking and IPC mechanisms provided
> by various operating systems out there are wildly incompatible.

Tell me about it... we've had that discussion before.

> Unfortunately, I don't think there's any other way to implement
> background merging for all Lucy target hosts besides the
> multiple-process approach.  Lucy will never work with Perl ithreads.

How, in general, are you approaching threads with Lucy?  It seems
ashame to forego threads entirely.  What's common in Python, Perl,
ruby, etc.'s, thread semantics?

EG Python is great about threads, in that they are native threads to
the platform.  Python itself has a giant lock that prevents multiple
thread from running in the interpreter at once; but once the threads
enter Lucy code they would release this lock and gain concurrency.

> PS: FYI, your messages today have premature line-wrapping issues --
> your original text, not just the quotes.

GRRR.  Thanks for pointing it out.  Why can't everything just
work?  Is this email better?

Mike

Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Mar 15, 2009 at 07:21:13PM -0400, Michael McCandless wrote:

> Right.  I guess it's because Lucene buffers up deletes that it can continue
> to accept adds & deletes even during the blip.  But it cannot write a new
> segment (materialize the adds & deletes) during the blip.

OK, I think that makes sense.  Lucene isn't so much performing deletions as
promising to perform deletions at some point in the future.  There's still a
window where no new deletions are being performed (the "blip"), and the
process of reconciling deletions finishes during this window.

> Does this mean you can run multiple writers against the same index, to gain
> concurrency?  

That would be fab.  I hadn't thought it would be possible, but maybe we can
get there...

> (Though... that's tricky, with deletes; oh maybe  because you store new
> deletes for an old segment along with the new segment  that's OK?  Hmm, it
> still seems like you'd have a staleness problem).

What if we have the deletions reader OR together all bit vectors against a
given segment?  Search-time performance would dive of course, but I believe
we'd get logically correct results.

Under the Lucene bit-vector naming scheme, you'd need to keep every deletions
file around for the life of a given segment -- at least until you had a
consolidator process lock everything down and write an authoritative bit
vector.  With the current KS bit-vector naming scheme, out of date bit-vector
files would be zapped by the merging process (which in this case means the
consolidator).  I don't think it's any more efficient, though it's arguably
cleaner.

The tombstone approach would work for the same reason.  It doesn't matter if
multiple tombstone rows contain a tombstone for the same document, because the
priority queue ORs together the results.  Therfore, you don't need to
coordinate the addition of new tombstones.

Claiming a new segment directory and committing a new master file
(segments_XXX in Lucene, snapshot_XXX.json in KS) wouldn't require
synchronization: if those ops fail because your process lost out in the race
condition, you just retry.  The only time we have true synchronization
requirements is during merging.

So... if we were to somehow make tombstones perform adequately at search-time,
I think we could make a many-writers-single-merger model work.

> Ugh, lock starvation.  Really the OS should provide a FIFO lock queue of
> some sort.

Well, I think this would be less of a headache if we didn't need portability.
It's just that the locking and IPC mechanisms provided by various operating
systems out there are wildly incompatible.

Unfortunately, I don't think there's any other way to implement background
merging for all Lucy target hosts besides the multiple-process approach.  Lucy
will never work with Perl ithreads.

PS: FYI, your messages today have premature line-wrapping issues -- your
original text, not just the quotes.

Marvin Humphrey


Re: real time updates

Posted by Michael McCandless <lu...@mikemccandless.com>.
Marvin Humphrey wrote:

>> Lucene also has a blip, but it's different because Lucene will  
>> still accept
>> added/deleted documents; but, one cannot reopen a new realtime  
>> (LUCENE-1516)
>> reader during the blip.
>
> The consolidator process has to block while carrying forward  
> deletes, because
> otherwise new deletions may get dropped.
>
> If seg_2 is getting merged away and a new writer adds deletions  
> against seg_2
> that the consolidator never sees, then once the consolidator  
> finishes, those
> deletes will vanish without a trace and the "deleted" document will  
> suddenly
> reappear in the newly consolidated segment.

Right.  I guess it's because Lucene buffers up deletes that it can  
continue
to accept adds & deletes even during the blip.  But it cannot write a  
new
segment (materialize the adds & deletes) during the blip.

>> Hang on: does your writer process hold onto the write lock the whole
>> time it's open?  Or it only grabs it when it needs to commit a  
>> change?
>
> The consolidator grabs consolidate.lock as soon as it launches.   
> While it's
> working in the background (so to speak), write process continually  
> grab
> and release write.lock.  At the very end of the consolidation  
> process, the
> consolidator grabs write.lock so that it can carry forward recent  
> deletions --
> but hopefully that doesn't take very long.

OK.  Does this mean you can run multiple writers against the same index,
to gain concurrency?  (Though... that's tricky, with deletes; oh maybe  
because
you store new deletes for an old segment along with the new segment  
that's
OK?  Hmm, it still seems like you'd have a staleness problem).

> Unfortuntely, we have an annoying IPC issue to deal with.  (Lucene  
> wouldn't
> have this problem.)  When it's time for the consolidator to grab  
> write.lock,
> it will try to obtain it once per second for X seconds, sleeping in  
> between.
> But if index mods are flying fast and furious, write processes may  
> continually
> cut in front and the consolidator may have difficulty obtaining the
> write.lock.
>
> We'd like to be able to signal the waiting consolidator process when  
> a write
> process finishes up so that it can try for write.lock right away,  
> but AFAIK
> there's no portable way to communicate that from one process to  
> another.
> Probably the only workaround is to add yet another lock file, e.g.
> consolidator_is_waiting.lock, that blocks further write processes.   
> Yuck.  We
> may also want to have the consolidator try more often than once per  
> second.


Ugh, lock starvation.  Really the OS should provide a FIFO lock queue of
some sort.

Mike

Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Mar 15, 2009 at 06:22:42PM -0400, Michael McCandless wrote:
> You basically split the index segments into 2 sets; the first set, only
> consolidator can change; the 2nd set, only your  primary writer can change.

Exactly.

> Lucene also has a blip, but it's different because Lucene will still accept
> added/deleted documents; but, one cannot reopen a new realtime (LUCENE-1516)
> reader during the blip.

The consolidator process has to block while carrying forward deletes, because
otherwise new deletions may get dropped.  

If seg_2 is getting merged away and a new writer adds deletions against seg_2
that the consolidator never sees, then once the consolidator finishes, those
deletes will vanish without a trace and the "deleted" document will suddenly
reappear in the newly consolidated segment.

> Hang on: does your writer process hold onto the write lock the whole
> time it's open?  Or it only grabs it when it needs to commit a change?

The consolidator grabs consolidate.lock as soon as it launches.  While it's
working in the background (so to speak), write process continually grab
and release write.lock.  At the very end of the consolidation process, the
consolidator grabs write.lock so that it can carry forward recent deletions --
but hopefully that doesn't take very long.

Unfortuntely, we have an annoying IPC issue to deal with.  (Lucene wouldn't
have this problem.)  When it's time for the consolidator to grab write.lock,
it will try to obtain it once per second for X seconds, sleeping in between.
But if index mods are flying fast and furious, write processes may continually
cut in front and the consolidator may have difficulty obtaining the
write.lock.

We'd like to be able to signal the waiting consolidator process when a write
process finishes up so that it can try for write.lock right away, but AFAIK
there's no portable way to communicate that from one process to another.
Probably the only workaround is to add yet another lock file, e.g.
consolidator_is_waiting.lock, that blocks further write processes.  Yuck.  We
may also want to have the consolidator try more often than once per second.

Marvin Humphrey


Re: real time updates

Posted by Michael McCandless <lu...@mikemccandless.com>.
Marvin Humphrey wrote:

>>> These processes are forbidden from merging any files that pre-date  
>>> the
>>> establishment of "consolidate.lock".
>>
>> Why?  It seems like it needs to merge segments created before it  
>> acquired
>> that lock (that's why it was launched).
>
> I was unclear.  By "these processes", I meant write processes  
> launched while
> the consolidation process is active.

OK makes sense.  You basically split the index segments into 2 sets;
the first set, only consolidator can change; the 2nd set, only your  
primary writer
can change.

>>> It finishes that task, commits, releases "write.lock", releases
>>> "consolidate.lock",then exits.
>
>> That, and update the master "segments" file to actually record the  
>> merge,
>> and incRef/decRef to delete files.
>
> By "commits", I was referring to updating the master file; I  
> probably should
> have used more precise language.  You're right that I'd left off file
> deletion.

OK

>> But, what if while a large merge is happening, and enough segments  
>> have
>> been written to warrant a small merge to kick off & finish?
>
> That should work OK.  Write processes launched during consolidation  
> will be
> allowed to performing any mods or merges they like on segments that  
> were
> written *after* the the consolidator grabbed consolidate.lock.  They  
> just
> can't touch stuff that the consolidator might be operating on.

OK, so writer can choose to do merging if it wants (I was under the  
impression
only consolidator could).

> The only thing I'm concerned about is the time that it takes to  
> carry recent
> deletions against merged segments forward so that they apply against  
> the new,
> consolidated segment.  The consolidator process has to obtain the  
> write.lock
> for that, blocking any new mods to the index.  Hopefully that  
> doesn't cause
> too big a blip.

Ahh, right.  Lucene also has a blip, but it's different because Lucene
will still accept added/deleted documents; but, one cannot reopen
a new realtime (LUCENE-1516) reader during the blip.

Hang on: does your writer process hold onto the write lock the whole
time it's open?  Or it only grabs it when it needs to commit a change?

Mike

Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Mar 15, 2009 at 05:44:43PM -0400, Michael McCandless wrote:

> So eg you could merge 2 sets of segments at once (like Lucene)?

Yes.  

>> These processes are forbidden from merging any files that pre-date the
>> establishment of "consolidate.lock".
> 
> Why?  It seems like it needs to merge segments created before it acquired
> that lock (that's why it was launched).

I was unclear.  By "these processes", I meant write processes launched while
the consolidation process is active.  

>> It finishes that task, commits, releases "write.lock", releases
>> "consolidate.lock",then exits.

> That, and update the master "segments" file to actually record the merge,
> and incRef/decRef to delete files.

By "commits", I was referring to updating the master file; I probably should
have used more precise language.  You're right that I'd left off file
deletion.

> But, what if while a large merge is happening, and enough segments have
> been written to warrant a small merge to kick off & finish?

That should work OK.  Write processes launched during consolidation will be
allowed to performing any mods or merges they like on segments that were
written *after* the the consolidator grabbed consolidate.lock.  They just
can't touch stuff that the consolidator might be operating on.

The only thing I'm concerned about is the time that it takes to carry recent
deletions against merged segments forward so that they apply against the new,
consolidated segment.  The consolidator process has to obtain the write.lock
for that, blocking any new mods to the index.  Hopefully that doesn't cause
too big a blip.

Marvin Humphrey


Re: real time updates

Posted by Michael McCandless <lu...@mikemccandless.com>.
Marvin Humphrey wrote:

> On Sat, Mar 14, 2009 at 05:51:43AM -0400, Michael McCandless wrote:
>> Even w/ background merging, which allows new segments to be written &
>> reopened in a reader even while the big merge is running in the BG,
>> Lucene still has the challenge of warming a reader on the [large]
>> newly merged segment before using the reader "for real".
>
> Lucy doesn't have to worry about the warming aspect; given  
> sufficient RAM, all
> the files in the recently written segment will still be "hot" in the  
> OS file
> cache.
>
> The trick we need to master is the coordination of two concurrent  
> write
> processes.  I think it goes something like this:
>
>  * The background consolidator writer grabs "consolidate.lock".  It  
> starts
>    writing its own segment based on the state of the index at that  
> moment.
>  * Meanwhile, an indeterminate number of consolidator-aware write  
> processes
>    launch and complete.

So eg you could merge 2 sets of segments at once (like Lucene)?

> These processes are forbidden from merging any files
>  that pre-date the establishment of "consolidate.lock".

Why?  It seems like it needs to merge segments created before it  
acquired
that lock (that's why it was launched).

>  * Once the consolidator finishes most of what it's doing, it waits  
> to obtain
>    a write lock.  The only task left is to carry forward new  
> deletions which
>    have been made since the establishment of "consolidate.lock"  
> against the
>    segments which the consolidator has just merged away.  It  
> finishes that
>    task, commits, releases "write.lock", releases  
> "consolidate.lock",then
>    exits.

That, and update the master "segments" file to actually record the  
merge, and
incRef/decRef to delete files.

> Does that sound similar to the Lucene implementation?

Yes.

But, what if while a large merge is happening, and enough segments have
been written to warrant a small merge to kick off & finish?

>> We need an incremental copy-on-write solution (eg only the "page"  
>> that's
>> change gets copied when a new deletion arrives).  We need this for  
>> changes
>> to norms too.
>
> Norms, huh?  That's weird.  Do you have to do that because a field  
> definition
> has been modified?

No, it's to handle someone calling IndexReader.setNorm, eg if they are
doing "realtime boosting".

>
>> But then does deletions-seg_2.bv contain all deletes for seg_2?   
>> In  which
>> case this is just like the "generation" Lucene increments & tacks  
>> on  when
>> it saves a del; just a different naming scheme.
>
> That's right, it's just a different naming scheme.  In fact, it's  
> marginally
> less efficient because the bit vector must be copied a little more  
> often.
>
> However, with that change, segment directories are truly never  
> modified once
> written.  For somewhat esoteric reasons, that made it easier to  
> factor a
> sensible DeletionsWriter out of the existing KinoSearch indexing  
> code so that
> we could plug in alternative implementations.

OK.

Mike

Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sat, Mar 14, 2009 at 05:51:43AM -0400, Michael McCandless wrote:
> Even w/ background merging, which allows new segments to be written &
> reopened in a reader even while the big merge is running in the BG,
> Lucene still has the challenge of warming a reader on the [large]
> newly merged segment before using the reader "for real".  

Lucy doesn't have to worry about the warming aspect; given sufficient RAM, all
the files in the recently written segment will still be "hot" in the OS file
cache.

The trick we need to master is the coordination of two concurrent write
processes.  I think it goes something like this:
 
  * The background consolidator writer grabs "consolidate.lock".  It starts  
    writing its own segment based on the state of the index at that moment.
  * Meanwhile, an indeterminate number of consolidator-aware write processes
    launch and complete. These processes are forbidden from merging any files
    that pre-date the establishment of "consolidate.lock".
  * Once the consolidator finishes most of what it's doing, it waits to obtain
    a write lock.  The only task left is to carry forward new deletions which
    have been made since the establishment of "consolidate.lock" against the
    segments which the consolidator has just merged away.  It finishes that
    task, commits, releases "write.lock", releases "consolidate.lock",then
    exits.

Does that sound similar to the Lucene implementation?

> We need an incremental copy-on-write solution (eg only the "page" that's
> change gets copied when a new deletion arrives).  We need this for changes
> to norms too.

Norms, huh?  That's weird.  Do you have to do that because a field definition
has been modified?

> But then does deletions-seg_2.bv contain all deletes for seg_2?  In  which
> case this is just like the "generation" Lucene increments & tacks on  when
> it saves a del; just a different naming scheme.  

That's right, it's just a different naming scheme.  In fact, it's marginally
less efficient because the bit vector must be copied a little more often.

However, with that change, segment directories are truly never modified once
written.  For somewhat esoteric reasons, that made it easier to factor a
sensible DeletionsWriter out of the existing KinoSearch indexing code so that
we could plug in alternative implementations.  

Marvin Humphrey

Re: real time updates

Posted by Michael McCandless <lu...@mikemccandless.com>.
Marvin Humphrey wrote:

> On Fri, Mar 13, 2009 at 12:48:11PM -0700, Nathan Kurz wrote:
>> [moved from private to lucy-dev in case others are interested]
>>> KS has a slow best-case write speed, but I'm not worried about  
>>> that.  The
>>> problem me and the Lucene folks are trying to address under the  
>>> topic
>>> heading of "real-time indexing" is *worst-case* write speed: most  
>>> of the
>>> time you're fine, but every once in a while you trigger a large  
>>> merge and
>>> you wait a loooooong time.  That problem has a different solution.
>>
>> What is the actual case that would trigger such a problem?
>
> The "big merge" problem is easy to trigger: repeatedly open an  
> indexer, add a
> small number of documents, and close it.  Each time you do that, a  
> new segment
> is added, and search-time performance degrades.  Every once in a  
> while, the
> segment merging algorithm decides that it needs to perform a big
> consolidation, and you have to wait while it finishes.
>
> Solving the big-merge problem requires setting up a background  
> consolidation
> routine that allows new segments to be written while it's running.   
> We'll need
> to make some mods to write locking, commit locking, segment naming,  
> snapshot
> file naming, and merging policies; I'm currently working on an  
> IndexManager
> class which bundles all of those concerns together.
>
> Lucene does background merging already using a single process with  
> multiple
> threads.  I'm thinking that we'd try a background process approach  
> instead, so
> that additional write processes could be launched which add new  
> segments while
> the background merger churns.

Even w/ background merging, which allows new segments to be written &
reopened in a reader even while the big merge is running in the BG,
Lucene still has the challenge of warming a reader on the [large]
newly merged segment before using the reader "for real".  Such warming
cannot block newly added segments, either.

I think for Lucene the solution is simple: don't allow the merge to be
committed until a reader has been warmed on that segment (assuming
you've opened a reader from IndexWriter).

> We also have a second major problem: the amount of time it takes to  
> load
> certain caches in IndexReader from disk into process memory --  
> especially sort
> caches.  This one we bix by 1) designing index file formats that can  
> be
> mmap'd, and 2) implementing "segment-centric sorted search", which  
> uses
> per-segment mmap'd caches rather than a per-index cache which has to  
> be
> rebuilt from scratch in process memory each time the index is  
> modified.

Lucene (trunk) is now doing segment-centric sort caches, but still
since we load the full FieldCache into RAM (slowly -- by uninverting
the field), warming is still costly.  But as long as we "background"
the warming, it's relatively OK.

> Lastly, we have a problem regarding writing out deletions, minor in  
> comparison
> to the other two but still worth thinking over: the size of the  
> deletions files
> we write out scales with the size of the index rather than number of  
> deletions
> we make.  If you delete a single document, but that document happens  
> to reside
> in a segment with 1 million documents, we have to write out a new 1  
> MB bit
> vector file.
>
> I thought we had that licked with the "tombstones" approach, but the
> performance of iterated deletions turns out to be worse than  
> expected, even
> before we get a priority queue involved to merge multiple tombstone  
> rows.

With Lucene we intend to carry the deletions in RAM, not through disk;
this mean we can leave our on-disk format as the "relatively"
inefficient write-the-whole-vector-out-each-time approach we have
now.

Still, we have the same challenge in RAM: how to make an efficient
transaactional bit set so that if there are N readers all opened after
different add/deletes, we don't need a full 1 MB bit set for each of
them.  We need an incremental copy-on-write solution (eg only the
"page" that's change gets copied when a new deletion arrives).  We
need this for changes to norms too.  Some discussions have happened
here:

   https://issues.apache.org/jira/browse/LUCENE-1526

>> Would the equivalent of row-level-locking allow you to
>> modify the index files in real time, instead of adding addenda and
>> tombstones?   I'm not necessarily suggesting that the default index
>> format should do this, but that it might be worth considering whether
>> a proposed API would support such a real-time format.
>
> I absolutely agree that deletions support should be pluggable.
>
> How does this interface look?
>
>  package MyDelWriter;
>  use base qw( Lucy::Index::DeletionsWriter );
>
>  ...
>
>  package MyDelReader;
>  use base qw( Lucy::Index::DeletionsReader );
>
>  ...
>
>  package MyArchitecture;
>  use base qw( Lucy::Architecture );
>
>  sub make_deletions_writer {
>    shift;
>    return MyDelWriter->new(@_);
>  }
>
>  sub make_deletions_reader {
>    shift;
>    return MyDelReader->new(@_);
>  }
>
>  package MySchema;
>  use base qw( Lucy::Schema );
>
>  sub make_architecture {
>    return MyArchitecture->new;
>  }
>
> That's the interface I've worked into KS over the last few weeks.
>
> The default deletions implementation uses BitVecDelWriter/ 
> BitVecDelReader.  It
> behaves slightly differently from Lucene and old KS; new bit vector  
> files
> belong to new segments but reference old ones (so "seg_8/deletions- 
> seg_2.bv"
> contains deletions which should be applied against segment 2).

But then does deletions-seg_2.bv contain all deletes for seg_2?  In  
which
case this is just like the "generation" Lucene increments & tacks on  
when it
saves a del; just a different naming scheme.  Or is it somehow  
incremental?

Mike

Pluggable IndexReader (was "real time updates")

Posted by Marvin Humphrey <ma...@rectangular.com>.
I wrote:

> Without Lexicon(), sort caches, and the like, IndexReader becomes a
> very generic class.  So generic that it's not really useful?  I don't know.
> 
> We're going to have to publish a class that looks like IndexReader --
> Doc_Freq, Lexicon, Doc_Max, etc -- in order to support the segmented inverted
> index engine.  It should probably be named "IndexReader".  :)  Do you think
> there ought to be a more generic super class above that one?

How about stripping down IndexReader and turning SegReader into a bucket of
components?

Remove Lexicon(), PostingList(), Deletions(), Doc_Freq(), and most other
methods from IndexReader.  The DocReader, PostingsReader, DeletionsReader,
LexReader, and TermVectorsReader members of SegReader, which implement those
methods, all get the heave ho.

What's left:

   Segment  *segment;
   Snapshot *snapshot;
   Schema   *schema;
   Folder   *folder;
   Hash     *components; /* <------ !!! */

Here's the new sequence for fetching a PostingList:

  PostingsReader *postings_reader 
    = SegReader_Fetch_Component(seg_reader, "postings");
  PostingList *plist = postings_reader
    ? PostReader_Posting_List(postings_reader, field, term)
    : NULL;

Doc_Freq() would also require an extra step:

  LexReader *lex_reader
    = SegReader_Fetch_Component(seg_reader, "lexicon");
  u32_t doc_freq = lex_reader
    ? LexReader_Doc_Freq(lex_reader, field, term)
    : 0;

Going this direction would make IndexReader less user-friendly but more
pluggable.

I think we're going to need something along the lines of Fetch_Component()
anyway.  I can't think of another way to add arbitrary components.

Say the user has supplied an RTreeQuery to Searcher:

  Hits *hits = Searcher_Hits(searcher, rtree_query);

At some point, the RTreeQuery is going to need data supplied by an RTreeReader
so that it can compile an RTreeScorer.  However, IndexReader doesn't define an
interface for accessing R-tree data.  

In theory, the LucyX::RTree package could supply a
LucyX::RTree::RTreeIndexReader class which extends IndexReader -- but that's
cumbersome and won't work well if you want to plug in more than one component.
In contrast, the hash-based Fetch_Component() scheme allows us to extend
SegReader without subclassing anything other than Architecture.

Strictly speaking, we don't *have* to strip down IndexReader if we add
Fetch_Component().  Having core classes interact with IndexReader exclusively
through the Fetch_Component interface wouldn't be a problem, but it's awkward
for users who might want to interact with IndexReader directly.  
    
But how many people really need that, and what's their profile?  The majority
of users will just go through the Searcher interface; only power-users need
IndexReader, and they should be able to handle going through Fetch_Component()
to deal with the plugins directly.

There are details to work out regarding conflating data from multiple
segments, but that's the gist.

Marvin Humphrey


Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Mar 16, 2009 at 12:09:43AM -0700, Nathan Kurz wrote:

> > Since Lucy's internal doc numbers will be ephemeral, they wouldn't work.
> > We'd need to add a primary key field.
> 
> While I understand this is mostly true, I had assumed that it was
> possible to create some synthetic primary key, say, 'segment - doc'.

We have ephemeral, private primary keys, if you want to think about it that
way.  For a given Snapshot/PolyReader, we order the segments by age, laying
them end to end.  If we had 3 segments with exactly 100 docs each, then from
the perspective of the PolyReader, there is a document number 150 which
belongs to the second segment.

However, once the first, oldest segment in that group gets merged away, the
second segment moves leftward in the array, and what was formerly document
number 150 becomes number 50. 

Is that good enough for your purposes?  I don't think it is.  You can't use a
Lucy/KS document number to recall a specific document over time from a naive
external caller.

> Yes, I'm suggesting that updating each and every one of those posting lists
> both feasible and possibly the best approach.  

That sounds kind of crazy to me -- but in my experience awesome sometimes
hides behind what I think must be crazy, so let's stumble onwards. :)

The problem is making space for the inserted postings.  Say you were to start
off a book's index with only the first chapter.  As you add chapters, that
index needs to grow.  To make room for each inserted posting, you're going to
have to shift stuff rightwards continuously in the postings file.  Ultimately,
it's a lot more work doing that incrementally than just writing out a new
file.  Even a single move is expensive.

Other data structures are optimized for insertions (e.g. B+ tree), but even so
you end up with a lot of churn.

> It only becomes impossible if the update rate exceeds the available memory
> bandwidth, which for a modern processor is mighty high.  Given a reasonable
> read/write ratio, I think it's a win, at least for the cases I'm interested
> in.  I'm not suggesting you move to such an approach now, but it's a route I
> strongly want left open until conclusively proven to be unworkable.

Well, it would be great if we could keep the basic Indexer class as lean and 
mean as possible.  Then implementing an alternative which supports update
semantics rather than just delete/add is super simple.

> > Getting back to the point... I think IndexReader has to have Doc_Max() and
> > Doc_Freq().  How do we avoid those and still support TF/IDF?
> 
> You are right that we need these, but it's just a question of where
> the support should go.  As a non-OO thinker, I view these as being
> data members within some standardized Posting struct.  Thus I'd guess
> that these should be part of the Posting object, accessible only
> during the Scoring phase, rather than on the IndexReader.   But likely
> I'm not understanding you here.

We need to know Doc_Freq() and Doc_Max() before scoring commences, because
they are used to calculate a float multiplier that's applied against the score
of each hit.  Rarer terms get a higher multiplier.  

Furthermore, IDF is calculated using Doc_Freq for the entire collection, which
may span several searchers in a cluster, while Scorer and Posting are local
only and so don't know enough about the whole collection.

In contrast, Term Frequency (the "TF" in TF/IDF) is per-document, and is
stored in the postings files as per your impression.

> > It's also hard to imagine life without Lexicon(), or support for sort caches.
> 
> Again, it's just a question of where these should go.  While you view
> the Lexicon as being part of the Index (and commented appropriately on
> the overloading of this term), I view them as independent.  

Without Lexicon(), sort caches, and the like, IndexReader becomes a
very generic class.  So generic that it's not really useful?  I don't know.

We're going to have to publish a class that looks like IndexReader --
Doc_Freq, Lexicon, Doc_Max, etc -- in order to support the segmented inverted
index engine.  It should probably be named "IndexReader".  :)  Do you think
there ought to be a more generic super class above that one?

> I can view cases where different inverted indexes might share a common
> lexicon in the same way that segments currently share one, perhaps even with
> the Lexicon being generated and stored by an unrelated app (say, for user
> applied tags).

Lexicon is an abstract class, and SegReader is pluggable.  If you override
Arch_Make_Lex_Reader, you can supply an implementation of LexReader that
reads from an external app.

> > If you were to write an Architecture class to support pluggability, what would
> > it look like?
> 
> I guess that's what I'm hoping to answer.  I don't currently know.  My
> instinct is that it might be best to treat the problem as one of data
> exchange between levels rather than pluggability, but that might just
> be my poor OO intuition.

It'll probably be easier to offer a critique once I can provide links to HTML
documentation for the entire pluggability apparatus: SegWriter, SegReader,
Segment, Snapshot, DataReader, DataWriter, and various subclasses of
DataWriter/DataReader like BitVecDelWriter, HashRowWriter (the default doc
writer), etc.

I think what you will object to about the current implementation of
Architecture is that it is dedicated to plugins for SegWriter and SegReader,
and is thus segment-centric, making it insufficiently generic.  However, I'm
not sure how to write something that's more generic that would still be
useful.

Marvin Humphrey


Re: real time updates

Posted by Nathan Kurz <na...@verse.com>.
On Sat, Mar 14, 2009 at 5:41 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> That sounds like a fun exercise.  Let's start from a clean slate, and try to
> build up interfaces for Indexer and Searchable.  Just so that it's clear that
> nothing final is being decided, let's call our experimental project "Luser".

Thanks for playing along!  I think this could be a very useful thought
experiment, at least for me.

> Since Lucy's internal
> doc numbers will be ephemeral, they wouldn't work.  We'd need to add a primary
> key field.

While I understand this is mostly true, I had assumed that it was
possible to create some synthetic primary key, say, 'segment - doc'.
If not, how does one know which record to mark for deletion with a
delete/add style of update?  Does one save some unique identifier as
field value, and then search for it?

> The very nature of an inverted index is that a single document gets broken up
> and listed in many posting lists.  Updating all those posting lists is
> basically impossible without rebuilding the index.

While I routinely mangle everything related to the Lucy object
hierarchy, the physical layout of the inverted index is one area where
I feel comfortable.  Yes, I'm suggesting that updating each and every
one of those posting lists both feasible and possibly the best
approach.  It only becomes impossible if the update rate exceeds the
available memory bandwidth, which for a modern processor is mighty
high.  Given a reasonable read/write ratio, I think it's a win, at
least for the cases I'm interested in.  I'm not suggesting you move to
such an approach now, but it's a route I strongly want left open until
conclusively proven to be unworkable.

> Getting back to the point... I think IndexReader has to have Doc_Max() and
> Doc_Freq().  How do we avoid those and still support TF/IDF?

You are right that we need these, but it's just a question of where
the support should go.  As a non-OO thinker, I view these as being
data members within some standardized Posting struct.  Thus I'd guess
that these should be part of the Posting object, accessible only
during the Scoring phase, rather than on the IndexReader.   But likely
I'm not understanding you here.

> It's also hard to imagine life without Lexicon(), or support for sort caches.

Again, it's just a question of where these should go.  While you view
the Lexicon as being part of the Index (and commented appropriately on
the overloading of this term), I view them as independent.  I can view
cases where different inverted indexes might share a common lexicon in
the same way that segments currently share one, perhaps even with the
Lexicon being generated and stored by an unrelated app (say, for user
applied tags).

> If you were to write an Architecture class to support pluggability, what would
> it look like?

I guess that's what I'm hoping to answer.  I don't currently know.  My
instinct is that it might be best to treat the problem as one of data
exchange between levels rather than pluggability, but that might just
be my poor OO intuition.

Nathan Kurz
nate@verse.com

Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sat, Mar 14, 2009 at 12:21:10PM -0700, Nathan Kurz wrote:
> For indexing, I'd love to see the same agnostic behaviour.  The
> Indexer calls knows only about a single function like
> UpdatePosting(docID, newPostings).

While this interface tries hard to be "agnostic" and highly abstract, it in
fact imposes a requirement that neither Lucene nor KinoSearch nor SQLite could
satisfy by default: "docID" has to be a primary key.  Since Lucy's internal
doc numbers will be ephemeral, they wouldn't work.  We'd need to add a primary
key field.

> My canonical example for this is that I want to be able to store my
> index in SQLite, and write a thin layer of interface between it and
> the rest of Lucy.  But my real desire is to substitute a custom mmap()
> solution such as the fast graph database referenced earlier.

That sounds like a fun exercise.  Let's start from a clean slate, and try to
build up interfaces for Indexer and Searchable.  Just so that it's clear that
nothing final is being decided, let's call our experimental project "Luser".

Our Luser APIs must support two engines:

  * A segmented inverted index back end.
  * An SQLite back end.

We will assume the following:

  * A Doc class that supports multiple fields.
  * An opaque Hits class.
  * A Query class that somehow compiles down to a Scorer given access
    to a Searchable.
  * Global field semantics, supported by an opaque Schema class.
  * An Engine class which does all the work.

I'm going to use a slightly simplified version of the current ".bp" syntax as
pseudocode.

Let's start with Indexer:

  public class Luser::Index::Indexer extends Luser::Obj {
    public Indexer*
    new(Indexer *self, Schema *schema, Engine *engine);

    public void
    Add_Doc(Indexer *self, Doc *doc);

    public void
    Delete_By_Term(Indexer *self, const CharBuf *field, Obj *term);

    public void
    Commit(Indexer *self);
  }

I'm dissatisfied with that constructor, and we're leaving off Prepare_Commit,
the destructor and a bunch of other important methods... but let's just stumble
past all that.

I'd like to add this to Indexer:

    public void
    Replace_Doc(Indexer *self, const CharBuf *field, Obj *term, Doc *doc);

"Replace_Doc" is superior to "Update_Doc" because the method name informs the
user that they have to supply the entire document and not just the updated
fields.  

Unfortunately, absent a primary key constraint they both suck, because it's
not clear whether you'll end up replacing many docs with one.  

I think the only way to work in update semantics is to put them in an
IndexUpdater class that forces you to specify a primary key.  Otherwise we'll
end up fielding waaaay too many questions from confused noobs on the user
list.

Next up, Searchable.

  public abstract class Luser::Search::Searchable extends Luser::Obj {
    public Searchable*
    init(Searchable *self, Schema *schema);

    public Hits*
    Hits(Searchable *self, Query *query, SortSpec *sort_spec = NULL,
         u32_t offset = 0, u32_t wanted = 10);

    public abstract void
    Collect(Searchable *self, Query *query, HitCollector *collector);

    public abstract Doc*
    Fetch_Doc(Searchable *self, i32_t doc_num);
  }

Hmm.  We need to add Doc_Max and Doc_Freq, otherwise TF/IDF scoring won't
work.  However, they aren't needed by the SQLite back end.

We've run into abstracting difficulties with both Indexer and Searchable.  We
can't reduce our interfaces to a pure intersection without crippling one of
the engines.  

The alternative is to define some abstract methods which wouldn't always be
needed or implemented.  However, Luser can't support every possible engine
without bloating up.  We have to make choices about who we are.

Marvin Humphrey


Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sat, Mar 14, 2009 at 12:21:10PM -0700, Nathan Kurz wrote:
> > Every once in a while, the segment merging algorithm decides that it needs
> > to perform a big consolidation, and you have to wait while it finishes.
> 
> Yes, but that's an artifact of current approach of adding segments rather
> than making real-time replacements.  

Perhaps we can abstract things enough to plug in a completely different
engine, but we're still going to have to solve this problem for the segmented
inverted index engine.

The very nature of an inverted index is that a single document gets broken up
and listed in many posting lists.  Updating all those posting lists is
basically impossible without rebuilding the index.  

Say you were to replace a chapter in a book; now the index at the back of the
book is out of date and needs to be updated.  You can't just swap out a couple
pages in the book's index, because references to the old chapter are
interspersed with references to all the other chapters.  You can go in with a
black marker and cross out all the references to the old chapter, but
inserting references to the new chapter is going to be very difficult.  The
only realistic approach is to tear out the whole index and start from scratch.

At least Lucene/KinoSearch/Lucy don't force you to rebuild the index from
scratch every time you make a change.  Instead they use a segmented approach,
which works pretty well, but has this annoying problem with worst-case write
time...

> I was more wondering if there is anything inherent about the rate of change
> required that would prevent a fully incremental update from working.

In-place updating would work for deletions: we just give every segment its own
dedicated, permanent deletions bit vector file and flip bits in it.  

That can't be the default behavior because it breaks IndexReader's promise to
provide a point-in-time view of the index.  However, I think it would be fine
to accomodate users who don't need IndexReader consistency by making it
possible to plug in a QuickAndDirtyDeletionsWriter /
QuickAndDirtyDeletionsReader combo.

In-place updating could also work for subclasses of DocWriter.  Again, it
couldn't be the default implementation, but we could say, plug in a
FlatFileDocWriter that wrote to a flat file with fixed length rows; fixed
length rows are easy to update in place.

In theory, in-place updating could even work for PostingsWriter, if your
PostingsWriter didn't write an inverted index but instead wrote something else
entirely.  But again, that couldn't be the default behavior.

> > How does this interface look?
> >
> >  package MyDelWriter;
> >  use base qw( Lucy::Index::DeletionsWriter );
> >  ...
> 
> This feels to me like it is solving the wrong problem.  There's
> nothing wrong with it, but DeletionsWriter and DeletionsReader seem
> like internal implementation details of particular type of Index.

The Architecture class is meant to help with this problem.  

I've been wrestling with exactly what to put into Architecture.  The default
type of index will a segmented inverted index, and the factory methods in
Architecture (Make_Deletions_Writer, Make_Doc_Writer, Make_Postings_Writer,
etc), are designed to accommodate that. 

Perhaps we ought to have an InvertedIndexArchitecture which serves as the
default, and have Architecture be an abstract base class which could also fit
well over an SQLite or graph-based back end.  But then what would we put in
Architecture that's common to all of those?  How do we avoid abstracting it so
thoroughly that there's nothing left?  

If you were to write an Architecture class to support pluggability, what would
it look like?

> I'd hope that the interface between a Scorer and an Index could be
> very simple,  probably just a single function to get a PostingList.

As an aside... I'd prefer that we never define a core class named "Index".
The word "index" is horrendously overloaded and thus unclear.  Same goes for
"Search".  They're fine as package names, but poor as class names.

Getting back to the point... I think IndexReader has to have Doc_Max() and
Doc_Freq().  How do we avoid those and still support TF/IDF?  It's also hard
to imagine life without Lexicon(), or support for sort caches.

Perhaps an Engine class with a smaller interface as IndexReader's parent?

> Thta PostingList would provide navigation by docID, but deletions
> would be handled internally and never be seen by the Scorer.

Actually, this is controversial.  We're hoping to see gains by applying
deletions at the Scorer_Collect() level, so PostingList wouldn't know about
them.  However, the matter is unresolved.

Marvin Humphrey



Re: real time updates

Posted by Nathan Kurz <na...@verse.com>.
On Fri, Mar 13, 2009 at 4:03 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
>  Every once in a while, the
> segment merging algorithm decides that it needs to perform a big
> consolidation, and you have to wait while it finishes.

Yes, but that's an artifact of current approach of adding segments
rather than making real-time replacements.  I was more wondering if
there is anything inherent about the rate of change required that
would prevent a fully incremental update from working.

If it could be pulled off, I think the advantages are large:  no
degradation due to accumulated changes, and no periodic long merges.
There's also the benefit that any changes written are likely to be hot
in the cache, so no warm up is needed.

> How does this interface look?
>
>  package MyDelWriter;
>  use base qw( Lucy::Index::DeletionsWriter );
>  ...

This feels to me like it is solving the wrong problem.  There's
nothing wrong with it, but DeletionsWriter and DeletionsReader seem
like internal implementation details of  particular type of Index.
Should the callers even have to know about their existence?

I'd hope that the interface between a Scorer and an Index could be
very simple,  probably just a single function to get a PostingList.
Thta PostingList would provide navigation by docID, but  deletions
would be handled internally and never be seen by the Scorer.

For indexing, I'd love to see the same agnostic behaviour.  The
Indexer calls knows only about a single function like
UpdatePosting(docID, newPostings).  Whether this is done internally
via tombstones, real-time updates or carrier pigeon is hidden from the
caller.

So while the interface you propose is probably great for making small
modifications to the current Index, I'd rather it not be part of the
official API that all Index formats must support.  I want each
component to make as few assumptions as possible about the internals
of other components.

My canonical example for this is that I want to be able to store my
index in SQLite, and write a thin layer of interface between it and
the rest of Lucy.  But my real desire is to substitute a custom mmap()
solution such as the fast graph database referenced earlier.

 I think the easiest way to make this possible is to reduce the points
of intersection between the components to the simplest set possible.
Instead of specifying a full internal API for each component, specify
(and restrict) only the the portions visible to the rest of the
program.

Nathan Kurz
nate@verse.com

Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Fri, Mar 13, 2009 at 12:48:11PM -0700, Nathan Kurz wrote:
> [moved from private to lucy-dev in case others are interested]
> > KS has a slow best-case write speed, but I'm not worried about that.  The
> > problem me and the Lucene folks are trying to address under the topic
> > heading of "real-time indexing" is *worst-case* write speed: most of the
> > time you're fine, but every once in a while you trigger a large merge and
> > you wait a loooooong time.  That problem has a different solution.
> 
> What is the actual case that would trigger such a problem?  

The "big merge" problem is easy to trigger: repeatedly open an indexer, add a
small number of documents, and close it.  Each time you do that, a new segment
is added, and search-time performance degrades.  Every once in a while, the
segment merging algorithm decides that it needs to perform a big
consolidation, and you have to wait while it finishes.

Solving the big-merge problem requires setting up a background consolidation
routine that allows new segments to be written while it's running.  We'll need
to make some mods to write locking, commit locking, segment naming, snapshot
file naming, and merging policies; I'm currently working on an IndexManager
class which bundles all of those concerns together.

Lucene does background merging already using a single process with multiple
threads.  I'm thinking that we'd try a background process approach instead, so
that additional write processes could be launched which add new segments while
the background merger churns.

We also have a second major problem: the amount of time it takes to load
certain caches in IndexReader from disk into process memory -- especially sort
caches.  This one we bix by 1) designing index file formats that can be
mmap'd, and 2) implementing "segment-centric sorted search", which uses
per-segment mmap'd caches rather than a per-index cache which has to be
rebuilt from scratch in process memory each time the index is modified.

Lastly, we have a problem regarding writing out deletions, minor in comparison
to the other two but still worth thinking over: the size of the deletions files
we write out scales with the size of the index rather than number of deletions
we make.  If you delete a single document, but that document happens to reside
in a segment with 1 million documents, we have to write out a new 1 MB bit
vector file.

I thought we had that licked with the "tombstones" approach, but the
performance of iterated deletions turns out to be worse than expected, even
before we get a priority queue involved to merge multiple tombstone rows.

> Would the equivalent of row-level-locking allow you to
> modify the index files in real time, instead of adding addenda and
> tombstones?   I'm not necessarily suggesting that the default index
> format should do this, but that it might be worth considering whether
> a proposed API would support such a real-time format.

I absolutely agree that deletions support should be pluggable.

How does this interface look?

  package MyDelWriter;
  use base qw( Lucy::Index::DeletionsWriter );

  ...

  package MyDelReader;
  use base qw( Lucy::Index::DeletionsReader );

  ...

  package MyArchitecture;
  use base qw( Lucy::Architecture );

  sub make_deletions_writer { 
    shift;
    return MyDelWriter->new(@_);
  }

  sub make_deletions_reader { 
    shift;
    return MyDelReader->new(@_);
  }

  package MySchema;
  use base qw( Lucy::Schema );

  sub make_architecture {
    return MyArchitecture->new;
  }

That's the interface I've worked into KS over the last few weeks.  

The default deletions implementation uses BitVecDelWriter/BitVecDelReader.  It
behaves slightly differently from Lucene and old KS; new bit vector files
belong to new segments but reference old ones (so "seg_8/deletions-seg_2.bv"
contains deletions which should be applied against segment 2).

I haven't yet tested that API to verify that an alternative implementation
works.  Perhaps a pure-Perl, hash-based MockDeletions class is in order?

Marvin Humphrey


Re: real time updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sat, Mar 14, 2009 at 05:34:02AM -0400, Michael McCandless wrote:

> What is this "sneaky lock" trick?

>From the original article:

    Now, the magic happens when a write lock comes in. We don’t want our
    writes or our reads to block on these Very Important Pieces Of Data, so
    what the right lock does is first copy the old data from that record into
    memory and then go through the list of active readers and do a swaperoo on
    the data pointer. The readers are now pointing to an in-memory copy of the
    data, rather than the disk-based version. It can then go on writing and
    moving and generally doing whatever it wants with the disk based copy. The
    lock also keeps a copy of the pre-write version of the data in memory so
    that all readers that come in and try to get a lock on that record get the
    in-memory version until the write is done.

http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/

The thing I liked was being able to swap out data in the reader at any time no
matter what the reader's present state; that's similar to the CAS-based algo
(CAS = Compare And Swap) Cliff Click describes with regards to
rebuilding/rehashing the lock-free hash table.  

Code for the lock-free hash table:

http://sourceforge.net/projects/high-scale-lib/

Marvin Humphrey

Re: real time updates

Posted by Michael McCandless <lu...@mikemccandless.com>.
Nathan Kurz wrote:

>> Except on NFS, KS doesn't have much in the way of lock contention  
>> issues
>> because index files are never modified.  But regardless of its  
>> applicability
>> to KS, this "sneaky lock" trick is pretty nice:
>> ...
>> We couldn't do that because we can't reach out across the system to  
>> active
>> IndexReaders in other processes, but still: nice.
>
> I realize this, but I'm wondering if a locking approach might be
> preferable.  Would the equivalent of row-level-locking allow you to
> modify the index files in real time, instead of adding addenda and
> tombstones?   I'm not necessarily suggesting that the default index
> format should do this, but that it might be worth considering whether
> a proposed API would support such a real-time format.

What is this "sneaky lock" trick?

Mike