You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by "Marvin Humphrey (JIRA)" <ji...@apache.org> on 2009/03/10 19:42:50 UTC

[jira] Created: (LUCY-5) Boilerplater compiler

Boilerplater compiler
---------------------

                 Key: LUCY-5
                 URL: https://issues.apache.org/jira/browse/LUCY-5
             Project: Lucy
          Issue Type: New Feature
          Components: Boilerplater
            Reporter: Marvin Humphrey
            Assignee: Marvin Humphrey


Boilerplater is a small compiler which supports a vtable-based object model.
The output is C code which adheres to the design that Dave Balmain and I
hammered out a while back; the input is a collection of ".bp" header files.

Our original intent was to pepper traditional C ".h" header files with no-op
macros to define each class's interface; the code generator would understand
these macros but the C compiler would ignore them.  C source code files would
then pound-include both the ".h" header and the auxiliary, generated ".bp"
file.

The problem with this approach is that C syntax is too constraining.  Because
C does not support namespacing, every symbol has to be prepended with a prefix
to avoid conflicts.  Futhermore, adding metadata to declarations (such as
default values for arguments, or whether NULL is an acceptable value) is
awkward.  The result is ".h" header files that are excessively verbose,
cumbersome to edit, and challenging to parse visually and to grok.

The solution is to make the ".bp" file the master header file, and write it in
a small, purpose-built, declaration-only language.  The
code-generator/compiler chews this ".bp" file and spits out a single ".h"
header file for pound-inclusion in ".c" source code files.

This isn't really that great a divergence from the original plan.  There's no
fixed point at which a "code generator" becomes a "compiler", and while the
declaration-only header language has a few conventions that core developers
will have to familiarize themselves with, the same was true for the no-op
macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
implementation detail; it is not publicly exposed and thus can be modified at
will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
Even Lucy's C users will never see it, because the public C API itself will be
defined by a lightweight binding and generated documentation.

The important thing for us to focus on is the *output* code generated by
Boilerplater.  We must nail the object model.  It has to be fast.  It has to
live happily as a symbiote within each host.  It has to support callbacks into
the host language, so that users may define custom subclasses and override
methods easily.  It has to present a robust ABI that makes it possible to
recompile an updated core without breaking compiled extensions (like Java,
unlike C++).  

The present implementation of the Boilerplater compiler is a collection of
Perl modules: Boilerplater::Type, Boilerplater::Variable,
Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
required, Parse::RecDescent; however, only core developers will need either
Perl or Parse::RecDescent, since public distributions of Lucy will 
contain pre-generated code.  Some of Boilerplater's modules have kludgy 
internals, but on the whole they seem to do a good job of throwing errors rather 
than failing subtly.

I expect to submit individual Boilerplater modules using JIRA sub-issues which
reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCY-5) Boilerplater compiler

Posted by "Marvin Humphrey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682041#action_12682041 ] 

Marvin Humphrey commented on LUCY-5:
------------------------------------

> Would the core default impls for classes like default MergePolicy,
> MergeScheduler, IndexDeletionPolicy, HitCollector, Analyzer,
> Tokenizer, TokenFilter, etc all be implemented in C?

Yes.  Actually, Balmain ultimately persuaded me that the entire core should be
in C.  I'm cool with that since Boilerplater makes pure-LanguageX subclassing
easy.

> OK so it sounds like calling functions/methods fit into various
> categories:

>     * Entirely Lucy internal --> just call the function directly, so
>       normal C compilation handles this.

That's rarely the case.  The only time that a method invocation resolves to a
function address is when the method is declared as "final".  In that case, the
method symbol is just an alias for the function name.  (Boilerplater has
supported "final" methods and classes from very early on.)

>     * Lucy invokes "dynamically dispatched" API (ie API that could be
>       implemented in the host language, eg when I subclass Analyzer,
>       HitCollector, IndexDeletionPolicy, etc.), but in the current
>       context we are using an object in C and so we bypass the dynamic
>       dispatch. This path remains fast?

Yes, this is the standard way of doing things, and it's plenty fast.  A
function pointer is found by looking it up in the vtable:

{noformat}
    char *const method_address = (char*)self->_ + Lucy_Scorer_Next_OFFSET;
    const lucy_Matcher_next_t method = *((lucy_Matcher_next_t*)method_address);
{noformat}

("self->_" is a pointer to a VTable object, a naming convention I copied from
the Axel Tobias Schreiner book "Object Oriented Programming With ANSI C".
Perhaps "self->vtable" would be better.)

For non-abstract core methods, that function pointer points directly at a core
C function. For abstract methods, or public methods that have been overridden 
in the host, that function pointer points at an auto-generated callback function
that invokes one of the lucy_Host_callback_xxxx routines.

>     * Lucy invokes "dynamically dispatched" API, and in fact its impl in
>       the current context is defined in the host language, so we go
>       through the full dynamic dispatch.

I think you and I are using the term "dynamic dispatch" to mean different
things.  I'm using it in the sense of "resolved at run-time", so any virtual
method qualifies -- including C++ and Java virtual methods, even though C++
and Java aren't typically called "dynamic languages".
[http://en.wikipedia.org/wiki/Dynamic_dispatch]

> Boilerplater compiler
> ---------------------
>
>                 Key: LUCY-5
>                 URL: https://issues.apache.org/jira/browse/LUCY-5
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCY-5) Boilerplater compiler

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681691#action_12681691 ] 

Michael McCandless commented on LUCY-5:
---------------------------------------


Thanks for all these details Marvin!  I have a better picture now.

I know there are issues with it, but... have you considered simply
using C++, which has already created OO over C (vtables, etc.)?  Or
are there hopeless problems with its approach for Lucy?

Another question: it seems like you are going to great lengths to
achieve "no recompilation back compatibility".  Meaning, eg, if you've
someone has built Python bindings to version X of Lucy, and you've
made some otherwise-back-compatibile changes to the exposed API and
release version X+1, you'd like for those Python bindings to continue
to work w/o recompilation when someone drops in Lucy X+1 (as a dynamic
library), right?

Is this feature really necessary?  Couldn't you require that the
bindings are rebuilt & recompiled when Lucy X+1 is released?

EG, Lucene just released 2.4.1, and so PyLucene went and regen'd its
bindings (using JCC) and recompiled and [almost] released PyLucene
2.4.1.

{quote}
This causes
severe runtime memory errors when a compiled extension expects to find
a function pointer with a certain signature at a given hard-coded offset, but
finds something unexpected and incompatible there instead. However, if we
store the offsets into the vtable as variables - a change which seems to
have minimal/negligible performance impact - then a compiled extension can
adapt to a new vtable layout presented by a recompiled core. 
{quote}

But if added methods always went to the end of the vtable, wouldn't
things work fine, as long as you had bounds checking so that if new
code tried to look up a new method on old compiled code it would see
it's not there?

bq. Here's the method-invocation wrapper for Scorer_Next.

This seems like a fair amount of overhead per-invocation.  Is it
possible/OK for the caller to grab the next method up front and then
invoke it itself?

Would "core" scorers be able to somehow bypass this lookup?

{quote}
Each binding will have to implement lucy_Native_callback_i() and a few other
methods declared by Native.
{quote}

Native in this case means the dynamic language, right?  Ie,
lucy_Native_callback_i would invoke my Python method for "next", when
I've defined next in Python in my Matcher subclass?


> Boilerplater compiler
> ---------------------
>
>                 Key: LUCY-5
>                 URL: https://issues.apache.org/jira/browse/LUCY-5
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCY-5) Boilerplater compiler

Posted by "Marvin Humphrey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681506#action_12681506 ] 

Marvin Humphrey commented on LUCY-5:
------------------------------------


> I could use a little more big picture here -- how does this compiler "fit in"?  

C is typically labeled a "procedural" language, but people use it for object
oriented programming all the time, from lightweight applications to projects
like GTK+/Gnome (<http://developer.gnome.org/doc/GGAD/cha-objects.html>).  One
common, lightweight technique is to store function pointers directly in struct
based objects.  Both KinoSearch 0.1x and Ferret use this approach, as does
GTK+.  

However, the methods-as-members approach doesn't scale well.  If you have a
large number of methods, objects become unreasonably large.  Furthermore, you
can't add more virtual methods to the end of a base class struct definition
without changing the struct definitions of all of its subclasses.  That causes
the ABI to break, so compiled extensions blow up.  You are left with the
choice of either breaking backwards compatibility or accepting severe
constraints on core library development.

To solve the bloating issue, methods can be stored in a shared vtable, a la
C++ and Java.  Single inheritance is sufficient for our needs, so we don't
have to worry about "fixups" and such complications.  However, absent a JIT
compiler, we still have the problem of not being able to add virtual methods
to a base class without breaking binary compatibility in its subclasses (see
<http://techbase.kde.org/Policies/Binary_Compatibility_Issues_With_C%2B%2B>).

To address the virtual method ABI problem, we can use what I call the
"inside-out vtable" approach.  Normally, when compiling virtual method
invocations, the compiler hard-codes the offset into the vtable.  This causes
severe runtime memory errors when a compiled extension expects to find
a function pointer with a certain signature at a given hard-coded offset, but
finds something unexpected and incompatible there instead.  However, if we
store the offsets into the vtable as *variables* -- a change which seems to
have minimal/negligible performance impact -- then a compiled extension can
adapt to a new vtable layout presented by a recompiled core.  We still can't
remove methods, rename them, or change their signatures, but we can add new
ones.

The primary function of the Boilerplater compiler is to generate "boilerplate"
C code which to supports this OO model.

> Is this analagous to SWIG (used to easily autogenerate bindings in dynamic
> languages X, Y and Z)? 

That's Boilerplater's secondary function.

We already need Boilerplater to parse headers and build up a representation of
the Lucy OO tree (using Boilerplater::Method, Boilerplater::Class, etc), so
that we can generate our "boilerplate" OO support code.  If we're already
doing that much, it's not that hard to add a few additional modules to
autogenerate binding code.

However, the bindings we can generate with Boilerplater are much more powerful
and integrated into our custom OO model than what we could achieve with SWIG.
SWIG bindings allow you to invoke the C library from the host via wrappers.
Bindings generated by Boilerplater, on the other hand, allow you to write
subclasses entirely in the host language which override methods defined in the
C core.

When you create a pure-Perl subclass, e.g. "MockScorer", a lookup is performed
against the VTable_registry hash to see whether a VTable object exists which
corresponds to that class name.  If not, we dupe the parent's class's VTable,
modifing the dupe by swapping out its class name and storing a reference to the
new parent.  Then we walk the Perl symbol table for "MockScorer" looking for
methods names which match up with the public methods defined by the parent
class Scorer.  For each one that we find, we replace the function pointer at
that slot in the vtable with a custom-tailored function which calls back to
Perl and invokes the pure-Perl method.

> Can you post an example of the output code generated? 

Here's the method-invocation wrapper for Scorer_Next.

{code}
extern size_t Lucy_Scorer_Next_OFFSET;
static CHY_INLINE chy_i32_t
Lucy_Scorer_Next(const void *vself)
{
    lucy_Matcher *const self = (lucy_Matcher*)vself;
    char *const method_address = (char*)self->_ + Lucy_Scorer_Next_OFFSET;
    const lucy_Matcher_next_t method = *((lucy_Matcher_next_t*)method_address);
    return method(self);
}
{code}

Here's the callback which gets installed in the VTable when we discover that
the pure-Perl class "MockScorer" has defined a method named "next".

{code}
chy_i32_t
lucy_Matcher_next(lucy_Matcher* self) 
{
    return (chy_i32_t)lucy_Native_callback_i(self, "next", 0);
}
{code}

Each binding will have to implement lucy_Native_callback_i() and a few other
methods declared by Native.


> Boilerplater compiler
> ---------------------
>
>                 Key: LUCY-5
>                 URL: https://issues.apache.org/jira/browse/LUCY-5
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCY-5) Boilerplater compiler

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682265#action_12682265 ] 

Michael McCandless commented on LUCY-5:
---------------------------------------

bq. The stub class won't be necessary. You can write your subclass in pure Python.

OK, nice.

{quote}
> OK so it sounds like calling functions/methods fit into various
> categories:

> * Entirely Lucy internal --> just call the function directly, so
> normal C compilation handles this.

That seems to me like a weird way of putting it, so maybe I'm not grokking
you.
{quote}
More likely the reverse!

{quote}
I think the answer is yes - but isn't that true for vtable-based
subclassing in general? Only invocations of "final" methods can be resolved
to a function address at compile-time. All other method invocations have to
go through the double-derefence to find the function address in the vtable.
{quote}

Sorry, I meant: are there many internal -> internal function calls
(ie, "normal" C function calls)?

EG say Lucy were up and running, and you ran a trace on indexing N
docs, gathering counters for how many times 1) a "normal" C function
was invoked (eg say calling sqrt()), 2) a "dynamic vtable" method was
invoked but the target method was implemented in C, and 3) a "dynamic
vtable" method was invoked that has been implemented in the host
language, so you dispatched to its runtime.

Those are the 3 categories I was wondering about; it sounds like
category 1) is actually rather small in Lucy?  Which means, most APIs
are in theory overridable in the host language?  (The "dynamic vtable"
API surface area is relatively high).

Eg in IndexWriter, Lucene has various internal (private/protected)
methods for doing merging - mergeInit, mergeMiddle, mergeCommit,
mergeFinish, etc. - that are not meant to be overridden.  These would
be category 1).

{quote}
Yes. Actually, Balmain ultimately persuaded me that the entire core should be
in C. I'm cool with that since Boilerplater makes pure-LanguageX subclassing
easy.
{quote}

OK.

{quote}
> * Lucy invokes "dynamically dispatched" API, and in fact its impl in
> the current context is defined in the host language, so we go
> through the full dynamic dispatch.

I think you and I are using the term "dynamic dispatch" to mean different
things. I'm using it in the sense of "resolved at run-time", so any virtual
method qualifies - including C++ and Java virtual methods, even though C++
and Java aren't typically called "dynamic languages".
http://en.wikipedia.org/wiki/Dynamic_dispatch
{quote}

I actually intended my usage to be this definition, ie your specific
implementation of dynamic dispatch in Lucy.


> Boilerplater compiler
> ---------------------
>
>                 Key: LUCY-5
>                 URL: https://issues.apache.org/jira/browse/LUCY-5
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCY-5) Boilerplater compiler

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682011#action_12682011 ] 

Michael McCandless commented on LUCY-5:
---------------------------------------


{quote}
"Native" may not be the best name for that module, especially since it has
exactly the opposite meaning in Java.  How about "Host", instead?
{quote}

I like Host!

{quote}
The idea is that if you install an independent third-party compiled extension
like "LucyX::RTree", it should still work after you upgrade the Lucy
core.
{quote}

I guess the assumption is if I install N packages that use Lucy core
for a given Host language X (eg Python) and presumably given major
version of Lucy core, they will centrally reference the core shared
lib (vs compiling statically, or using a "private" shared lib)?

I can understand why KDE needs to use shared libs -- a zillion
installed apps link against those libs.  For Lucy, it's less clear
this should be a requirement, though it's certainly nice.


{quote}
Actually, I don't think we're going to be able to use the same shared object
with multiple bindings. Python will need its own, C will need its own, Perl
will need its own, etc - and therefore, there will never be a normal case
where the bindings and the core library are out of sync.
{quote}

I see -- the compiled Lucy core shared lib will be host-language
specific.  So somewhere centrally (/usr/local/lib or something) I'd
have something like this:

  lucy-python-1.so
  lucy-python-2.so
  lucy-perl-1.so
  lucy-c-1.so

(Where python had two major releases, 1 and 2, of Lucy installed).
Hmm actually you'd also have to separate out the version of the host
language each of these were built against.

These may symlink to the particular minor releases for each of those
that are currently installed.

{quote}
That won't work. Say that we have a core class "Dog" with two methods, bark()
and bite(), and an externally compiled subclass "Boxer" which overrides bark()
and adds drool().
{quote}

OK -- the binary compatibility challenge makes sense now -- thanks for
the tutorial (I should've gone and read that KDE doc the first time
around!).

{quote}
> This seems like a fair amount of overhead per-invocation.

Not true.   The C code is verbose, but the assembler is compact and the
machine instructions are cheap.
{quote}

Interesting and strange!  And unexpectedly pleasantly surprising...

{quote}

> Would "core" scorers be able to somehow bypass this lookup?

Yes.

In addition... Since all vtable offsets are constant for a given core compile,
we could actually define our method invocation symbols differently if e.g.
LUCY_CORE is defined, avoiding the extra variable lookup.
{quote}

Would the core default impls for classes like default MergePolicy,
MergeScheduler, IndexDeletionPolicy, HitCollector, Analyzer,
Tokenizer, TokenFilter, etc all be implemented in C?

OK so it sounds like calling functions/methods fit into various
categories:

  * Entirely Lucy internal --> just call the function directly, so
    normal C compilation handles this.

  * Lucy invokes "dynamically dispatched" API (ie API that could be
    implemented in the host language, eg when I subclass Analyzer,
    HitCollector, IndexDeletionPolicy, etc.), but in the current
    context we are using an object in C and so we bypass the dynamic
    dispatch.  This path remains fast?

  * Lucy invokes "dynamically dispatched" API, and in fact its impl in
    the current context is defined in the host language, so we go
    through the full dynamic dispatch.

How easy will it be to subclass in the host language?  EG, for
PyLucene I have to make a 'stub' class in Java first:

  http://lucene.apache.org/pylucene/jcc/documentation/readme.html#extensions

I assume Lucy has a similar requirement, ie we must decide up front
which methods are "dynamically dispatchable" and ensure Lucy always
invokes those methods dynamically.


> Boilerplater compiler
> ---------------------
>
>                 Key: LUCY-5
>                 URL: https://issues.apache.org/jira/browse/LUCY-5
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (LUCY-5) Boilerplater compiler

Posted by "Marvin Humphrey (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCY-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marvin Humphrey resolved LUCY-5.
--------------------------------

    Resolution: Fixed

> Boilerplater compiler
> ---------------------
>
>                 Key: LUCY-5
>                 URL: https://issues.apache.org/jira/browse/LUCY-5
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCY-5) Boilerplater compiler

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681233#action_12681233 ] 

Michael McCandless commented on LUCY-5:
---------------------------------------

I could use a little more big picture here -- how does this compiler
"fit in"?  Is this analagous to SWIG (used to easily autogenerate
bindings in dynamic languages X, Y and Z)?  Can you post an
example of the output code generated?


> Boilerplater compiler
> ---------------------
>
>                 Key: LUCY-5
>                 URL: https://issues.apache.org/jira/browse/LUCY-5
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCY-5) Boilerplater compiler

Posted by "Marvin Humphrey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681836#action_12681836 ] 

Marvin Humphrey commented on LUCY-5:
------------------------------------

> I know there are issues with it, but... have you considered simply using
> C++, which has already created OO over C (vtables, etc.)? Or are there
> hopeless problems with its approach for Lucy?

Yes, we considered it.  First, C++ would severely constrain core library
development, because of the vtable ABI issue -- applying that KDE
compatibility document to Lucy would be unacceptable.  Second, we would not be
able to achieve the same level of integration between the bindings and the
core, because the C++ standard does not specify how dynamic dispatch should be
implemented. 

> Another question: it seems like you are going to great lengths to achieve
> "no recompilation back compatibility". Meaning, eg, if you've someone has
> built Python bindings to version X of Lucy, and you've made some
> otherwise-back-compatibile changes to the exposed API and release version
> X+1, you'd like for those Python bindings to continue to work w/o
> recompilation when someone drops in Lucy X+1 (as a dynamic library), right?

No, that's not the use case we're concerned with.

The idea is that if you install an independent third-party compiled extension
like "LucyX::RTree", it should still work after you upgrade the Lucy core.  

Using Perl/CPAN as an example, consider the following sequence of events:

  1. Install Lucy 1.00 via CPAN.
  2. Install LucyX::RTree via CPAN.
  3. Upgrade Lucy to version 1.01 via CPAN.

If we do not preserve Lucy's binary compatibility from version 1.00 to 1.01,
apps which use LucyX::RTree will suddenly start crashing hard immediately
after the upgrade finishes.  That's not acceptable.

> Couldn't you require that the bindings are rebuilt & recompiled when Lucy
> X+1 is released?

Yes, I think that's a reasonable requirement.  

Actually, I don't think we're going to be able to use the same shared object
with multiple bindings.  Python will need its own, C will need its own, Perl
will need its own, etc -- and therefore, there will never be a normal case
where the bindings and the core library are out of sync.

The reason the core cannot be shared is that each binding has to implement
some functions which are declared by the core but left unimplemented -- for
example, the functions which implement callbacks to the host.  The object code
from those implementations will end up in the shared object.

> But if added methods always went to the end of the vtable, wouldn't things
> work fine, as long as you had bounds checking so that if new code tried to
> look up a new method on old compiled code it would see it's not there?

That won't work.  Say that we have a core class "Dog" with two methods, bark()
and bite(), and an externally compiled subclass "Boxer" which overrides bark()
and adds drool().

{code}
Dog_vtable = {
    Dog_bark,
    Dog_bite
};

Boxer_vtable = {
    Boxer_bark,
    Dog_bite,
    Boxer_drool
};
{code}

Now say that we add eat(Food *food) to the base class Dog:

{code}
Dog_vtable = {
    Dog_bark,
    Dog_bite,
    Dog_eat
};
{code}

Unfortunately, the externally compiled Boxer_vtable has a fixed layout, and it
puts Boxer_drool in the slot where the core expects to find eat().  When the
core tries to call eat() on a Boxer object, chaos will ensue.

>> Here's the method-invocation wrapper for Scorer_Next.

> This seems like a fair amount of overhead per-invocation. 

Not true.  :)  The C code is verbose, but the assembler is compact and the
machine instructions are cheap.  From
[http://mail-archives.apache.org/mod_mbox/lucene-lucy-dev/200711.mbox/%3C5C8D8968-9788-4679-BB17-A975383C4A6E@rectangular.com%3E]:

{noformat}
    ... I can detect no impact on performance using the indexing  
    benchmark script, even after changing InStream and OutStream from  
    FINAL_CLASS to CLASS so that their methods go through the dispatch  
    table rather than resolve to function addresses.  I speculate that  
    because all the extra instructions are pipeline-able, they're nearly  
    indistinguishable from free.
{noformat}

Double dereference vtables are a standard technique for for implementing
dynamic dispatch in C++, Java, etc.  The only thing we're doing differently is
loading the offset from a variable.

The "inside-out" aspect of using individual variables to hold the offsets was
inspired by the "inside-out object" technique drawn from Perl culture.
However, the idea of using variable vtable offsets has been studied before,
and is actually implemented in GCJ.  

See "Supporting Binary Compatibility with Static Compilation" by Dachuan Yu,
Zhong Shao, and Valery Trifonov, at
[http://www.usenix.org/events/javavm02/yu/yu_html/index.html].

> Is it possible/OK for the caller to grab the next method up front and then
> invoke it itself?

Yes.  In fact, I don't think there's any harm in making that part of the
public API, because we're already committed by the ABI requirements.

> Would "core" scorers be able to somehow bypass this lookup?

Yes.

In addition... Since all vtable offsets are constant for a given core compile,
we could actually define our method invocation symbols differently if e.g.
LUCY_CORE is defined, avoiding the extra variable lookup.  

>> Each binding will have to implement lucy_Native_callback_i() and a few other
>> methods declared by Native.
>
> Native in this case means the dynamic language, right? Ie,
> lucy_Native_callback_i would invoke my Python method for "next", when I've
> defined next in Python in my Matcher subclass?

Yes, that's the idea.  

"Native" may not be the best name for that module, especially since it has
exactly the opposite meaning in Java. :)  How about "Host", instead?

> Boilerplater compiler
> ---------------------
>
>                 Key: LUCY-5
>                 URL: https://issues.apache.org/jira/browse/LUCY-5
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCY-5) Boilerplater compiler

Posted by "Marvin Humphrey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682334#action_12682334 ] 

Marvin Humphrey commented on LUCY-5:
------------------------------------

> are there many internal -> internal function calls (ie, "normal" C function
> calls)?

InStream and OutStream are "final" classes in KS, so all of their "method
calls" are actually straight up "normal" C function calls.  I specifically
chose to make those two "final" because they get hit more often than anybody
else.

However, there seems to be little performance difference one way or the other.
The old KS indexing benchmarker shows maybe a 1% difference at most when I
flip the "final" flags on InStream and OutStream, close to the noise floor of
the test on my Mac.

I think this is partly because minimizing OO overhead has been a KS design
goal for a very long time; whenever possible, we avoid creating objects or
making method calls.  For instance, since InStream is final, InStream_Read_C32
performs all of its operations within a single function call (provided the
buffer doesn't need refilling), instead of needing to invoke
InStream_Read_Byte 1-5 times.  (C32 = Compressed 32-bit integer, analogous to
a Lucene VInt).  Because there are fewer method invocations overall, messing
with the method-invocation apparatus has less of an effect than it might have
on other libraries.

Another thing to bear in mind is that the "indirect dispatch" technique used
by inside-out vtables just isn't all that expensive.  Take a look at the GCJ
performance evaluations at
[http://www.usenix.org/events/javavm02/yu/yu_html/node29.html] -- they reveal
a 1-2% maximum difference on some tests which are far more method-call
intensive that what Lucy would be doing.

> most APIs are in theory overridable in the host language?

Yes -- anything which hasn't been declared "final", which means most APIs.

There are a handful of non-public methods and classes which could in theory be
declared final, but I haven't bothered because it probably wouldn't make much
difference.

The perl test files in KS take advantage of the subclassing API all the time.
All the combining Scorers (ORScorer, ANDScorer, etc) use pure-Perl MockScorer
instances as their subscorers.  Come to think of it, the current implementation of 
Schema *requires* you to subclass it, though that's about to change.

Of course calling back to a dynamic language host causes a big performance
degradation on tight loops, but it's still good enough for small data sets and
rapid prototyping.

> Boilerplater compiler
> ---------------------
>
>                 Key: LUCY-5
>                 URL: https://issues.apache.org/jira/browse/LUCY-5
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCY-5) Boilerplater compiler

Posted by "Marvin Humphrey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCY-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682036#action_12682036 ] 

Marvin Humphrey commented on LUCY-5:
------------------------------------

> How easy will it be to subclass in the host language? EG, for
> PyLucene I have to make a 'stub' class in Java first:

The stub class won't be necessary.  You can write your subclass in pure Python.

Boilerplater auto-generates all the routines that "stub" class would have
provided at core build time.  You bloat up the shared object that way
somewhat, but it's not the kind of bloat that would matter in the context of a
dynamic language.

Then at run-time, the VTable class's constructor installs these callback pointers
whenever it finds that you've overridden a public method.

> I assume Lucy has a similar requirement, ie we must decide up front
> which methods are "dynamically dispatchable" and ensure Lucy always
> invokes those methods dynamically.

That seems to me like a weird way of putting it, so maybe I'm not grokking
you.  I think the answer is yes -- but isn't that true for vtable-based
subclassing in general?  Only invocations of "final" methods can be resolved
to a function address at compile-time.  All other method invocations have to
go through the double-derefence to find the function address in the vtable.

However, it isn't necessary for the core to define methods that always call
back into the host.  Abstract methods do that by default, but other methods
don't have to:

   * The score() slot in the VTable for Scorer points at a function that
     invokes lucy_Host_callback_i to call back to the host.
   * The score() slot in the VTable for TermScorer points directly at the core
     C function lucy_TermScorer_score.
   * The score() slot in the dynamically generated VTable for the user-defined
     pure-Python subclass "MyTermScorer" points at a function that invokes
     lucy_Host_callback_i to call back to the host.

> Boilerplater compiler
> ---------------------
>
>                 Key: LUCY-5
>                 URL: https://issues.apache.org/jira/browse/LUCY-5
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
>
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.