You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2013/02/16 00:23:17 UTC

[lucy-dev] Rework test infrastructure for C bindings

Hello lucy_dev,

I just started a new branch 'clownfish-test' with a rework of Lucy's test infrastructure. It's basically the same architecture that I used for the CFC tests.

The class Clownfish::Test::TestEngine incorporates the features of Lucy::Test::TestBatch but with pluggable output formats which use the abstract class Clownfish::Test::TestFormatter. Every test class now inherits from Clownfish::Test::TestBatch instead of being an inert class.

Lucy::Test::TestEngineLucy inherits from Clownfish::Test::TestEngine and populates a VArray with all the TestBatch classes. This makes it possible to lookup tests by name so we don't need Perl bindings for every single test class.

For now, I only ported a single test to the new layout. If there arent't any objections, I'll port the rest of the tests and switch to the new test infrastructure.

Nick

Re: [lucy-dev] Rework test infrastructure for C bindings

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sat, Feb 16, 2013 at 11:01 AM, Marvin Humphrey
<ma...@rectangular.com> wrote:

> For instance, we agree on these two things:
>
> *   `TEST_TRUE` should be renamed `OK`.
> *   The first argument to `OK` should be hidden.

Oops -- I was trying to propose something completely non-controversial, but
I've just realized that the implementation strategy for how we hide the first
argument is not a settled issue.

Maybe we can start by getting rid of the varidic stuff?

Marvin Humphrey

Re: [lucy-dev] Iterating through CharBufs

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Fri, Feb 22, 2013 at 8:40 AM, Kurt Starsinic <ks...@gmail.com> wrote:
> For a little context, the history constraint isn't mentioned in Liskov's
> original paper <http://doi.acm.org/10.1145/62139.62141> (1987); it first
> appears in Liskov and Wing <http://doi.acm.org/10.1145/197320.197383>(1994).

Material for the Book Club?

After a brief skim, my impression is that while both are interesting, the
second paper is more cohesive and a better read.

> And, having said all that, it seems to me that making CharBuf a subclass of
> String is a square-peg/round-hole situation.

The reverse -- having String subclass CharBuf -- is problematic as well.  If
the subtype is immutable, then casting to the mutable superclass is akin to
casting away const-ness.

This issue comes up when wrapping a host argument string to pass to a function
in Clownfish-space.

If the Clownfish-space function declares that it accepts a `const CharBuf*` as
an argument, then it's safe to provide string content owned by the host object
within a stack-allocated wrapper.

    // Implemetation of Clownfish-space function.
    void
    Foo_set_name(Foo *self, const CharBuf *name) { // <--- `const`
        CharBuf *temp = self->name;
        self->name = CB_Clone(name);  // <--- Clone(), not INCREF.
        DECREF(temp);
    }

    // Wrapper function.
    void
    wrap_Foo_set_name(SV *foo_sv, SV *name_sv) {
        Foo *self = (Foo*)extract_cfish_obj_from_sv(foo_sv, FOO);
        STRLEN len;
        const char *name_str = SvPVutf8(name_sv, len);
        void *stack_memory = alloca(ZCB_size());
        ZombieCharBuf *name = ZCB_wrap(stack_memory, name_str, len);
        Foo_set_name(self, name);
    }

The Clownfish-space function cannot store a reference to the `const CharBuf*`,
because that requires taking ownership of a refcount -- and incrementing a the
refcount requires mutating the object, violating the `const` constraint.
(The rules for passing around a `const CharBuf*` are the same as for passing
around a `const char*`.)

However, if the Clownfish-space function declares that it takes a non-const
`CharBuf*` as an argument, we need to malloc a whole new CharBuf and copy the
host string into it -- and then after the Clownfish-space function returns, we
need to release its refcount[1].

Allocating on the heap is essential because the Clownfish-space code may
INCREF the CharBuf and store a reference which outlives a stack-allocated
wrapper.

    // Implemetation of Clownfish-space function.
    void
    Foo_set_name(Foo *self, CharBuf *name) { // <--- not `const`
        CharBuf *temp = self->name;
        self->name = (CharBuf*)INCREF(name); // <--- INCREF, not Clone().
        DECREF(temp);
    }

    // Wrapper function.
    void
    wrap_Foo_set_name(SV *foo_sv, SV *name_sv) {
        Foo *self = (Foo*)extract_cfish_obj_from_sv(foo_sv, FOO);
        STRLEN len;
        const char *name_str = SvPVutf8(name_sv, len);
        CharBuf *name = CB_new_from_trusted_utf8(name_str, len);
        Foo_set_name(self, name);
        DECREF(name);  // <--- Dispose of refcount.
    }

Surprisingly, the need to wrap host string arguments, which is a significant
aspect of Clownfish, does not argue for an immutable String class -- it argues
for `const`!

Only the _content_ of a non-const String is immutable, not its refcount -- so
only a `const String*` can replace `const CharBuf*` in our type-mapping
scheme.

Marvin Humphrey

[1] The refcount will leak if an exception is thrown in Clownfish-space
    because we do not have control over unwinding the stack.

Re: [lucy-dev] Iterating through CharBufs

Posted by Kurt Starsinic <ks...@gmail.com>.

For a little context, the history constraint isn't mentioned in Liskov's
original paper <http://doi.acm.org/10.1145/62139.62141> (1987); it first
appears in Liskov and Wing <http://doi.acm.org/10.1145/197320.197383>(1994).

Having said that (my apologies if this is well-trodden ground, I'm New
Here), I don't think Liskov Substitution is best thought of as the law of
the land, but rather as a useful guideline to be violated only with eyes
wide open.

And, having said all that, it seems to me that making CharBuf a subclass of
String is a square-peg/round-hole situation. I'm gonna read some code now.

- Kurt

On Fri, Feb 22, 2013 at 6:27 AM, Nick Wellnhofer <we...@aevum.de>wrote:

> On 22/02/2013 06:38, Marvin Humphrey wrote:
>
>> On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <we...@aevum.de>
>> wrote:
>> How about a three-prong strategy which uses both of those approaches and
>> adds one more?
>>
>> *   Externally, expose iterators as full-blown, opaque objects.
>> *   Internally, allocate iterators on the stack using alloca() and access
>> the
>>      struct members directly.
>> *   To accommodate highly performance-sensitive client code, provide
>> access to
>>      raw string data, so that the client can operate on it using its own
>> highly
>>      optimized and customized routines.
>>
>
> +1
>
>
>  This remains a problem if we make CharBuf a subclass of String.  If safe
>> iteration is part of String's interface, then a mutable subclass which
>> cannot
>> support safe iteration violates the Liskov Substitution Principle[1].
>>
>
> Doesn't a mutable subclass of an immutable class already violate the
> Liskov Substitution Principle? See "history constraint" on the Wikipedia
> page.
>
> Nick
>
>

Re: [lucy-dev] Iterating through CharBufs

Posted by Nick Wellnhofer <we...@aevum.de>.

On 22/02/2013 06:38, Marvin Humphrey wrote:
> On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> How about a three-prong strategy which uses both of those approaches and
> adds one more?
>
> *   Externally, expose iterators as full-blown, opaque objects.
> *   Internally, allocate iterators on the stack using alloca() and access the
>      struct members directly.
> *   To accommodate highly performance-sensitive client code, provide access to
>      raw string data, so that the client can operate on it using its own highly
>      optimized and customized routines.

+1

> This remains a problem if we make CharBuf a subclass of String.  If safe
> iteration is part of String's interface, then a mutable subclass which cannot
> support safe iteration violates the Liskov Substitution Principle[1].

Doesn't a mutable subclass of an immutable class already violate the 
Liskov Substitution Principle? See "history constraint" on the Wikipedia 
page.

Nick

Re: [lucy-dev] Iterating through CharBufs

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> IMO, the best way to solve these problems is to introduce string iterators.

+1

> In a basic form, they are simply the collection of a byte offset and a code
> point offset into a string.
>
>     struct CharBufIterator {
>         CharBuf *cb;
>         size_t   byte_offset;
>         size_t   code_point_offset;
>     };
>
> Useful operations on a string iterator are:
>
>     * Move the iterator forward or backward by a number of
>       code points.
>     * Get the code point at the current position or at an
>       offset relative to the current position.
>     * Get a substring between two string iterators.
>
> I used a very basic form of string iterators in my implementation of the
> StandardTokenizer:
>
>     http://s.apache.org/DCH
>
> You can also make string iterators into full-blown classes. But typically
> they're short-lived and only used as local variables inside a single
> function.

How about a three-prong strategy which uses both of those approaches and
adds one more?

*   Externally, expose iterators as full-blown, opaque objects.
*   Internally, allocate iterators on the stack using alloca() and access the
    struct members directly.
*   To accommodate highly performance-sensitive client code, provide access to
    raw string data, so that the client can operate on it using its own highly
    optimized and customized routines.

Here's some trivial sample code using CharBufIterator:

    CharBuf *hello = CB_newf("hello world");
    CharBufIterator *iterator = CB_Make_Iterator(hello);
    while (CBIter_Next(iterator)) {
        printf("%c | %d\n", CBIter_Get_Code_Point(iterator),
               (int)CBIter_Get_Byte_Offset(iterator));
    }

Returning to your earlier connection between iteration and mutability, the
fly in the ointment for exposing iteration as a public API on CharBuf is that
iterating over a mutable object like a CharBuf is an unsafe operation.
In contrast, iterating over an immutable String would be safe.

This remains a problem if we make CharBuf a subclass of String.  If safe
iteration is part of String's interface, then a mutable subclass which cannot
support safe iteration violates the Liskov Substitution Principle[1].
Arguably, a String class and a mutable character buffer serve different
purposes.

A similar issue exists when using CharBufs as hash keys: it is possible to get
at a CharBuf key and mutate it, changing its hash sum, potentially colliding
with another key and generally wreaking havoc.  Python only allows immutable
types to serve as hash keys to avoid such problems.

>> Bear in mind that one requirement for Clownfish strings going forward is to
>> support UTF-16 as an internal encoding in addition to UTF-8.
>
> Actually, you can implement every string operation in terms of string
> iterators. This concept is, for example, heavily used in the Parrot VM which
> transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and UCS-4
> encodings. FWIW, I refactored large parts of Parrot's string subsystem in
> 2010 and I'd be happy to share my experiences.

We're fortunate to be able to draw on your experience. :)

The Clownfish core, it seems to me, has goals which are inverted in
comparison to Parrot.  While Parrot's data types have to provide exhaustive
support for all features in all target languages, the primary objective for
Clownfish is to provide least-common-denominator data types which convert to
and from host data types sensibly and with minimal resistance.  That leaves us
with a lot of freedom to pursue the secondary objective of giving the core
Clownfish data types a coherent, user-friendly programming API.

Marvin Humphrey

[1] http://en.wikipedia.org/wiki/Liskov_substitution_principle

Re: [lucy-dev] Iterating through CharBufs

Posted by Nick Wellnhofer <we...@aevum.de>.

On Feb 24, 2013, at 06:59 , Marvin Humphrey <ma...@rectangular.com> wrote:

> I'd be interested whether you have anything to add to this argument from Tom
> Christiansen as to why iteration is the best model for string processing, as
> opposed to random access:
> 
>    http://bugs.python.org/issue12729#msg142036

I fully agree with Tom. Random access is only useful if you deal with fixed-length records which are rarely used these days.

This is a very interesting thread, BTW. It taught me some things I didn't know about Unicode yet. Thanks for sharing it.

Another nice thing about iterators is that if we have to support multiple encodings, the encoding can be abstracted behind the iterator interface. So we can share the implementations of String methods across encodings except for performance-critical stuff like Hash_Sum.

> Christiansen also argues for UTF-8 as a native encoding, like Perl and Go.
> Clownfish doesn't have that option -- but if we make iteration our primary
> string processing model, we can avoid problems associated with random
> access, such as splitting logical characters.

UTF-8 is certainly superior in almost all aspects. The fact that UTF-16 is still used so much has mainly historical reasons. Many implementations originally started out with UCS-2 and later upgraded to UTF-16 being the obvious but not really ideal choice. Switching from a fixed-width to a variable-width encoding has a lot of implications which have been overlooked in some programming languages as Tom Christiansen points out in the thread mentioned above.

Nick

Re: [lucy-dev] Iterating through CharBufs

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> Actually, you can implement every string operation in terms of string
> iterators. This concept is, for example, heavily used in the Parrot VM which
> transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and UCS-4
> encodings. FWIW, I refactored large parts of Parrot's string subsystem in
> 2010 and I'd be happy to share my experiences.

I'd be interested whether you have anything to add to this argument from Tom
Christiansen as to why iteration is the best model for string processing, as
opposed to random access:

    http://bugs.python.org/issue12729#msg142036

    I may be wrong here, not least because I can think of possible extenuating
    circumstances, but it is my impression that there there is an underlying
    assumption in the Python community and many others that being able to
    access the Nth character in a string in constant time for arbitrary N is
    the most important of all possible considerations.

    I don't believe that makes as much sense as people think, because I don't
    believe character strings really are accessed in that fashion very often
    at all.  Sure, if you have a 2D matrix of strings where a given row-column
    pair yields one character and you're taking the diagonal you might want
    that, but how often does that actually happen?  Virtually never: these are
    strings and not matrices we're running FFTs on after all.  We don't need
    to be able to load them into vector registers or anything the way the
    number-crunching people do.

    That's because strings are a sequence of characters: they're text.
    Whether reading text left to right, right to left, or even
    boustrophedonically, you're always going one past the character you're
    currently at.  You aren't going to the Nth character forward or back for
    arbitrary N.  That isn't how people deal with text.  Sometimes they do
    look at the end, or a few in from the far end, but even that can be
    handled in other fashions.

Christiansen also argues for UTF-8 as a native encoding, like Perl and Go.
Clownfish doesn't have that option -- but if we make iteration our primary
string processing model, we can avoid problems associated with random
access, such as splitting logical characters.

Marvin Humphrey

[lucy-dev] Iterating through CharBufs

Posted by Nick Wellnhofer <we...@aevum.de>.

On 21/02/2013 04:33, Marvin Humphrey wrote:
> Iterating through strings seems orthogonal to mutability.  What is it that you
> find objectionable about the current iteration support?

AFAICS, the current way to efficiently iterate through a string is to 
create a ViewCharBuf and use the ViewCB_Nip_One method, resulting in the 
following code:

     ViewCB_Assign(iterator, string);
     while (ViewCB_Get_Size(iterator)) {
         uint32_t code_point = ViewCB_Code_Point_At(iterator, 0);
         // Do something with code_point
         ViewCB_Nip_One(iterator);
     }

ViewCharBuf seems to be an immutable reference to a substring of a 
CharBuf. ViewCB_Nip_One advances the start of the substring by a single 
code point.

I can see the following drawbacks with this approach:

1. It's not possible to safely iterate backwards with a ViewCharBuf 
because the ViewCharBuf doesn't know where the original string starts. 
Stepping backwards from a certain position in a string is rarely needed 
in practice but the Highlighter is an example where exactly this 
operation is used.

2. It's hard to keep track of multiple positions in a string and extract 
a substring between two positions. These operations are primarily needed 
when splitting and tokenizing strings. You could remember a previous 
position in a string by simply copying a whole ViewCharBuf but 
extracting the substring between the start of two ViewCharBufs seems 
extremely messy to me.

IMO, the best way to solve these problems is to introduce string 
iterators. In a basic form, they are simply the collection of a byte 
offset and a code point offset into a string.

     struct CharBufIterator {
         CharBuf *cb;
         size_t   byte_offset;
         size_t   code_point_offset;
     };

Useful operations on a string iterator are:

     * Move the iterator forward or backward by a number of
       code points.
     * Get the code point at the current position or at an
       offset relative to the current position.
     * Get a substring between two string iterators.

I used a very basic form of string iterators in my implementation of the 
StandardTokenizer:

     http://s.apache.org/DCH

You can also make string iterators into full-blown classes. But 
typically they're short-lived and only used as local variables inside a 
single function.

> Bear in mind that one requirement for Clownfish strings going forward is to
> support UTF-16 as an internal encoding in addition to UTF-8.

Actually, you can implement every string operation in terms of string 
iterators. This concept is, for example, heavily used in the Parrot VM 
which transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and 
UCS-4 encodings. FWIW, I refactored large parts of Parrot's string 
subsystem in 2010 and I'd be happy to share my experiences.

Nick

Re: [lucy-dev] Rework test infrastructure for C bindings

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sun, Feb 17, 2013 at 3:22 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> On Sun, Feb 17, 2013 at 12:08 PM, Nick Wellnhofer <we...@aevum.de> wrote:
>> On Feb 17, 2013, at 19:48 , Marvin Humphrey <ma...@rectangular.com> wrote:
>>> For a number of reasons, Clownfish needs an immutable string type, which I
>>> think should be named Clownfish::String.  CharBuf would become a mutable
>>> subclass of String (and would be significantly rarer than it is now).
>>
>> Oh, I'd love to help working on this. One thing I don't like about the
>> current CharBuf implementation is that the way to iterate through strings
>> is rather limiting. The highlighter code in particular could benefit from a
>> few changes.

After giving the matter some thought I've decided to try another approach
first for the most pressing problem I want to solve, which is related to
`const` method invocants.

Iterating through strings seems orthogonal to mutability.  What is it that you
find objectionable about the current iteration support?

Bear in mind that one requirement for Clownfish strings going forward is to
support UTF-16 as an internal encoding in addition to UTF-8.

    // Example of wrapping UTF-8 content.  We'll have to do something similar
    // for UTF-16 strings, like those in Python.  We want to avoid malloc() in
    // wrapper functions both for the sake of speed and to avoid leaking
    // memory during exceptions, so we use alloca() to allocate the object on
    // the stack and then wrap the original string content rather than copy
    // it.  (So long as the amount requested is finite and small, alloca() is
    // safe.)
    int32_t
    call_hash_sum(SV *perl_string) {
        STRLEN len;
        char *utf8 = SvPVutf8(perl_string, len);
        void *stack_memory = alloca(ZCB_size()); // sizeof(ZombieCharBuf);
        ZombieCharBuf *wrapper = ZCB_wrap(stack_memory, utf8, len);
        return ZCB_Hash_Sum(wrapper);
    }

CharBuf's iteration facilities were designed to proceed code-point by
code-point in order to work with multiple Unicode encodings.  Does that
help to explain why things are as they are?

Marvin Humphrey

Re: [lucy-dev] Rework test infrastructure for C bindings

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sun, Feb 17, 2013 at 12:08 PM, Nick Wellnhofer <we...@aevum.de> wrote:
> On Feb 17, 2013, at 19:48 , Marvin Humphrey <ma...@rectangular.com> wrote:
>> For a number of reasons, Clownfish needs an immutable string type, which I
>> think should be named Clownfish::String.  CharBuf would become a mutable
>> subclass of String (and would be significantly rarer than it is now).
>
> Oh, I'd love to help working on this. One thing I don't like about the
> current CharBuf implementation is that the way to iterate through strings is
> rather limiting. The highlighter code in particular could benefit from a few
> changes.

I'll reply to this in a new thread later.

> OK, then I continue the work I started in 'clownfish-test-v2'.

Nick++

> To recap, there seems to be consensus on the following items:
>
>     * Use the following macros locally in the Lucy tests:
>         * OK
>         * INT_EQ
>         * FLOAT_EQ
>         * STR_EQ
>         * CB_EQ
>         * CB_EQ_STR
>     * Store the current test batch in a global variable so we can
>       do away with the first macro argument.

+1

> WRT the printf-style messages, I'd simply keep them.

+1

Aside from using the "vself hack"[1] to support an approximation of variadic
methods[2], the printf-style doesn't make the Clownfish::Test interface more
brittle or the implementation much harder to maintain.  It was only worth
deleting on the basis of YAGNI and for the sake of less code -- but if it's
actually being used, let's keep it.

Marvin Humphrey

[1] The "vself hack" involves creating a variadic inert function which wraps a
    method which takes a `va_list`.  The first argument to the inert wrapper
    function is `void *vself`, which is a "hack" because we don't get the
    benefit of type checking `self` the way we do with our method invocations.

[2] I think we may be able to kill off the "vself hack".  It was necessary to
    avoid variadic macros when Clownfish method invocations were implemented
    as macros, but now that they're inline functions, we should be able to
    support variadic methods without any problem.

Re: [lucy-dev] Rework test infrastructure for C bindings

Posted by Nick Wellnhofer <we...@aevum.de>.

On Feb 17, 2013, at 19:48 , Marvin Humphrey <ma...@rectangular.com> wrote:

> For a number of reasons, Clownfish needs an immutable string type, which I
> think should be named Clownfish::String.  CharBuf would become a mutable
> subclass of String (and would be significantly rarer than it is now).

Oh, I'd love to help working on this. One thing I don't like about the current CharBuf implementation is that the way to iterate through strings is rather limiting. The highlighter code in particular could benefit from a few changes.

> How about CSTR_EQ for C strings and STR_EQ for CharBufs (and eventually,
> Clownfish::Strings)?
> 
> IMO, claiming primacy for CharBufs over C strings is already justified based
> on stats:
> 
>    $ grep -pir "cb_equals\b" core/Lucy/Test/ | wc -l
>          58
>    $ grep -pir "cb_equals_str\b" core/Lucy/Test/ | wc -l
>          42
>    $ grep -pir "str_eq\b" core/Lucy/Test/ | wc -l
>           3
> 
> Or alternatively, can we punt and define the three test macros you enumerate
> above in Lucy::Test but not Clownfish::Test?

+1 for punting the decision, then.

> I'd thought the printf-style would have been taken advantage of more often,
> but on inspection I think there were 4 usages total between the Charmonizer
> test suite and the Lucy test suite.  I'm a little surprised that you actually
> used it more in CFC's tests.

That's interesting. I used the printf style about 35 times in the CFC tests:

    $ grep '%' clownfish/compiler/src/CFCTest[A-Z]*.c |grep -v sprintf |wc -l
          35

It's probably due to the nature of the CFC tests where a lot of stuff is tested in 'for' loops.

> Hmm, I hope the give-and-take of consensus building hasn't been too
> demotivating. :\

No, not at all.

> If it's just coding fatigue, though, then congratulations on
> getting a huge amount done before burning out.  I'm happy to do some work on
> this area, and we might be able to queue up some easy issues as well for
> newcomers looking to get involved.

Refactoring and porting tests is simply boring stuff and I have done a lot of that lately.

> We might consider looking to xUnit once again for inspiration and
> encapsulating the list of TestBatches within a TestSuite -- but that's
> something we could tackle down the road.

OK, then I continue the work I started in 'clownfish-test-v2'.

To recap, there seems to be consensus on the following items:

    * Use the following macros locally in the Lucy tests:
        * OK
        * INT_EQ
        * FLOAT_EQ
        * STR_EQ
        * CB_EQ
        * CB_EQ_STR
    * Store the current test batch in a global variable so we can
      do away with the first macro argument.

WRT the printf-style messages, I'd simply keep them.

Nick

Re: [lucy-dev] Rework test infrastructure for C bindings

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sat, Feb 16, 2013 at 5:30 PM, Nick Wellnhofer <we...@aevum.de> wrote:
> See my new branch 'clownfish-test-v2' for a solution based on this idea. I
> agree that it's better than my previous approach.

I like the new branch very much.  Thank you for taking the time to refine it.

> OK, I thought a TestCase was a single "assertion". If it's equivalent to a
> test script, I didn't understand the xUnit terminology.

Interesting -- that's the same misunderstanding I had when I first encountered
xUnit.

>> `STR_EQ` is good as implemented in CFC, but the main string type in Clownfish
>> is an object (named "CharBuf" at this time) rather than a NUL-terminated C
>> string.  If we have `STR_EQ` take two CharBuf objects, we'll find a lot more
>> use for it.
>
> Maybe we should have STR_EQ, CB_EQ and even CB_EQ_STR?

For a number of reasons, Clownfish needs an immutable string type, which I
think should be named Clownfish::String.  CharBuf would become a mutable
subclass of String (and would be significantly rarer than it is now).

I'll omit the rationale since that's a long tangent, and as of this moment I
don't intend to prioritize work on an immutable string type until after we
have at least Python and Ruby bindings -- but I'd like to stake a claim on the
high-value "str" abbreviation for Clownfish::String.

How about CSTR_EQ for C strings and STR_EQ for CharBufs (and eventually,
Clownfish::Strings)?

IMO, claiming primacy for CharBufs over C strings is already justified based
on stats:

    $ grep -pir "cb_equals\b" core/Lucy/Test/ | wc -l
          58
    $ grep -pir "cb_equals_str\b" core/Lucy/Test/ | wc -l
          42
    $ grep -pir "str_eq\b" core/Lucy/Test/ | wc -l
           3

Or alternatively, can we punt and define the three test macros you enumerate
above in Lucy::Test but not Clownfish::Test?

> In the CFC tests, the printf-style is used quite a lot. IMO it's quite
> convenient so I'd like to keep it.

I'd thought the printf-style would have been taken advantage of more often,
but on inspection I think there were 4 usages total between the Charmonizer
test suite and the Lucy test suite.  I'm a little surprised that you actually
used it more in CFC's tests.

> My main concern for now is to adapt the tests for the C bindings.  My second
> approach in 'clownfish-test-v2' is less invasive than the first one and to
> be honest I lost a bit of my enthusiasm of making bigger changes to the test
> infrastructure ;)

Hmm, I hope the give-and-take of consensus building hasn't been too
demotivating. :\  If it's just coding fatigue, though, then congratulations on
getting a huge amount done before burning out.  I'm happy to do some work on
this area, and we might be able to queue up some easy issues as well for
newcomers looking to get involved.

> The only thing I really care about is that adding a new test class and
> making it work with every host language requires the minimum amount of
> changes to the rest of the code. For the C bindings, a VArray with all the
> test classes would be ideal. In my new branch, I moved that code from
> TestEngine to Lucy::Test but I'm not sure if that's the better idea.

We might consider looking to xUnit once again for inspiration and
encapsulating the list of TestBatches within a TestSuite -- but that's
something we could tackle down the road.

Marvin Humphrey

Re: [lucy-dev] Rework test infrastructure for C bindings

Posted by Nick Wellnhofer <we...@aevum.de>.

On Feb 16, 2013, at 20:01 , Marvin Humphrey <ma...@rectangular.com> wrote:

> OK, got it.
> 
> The rationale for running all the tests within a single process, despite the
> potential for interference, is that we won't have to write a TAP parser.
> 
> And the rationale for passing around the TestEngine as the first argument to
> individual test subroutines rather than the TestBatch is that the TestEngine
> is maintaining pass/fail statistics for the entire suite.
> 
> Right?

Exactly.

>> TestEngine only has to be subclassed once per parcel.
> 
> That's true, but under the proposed changes TestEngine becomes mandatory -- so
> you have to deal with it up front.  In other words, the startup development
> costs are higher, though they are amortized over time.
> 
> That's the reverse of what we ought to be doing -- we should strive to
> minimize the amount of effort it takes to write and run a single test case in
> isolation.

I don't think my previous implementation of TestEngine would increase startup costs considerably more than other solutions.

> Another downside is that it would be nice to customize the behavior of the
> test routines per test class, but if the first argument to the test routine is
> a TestEngine rather than a TestBatch, we can only customize the behavior at
> the suite level.

This is a valid concern.

> I appreciate that we want a convenient way to run all the tests for the C
> bindings.  However, I'd like to avoid changing the first argument to our
> test routines.
> 
> I believe that it's possible to solve this dilemma by having the test runner
> inspect the TestBatch after Run_Tests() completes.
> 
>    TestBatch_Run_Tests(batch);
>    test_runner->num_passed  += TestBatch_Get_Num_Passed(batch);
>    test_runner->num_failed  += TestBatch_Get_Num_Failed(batch);
>    test_runner->num_skipped += TestBatch_Get_Num_Skipped(batch);
>    // …

See my new branch 'clownfish-test-v2' for a solution based on this idea. I agree that it's better than my previous approach.

> OK, I think I see what you mean.  The individual test subroutines are
> "assertions" in xUnit parlance, but an xUnit "TestCase" corresponds roughly to
> a Perl test script -- which is what I think you mean by "a single test".

OK, I thought a TestCase was a single "assertion". If it's equivalent to a test script, I didn't understand the xUnit terminology.

>> But xUnit's TestSuite can also contain a collection of other TestSuites
>> which is what I'm trying achieve with the registry in TestEngine.
> 
> Hmm, I don't really grok why that's needed right away.  Why isn't a flat
> collection of test subclasses good enough for now?

A flat collection of subclasses is OK. The confusion comes from my misunderstanding of the xUnit terminology.

> `INT_EQ` is good.  I'd argue that it should start taking an `int64_t` rather
> than a `long`.

+1

> `STR_EQ` is good as implemented in CFC, but the main string type in Clownfish
> is an object (named "CharBuf" at this time) rather than a NUL-terminated C
> string.  If we have `STR_EQ` take two CharBuf objects, we'll find a lot more
> use for it.

Maybe we should have STR_EQ, CB_EQ and even CB_EQ_STR?

> A `PTR_EQ` which takes two `void*` arguments might also come in handy.

I don't think this would be useful. It doesn't save typing and a diagnosis like "Got pointer 0xABCD, expected 0x1234" doesn't really help.

>> I'd also like to get rid of the first parameter of the test macros. IIRC,
>> you already proposed a global test object.  Another solution could be based
>> on variadic macros but these aren't supported on all platforms, are they?
> 
> Variadic macros are not a blocker, as I'll explain in a moment.  However, I'm
> not enthusiastic about macros which would rely on having local variables named
> according to conventions -- too fragile and fiddly.

Fair enough.

> It turns out that we almost never need the printf-style formatted printing in
> test routines -- a single "message" argument suffices in almost all cases.  On
> those rare occasions where a message needs to be customized, it can be
> prepared in a separate sprintf() operation before the test routine is called.
> 
> I got rid of the variadic routines in the Charmonizer test harness a while
> back:
> 
>    http://s.apache.org/gK
> 
> We can likewise get rid of all the "V" routines in TestBatch, replacing e.g.
> the inert routine `pass` and the method `VPass` with a single `Pass`
> method.

In the CFC tests, the printf-style is used quite a lot. IMO it's quite convenient so I'd like to keep it.

> I feel like there are a lot of good ideas in this thread, but I'm concerned
> about trying to do too much at once.  Does it make sense to go after some
> low-hanging fruit?
> 
> For instance, we agree on these two things:
> 
> *   `TEST_TRUE` should be renamed `OK`.
> *   The first argument to `OK` should be hidden.
> 
> Maybe we can start there?  It would make the other changes under consideration
> less consequential.

My main concern for now is to adapt the tests for the C bindings. My second approach in 'clownfish-test-v2' is less invasive than the first one and to be honest I lost a bit of my enthusiasm of making bigger changes to the test infrastructure ;)

The only thing I really care about is that adding a new test class and making it work with every host language requires the minimum amount of changes to the rest of the code. For the C bindings, a VArray with all the test classes would be ideal. In my new branch, I moved that code from TestEngine to Lucy::Test but I'm not sure if that's the better idea.

Nick

Re: [lucy-dev] Rework test infrastructure for C bindings

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Sat, Feb 16, 2013 at 3:30 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> The idea of the TestEngine class is to provide a replacement for Perl's
> 'prove' tool.  For the Perl bindings we only ever run a single test batch
> with TestEngine but it's also possible to run multiple batches and print a
> summary at the end. I plan to use this feature for the C bindings.

OK, got it.

The rationale for running all the tests within a single process, despite the
potential for interference, is that we won't have to write a TAP parser.

And the rationale for passing around the TestEngine as the first argument to
individual test subroutines rather than the TestBatch is that the TestEngine
is maintaining pass/fail statistics for the entire suite.

Right?

As a bonus, it's nice not to have to create stub .t files to drive the test
runner under the C bindings as we have to under Perl.  (FWIW, there are ways
to do away with those stub .t files under Perl using recent versions of
TAP::Parser, but it's not worth either the surprise factor of the
unconventional configuration or the prereq.)

>> It seems unfortunate that users must provide *both* a subclass of TestBatch
>> *and* a subclass of TestEngine up front.  For the sake of simplicity, I think
>> we should only be asking people to learn one class when starting out -- so
>> subclassing Clownfish::Test::TestBatch ought to be enough.
>
> TestEngine only has to be subclassed once per parcel.

That's true, but under the proposed changes TestEngine becomes mandatory -- so
you have to deal with it up front.  In other words, the startup development
costs are higher, though they are amortized over time.

That's the reverse of what we ought to be doing -- we should strive to
minimize the amount of effort it takes to write and run a single test case in
isolation.

Another downside is that it would be nice to customize the behavior of the
test routines per test class, but if the first argument to the test routine is
a TestEngine rather than a TestBatch, we can only customize the behavior at
the suite level.

> It's basically a registry of all test batches replacing the huge 'if … else
> if'-chain in Lucy/Build/Binding/Misc.pm. There are other ways to achieve
> this like adding a Register_Test_Batch method to TestEngine or simply using
> an 'inert' function per parcel like you proposed.

I appreciate that we want a convenient way to run all the tests for the C
bindings.  However, I'd like to avoid changing the first argument to our
test routines.

I believe that it's possible to solve this dilemma by having the test runner
inspect the TestBatch after Run_Tests() completes.

    TestBatch_Run_Tests(batch);
    test_runner->num_passed  += TestBatch_Get_Num_Passed(batch);
    test_runner->num_failed  += TestBatch_Get_Num_Failed(batch);
    test_runner->num_skipped += TestBatch_Get_Num_Skipped(batch);
    // ...

>> FWIW, take away the test methods, and TestEngine becomes effectively an xUnit
>> TestSuite.  The xUnit system isn't perfect, but the division of roles into
>> TestCase (TestBatch for us), TestSuite and TestRunner makes sense IMO.  Shall
>> we formalize our own variants?
>
> I'm not familiar with xUnit, but it seems that my TestBatch class actually
> corresponds to xUnit's TestSuite, and my TestEngine to xUnit's TestRunner.
> xUnit's TestCase seems to encapsulate only a single test.

OK, I think I see what you mean.  The individual test subroutines are
"assertions" in xUnit parlance, but an xUnit "TestCase" corresponds roughly to
a Perl test script -- which is what I think you mean by "a single test".

For the record, the name "TestBatch" originated under the Charmonizer test
harness as an approximation of a Perl test script, inspired partly by
famililarity I'd gained with Test::Simple's guts by hacking on David Wheeler's
JavaScript port, Test.Simple.  "TestBatch" was supposed to convey "a batch of
test subroutines" --  but if the name is causing confusion because it seems to
imply "a batch of test *files*", perhaps our base test class should be named
TestCase instead, a la xUnit.

The parallel between a Perl test script and an xUnit TestCase is not perfect,
as there is more of an emphasis on setting up and tearing down the "fixture"
or "test context" in xUnit.  However, those are nice features, and I think it
makes sense to move in that direction.

> But xUnit's TestSuite can also contain a collection of other TestSuites
> which is what I'm trying achieve with the registry in TestEngine.

Hmm, I don't really grok why that's needed right away.  Why isn't a flat
collection of test subclasses good enough for now?

(FWIW, what you describe sounds similar to the "subtest" feature available in
recent versions of Test::More.)

>> PS: One other thing you did in the CFC tests is you used `OK` and friends
>>    instead of `TEST_TRUE` and so on.  Would you prefer to see Clownfish::Test
>>    and thus all of Lucy's tests do the same?  I wouldn't mind.
>
> Yes, I'd prefer 'OK' inspired by Perl's Test::More. For the equality test
> macros, I used 'INT_EQ' and 'STR_EQ' in CFC which don't correspond to other
> Perlish names.

`INT_EQ` is good.  I'd argue that it should start taking an `int64_t` rather
than a `long`.

`STR_EQ` is good as implemented in CFC, but the main string type in Clownfish
is an object (named "CharBuf" at this time) rather than a NUL-terminated C
string.  If we have `STR_EQ` take two CharBuf objects, we'll find a lot more
use for it.

A `PTR_EQ` which takes two `void*` arguments might also come in handy.

> I also found that in many places the Lucy test suite uses 'TEST_TRUE' where
> the equality test macros could be used. Now would be a good opportunity to
> fix this.

If I recall correctly, that's often because of precision limits, or because an
Equals() method is being invoked.

> I'd also like to get rid of the first parameter of the test macros. IIRC,
> you already proposed a global test object.  Another solution could be based
> on variadic macros but these aren't supported on all platforms, are they?

Variadic macros are not a blocker, as I'll explain in a moment.  However, I'm
not enthusiastic about macros which would rely on having local variables named
according to conventions -- too fragile and fiddly.

It turns out that we almost never need the printf-style formatted printing in
test routines -- a single "message" argument suffices in almost all cases.  On
those rare occasions where a message needs to be customized, it can be
prepared in a separate sprintf() operation before the test routine is called.

I got rid of the variadic routines in the Charmonizer test harness a while
back:

    http://s.apache.org/gK

We can likewise get rid of all the "V" routines in TestBatch, replacing e.g.
the inert routine `pass` and the method `VPass` with a single `Pass`
method.

...

I feel like there are a lot of good ideas in this thread, but I'm concerned
about trying to do too much at once.  Does it make sense to go after some
low-hanging fruit?

For instance, we agree on these two things:

*   `TEST_TRUE` should be renamed `OK`.
*   The first argument to `OK` should be hidden.

Maybe we can start there?  It would make the other changes under consideration
less consequential.

Marvin Humphrey

Re: [lucy-dev] Rework test infrastructure for C bindings

Posted by Nick Wellnhofer <we...@aevum.de>.

On Feb 16, 2013, at 07:12 , Marvin Humphrey <ma...@rectangular.com> wrote:

> I'm more concerned about extracting all the test routines from TestBatch into
> the new class TestEngine.  (Is the idea to support alternative TestEngine
> implementations, e.g. one that provides xUnit-style assertions?)

The idea of the TestEngine class is to provide a replacement for Perl's 'prove' tool. For the Perl bindings we only ever run a single test batch with TestEngine but it's also possible to run multiple batches and print a summary at the end. I plan to use this feature for the C bindings.

> It seems unfortunate that users must provide *both* a subclass of TestBatch
> *and* a subclass of TestEngine up front.  For the sake of simplicity, I think
> we should only be asking people to learn one class when starting out -- so
> subclassing Clownfish::Test::TestBatch ought to be enough.

TestEngine only has to be subclassed once per parcel. For every new test batch, one only has to add a single line with a VA_Push to TestEngine here (line 36):

    http://s.apache.org/6EL

It's basically a registry of all test batches replacing the huge 'if … else if'-chain in Lucy/Build/Binding/Misc.pm. There are other ways to achieve this like adding a Register_Test_Batch method to TestEngine or simply using an 'inert' function per parcel like you proposed.

> FWIW, take away the test methods, and TestEngine becomes effectively an xUnit
> TestSuite.  The xUnit system isn't perfect, but the division of roles into
> TestCase (TestBatch for us), TestSuite and TestRunner makes sense IMO.  Shall
> we formalize our own variants?

I'm not familiar with xUnit, but it seems that my TestBatch class actually corresponds to xUnit's TestSuite, and my TestEngine to xUnit's TestRunner. xUnit's TestCase seems to encapsulate only a single test. But xUnit's TestSuite can also contain a collection of other TestSuites which is what I'm trying achieve with the registry in TestEngine.

> PS: One other thing you did in the CFC tests is you used `OK` and friends
>    instead of `TEST_TRUE` and so on.  Would you prefer to see Clownfish::Test
>    and thus all of Lucy's tests do the same?  I wouldn't mind.

Yes, I'd prefer 'OK' inspired by Perl's Test::More. For the equality test macros, I used 'INT_EQ' and 'STR_EQ' in CFC which don't correspond to other Perlish names.

I also found that in many places the Lucy test suite uses 'TEST_TRUE' where the equality test macros could be used. Now would be a good opportunity to fox this.

I'd also like to get rid of the first parameter of the test macros. IIRC, you already proposed a global test object. Another solution could be based on variadic macros but these aren't supported on all platforms, are they?

Nick

Re: [lucy-dev] Rework test infrastructure for C bindings

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Fri, Feb 15, 2013 at 3:23 PM, Nick Wellnhofer <we...@aevum.de> wrote:

> The class Clownfish::Test::TestEngine incorporates the features of
> Lucy::Test::TestBatch but with pluggable output formats which use the
> abstract class Clownfish::Test::TestFormatter.

The TestFormatter class seems like a nice step forward.

> Every test class now inherits from Clownfish::Test::TestBatch instead of
> being an inert class.

+1

> Lucy::Test::TestEngineLucy inherits from Clownfish::Test::TestEngine and
> populates a VArray with all the TestBatch classes. This makes it possible to
> lookup tests by name so we don't need Perl bindings for every single test
> class.

I see why it makes sense to avoid specifying bindings for every test class.
It's worse than you think -- each test batch needs a binding not just for
Perl, but for every host language!

Working around that problem is straightforward, though -- see code for an ad
hoc run_test_batch() function below.

I'm more concerned about extracting all the test routines from TestBatch into
the new class TestEngine.  (Is the idea to support alternative TestEngine
implementations, e.g. one that provides xUnit-style assertions?)

It seems unfortunate that users must provide *both* a subclass of TestBatch
*and* a subclass of TestEngine up front.  For the sake of simplicity, I think
we should only be asking people to learn one class when starting out -- so
subclassing Clownfish::Test::TestBatch ought to be enough.

FWIW, take away the test methods, and TestEngine becomes effectively an xUnit
TestSuite.  The xUnit system isn't perfect, but the division of roles into
TestCase (TestBatch for us), TestSuite and TestRunner makes sense IMO.  Shall
we formalize our own variants?

> For now, I only ported a single test to the new layout. If there aren't any
> objections, I'll port the rest of the tests and switch to the new test
> infrastructure.

Thanks for bringing the issue to the list and building consensus.

PS: One other thing you did in the CFC tests is you used `OK` and friends
    instead of `TEST_TRUE` and so on.  Would you prefer to see Clownfish::Test
    and thus all of Lucy's tests do the same?  I wouldn't mind.

Marvin Humphrey

int32_t
LucyTest_run_test_batch(const char *name, const char *format) {
    // Obtain a TestFormatter.
    TestFormatter *formatter = NULL;
    if (strcmp (format, "tap") == 0) {
        formatter = (TestFormatter*)TestFormatterTAP_new();
    }
    // ...
    else {
        THROW(ERR, "Can't find formatter for '%s'", format);
    }

    // Find the test class to run.
    TestBatch *batch = NULL;
    if (strcmp(name, "Lucy::Test::Highlighter::TestHeatMap") == 0) {
        batch = (TestBatch*)TestHeatMap_new(formatter);
    }
    // ...
    else {
        THROW(ERR, "Can't find test class '%s'", name);
    }

    // Run tests, clean up, and return success/failure.
    int32_t result = TestBatch_Run_Tests(batch);
    DECREF(batch);
    DECREF(formatter);
    return result;
}