You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2013/02/21 13:43:56 UTC

[lucy-dev] Iterating through CharBufs

On 21/02/2013 04:33, Marvin Humphrey wrote:
> Iterating through strings seems orthogonal to mutability.  What is it that you
> find objectionable about the current iteration support?

AFAICS, the current way to efficiently iterate through a string is to 
create a ViewCharBuf and use the ViewCB_Nip_One method, resulting in the 
following code:

     ViewCB_Assign(iterator, string);
     while (ViewCB_Get_Size(iterator)) {
         uint32_t code_point = ViewCB_Code_Point_At(iterator, 0);
         // Do something with code_point
         ViewCB_Nip_One(iterator);
     }

ViewCharBuf seems to be an immutable reference to a substring of a 
CharBuf. ViewCB_Nip_One advances the start of the substring by a single 
code point.

I can see the following drawbacks with this approach:

1. It's not possible to safely iterate backwards with a ViewCharBuf 
because the ViewCharBuf doesn't know where the original string starts. 
Stepping backwards from a certain position in a string is rarely needed 
in practice but the Highlighter is an example where exactly this 
operation is used.

2. It's hard to keep track of multiple positions in a string and extract 
a substring between two positions. These operations are primarily needed 
when splitting and tokenizing strings. You could remember a previous 
position in a string by simply copying a whole ViewCharBuf but 
extracting the substring between the start of two ViewCharBufs seems 
extremely messy to me.

IMO, the best way to solve these problems is to introduce string 
iterators. In a basic form, they are simply the collection of a byte 
offset and a code point offset into a string.

     struct CharBufIterator {
         CharBuf *cb;
         size_t   byte_offset;
         size_t   code_point_offset;
     };

Useful operations on a string iterator are:

     * Move the iterator forward or backward by a number of
       code points.
     * Get the code point at the current position or at an
       offset relative to the current position.
     * Get a substring between two string iterators.

I used a very basic form of string iterators in my implementation of the 
StandardTokenizer:

     http://s.apache.org/DCH

You can also make string iterators into full-blown classes. But 
typically they're short-lived and only used as local variables inside a 
single function.

> Bear in mind that one requirement for Clownfish strings going forward is to
> support UTF-16 as an internal encoding in addition to UTF-8.

Actually, you can implement every string operation in terms of string 
iterators. This concept is, for example, heavily used in the Parrot VM 
which transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and 
UCS-4 encodings. FWIW, I refactored large parts of Parrot's string 
subsystem in 2010 and I'd be happy to share my experiences.

Nick

Re: [lucy-dev] Iterating through CharBufs

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Fri, Feb 22, 2013 at 8:40 AM, Kurt Starsinic <ks...@gmail.com> wrote:
> For a little context, the history constraint isn't mentioned in Liskov's
> original paper <http://doi.acm.org/10.1145/62139.62141> (1987); it first
> appears in Liskov and Wing <http://doi.acm.org/10.1145/197320.197383>(1994).

Material for the Book Club?

After a brief skim, my impression is that while both are interesting, the
second paper is more cohesive and a better read.

> And, having said all that, it seems to me that making CharBuf a subclass of
> String is a square-peg/round-hole situation.

The reverse -- having String subclass CharBuf -- is problematic as well.  If
the subtype is immutable, then casting to the mutable superclass is akin to
casting away const-ness.

This issue comes up when wrapping a host argument string to pass to a function
in Clownfish-space.

If the Clownfish-space function declares that it accepts a `const CharBuf*` as
an argument, then it's safe to provide string content owned by the host object
within a stack-allocated wrapper.

    // Implemetation of Clownfish-space function.
    void
    Foo_set_name(Foo *self, const CharBuf *name) { // <--- `const`
        CharBuf *temp = self->name;
        self->name = CB_Clone(name);  // <--- Clone(), not INCREF.
        DECREF(temp);
    }

    // Wrapper function.
    void
    wrap_Foo_set_name(SV *foo_sv, SV *name_sv) {
        Foo *self = (Foo*)extract_cfish_obj_from_sv(foo_sv, FOO);
        STRLEN len;
        const char *name_str = SvPVutf8(name_sv, len);
        void *stack_memory = alloca(ZCB_size());
        ZombieCharBuf *name = ZCB_wrap(stack_memory, name_str, len);
        Foo_set_name(self, name);
    }

The Clownfish-space function cannot store a reference to the `const CharBuf*`,
because that requires taking ownership of a refcount -- and incrementing a the
refcount requires mutating the object, violating the `const` constraint.
(The rules for passing around a `const CharBuf*` are the same as for passing
around a `const char*`.)

However, if the Clownfish-space function declares that it takes a non-const
`CharBuf*` as an argument, we need to malloc a whole new CharBuf and copy the
host string into it -- and then after the Clownfish-space function returns, we
need to release its refcount[1].

Allocating on the heap is essential because the Clownfish-space code may
INCREF the CharBuf and store a reference which outlives a stack-allocated
wrapper.

    // Implemetation of Clownfish-space function.
    void
    Foo_set_name(Foo *self, CharBuf *name) { // <--- not `const`
        CharBuf *temp = self->name;
        self->name = (CharBuf*)INCREF(name); // <--- INCREF, not Clone().
        DECREF(temp);
    }

    // Wrapper function.
    void
    wrap_Foo_set_name(SV *foo_sv, SV *name_sv) {
        Foo *self = (Foo*)extract_cfish_obj_from_sv(foo_sv, FOO);
        STRLEN len;
        const char *name_str = SvPVutf8(name_sv, len);
        CharBuf *name = CB_new_from_trusted_utf8(name_str, len);
        Foo_set_name(self, name);
        DECREF(name);  // <--- Dispose of refcount.
    }

Surprisingly, the need to wrap host string arguments, which is a significant
aspect of Clownfish, does not argue for an immutable String class -- it argues
for `const`!

Only the _content_ of a non-const String is immutable, not its refcount -- so
only a `const String*` can replace `const CharBuf*` in our type-mapping
scheme.

Marvin Humphrey

[1] The refcount will leak if an exception is thrown in Clownfish-space
    because we do not have control over unwinding the stack.

Re: [lucy-dev] Iterating through CharBufs

Posted by Kurt Starsinic <ks...@gmail.com>.

For a little context, the history constraint isn't mentioned in Liskov's
original paper <http://doi.acm.org/10.1145/62139.62141> (1987); it first
appears in Liskov and Wing <http://doi.acm.org/10.1145/197320.197383>(1994).

Having said that (my apologies if this is well-trodden ground, I'm New
Here), I don't think Liskov Substitution is best thought of as the law of
the land, but rather as a useful guideline to be violated only with eyes
wide open.

And, having said all that, it seems to me that making CharBuf a subclass of
String is a square-peg/round-hole situation. I'm gonna read some code now.

- Kurt

On Fri, Feb 22, 2013 at 6:27 AM, Nick Wellnhofer <we...@aevum.de>wrote:

> On 22/02/2013 06:38, Marvin Humphrey wrote:
>
>> On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <we...@aevum.de>
>> wrote:
>> How about a three-prong strategy which uses both of those approaches and
>> adds one more?
>>
>> *   Externally, expose iterators as full-blown, opaque objects.
>> *   Internally, allocate iterators on the stack using alloca() and access
>> the
>>      struct members directly.
>> *   To accommodate highly performance-sensitive client code, provide
>> access to
>>      raw string data, so that the client can operate on it using its own
>> highly
>>      optimized and customized routines.
>>
>
> +1
>
>
>  This remains a problem if we make CharBuf a subclass of String.  If safe
>> iteration is part of String's interface, then a mutable subclass which
>> cannot
>> support safe iteration violates the Liskov Substitution Principle[1].
>>
>
> Doesn't a mutable subclass of an immutable class already violate the
> Liskov Substitution Principle? See "history constraint" on the Wikipedia
> page.
>
> Nick
>
>

Re: [lucy-dev] Iterating through CharBufs

Posted by Nick Wellnhofer <we...@aevum.de>.

On 22/02/2013 06:38, Marvin Humphrey wrote:
> On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> How about a three-prong strategy which uses both of those approaches and
> adds one more?
>
> *   Externally, expose iterators as full-blown, opaque objects.
> *   Internally, allocate iterators on the stack using alloca() and access the
>      struct members directly.
> *   To accommodate highly performance-sensitive client code, provide access to
>      raw string data, so that the client can operate on it using its own highly
>      optimized and customized routines.

+1

> This remains a problem if we make CharBuf a subclass of String.  If safe
> iteration is part of String's interface, then a mutable subclass which cannot
> support safe iteration violates the Liskov Substitution Principle[1].

Doesn't a mutable subclass of an immutable class already violate the 
Liskov Substitution Principle? See "history constraint" on the Wikipedia 
page.

Nick

Re: [lucy-dev] Iterating through CharBufs

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> IMO, the best way to solve these problems is to introduce string iterators.

+1

> In a basic form, they are simply the collection of a byte offset and a code
> point offset into a string.
>
>     struct CharBufIterator {
>         CharBuf *cb;
>         size_t   byte_offset;
>         size_t   code_point_offset;
>     };
>
> Useful operations on a string iterator are:
>
>     * Move the iterator forward or backward by a number of
>       code points.
>     * Get the code point at the current position or at an
>       offset relative to the current position.
>     * Get a substring between two string iterators.
>
> I used a very basic form of string iterators in my implementation of the
> StandardTokenizer:
>
>     http://s.apache.org/DCH
>
> You can also make string iterators into full-blown classes. But typically
> they're short-lived and only used as local variables inside a single
> function.

How about a three-prong strategy which uses both of those approaches and
adds one more?

*   Externally, expose iterators as full-blown, opaque objects.
*   Internally, allocate iterators on the stack using alloca() and access the
    struct members directly.
*   To accommodate highly performance-sensitive client code, provide access to
    raw string data, so that the client can operate on it using its own highly
    optimized and customized routines.

Here's some trivial sample code using CharBufIterator:

    CharBuf *hello = CB_newf("hello world");
    CharBufIterator *iterator = CB_Make_Iterator(hello);
    while (CBIter_Next(iterator)) {
        printf("%c | %d\n", CBIter_Get_Code_Point(iterator),
               (int)CBIter_Get_Byte_Offset(iterator));
    }

Returning to your earlier connection between iteration and mutability, the
fly in the ointment for exposing iteration as a public API on CharBuf is that
iterating over a mutable object like a CharBuf is an unsafe operation.
In contrast, iterating over an immutable String would be safe.

This remains a problem if we make CharBuf a subclass of String.  If safe
iteration is part of String's interface, then a mutable subclass which cannot
support safe iteration violates the Liskov Substitution Principle[1].
Arguably, a String class and a mutable character buffer serve different
purposes.

A similar issue exists when using CharBufs as hash keys: it is possible to get
at a CharBuf key and mutate it, changing its hash sum, potentially colliding
with another key and generally wreaking havoc.  Python only allows immutable
types to serve as hash keys to avoid such problems.

>> Bear in mind that one requirement for Clownfish strings going forward is to
>> support UTF-16 as an internal encoding in addition to UTF-8.
>
> Actually, you can implement every string operation in terms of string
> iterators. This concept is, for example, heavily used in the Parrot VM which
> transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and UCS-4
> encodings. FWIW, I refactored large parts of Parrot's string subsystem in
> 2010 and I'd be happy to share my experiences.

We're fortunate to be able to draw on your experience. :)

The Clownfish core, it seems to me, has goals which are inverted in
comparison to Parrot.  While Parrot's data types have to provide exhaustive
support for all features in all target languages, the primary objective for
Clownfish is to provide least-common-denominator data types which convert to
and from host data types sensibly and with minimal resistance.  That leaves us
with a lot of freedom to pursue the secondary objective of giving the core
Clownfish data types a coherent, user-friendly programming API.

Marvin Humphrey

[1] http://en.wikipedia.org/wiki/Liskov_substitution_principle

Re: [lucy-dev] Iterating through CharBufs

Posted by Nick Wellnhofer <we...@aevum.de>.

On Feb 24, 2013, at 06:59 , Marvin Humphrey <ma...@rectangular.com> wrote:

> I'd be interested whether you have anything to add to this argument from Tom
> Christiansen as to why iteration is the best model for string processing, as
> opposed to random access:
> 
>    http://bugs.python.org/issue12729#msg142036

I fully agree with Tom. Random access is only useful if you deal with fixed-length records which are rarely used these days.

This is a very interesting thread, BTW. It taught me some things I didn't know about Unicode yet. Thanks for sharing it.

Another nice thing about iterators is that if we have to support multiple encodings, the encoding can be abstracted behind the iterator interface. So we can share the implementations of String methods across encodings except for performance-critical stuff like Hash_Sum.

> Christiansen also argues for UTF-8 as a native encoding, like Perl and Go.
> Clownfish doesn't have that option -- but if we make iteration our primary
> string processing model, we can avoid problems associated with random
> access, such as splitting logical characters.

UTF-8 is certainly superior in almost all aspects. The fact that UTF-16 is still used so much has mainly historical reasons. Many implementations originally started out with UCS-2 and later upgraded to UTF-16 being the obvious but not really ideal choice. Switching from a fixed-width to a variable-width encoding has a lot of implications which have been overlooked in some programming languages as Tom Christiansen points out in the thread mentioned above.

Nick

Re: [lucy-dev] Iterating through CharBufs

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Feb 21, 2013 at 4:43 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> Actually, you can implement every string operation in terms of string
> iterators. This concept is, for example, heavily used in the Parrot VM which
> transparently supports strings in ASCII, UTF-8, UCS-2, UTF-16, and UCS-4
> encodings. FWIW, I refactored large parts of Parrot's string subsystem in
> 2010 and I'd be happy to share my experiences.

I'd be interested whether you have anything to add to this argument from Tom
Christiansen as to why iteration is the best model for string processing, as
opposed to random access:

    http://bugs.python.org/issue12729#msg142036

    I may be wrong here, not least because I can think of possible extenuating
    circumstances, but it is my impression that there there is an underlying
    assumption in the Python community and many others that being able to
    access the Nth character in a string in constant time for arbitrary N is
    the most important of all possible considerations.

    I don't believe that makes as much sense as people think, because I don't
    believe character strings really are accessed in that fashion very often
    at all.  Sure, if you have a 2D matrix of strings where a given row-column
    pair yields one character and you're taking the diagonal you might want
    that, but how often does that actually happen?  Virtually never: these are
    strings and not matrices we're running FFTs on after all.  We don't need
    to be able to load them into vector registers or anything the way the
    number-crunching people do.

    That's because strings are a sequence of characters: they're text.
    Whether reading text left to right, right to left, or even
    boustrophedonically, you're always going one past the character you're
    currently at.  You aren't going to the Nth character forward or back for
    arbitrary N.  That isn't how people deal with text.  Sometimes they do
    look at the end, or a few in from the far end, but even that can be
    handled in other fashions.

Christiansen also argues for UTF-8 as a native encoding, like Perl and Go.
Clownfish doesn't have that option -- but if we make iteration our primary
string processing model, we can avoid problems associated with random
access, such as splitting logical characters.

Marvin Humphrey