You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2015/10/07 13:10:49 UTC
[lucy-dev] Clownfish String API
Lucifers,
here are my condensed thoughts on cleaning up the Clownfish String API:
String
- Rename `new_stack_string` to `init_stack_string`.
- Make all `new_*`, `init_wrap_*` functions public except:
- `init_stack_string` (only called from macros)
- Remove `Cat_Utf8` and `Cat_Trusted_Utf8`?
Callers should probably use CharBufs instead.
Only keep `Cat`.
- Remove underused `Swap_Chars`.
Maybe add a `Replace` method instead.
- Make stack strings public?
- Add macro to create stack string from null-terminated UTF-8
without passing the string size:
String *str = SSTR_WRAP_C("A string.");
- Make `Code_Point_*` return special value instead of 0
if outside the string.
- Make `Find` return a size_t.
Requires special value for "not found".
- Make all methods public except:
- `Is_Copy_On_IncRef`
- `Hash_Sum`
StringIterator
- Rename StrIter_substring to Str_new_from_iter?
- Make `Starts_With`, `Ends_With` skip over strings?
I think this better matches current usage patterns.
- Find better names for `Skip_{Next|Prev}_Whitespace`?
- Make all methods public.
Nick
Re: [lucy-dev] Str_Find return type
Posted by Nick Wellnhofer <we...@aevum.de>.
On 22/10/2015 23:21, Marvin Humphrey wrote:
> You're absolutely right, avoiding an index which counts code points is
> consistent with our iterator-centric model for string processing. Good
> insight, and nice API proposal!
I'm also tempted to remove Str_Code_Point_At and make string iterators the
only way to access character data in a string.
Nick
Re: [lucy-dev] Str_Find return type
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Oct 22, 2015 at 8:06 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> On 18/10/2015 14:01, Nick Wellnhofer wrote:
>>>> - Make `Find` return a size_t.
>>>> Requires special value for "not found".
>>>
>>> Hmm, that's a toughie. If the sentinel is SIZE_MAX, that might not fit in
>>> all host numeric types.
>>
>>
>> Good point. The problem is not so much the byte size of the return type but
>> the fact that a host language might not support unsigned integers. Maybe we
>> should limit string sizes to SSIZE_MAX and make `Find` return an ssize_t.
>> But this requires to emulate the ssize_t type on non-POSIX platforms.
>
> Here's another idea. Most of the time, users of Str_Find don't care about
> the position of the substring and only want to know whether the substring is
> contained or not. For this use case, a method like Str_Contains returning a
> bool is a more appropriate interface.
>
> If someone is interested in the exact position of the substring, it might
> make more sense to return a string iterator pointing to the first occurrence
> of the substring. So what about:
>
> public bool
> Contains(String *self, String *substring);
>
> public incremented nullable StringIterator*
> Find(String *self, String *substring);
+1
You're absolutely right, avoiding an index which counts code points is
consistent with our iterator-centric model for string processing. Good
insight, and nice API proposal!
Marvin Humphrey
[lucy-dev] Str_Find return type
Posted by Nick Wellnhofer <we...@aevum.de>.
On 18/10/2015 14:01, Nick Wellnhofer wrote:
>>> - Make `Find` return a size_t.
>>> Requires special value for "not found".
>>
>> Hmm, that's a toughie. If the sentinel is SIZE_MAX, that might not fit in all
>> host numeric types.
>
> Good point. The problem is not so much the byte size of the return type but
> the fact that a host language might not support unsigned integers. Maybe we
> should limit string sizes to SSIZE_MAX and make `Find` return an ssize_t. But
> this requires to emulate the ssize_t type on non-POSIX platforms.
Here's another idea. Most of the time, users of Str_Find don't care about the
position of the substring and only want to know whether the substring is
contained or not. For this use case, a method like Str_Contains returning a
bool is a more appropriate interface.
If someone is interested in the exact position of the substring, it might make
more sense to return a string iterator pointing to the first occurrence of the
substring. So what about:
public bool
Contains(String *self, String *substring);
public incremented nullable StringIterator*
Find(String *self, String *substring);
Nick
Re: [lucy-dev] Clownfish String API
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Oct 22, 2015 at 3:13 PM, Nathan Kurz <na...@verse.com> wrote:
> Perhaps consider the modifier to be adverb rather than an adjective?
> That is, rather than specifying which whitespace, specify how you are
> skipping.
>
> For example:
>
> Skip_Forward_Whitespace / Skip_Backward_Whitespace.
>
> Or put it at the end:
>
> Skip_Whitespace_Forward / Skip_Whitespace_Backward
>
> Or as you suggest, keep the common case short and modify only the rare one:
>
> Skip_Whitespace / Skip_Whitespace_[Back|Backward|Reverse]
Thanks for the fresh perspective, Nate! I'm fine with any of these;
if I were to pick one combo it would be
`Skip_Whitespace`/`Skip_Whitespace_Back`.
Marvin Humphrey
Re: [lucy-dev] Clownfish String API
Posted by Nathan Kurz <na...@verse.com>.
On Thu, Oct 22, 2015 at 2:57 PM, Marvin Humphrey <ma...@rectangular.com> wrote:
> I think `Skip_Whitespace` is the best option for forward skipping. I can't
> think of a good option for moving backwards past any whitespace before the
> current iterator position. It's not a common operation, though, and I can
> live with a suboptimal name. `Skip_Prev_Whitespace` might be the clearest,
> since it can be contrasted with `Skip_Whitespace` -- and it's less
> gramatically awkward than `Recede_Whitespace`.
Hi Guys --
Perhaps consider the modifier to be adverb rather than an adjective?
That is, rather than specifying which whitespace, specify how you are
skipping.
For example:
Skip_Forward_Whitespace / Skip_Backward_Whitespace.
Or put it at the end:
Skip_Whitespace_Forward / Skip_Whitespace_Backward
Or as you suggest, keep the common case short and modify only the rare one:
Skip_Whitespace / Skip_Whitespace_[Back|Backward|Reverse]
I think all of these are slightly better than Skip_Prev_Whitespace,
which in turn is indeed much better than Recede_Whitespace.
--nate
Re: [lucy-dev] Clownfish String API
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sun, Oct 18, 2015 at 5:01 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> On 15/10/2015 21:21, Marvin Humphrey wrote:
>> * Remove `new_from_char`.
>
>
> -0.5
>
>> * Remove `BaseX_To_I64`, optionally adding its functionality to
>> StringHelper.
>
>
> +0
I don't feel that strongly about either of these, so I'm amenable to your
plan.
>>> StringIterator
>>>
>>> - Rename StrIter_substring to Str_new_from_iter?
>>
>>
>> Hmm. I see where you're going with that, but new_from_iter doesn't
>> communicate that there must be a single backing String for the two
>> StringIterator arguments. Maybe `StrIter_crop`?
>
> I'm also OK with either StrIter_substring ot StrIter_crop. I only thought it
> might be clearer to make this function a constructor of the String class.
Then how about `Str_crop`? I think that would be OK because it correctly
implies that we're cropping a single source String.
>>> - Make `Starts_With`, `Ends_With` skip over strings?
>>> I think this better matches current usage patterns.
>>
>>
>> Somehow I don't understand what you mean.
>
>
> I mean that `Starts_With` should be named something like `Advance_If_Match`
> and make the iterator move past the prefix if it matches. Currently, most
> code that calls StrIter_Start_With looks like
>
> if (StrIter_Starts_With(iter, prefix)) {
> StrIter_Advance(iter, prefix_len);
> ...
> }
>
> This could be replaced with a simpler and more efficient
>
> if (StrIter_Advance_If_Match(iter, prefix)) {
> ...
> }
How about `Skip_Match`?
if (!StrIter_Skip_Match(iter, prefix)) {
return 0;
}
>>> - Find better names for `Skip_{Next|Prev}_Whitespace`?
>>
>> `Advance_Whitespace` and `Recede_Whitespace`?
>
> As a non-native speaker, I'm not really sure.
I think `Skip_Whitespace` is the best option for forward skipping. I can't
think of a good option for moving backwards past any whitespace before the
current iterator position. It's not a common operation, though, and I can
live with a suboptimal name. `Skip_Prev_Whitespace` might be the clearest,
since it can be contrasted with `Skip_Whitespace` -- and it's less
gramatically awkward than `Recede_Whitespace`.
Marvin Humphrey
Re: [lucy-dev] Clownfish String API
Posted by Nick Wellnhofer <we...@aevum.de>.
On 15/10/2015 21:21, Marvin Humphrey wrote:
> On Wed, Oct 7, 2015 at 4:10 AM, Nick Wellnhofer <we...@aevum.de> wrote:
>> - Make stack strings public?
>
> I'd say wait for now. They're truly an expert API -- it's too easy to blow
> the C stack by using them in a loop, etc.
OK.
>> - Make `Find` return a size_t.
>> Requires special value for "not found".
>
> Hmm, that's a toughie. If the sentinel is SIZE_MAX, that might not fit in all
> host numeric types.
Good point. The problem is not so much the byte size of the return type but
the fact that a host language might not support unsigned integers. Maybe we
should limit string sizes to SSIZE_MAX and make `Find` return an ssize_t. But
this requires to emulate the ssize_t type on non-POSIX platforms.
> In addition, I'd suggest:
>
> * Remove `Str_less_than`, replacing it in PolyLexicon with
> `Str_Compare_To(a, b) < 0`.
> * Remove `Str_compare` from the public API.
+1
> * Remove `new_from_char`.
-0.5
> * Remove `BaseX_To_I64`, optionally adding its functionality to
> StringHelper.
+0
> * Rename the argument to `Ends_With` from `postfix` to `suffix`. Same with
> `Ends_With_Utf8`.
> * The return values for `Trim`, `Trim_Top`, and `Trim_Tail` should be
> `incremented`.
+1
>> StringIterator
>>
>> - Rename StrIter_substring to Str_new_from_iter?
>
> Hmm. I see where you're going with that, but new_from_iter doesn't
> communicate that there must be a single backing String for the two
> StringIterator arguments. Maybe `StrIter_crop`?
I'm also OK with either StrIter_substring ot StrIter_crop. I only thought it
might be clearer to make this function a constructor of the String class.
>> - Make `Starts_With`, `Ends_With` skip over strings?
>> I think this better matches current usage patterns.
>
> Somehow I don't understand what you mean.
I mean that `Starts_With` should be named something like `Advance_If_Match`
and make the iterator move past the prefix if it matches. Currently, most code
that calls StrIter_Start_With looks like
if (StrIter_Starts_With(iter, prefix)) {
StrIter_Advance(iter, prefix_len);
...
}
This could be replaced with a simpler and more efficient
if (StrIter_Advance_If_Match(iter, prefix)) {
...
}
>> - Find better names for `Skip_{Next|Prev}_Whitespace`?
>
> `Advance_Whitespace` and `Recede_Whitespace`?
As a non-native speaker, I'm not really sure.
> While preparing this reply, I also created the patch below with some doc
> tweaks.
Looks good. I will start off with your patch.
Nick
Re: [lucy-dev] Clownfish String API
Posted by Marvin Humphrey <ma...@apache.org>.
Hi, Nick,
Thanks for thinking this through!
On Wed, Oct 7, 2015 at 4:10 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> Lucifers,
>
> here are my condensed thoughts on cleaning up the Clownfish String API:
>
> String
>
> - Rename `new_stack_string` to `init_stack_string`.
> - Make all `new_*`, `init_wrap_*` functions public except:
> - `init_stack_string` (only called from macros)
> - Remove `Cat_Utf8` and `Cat_Trusted_Utf8`?
> Callers should probably use CharBufs instead.
> Only keep `Cat`.
+1
> - Remove underused `Swap_Chars`.
> Maybe add a `Replace` method instead.
+1 to remove Swap_Chars. If you get serious about adding Replace, let's
contemplate the interface.
> - Make stack strings public?
I'd say wait for now. They're truly an expert API -- it's too easy to blow
the C stack by using them in a loop, etc.
> - Add macro to create stack string from null-terminated UTF-8
> without passing the string size:
> String *str = SSTR_WRAP_C("A string.");
> - Make `Code_Point_*` return special value instead of 0
> if outside the string.
+1
> - Make `Find` return a size_t.
> Requires special value for "not found".
Hmm, that's a toughie. If the sentinel is SIZE_MAX, that might not fit in all
host numeric types.
> - Make all methods public except:
> - `Is_Copy_On_IncRef`
> - `Hash_Sum`
In addition, I'd suggest:
* Remove `Str_less_than`, replacing it in PolyLexicon with
`Str_Compare_To(a, b) < 0`.
* Remove `Str_compare` from the public API.
* Remove `new_from_char`.
* Remove `BaseX_To_I64`, optionally adding its functionality to
StringHelper.
* Rename the argument to `Ends_With` from `postfix` to `suffix`. Same with
`Ends_With_Utf8`.
* The return values for `Trim`, `Trim_Top`, and `Trim_Tail` should be
`incremented`.
> StringIterator
>
> - Rename StrIter_substring to Str_new_from_iter?
Hmm. I see where you're going with that, but new_from_iter doesn't
communicate that there must be a single backing String for the two
StringIterator arguments. Maybe `StrIter_crop`?
> - Make `Starts_With`, `Ends_With` skip over strings?
> I think this better matches current usage patterns.
Somehow I don't understand what you mean.
Also, as with String, the argument to Ends_With should be `suffix` rather than
`postfix`.
> - Find better names for `Skip_{Next|Prev}_Whitespace`?
`Advance_Whitespace` and `Recede_Whitespace`?
> - Make all methods public.
+1
...
While preparing this reply, I also created the patch below with some doc
tweaks.
Marvin Humphrey
diff --git a/runtime/core/Clownfish/String.cfh b/runtime/core/Clownfish/String.cfh
index e422cbf..b49a0df 100644
--- a/runtime/core/Clownfish/String.cfh
+++ b/runtime/core/Clownfish/String.cfh
@@ -34,52 +34,52 @@ public final class Clownfish::String nickname Str
size_t size;
String *origin;
- /** Return a new String which holds a copy of the passed-in string.
- * Check for UTF-8 validity.
+ /** Return a String which holds a copy of the supplied UTF-8 character
+ * data after checking for validity.
*/
inert incremented String*
new_from_utf8(const char *utf8, size_t size);
- /** Return a new String which holds a copy of the passed-in string. No
- * validity checking is performed.
+ /** Return a String which holds a copy of the supplied UTF-8 character
+ * data, skipping validity checks.
*/
inert incremented String*
new_from_trusted_utf8(const char *utf8, size_t size);
- /** Initialize the String using the passed-in string. Do not check
- * validity of supplied UTF-8.
+ /** Initialize a String which holds a copy of the supplied UTF-8 character
+ * data, skipping validity checks.
*/
inert String*
init_from_trusted_utf8(String *self, const char *utf8, size_t size);
- /** Return a pointer to a new String which assumes ownership of the
- * passed-in string. Check validity of supplied UTF-8.
+ /** Return a String which assumes ownership of the supplied buffer
+ * containing UTF-8 character data after checking for validity.
*/
inert incremented String*
new_steal_utf8(char *utf8, size_t size);
- /** Return a pointer to a new String which assumes ownership of the
- * passed-in string. Do not check validity of supplied UTF-8.
+ /** Return a String which assumes ownership of the supplied buffer
+ * containing UTF-8 character data, skipping validity checks.
*/
inert incremented String*
new_steal_trusted_utf8(char *utf8, size_t size);
- /** Initialize the String using the passed-in string. Do not check
- * validity of supplied UTF-8.
+ /** Initialize a String which assumes ownership of the supplied buffer
+ * containing UTF-8 character data, skipping validity checks.
*/
public inert String*
init_steal_trusted_utf8(String *self, char *utf8, size_t size);
- /** Return a pointer to a new String which wraps an external buffer
- * containing UTF-8. The buffer must stay unchanged for the lifetime
- * of the String. Check validity of supplied UTF-8.
+ /** Return a String which wraps an external buffer containing UTF-8
+ * character data after checking for validity. The buffer must stay
+ * unchanged for the lifetime of the String.
*/
inert incremented String*
new_wrap_utf8(const char *utf8, size_t size);
- /** Return a pointer to a new String which wraps an external buffer
- * containing UTF-8. The buffer must stay unchanged for the lifetime
- * of the String. Do not check validity of supplied UTF-8.
+ /** Return a String which wraps an external buffer containing UTF-8
+ * character data, skipping validity checks. The buffer must stay
+ * unchanged for the lifetime of the String.
*/
inert incremented String*
new_wrap_trusted_utf8(const char *utf8, size_t size);
@@ -87,8 +87,8 @@ public final class Clownfish::String nickname Str
inert incremented String*
new_stack_string(void *allocation, const char *utf8, size_t size);
- /** Initialize the String which wraps an external buffer containing
- * UTF-8. Do not check validity of supplied UTF-8.
+ /** Initialize a String which wraps an external buffer containing UTF-8
+ * character data after checking for validity.
*/
public inert String*
init_wrap_trusted_utf8(String *self, const char *utf8, size_t size);
@@ -98,8 +98,8 @@ public final class Clownfish::String nickname Str
inert incremented String*
new_from_char(int32_t code_point);
- /** Return a pointer to a new String which contains formatted data
- * expanded according to CB_VCatF.
+ /** Return a String with content expanded from a pattern and arguments
+ * conforming to the spec defined by CharBuf.
*
* Note: a user-supplied `pattern` string is a security hole
* and must not be allowed.
@@ -128,13 +128,14 @@ public final class Clownfish::String nickname Str
incremented String*
Cat(String *self, String *other);
- /** Return the concatenation of the String and the passed-in raw UTF-8.
+ /** Return the concatenation of the String and the supplied UTF-8
+ * character data after checking for validity.
*/
incremented String*
Cat_Utf8(String *self, const char *ptr, size_t size);
- /** Return the concatenation of the String and the passed-in raw UTF-8.
- * Don't check for UTF-8 validity.
+ /** Return the concatenation of the String and the supplied UTF-8
+ * character data, skipping validity checks.
*/
incremented String*
Cat_Trusted_Utf8(String *self, const char *ptr, size_t size);
@@ -157,22 +158,22 @@ public final class Clownfish::String nickname Str
public double
To_F64(String *self);
- /** Test whether the String starts with the content of another.
+ /** Test whether the String starts with `prefix`.
*/
bool
Starts_With(String *self, String *prefix);
- /** Test whether the String starts with the passed-in string.
+ /** Test whether the String starts with `prefix`.
*/
bool
Starts_With_Utf8(String *self, const char *prefix, size_t size);
- /** Test whether the String ends with the content of another.
+ /** Test whether the String ends with `postfix`.
*/
bool
Ends_With(String *self, String *postfix);
- /** Test whether the String ends with the passed-in string.
+ /** Test whether the String ends with `postfix`.
*/
bool
Ends_With_Utf8(String *self, const char *postfix, size_t size);
@@ -186,17 +187,17 @@ public final class Clownfish::String nickname Str
int64_t
Find_Utf8(String *self, const char *ptr, size_t size);
- /** Test whether the String matches the passed-in string.
+ /** Test whether the String matches the supplied UTF-8 character data.
*/
bool
Equals_Utf8(String *self, const char *ptr, size_t size);
- /** Return the number of Unicode code points in the object's string.
+ /** Return the number of Unicode code points the String contains.
*/
size_t
Length(String *self);
- /** Get the String's `size` attribute.
+ /** Return the number of bytes occupied by the String's internal content.
*/
size_t
Get_Size(String *self);
@@ -251,35 +252,33 @@ public final class Clownfish::String nickname Str
String*
Trim_Tail(String *self);
- /** Return the Unicode code point at the specified number of code points
- * in. Return 0 if the string length is exceeded. (XXX It would be
+ /** Return the Unicode code point located `tick` code points in from the
+ * top. Return 0 if out of bounds. (XXX It would be
* better to throw an exception, but that's not practical with UTF-8 and
* no cached length.)
*/
int32_t
Code_Point_At(String *self, size_t tick);
- /** Return the Unicode code point at the specified number of code points
- * counted backwards from the end of the string. Return 0 if outside the
- * string.
+ /** Return the Unicode code point located `tick` code points counting
+ * backwards from the end. Return 0 if out of bounds.
*/
int32_t
Code_Point_From(String *self, size_t tick);
- /** Return a newly allocated String containing a copy of the indicated
- * substring.
+ /** Return a new String containing a copy of the specified substring.
* @param offset Offset from the top, in code points.
* @param len The desired length of the substring, in code points.
*/
incremented String*
SubString(String *self, size_t offset, size_t len);
- /** Return an iterator to the start of the string.
+ /** Return an iterator initialized to the start of the string.
*/
incremented StringIterator*
Top(String *self);
- /** Return an iterator to the end of the string.
+ /** Return an iterator initialized to the end of the string.
*/
incremented StringIterator*
Tail(String *self);
@@ -363,14 +362,12 @@ public final class Clownfish::StringIterator nickname StrIter
public size_t
Skip_Prev_Whitespace(StringIterator *self);
- /** Test whether the content after the iterator starts with
- * `prefix`.
+ /** Test whether the content after the iterator starts with `prefix`.
*/
bool
Starts_With(StringIterator *self, String *prefix);
- /** Test whether the content after the iterator starts with the passed-in
- * string.
+ /** Test whether the content after the iterator starts with `prefix`.
*/
bool
Starts_With_Utf8(StringIterator *self, const char *prefix, size_t size);
@@ -381,8 +378,7 @@ public final class Clownfish::StringIterator nickname StrIter
bool
Ends_With(StringIterator *self, String *postfix);
- /** Test whether the content before the iterator ends with the passed-in
- * string.
+ /** Test whether the content before the iterator ends with `postfix`.
*/
bool
Ends_With_Utf8(StringIterator *self, const char *postfix, size_t size);