You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Benjamin Kietzman (JIRA)" <ji...@apache.org> on 2018/12/18 16:39:00 UTC

[jira] [Updated] (ARROW-4067) [C++] RFC: standardize ArrayBuilder subclasses

     [ https://issues.apache.org/jira/browse/ARROW-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Kietzman updated ARROW-4067:
-------------------------------------
    Description: 
Each builder supports different and frequently differently named methods for appending. It should be possible to establish a more consistent convention, which would alleviate dev confusion and simplify generics.



For example, let all Builders be required to define at minimum:
 * {{Reserve(int64_t)}}
 * a nested type named {{Scalar}}, which is the canonical scalar appended to this builder. Append other types may be supported for convenience.
 * {{UnsafeAppend(Scalar)}}
 * {{UnsafeAppendNull()}}

The other methods described below can be overridden if an optimization is available or left defaulted (a CRTP helper can contain the default implementations, for example {{Append(Scalar)}} would simply be a call to Reserve then UnsafeAppend.

In addition to their unsafe equivalents, {{Append(Scalar)}} and {{AppendNull()}} should be available for appending without manual capacity maintenance.

It is not necessary for the rest of this RFC, but it would simplify builders further if scalar append methods always had a single argument. For example, this would mean abolishing {{BinaryBuilder::Append(const uint8_t*, int32_t)}} in favor of {{BinaryBuilder::Append(basic_string_view<uint8_t>)}}. There's no runtime overhead involved in this change, and developers who have a pointer and a length instead of a view can just construct one without boilerplate using brace initialization: {code}b->Append({pointer, length});{code}

Unsafe and safe methods should be provided for appending multiple values as well. The default implementation will be a trivial loop but if optimizations are available then this could be overridden (for example instead of copying bits one by one into a BooleanBuilder, bytes could be memcpy'd). Append methods for multiple values should accept two arguments, the first of which contains values and the second of which defines validity. The canonical multiple append method has signature {{Status(array_view<Scalar> values, const uint8_t* valid_bytes)}}, but other overloads and helpers could be provided as well:

{code}
b->Append({{1, 3, 4}}, all_valid); // append values with no nulls
b->Append({{1, 3, 4}}, bool_vector); // use the elements of a vector<bool> for validity
b->Append({{1, 3, 4}}, bits(ptr)); // interpret ptr as a buffer of valid bits, rather than valid bytes
{code}

Builders of nested types currently require developers to write boilerplate wrangling the child builders. This could be alleviated by letting nested builders' append methods return a helper as an output argument:

{code}
ListBuilder::List<Int32Type> lst;
RETURN_NOT_OK(list_builder.Append(&lst)); // ListBuilder::Scalar == ListBuilder::ListBase*
RETURN_NOT_OK(lst->Append(3));
RETURN_NOT_OK(lst->Append(4));

StructBuilder::Struct strct;
RETURN_NOT_OK(struct_builder.Append(&strct));
RETURN_NOT_OK(strct.Set<StringType>(0, "uuid"));
RETURN_NOT_OK(strct.Set<Int32Type>(2, 47));
RETURN_NOT_OK(strct->Finish()); // appends null to unspecified fields
{code}

  was:
Each builder supports different and frequently differently named methods for appending. It should be possible to establish a more consistent convention, which would alleviate dev confusion and simplify generics.

For example, let all Builders be required to define at minimum:
 * {{Reserve(int64_t)}}
 * a nested type named {{Scalar}}, which is the canonical scalar appended to this builder. Append other types may be supported for convenience.
 * {{UnsafeAppend(Scalar)}}
 * {{UnsafeAppendNull()}}

The other methods described below can be overridden if an optimization is available or left defaulted (a CRTP helper can contain the default implementations, for example {{Append(Scalar)}} would simply be a call to Reserve then UnsafeAppend.

In addition to their unsafe equivalents, {{Append(Scalar)}} and {{AppendNull()}} should be available for appending without manual capacity maintenance.

It is not necessary for the rest of this RFC, but it would simplify builders further if scalar append methods always had a single argument. For example, this would mean abolishing {{BinaryBuilder::Append(const uint8_t*, int32_t)}} in favor of {{BinaryBuilder::Append(basic_string_view<uint8_t>)}}. There's no runtime overhead involved in this change, and developers who have a pointer and a length instead of a view can just construct one without boilerplate using brace initialization: {code}b->Append({pointer, length});{code}

Unsafe and safe methods should be provided for appending multiple values as well. The default implementation will be a trivial loop but if optimizations are available then this could be overridden (for example instead of copying bits one by one into a BooleanBuilder, bytes could be memcpy'd). Append methods for multiple values should accept two arguments, the first of which contains values and the second of which defines validity. The canonical multiple append method has signature {{Status(array_view<Scalar> values, const uint8_t* valid_bytes)}}, but other overloads and helpers could be provided as well:

{code}
b->Append({{1, 3, 4}}, all_valid); // append values with no nulls
b->Append({{1, 3, 4}}, bool_vector); // use the elements of a vector<bool> for validity
b->Append({{1, 3, 4}}, bits(ptr)); // interpret ptr as a buffer of valid bits, rather than valid bytes
{code}

Builders of nested types currently require developers to write boilerplate wrangling the child builders. This could be alleviated by letting nested builders' append methods return a helper as an output argument:

{code}
ListBuilder::List<Int32Type> lst;
RETURN_NOT_OK(list_builder.Append(&lst)); // ListBuilder::Scalar == ListBuilder::ListBase*
RETURN_NOT_OK(lst->Append(3));
RETURN_NOT_OK(lst->Append(4));

StructBuilder::Struct strct;
RETURN_NOT_OK(struct_builder.Append(&strct));
RETURN_NOT_OK(strct.Set<StringType>(0, "uuid"));
RETURN_NOT_OK(strct.Set<Int32Type>(2, 47));
RETURN_NOT_OK(strct->Finish()); // appends null to unspecified fields
{code}


> [C++] RFC: standardize ArrayBuilder subclasses
> ----------------------------------------------
>
>                 Key: ARROW-4067
>                 URL: https://issues.apache.org/jira/browse/ARROW-4067
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Benjamin Kietzman
>            Priority: Minor
>              Labels: usability
>
> Each builder supports different and frequently differently named methods for appending. It should be possible to establish a more consistent convention, which would alleviate dev confusion and simplify generics.
> For example, let all Builders be required to define at minimum:
>  * {{Reserve(int64_t)}}
>  * a nested type named {{Scalar}}, which is the canonical scalar appended to this builder. Append other types may be supported for convenience.
>  * {{UnsafeAppend(Scalar)}}
>  * {{UnsafeAppendNull()}}
> The other methods described below can be overridden if an optimization is available or left defaulted (a CRTP helper can contain the default implementations, for example {{Append(Scalar)}} would simply be a call to Reserve then UnsafeAppend.
> In addition to their unsafe equivalents, {{Append(Scalar)}} and {{AppendNull()}} should be available for appending without manual capacity maintenance.
> It is not necessary for the rest of this RFC, but it would simplify builders further if scalar append methods always had a single argument. For example, this would mean abolishing {{BinaryBuilder::Append(const uint8_t*, int32_t)}} in favor of {{BinaryBuilder::Append(basic_string_view<uint8_t>)}}. There's no runtime overhead involved in this change, and developers who have a pointer and a length instead of a view can just construct one without boilerplate using brace initialization: {code}b->Append({pointer, length});{code}
> Unsafe and safe methods should be provided for appending multiple values as well. The default implementation will be a trivial loop but if optimizations are available then this could be overridden (for example instead of copying bits one by one into a BooleanBuilder, bytes could be memcpy'd). Append methods for multiple values should accept two arguments, the first of which contains values and the second of which defines validity. The canonical multiple append method has signature {{Status(array_view<Scalar> values, const uint8_t* valid_bytes)}}, but other overloads and helpers could be provided as well:
> {code}
> b->Append({{1, 3, 4}}, all_valid); // append values with no nulls
> b->Append({{1, 3, 4}}, bool_vector); // use the elements of a vector<bool> for validity
> b->Append({{1, 3, 4}}, bits(ptr)); // interpret ptr as a buffer of valid bits, rather than valid bytes
> {code}
> Builders of nested types currently require developers to write boilerplate wrangling the child builders. This could be alleviated by letting nested builders' append methods return a helper as an output argument:
> {code}
> ListBuilder::List<Int32Type> lst;
> RETURN_NOT_OK(list_builder.Append(&lst)); // ListBuilder::Scalar == ListBuilder::ListBase*
> RETURN_NOT_OK(lst->Append(3));
> RETURN_NOT_OK(lst->Append(4));
> StructBuilder::Struct strct;
> RETURN_NOT_OK(struct_builder.Append(&strct));
> RETURN_NOT_OK(strct.Set<StringType>(0, "uuid"));
> RETURN_NOT_OK(strct.Set<Int32Type>(2, 47));
> RETURN_NOT_OK(strct->Finish()); // appends null to unspecified fields
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)