You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Francois Saint-Jacques <fs...@gmail.com> on 2020/03/13 20:04:02 UTC

[DISCUSS] Field reference ambiguity

Hello,

the recent dataset and compute work has forced us to think about
schema projection. One problem that surfaced is referencing fields in
nested schemas and/or schemas where duplicate column names exists. We
currently have (C++) APIs that either pass a vector<int> or a
vector<std::string> to represent fields subset, both way poses
challenges:

- Referencing a column by index can't access sub-fields of nested type.
- Referencing a column by name can return more than one field.

Thus, Ben drafted a PR [1] to allow referencing fields in (hopefully)
non-ambiguous way. This is divided into 2 concepts:

- FieldPath: A stack of indices pointing into nested structures. It
points to exactly one field, or none if ill formed. If the depth is
one, it is equivalent to referencing a field by index.
- FieldRef: A friendlier version that supports referencing by names
and/or a tiny string DSL similar to JSONPath. One can "dereference" a
FieldRef into a FieldPath given a schema. Since it supports name
component, a FieldRef can expand to more than one FieldPath.

We'd like to standardise most C++ APIs where a vector of indices (or
names) is given as an indicator of subset of columns to use this new
facility. For this reason, we'd like feedback on the implementation. I
encourage other language developers to look at this as they'll likely
face the same issues.

Thank you,
François

[1] https://github.com/apache/arrow/pull/6545

Re: [DISCUSS] Field reference ambiguity

Posted by Wes McKinney <we...@gmail.com>.
It seems like there are two common patterns for projection from a record batch:

* Selecting top-level fields by index
* Selecting a collection of column paths.

I'm on board with deprecating std::vector<std::string>-based APIs
since these are a special case of selecting a collection of column
paths that include all children of nested types

Suppose we have the following schema:

a: int64
b: struct<f0: list<item: string>, f1: float64, f2: struct<f3: int8, f4: binary>>

What would be the proposed syntax of projecting this to

a: int64
b: struct<f0: list<item: string>, f2: struct<f3: int8>>

?

Probably something like

{
  FieldRef("a"),
  FieldRef("b", {FieldRef("f0"), FieldRef("f2", {FieldRef("f3"})})
}

(I apologize if this is already addressed in the PR, I will certainly
take a closer look)

- Wes

On Fri, Mar 13, 2020 at 3:04 PM Francois Saint-Jacques
<fs...@gmail.com> wrote:
>
> Hello,
>
> the recent dataset and compute work has forced us to think about
> schema projection. One problem that surfaced is referencing fields in
> nested schemas and/or schemas where duplicate column names exists. We
> currently have (C++) APIs that either pass a vector<int> or a
> vector<std::string> to represent fields subset, both way poses
> challenges:
>
> - Referencing a column by index can't access sub-fields of nested type.
> - Referencing a column by name can return more than one field.
>
> Thus, Ben drafted a PR [1] to allow referencing fields in (hopefully)
> non-ambiguous way. This is divided into 2 concepts:
>
> - FieldPath: A stack of indices pointing into nested structures. It
> points to exactly one field, or none if ill formed. If the depth is
> one, it is equivalent to referencing a field by index.
> - FieldRef: A friendlier version that supports referencing by names
> and/or a tiny string DSL similar to JSONPath. One can "dereference" a
> FieldRef into a FieldPath given a schema. Since it supports name
> component, a FieldRef can expand to more than one FieldPath.
>
> We'd like to standardise most C++ APIs where a vector of indices (or
> names) is given as an indicator of subset of columns to use this new
> facility. For this reason, we'd like feedback on the implementation. I
> encourage other language developers to look at this as they'll likely
> face the same issues.
>
> Thank you,
> François
>
> [1] https://github.com/apache/arrow/pull/6545