You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Simon Perkins <si...@gmail.com> on 2021/06/08 08:26:49 UTC

Complex Number support in Arrow

Greetings Apache Dev Mailing List

I'm interested in adding complex number support to Arrow. The use case is
Radio Astronomy data, which is represented by complex values.

xref https://issues.apache.org/jira/browse/ARROW-638
xref https://github.com/apache/arrow/pull/10452

It's fairly easy to support Complex Numbers as a Python Extension -- see
for e.g. how I've done it here using a list(float{32,64}):

https://github.com/ska-sa/dask-ms/blob/a5bd8538ea3de9fabb8fe74e89c3a75c4043f813/daskms/experimental/arrow/extension_types.py#L144-L173

The above seems to work with the standard NumPy complex memory layout
(consecutive pairs of [real, imag] values) and should work with the C++
std::complex layout. Note that C complex and C++ std::complex should also
have the same layout https://stackoverflow.com/a/10540346.

However, this constrains this representation of Complex Numbers to the
dask-ms only. I think that it would be better to add support for this at a
base level in Arrow, especially since this will open up the ability for
other packages to understand the Complex Number Type. For example, it would
be useful to:

   1. Have a clearly defined Pandas -> Arrow -> Parquet -> Arrow -> Pandas
   roundtrip. Currently there's no Pandas -> Arrow conversion for
   np.complex{64, 128}.
   2. Support complex number types in query engines like DataFusion and
   BlazingSQL, if only initially via selection on indexing columns.


I started up a PR in https://github.com/apache/arrow/pull/10452 adding
Complex Numbers as a first-class Arrow type, although I note that
https://issues.apache.org/jira/browse/ARROW-638?focusedCommentId=16912456&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16912456
suggests implementing this as a C++ Extension Type on a first pass. Initial
experiments suggests this is pretty doable -- I've got some test cases
running already.

I have some questions going forward:

   - Adding first class complex types seems to involve modifying
   cpp/src/arrow/ipc/feather.fbs which may change the protocol and introduce
   breaking changes. I'm not sure about this and seek advice on how invasive
   this approach is and whether its worth pursuing.
   - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
   imagine a struct([real, imag]) might offer more in terms of affordance ot
   the user. I'd imagine the underlying memory layout would be the same.
   - I don't have a clear understanding of whether adding either a
   First-Class or ExtensionType involves supporting numeric operations on that
   type (e.g. Complex Exponential, Absolutes, Min or Max operations) or
   whether Arrow is merely concerned with the underlying data representation.

Thanks for considering this.
  Simon Perkins

Re: Complex Number support in Arrow

Posted by Antoine Pitrou <an...@python.org>.
Le 10/06/2021 à 09:20, Simon Perkins a écrit :
> 
> Ah so Arrow Structs are represented as a Struct of Arrays (SoA) vs an Array
> of Structs (AoS)?

If you are not familiar with the Arrow format, I would suggest you start 
by reading https://arrow.apache.org/docs/format/Columnar.html

(see "Struct layout" in particular, but the rest is useful as well)

> I don't immediately see a Packed Struct type. Would this need to be
> implemented?

Not necessarily (*).  But before thinking about implementation, this 
proposal must be accepted into the format.

(*) see 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L1291 
for an example

Regards

Antoine.

Re: Complex Number support in Arrow

Posted by Simon Perkins <si...@gmail.com>.
Hi Micah

Please see a recent discussion on adding new types [1]
>

Thanks, this is useful.



> My understanding is that feather.fbs is for V1 feather files and probably
> shouldn't be touched.  Only updating schema.fbs should be required and the
> type should be doable in a backwards/forwards compatible way (we've added
> types without bumping the metadata version and are in the process of adding
> more).
>

This is good to know. I'm still getting to know the code base, but work
form Schema.fbs going forward.



>    - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
> >    imagine a struct([real, imag]) might offer more in terms of affordance
> > ot
> >    the user. I'd imagine the underlying memory layout would be the same.
>
>
> What notation is this using (are 32, 64 meant to be substitual
> parameters)?  I would think FixedSizeList might be more appropriate then
> list.
>

This should read list(float32()) or list(float64()) for a Python/C++
notation.
As you say, fixed_size_list(float32(), 2), fixed_size_list(float64(), 2)
are more appropriate.


> It seems like what we would want for this is a "Packed Struct" type and
> then have an extension type to wrap it. The existing structs in arrow have
> a very different memory layout than lists (the real and imaginary
> components would not be adjacent in memory with Structs).  All the
> representations also have trade-offs on how they would be mapped to parquet
> and the relevant feature set there.
>

Ah so Arrow Structs are represented as a Struct of Arrays (SoA) vs an Array
of Structs (AoS)?
I don't immediately see a Packed Struct type. Would this need to be
implemented?
Alternatively, std::complex<float> and std::complex<double> seem to work and
implicitly provide a Packed Struct.
The base C Types "float complex" and "double complex" don't seem to be
accepted by C++ templating system as template parameters in types.h.


> Adding a new first-class type in Arrow requires working integration tests
> between C++ and Java libraries (once the idea is informally agreed upon)
> and then a final vote for approval.  We haven't formalized extension types
> but I imagine a similar cross language requirement would be agreed upon.
> Implementation of computation wouldn't be required for adding a new type.
> Different language bindings have taken different approaches on how much
> additional computational elements are packaged in them.
>

Agreed, Complex Types should be covered by integration tests.

regards,

Simon




> On Tue, Jun 8, 2021 at 1:27 AM Simon Perkins <si...@gmail.com>
> wrote:
>
> > Greetings Apache Dev Mailing List
> >
> > I'm interested in adding complex number support to Arrow. The use case is
> > Radio Astronomy data, which is represented by complex values.
> >
> > xref https://issues.apache.org/jira/browse/ARROW-638
> > xref https://github.com/apache/arrow/pull/10452
> >
> > It's fairly easy to support Complex Numbers as a Python Extension -- see
> > for e.g. how I've done it here using a list(float{32,64}):
> >
> >
> >
> https://github.com/ska-sa/dask-ms/blob/a5bd8538ea3de9fabb8fe74e89c3a75c4043f813/daskms/experimental/arrow/extension_types.py#L144-L173
> >
> > The above seems to work with the standard NumPy complex memory layout
> > (consecutive pairs of [real, imag] values) and should work with the C++
> > std::complex layout. Note that C complex and C++ std::complex should also
> > have the same layout https://stackoverflow.com/a/10540346.
> >
> > However, this constrains this representation of Complex Numbers to the
> > dask-ms only. I think that it would be better to add support for this at
> a
> > base level in Arrow, especially since this will open up the ability for
> > other packages to understand the Complex Number Type. For example, it
> would
> > be useful to:
> >
> >    1. Have a clearly defined Pandas -> Arrow -> Parquet -> Arrow ->
> Pandas
> >    roundtrip. Currently there's no Pandas -> Arrow conversion for
> >    np.complex{64, 128}.
> >    2. Support complex number types in query engines like DataFusion and
> >    BlazingSQL, if only initially via selection on indexing columns.
> >
> >
> > I started up a PR in https://github.com/apache/arrow/pull/10452 adding
> > Complex Numbers as a first-class Arrow type, although I note that
> >
> >
> https://issues.apache.org/jira/browse/ARROW-638?focusedCommentId=16912456&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16912456
> > suggests implementing this as a C++ Extension Type on a first pass.
> Initial
> > experiments suggests this is pretty doable -- I've got some test cases
> > running already.
> >
> > I have some questions going forward:
> >
> >    - Adding first class complex types seems to involve modifying
> >    cpp/src/arrow/ipc/feather.fbs which may change the protocol and
> > introduce
> >    breaking changes. I'm not sure about this and seek advice on how
> > invasive
> >    this approach is and whether its worth pursuing.
> >    - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
> >    imagine a struct([real, imag]) might offer more in terms of affordance
> > ot
> >    the user. I'd imagine the underlying memory layout would be the same.
> >    - I don't have a clear understanding of whether adding either a
> >    First-Class or ExtensionType involves supporting numeric operations on
> > that
> >    type (e.g. Complex Exponential, Absolutes, Min or Max operations) or
> >    whether Arrow is merely concerned with the underlying data
> > representation.
> >
> > Thanks for considering this.
> >   Simon Perkins
> >
>

Re: Complex Number support in Arrow

Posted by Simon Perkins <si...@gmail.com>.
On Wed, Jun 9, 2021 at 11:25 PM Wes McKinney <we...@gmail.com> wrote:

> I think that having a top-level type for complex numbers would be
> nicer than extension types


Agreed. As Micha mentioned, adding these types don't seem to interfere with
any existing protocol, I'd like to take this approach going forward.



> , so it would look like
>
> table Complex {
>   precision: Precision;
> }
>
> and the representation is a packed tuple of two floating point numbers
> of the indicated precision (I think this is the standard way that
> people do complex numbers, but would be good to know if there are any
> variations out there)
>

I believe that this is the binary representation in C, C++, Native Python
and NumPy.
Does Arrow support adapters If the native binary representation in other
languages
(Java/Rust/R/Julia) don't have a matching binary representation?
Does the C/C++ representation take precedence?


   - Java does not seem to natively support complex numbers:
   https://stackoverflow.com/questions/2997053/does-java-have-a-class-for-complex-numbers.
   There's an Apache Commons but the imaginary may be packed before the real:
   class
   https://github.com/apache/commons-numbers/blob/9b67b8e6890a47dcfc26388da2b4ee03758a9a94/commons-numbers-complex/src/main/java/org/apache/commons/numbers/complex/Complex.java#L228-L231
   - Rust does not natively support complex numbers:
   https://users.rust-lang.org/t/complex-number-in-rust-language/41081/3,
   but there's a num_complex crate that supports the C/C++ style packed
   struct:
   https://autumnai.github.io/cuticula/num/complex/struct.Complex.html
   - Julia seems to support the C/C++ style packed struct:
   https://github.com/JuliaLang/julia/blob/f1174888e8b9351a76996db328db13f130c23af8/base/complex.jl#L13-L16
   - I don't know R at all, but I'd imagine it's stance towards data is
   Python/Numpy-like. Can anyone provide input here?

Regards,


Simon




> On Wed, Jun 9, 2021 at 12:56 PM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > >
> > > Adding a new first-class type in Arrow requires working integration
> tests
> > > between C++ and Java libraries (once the idea is informally agreed
> upon)
> > > and then a final vote for approval.  We haven't formalized extension
> types
> > > but I imagine a similar cross language requirement would be agreed
> upon.
> > > Implementation of computation wouldn't be required for adding a new
> type.
> > > Different language bindings have taken different approaches on how much
> > > additional computational elements are packaged in them.
> >
> > While dedicated types are not strictly required, compute functions would
> > be much easier to add for a first-class dedicated complex datatype
> > rather than for an extension type.
> >
> > Since complex numbers are quite common in some domains, and since they
> > are conceptually simply, IMHO it would make sense to add them to the
> > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> >
> > Regards
> >
> > Antoine.
>

Re: Complex Number support in Arrow

Posted by Antoine Pitrou <an...@python.org>.
On Wed, 9 Jun 2021 15:34:41 -0700
Micah Kornfield <em...@gmail.com> wrote:
> Hi Antoine,
> In regards to conceptual simplicity, I might have misinterpreted when you
> wrote:
> 
> Since complex numbers are quite common in some domains, and since they
> > are conceptually simply,  
> 
> 
> It seemed like a justification for adding them as a first class type.

Ah, indeed.  I misexpressed myself, then, as I was really thinking
about implementation simplicity.  Sorry.

Regards

Antoine.



Re: Complex Number support in Arrow

Posted by Micah Kornfield <em...@gmail.com>.
Hi Antoine,
In regards to conceptual simplicity, I might have misinterpreted when you
wrote:

Since complex numbers are quite common in some domains, and since they
> are conceptually simply,


It seemed like a justification for adding them as a first class type.

Thanks,
Micah


On Wed, Jun 9, 2021 at 3:16 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 10/06/2021 à 00:05, Micah Kornfield a écrit :
> >>
> >> While dedicated types are not strictly required, compute functions would
> >> be much easier to add for a first-class dedicated complex datatype
> >> rather than for an extension type.
> >
> > It seems like maybe this is an area to focus on?  I'm not sure
> conceptually
> > simple is the right criteria to apply here.
>
> Hmm... who talked about conceptual simplicitly? I was alluding to
> potentially adding compute kernels to Arrow C++ applying to complex data.
>
> Regards
>
> Antoine.
>

Re: Complex Number support in Arrow

Posted by Antoine Pitrou <an...@python.org>.
Le 10/06/2021 à 00:05, Micah Kornfield a écrit :
>>
>> While dedicated types are not strictly required, compute functions would
>> be much easier to add for a first-class dedicated complex datatype
>> rather than for an extension type.
> 
> It seems like maybe this is an area to focus on?  I'm not sure conceptually
> simple is the right criteria to apply here.

Hmm... who talked about conceptual simplicitly? I was alluding to 
potentially adding compute kernels to Arrow C++ applying to complex data.

Regards

Antoine.

Re: Complex Number support in Arrow

Posted by Micah Kornfield <em...@gmail.com>.
>
> While dedicated types are not strictly required, compute functions would
> be much easier to add for a first-class dedicated complex datatype
> rather than for an extension type.


It seems like maybe this is an area to focus on?  I'm not sure conceptually
simple is the right criteria to apply here.  For instance Complex number
appear to be a user defined type in Postgres [1].

[1] https://www.postgresql.org/docs/9.0/xoper.html

On Wed, Jun 9, 2021 at 2:25 PM Wes McKinney <we...@gmail.com> wrote:

> I think that having a top-level type for complex numbers would be
> nicer than extension types, so it would look like
>
> table Complex {
>   precision: Precision;
> }
>
> and the representation is a packed tuple of two floating point numbers
> of the indicated precision (I think this is the standard way that
> people do complex numbers, but would be good to know if there are any
> variations out there)
>
> On Wed, Jun 9, 2021 at 12:56 PM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > >
> > > Adding a new first-class type in Arrow requires working integration
> tests
> > > between C++ and Java libraries (once the idea is informally agreed
> upon)
> > > and then a final vote for approval.  We haven't formalized extension
> types
> > > but I imagine a similar cross language requirement would be agreed
> upon.
> > > Implementation of computation wouldn't be required for adding a new
> type.
> > > Different language bindings have taken different approaches on how much
> > > additional computational elements are packaged in them.
> >
> > While dedicated types are not strictly required, compute functions would
> > be much easier to add for a first-class dedicated complex datatype
> > rather than for an extension type.
> >
> > Since complex numbers are quite common in some domains, and since they
> > are conceptually simply, IMHO it would make sense to add them to the
> > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> >
> > Regards
> >
> > Antoine.
>

Re: Complex Number support in Arrow

Posted by Wes McKinney <we...@gmail.com>.
I think that having a top-level type for complex numbers would be
nicer than extension types, so it would look like

table Complex {
  precision: Precision;
}

and the representation is a packed tuple of two floating point numbers
of the indicated precision (I think this is the standard way that
people do complex numbers, but would be good to know if there are any
variations out there)

On Wed, Jun 9, 2021 at 12:56 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> >
> > Adding a new first-class type in Arrow requires working integration tests
> > between C++ and Java libraries (once the idea is informally agreed upon)
> > and then a final vote for approval.  We haven't formalized extension types
> > but I imagine a similar cross language requirement would be agreed upon.
> > Implementation of computation wouldn't be required for adding a new type.
> > Different language bindings have taken different approaches on how much
> > additional computational elements are packaged in them.
>
> While dedicated types are not strictly required, compute functions would
> be much easier to add for a first-class dedicated complex datatype
> rather than for an extension type.
>
> Since complex numbers are quite common in some domains, and since they
> are conceptually simply, IMHO it would make sense to add them to the
> native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
>
> Regards
>
> Antoine.

Re: Complex Number support in Arrow

Posted by Neal Richardson <ne...@gmail.com>.
Thanks Micah. Those criteria seem reasonable (and that discussion was
recent enough that my memory of it should have been sharper). I've created
https://issues.apache.org/jira/browse/ARROW-13055 so that we can document
this decision. IMO we don't need a vote on these criteria--seems like there
was consensus before.

Neal

On Thu, Jun 10, 2021 at 9:52 PM Micah Kornfield <em...@gmail.com>
wrote:

> >
> >  It might help this discussion and future discussions like it if we could
> > define how it is determined whether a type should be part of the Arrow
> > format, an extension type (and what does it mean to say there is a
> > "canonical" extension type), or just something that a language
> > implementation or downstream library builds for itself with metadata. I
> > feel like this has come up before but I don't recall a resolution.
>
>
> There seemed to be  consensus, but I guess we never formally voted on the
> decision points here:
>
> https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E
>
> Applying the criteria to complex types:
> 1.  Is the type a new parameterization of an existing type?  No
>
> 2.  Does the type itself have its own specification for processing (e.g.
> JSON, BSON, Thrift, Avro, Protobuf)? No
>
> 3.  Is the underlying encoding of the type already semantically supported
> by a type?  Yes.  Two have been mentioned in this thread and I would also
> support adding a new packed struct type, but it appears isn't necessary for
> this. Note that FixedSizeLists have some limitations in regards to parquet
> compatibility around nullability, there might be a few other sharp edges.
>
> So if we use this criteria we would lean towards an extension type.
>
> We never converged on a standard for "canonical" extension types.  I would
> propose it roughly be the same criteria as a first class type:
> 1.  Specification/document update PR that describes the representation
> 2.  Implementation showing working integration tests across two languages
> (for canonical types I think this can be any 2 languages instead of C++ and
> Java)
> 3.  Formal vote accepting the canonical type.
>
> Thanks,
> Micah
>
>
>
> On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > Isn't an array of complexes represented by what arrow already supports?
> In
> > particular, I see at least two valid in-memory representations to use,
> that
> > depend on what we are going to do with it:
> >
> > * Struct[re, im]
> > * FixedList[2]
> >
> > In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, ...],
> in
> > the second case we have 1 buffer, [x0, y0, x1, y1, ...].
> >
> > The first representation is useful for column-based operations (e.g.
> taking
> > the real part in case 1 is trivial; requires a copy in the second case),
> > the second representation is useful for row-base operations (e.g. "take"
> > and "filter" require a single pass over buffer 1). Case 2 does not
> support
> > Re and Im of different physical types (arguably an issue). Both cases
> > support nullability of individual items or combined.
> >
> > What I conclude is that this does not seem to be a problem about a base
> > in-memory representation, but rather on whether we agree on a
> > representation that justifies adding associated metadata to the spec.
> >
> > The case for the complex interval type recently proposed [1] is more
> > compelling to me because a complex ops over intervals usually required
> all
> > parts of the interval (and thus the "FixedList" representation is more
> > compelling), but each part has a different type. I.e. it is like a
> > "FixedTypedList[int32, int32, int64]", which we do not natively support.
> >
> > [1] https://github.com/apache/arrow/pull/10177
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
> > neal.p.richardson@gmail.com>
> > wrote:
> >
> > >  It might help this discussion and future discussions like it if we
> could
> > > define how it is determined whether a type should be part of the Arrow
> > > format, an extension type (and what does it mean to say there is a
> > > "canonical" extension type), or just something that a language
> > > implementation or downstream library builds for itself with metadata. I
> > > feel like this has come up before but I don't recall a resolution.
> > >
> > > Examples might also help: are there examples of "canonical extension
> > > types"?
> > >
> > > Neal
> > >
> > > On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > > > >
> > > > > My understanding is that it means having COMPLEX as an entry in the
> > > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > > work in the C++ library much more straightforward.
> > > >
> > > > One idea I proposed would be to do that, and implement the
> > > > > serialization of the complex metadata using Extension types.
> > > >
> > > >
> > > > If this is a maintainable strategy for Canonical types it sounds good
> > to
> > > > me.
> > > >
> > > > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > My understanding is that it means having COMPLEX as an entry in the
> > > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > > work in the C++ library much more straightforward.
> > > > >
> > > > > One idea I proposed would be to do that, and implement the
> > > > > serialization of the complex metadata using Extension types.
> > > > >
> > > > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <weston.pace@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > While dedicated types are not strictly required, compute
> > functions
> > > > > would
> > > > > > > be much easier to add for a first-class dedicated complex
> > datatype
> > > > > > > rather than for an extension type.
> > > > > > @pitrou
> > > > > >
> > > > > > This is perhaps a naive question (and admittedly, I'm not up to
> > speed
> > > > > > on my compute kernels) but why is this the case?  For example, if
> > > > > > adding a complex addition kernel it seems we would be talking
> > > about...
> > > > > >
> > > > > > dest_scalar.real = scalar1.real + scalar2.real;
> > > > > > dest_scalar.im = scalar1.im + scalar2.im;
> > > > > >
> > > > > > vs...
> > > > > >
> > > > > > dest_scalar[0] = scalar1[0] + scalar2[0];
> > > > > > dest_scalar[1] = scalar1[1] + scalar2[1];
> > > > > >
> > > > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <
> wesmckinn@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > I'd be supportive of starting with this as a "canonical"
> > extension
> > > > > > > type so that all implementations are not expected to support
> > > complex
> > > > > > > types — this would encourage us to build sufficient integration
> > > e.g.
> > > > > > > with NumPy to get things working end-to-end with the on-wire
> > > > > > > representation being an extension type. We could certainly
> choose
> > > to
> > > > > > > treat the type as "first class" in the C++ library without it
> > being
> > > > > > > "top level" in the Type union in Flatbuffers.
> > > > > > >
> > > > > > > I agree that the use cases are more specialized, and the fact
> > that
> > > we
> > > > > > > haven't needed it until now (or at least, its absence suggests
> > > this)
> > > > > > > shows that this is the case.
> > > > > > >
> > > > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> > > > emkornfield@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm convinced now that  first-class types seem to be the
> way
> > to
> > > > go
> > > > > and I'm
> > > > > > > > > happy to take this approach.
> > > > > > > >
> > > > > > > > I agree from an implementation effort it is simpler, but I'm
> > > still
> > > > > not
> > > > > > > > convinced that we should be adding this as a first class
> type.
> > > As
> > > > > noted in
> > > > > > > > the survey below it appears Complex numbers are not a core
> > > concept
> > > > > in many
> > > > > > > > general purpose coding languages and it doesn't appear to be
> a
> > > > > common type
> > > > > > > > in SQL systems either.
> > > > > > > >
> > > > > > > > The reason why I am being nit-picky here is I think that
> > having a
> > > > > first
> > > > > > > > class type indicates that it should eventually be supported
> by
> > > all
> > > > > > > > reference implementations.  An "well known" extension type I
> > > think
> > > > > offers
> > > > > > > > less guarantees which makes it seem more suitable for niche
> > > types.
> > > > > > > >
> > > > > > > > > I don't immediately see a Packed Struct type. Would this
> need
> > > to
> > > > be
> > > > > > > > > > implemented?
> > > > > > > > > Not necessarily (*).  But before thinking about
> > implementation,
> > > > > this
> > > > > > > > > proposal must be accepted into the format.
> > > > > > > >
> > > > > > > >
> > > > > > > > Yes, this is a type that has been proposed in the past and I
> > > think
> > > > > handles
> > > > > > > > a lot of  types not yet in Arrow but have been requested
> (e.g.
> > IP
> > > > > > > > Addresses, Geo coordinates), etc.
> > > > > > > >
> > > > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> > > > > simon.perkins@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> > > > antoine@python.org>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > > > > > > >
> > > > > > > > > > > Adding a new first-class type in Arrow requires working
> > > > > integration
> > > > > > > > > tests
> > > > > > > > > > > between C++ and Java libraries (once the idea is
> > informally
> > > > > agreed
> > > > > > > > > upon)
> > > > > > > > > > > and then a final vote for approval.  We haven't
> > formalized
> > > > > extension
> > > > > > > > > > types
> > > > > > > > > > > but I imagine a similar cross language requirement
> would
> > be
> > > > > agreed
> > > > > > > > > upon.
> > > > > > > > > > > Implementation of computation wouldn't be required for
> > > adding
> > > > > a new
> > > > > > > > > type.
> > > > > > > > > > > Different language bindings have taken different
> > approaches
> > > > on
> > > > > how much
> > > > > > > > > > > additional computational elements are packaged in them.
> > > > > > > > > >
> > > > > > > > > > While dedicated types are not strictly required, compute
> > > > > functions would
> > > > > > > > > > be much easier to add for a first-class dedicated complex
> > > > > datatype
> > > > > > > > > > rather than for an extension type.
> > > > > > > > > >
> > > > > > > > > > Since complex numbers are quite common in some domains,
> and
> > > > > since they
> > > > > > > > > > are conceptually simply, IMHO it would make sense to add
> > them
> > > > to
> > > > > the
> > > > > > > > > > native Arrow datatypes (at least COMPLEX64 and
> COMPLEX128).
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm convinced now that  first-class types seem to be the
> way
> > to
> > > > go
> > > > > and I'm
> > > > > > > > > happy to take this approach.
> > > > > > > > > Regarding compute functions, it looks like the standard set
> > of
> > > > > scalar
> > > > > > > > > arithmetic and reduction functionality
> > > > > > > > > is desirable for complex numbers:
> > > > > > > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > > > > > > Perhaps it would be better to split the addition of the
> Types
> > > and
> > > > > addition
> > > > > > > > > Compute functionality into separate PRs?
> > > > > > > > >
> > > > > > > > > Regarding the process for managing this PR, it sounds like
> a
> > > > > proposal must
> > > > > > > > > be voted on?
> > > > > > > > > i.e. is this proposal still in this phase
> > > > > > > > >
> > > > >
> > > >
> > >
> >
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > > > > > > Regards
> > > > > > > > >
> > > > > > > > > Simon
> > > > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Complex Number support in Arrow

Posted by Simon Perkins <si...@gmail.com>.
On Mon, Jun 21, 2021 at 4:58 PM Antoine Pitrou <an...@python.org> wrote:

>
> I certainly don't think we should have extension types with a different
> type id.  IMHO, it's a recipe for confusion.
>

Thanks, I think I got confused by the different perspectives in the thread.
I'll do some more exploratory coding with the pure ExtensionType which
should certainly
make life easier!



>
> Le 21/06/2021 à 15:54, Simon Perkins a écrit :
> > To put it another way, an Extension Type technically has Type::EXTENSION,
> > but now there's Type::COMPLEX_FLOAT and Type::COMPLEX_DOUBLE.
> >
> > When checking enums, the code see's a Type::COMPLEX_FLOAT and seems to
> > mismatch on ComplexFloatType::Type::type_id, as the latter is
> > Type::EXTENSION?
> >
> > On Mon, Jun 21, 2021 at 3:11 PM Simon Perkins <si...@gmail.com>
> > wrote:
> >
> >> I did some exploratory coding adding Complex Numbers as ExtensionTypes
> in
> >> this PR: https://github.com/apache/arrow/pull/10565
> >>
> >>> My understanding is that it means having COMPLEX as an entry in the
> >> arrow/type_fwd.h Type enum. I agree this would make implementation
> >> work in the C++ library much more straightforward.
> >>
> >> I implemented this approach, adding COMPLEX_FLOAT and COMPLEX_DOUBLE
> >> entries to the Type enum.
> >> One thing I noted is that at least some portion of the code base
> >> (visitor_inline.h) expects the ExtensionTypes
> >> to have first-class type-like interfaces (i.e. needs a TypeTraits entry,
> >> Visitor definitions).
> >>
> >> My impression at this point, is that the work in implementing a hybrid
> >> approach
> >> (i.e. ExtensionType with a Type enum entry) replicates that of adding a
> >> first-class type.
> >> As I am not extensively familiar with the code base, I thought I'd just
> >> check whether this impression
> >> is correct.
> >>
> >> regards
> >>    Simon
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jun 11, 2021 at 6:52 AM Micah Kornfield <em...@gmail.com>
> >> wrote:
> >>
> >>>>
> >>>>   It might help this discussion and future discussions like it if we
> >>> could
> >>>> define how it is determined whether a type should be part of the Arrow
> >>>> format, an extension type (and what does it mean to say there is a
> >>>> "canonical" extension type), or just something that a language
> >>>> implementation or downstream library builds for itself with metadata.
> I
> >>>> feel like this has come up before but I don't recall a resolution.
> >>>
> >>>
> >>> There seemed to be  consensus, but I guess we never formally voted on
> the
> >>> decision points here:
> >>>
> >>>
> https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E
> >>>
> >>> Applying the criteria to complex types:
> >>> 1.  Is the type a new parameterization of an existing type?  No
> >>>
> >>> 2.  Does the type itself have its own specification for processing
> (e.g.
> >>> JSON, BSON, Thrift, Avro, Protobuf)? No
> >>>
> >>> 3.  Is the underlying encoding of the type already semantically
> supported
> >>> by a type?  Yes.  Two have been mentioned in this thread and I would
> also
> >>> support adding a new packed struct type, but it appears isn't necessary
> >>> for
> >>> this. Note that FixedSizeLists have some limitations in regards to
> parquet
> >>> compatibility around nullability, there might be a few other sharp
> edges.
> >>>
> >>> So if we use this criteria we would lean towards an extension type.
> >>>
> >>> We never converged on a standard for "canonical" extension types.  I
> would
> >>> propose it roughly be the same criteria as a first class type:
> >>> 1.  Specification/document update PR that describes the representation
> >>> 2.  Implementation showing working integration tests across two
> languages
> >>> (for canonical types I think this can be any 2 languages instead of C++
> >>> and
> >>> Java)
> >>> 3.  Formal vote accepting the canonical type.
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>>
> >>>
> >>> On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
> >>> jorgecarleitao@gmail.com> wrote:
> >>>
> >>>> Isn't an array of complexes represented by what arrow already
> supports?
> >>> In
> >>>> particular, I see at least two valid in-memory representations to use,
> >>> that
> >>>> depend on what we are going to do with it:
> >>>>
> >>>> * Struct[re, im]
> >>>> * FixedList[2]
> >>>>
> >>>> In the first case, we have two buffers, [x0, x1, ...] and [y0, y1,
> >>> ...], in
> >>>> the second case we have 1 buffer, [x0, y0, x1, y1, ...].
> >>>>
> >>>> The first representation is useful for column-based operations (e.g.
> >>> taking
> >>>> the real part in case 1 is trivial; requires a copy in the second
> case),
> >>>> the second representation is useful for row-base operations (e.g.
> "take"
> >>>> and "filter" require a single pass over buffer 1). Case 2 does not
> >>> support
> >>>> Re and Im of different physical types (arguably an issue). Both cases
> >>>> support nullability of individual items or combined.
> >>>>
> >>>> What I conclude is that this does not seem to be a problem about a
> base
> >>>> in-memory representation, but rather on whether we agree on a
> >>>> representation that justifies adding associated metadata to the spec.
> >>>>
> >>>> The case for the complex interval type recently proposed [1] is more
> >>>> compelling to me because a complex ops over intervals usually required
> >>> all
> >>>> parts of the interval (and thus the "FixedList" representation is more
> >>>> compelling), but each part has a different type. I.e. it is like a
> >>>> "FixedTypedList[int32, int32, int64]", which we do not natively
> support.
> >>>>
> >>>> [1] https://github.com/apache/arrow/pull/10177
> >>>>
> >>>> Best,
> >>>> Jorge
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
> >>>> neal.p.richardson@gmail.com>
> >>>> wrote:
> >>>>
> >>>>>   It might help this discussion and future discussions like it if we
> >>> could
> >>>>> define how it is determined whether a type should be part of the
> Arrow
> >>>>> format, an extension type (and what does it mean to say there is a
> >>>>> "canonical" extension type), or just something that a language
> >>>>> implementation or downstream library builds for itself with metadata.
> >>> I
> >>>>> feel like this has come up before but I don't recall a resolution.
> >>>>>
> >>>>> Examples might also help: are there examples of "canonical extension
> >>>>> types"?
> >>>>>
> >>>>> Neal
> >>>>>
> >>>>> On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <
> >>> emkornfield@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>>>
> >>>>>>> My understanding is that it means having COMPLEX as an entry in
> >>> the
> >>>>>>> arrow/type_fwd.h Type enum. I agree this would make implementation
> >>>>>>> work in the C++ library much more straightforward.
> >>>>>>
> >>>>>> One idea I proposed would be to do that, and implement the
> >>>>>>> serialization of the complex metadata using Extension types.
> >>>>>>
> >>>>>>
> >>>>>> If this is a maintainable strategy for Canonical types it sounds
> >>> good
> >>>> to
> >>>>>> me.
> >>>>>>
> >>>>>> On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> My understanding is that it means having COMPLEX as an entry in
> >>> the
> >>>>>>> arrow/type_fwd.h Type enum. I agree this would make implementation
> >>>>>>> work in the C++ library much more straightforward.
> >>>>>>>
> >>>>>>> One idea I proposed would be to do that, and implement the
> >>>>>>> serialization of the complex metadata using Extension types.
> >>>>>>>
> >>>>>>> On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <
> >>> weston.pace@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> While dedicated types are not strictly required, compute
> >>>> functions
> >>>>>>> would
> >>>>>>>>> be much easier to add for a first-class dedicated complex
> >>>> datatype
> >>>>>>>>> rather than for an extension type.
> >>>>>>>> @pitrou
> >>>>>>>>
> >>>>>>>> This is perhaps a naive question (and admittedly, I'm not up to
> >>>> speed
> >>>>>>>> on my compute kernels) but why is this the case?  For example,
> >>> if
> >>>>>>>> adding a complex addition kernel it seems we would be talking
> >>>>> about...
> >>>>>>>>
> >>>>>>>> dest_scalar.real = scalar1.real + scalar2.real;
> >>>>>>>> dest_scalar.im = scalar1.im + scalar2.im;
> >>>>>>>>
> >>>>>>>> vs...
> >>>>>>>>
> >>>>>>>> dest_scalar[0] = scalar1[0] + scalar2[0];
> >>>>>>>> dest_scalar[1] = scalar1[1] + scalar2[1];
> >>>>>>>>
> >>>>>>>> On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <
> >>> wesmckinn@gmail.com
> >>>>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> I'd be supportive of starting with this as a "canonical"
> >>>> extension
> >>>>>>>>> type so that all implementations are not expected to support
> >>>>> complex
> >>>>>>>>> types — this would encourage us to build sufficient
> >>> integration
> >>>>> e.g.
> >>>>>>>>> with NumPy to get things working end-to-end with the on-wire
> >>>>>>>>> representation being an extension type. We could certainly
> >>> choose
> >>>>> to
> >>>>>>>>> treat the type as "first class" in the C++ library without it
> >>>> being
> >>>>>>>>> "top level" in the Type union in Flatbuffers.
> >>>>>>>>>
> >>>>>>>>> I agree that the use cases are more specialized, and the fact
> >>>> that
> >>>>> we
> >>>>>>>>> haven't needed it until now (or at least, its absence suggests
> >>>>> this)
> >>>>>>>>> shows that this is the case.
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> >>>>>> emkornfield@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I'm convinced now that  first-class types seem to be the
> >>> way
> >>>> to
> >>>>>> go
> >>>>>>> and I'm
> >>>>>>>>>>> happy to take this approach.
> >>>>>>>>>>
> >>>>>>>>>> I agree from an implementation effort it is simpler, but I'm
> >>>>> still
> >>>>>>> not
> >>>>>>>>>> convinced that we should be adding this as a first class
> >>> type.
> >>>>> As
> >>>>>>> noted in
> >>>>>>>>>> the survey below it appears Complex numbers are not a core
> >>>>> concept
> >>>>>>> in many
> >>>>>>>>>> general purpose coding languages and it doesn't appear to
> >>> be a
> >>>>>>> common type
> >>>>>>>>>> in SQL systems either.
> >>>>>>>>>>
> >>>>>>>>>> The reason why I am being nit-picky here is I think that
> >>>> having a
> >>>>>>> first
> >>>>>>>>>> class type indicates that it should eventually be supported
> >>> by
> >>>>> all
> >>>>>>>>>> reference implementations.  An "well known" extension type I
> >>>>> think
> >>>>>>> offers
> >>>>>>>>>> less guarantees which makes it seem more suitable for niche
> >>>>> types.
> >>>>>>>>>>
> >>>>>>>>>>> I don't immediately see a Packed Struct type. Would this
> >>> need
> >>>>> to
> >>>>>> be
> >>>>>>>>>>>> implemented?
> >>>>>>>>>>> Not necessarily (*).  But before thinking about
> >>>> implementation,
> >>>>>>> this
> >>>>>>>>>>> proposal must be accepted into the format.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes, this is a type that has been proposed in the past and I
> >>>>> think
> >>>>>>> handles
> >>>>>>>>>> a lot of  types not yet in Arrow but have been requested
> >>> (e.g.
> >>>> IP
> >>>>>>>>>> Addresses, Geo coordinates), etc.
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> >>>>>>> simon.perkins@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> >>>>>> antoine@python.org>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Adding a new first-class type in Arrow requires
> >>> working
> >>>>>>> integration
> >>>>>>>>>>> tests
> >>>>>>>>>>>>> between C++ and Java libraries (once the idea is
> >>>> informally
> >>>>>>> agreed
> >>>>>>>>>>> upon)
> >>>>>>>>>>>>> and then a final vote for approval.  We haven't
> >>>> formalized
> >>>>>>> extension
> >>>>>>>>>>>> types
> >>>>>>>>>>>>> but I imagine a similar cross language requirement
> >>> would
> >>>> be
> >>>>>>> agreed
> >>>>>>>>>>> upon.
> >>>>>>>>>>>>> Implementation of computation wouldn't be required for
> >>>>> adding
> >>>>>>> a new
> >>>>>>>>>>> type.
> >>>>>>>>>>>>> Different language bindings have taken different
> >>>> approaches
> >>>>>> on
> >>>>>>> how much
> >>>>>>>>>>>>> additional computational elements are packaged in
> >>> them.
> >>>>>>>>>>>>
> >>>>>>>>>>>> While dedicated types are not strictly required, compute
> >>>>>>> functions would
> >>>>>>>>>>>> be much easier to add for a first-class dedicated
> >>> complex
> >>>>>>> datatype
> >>>>>>>>>>>> rather than for an extension type.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Since complex numbers are quite common in some domains,
> >>> and
> >>>>>>> since they
> >>>>>>>>>>>> are conceptually simply, IMHO it would make sense to add
> >>>> them
> >>>>>> to
> >>>>>>> the
> >>>>>>>>>>>> native Arrow datatypes (at least COMPLEX64 and
> >>> COMPLEX128).
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I'm convinced now that  first-class types seem to be the
> >>> way
> >>>> to
> >>>>>> go
> >>>>>>> and I'm
> >>>>>>>>>>> happy to take this approach.
> >>>>>>>>>>> Regarding compute functions, it looks like the standard
> >>> set
> >>>> of
> >>>>>>> scalar
> >>>>>>>>>>> arithmetic and reduction functionality
> >>>>>>>>>>> is desirable for complex numbers:
> >>>>>>>>>>> https://arrow.apache.org/docs/cpp/compute.html#
> >>>>>>>>>>> Perhaps it would be better to split the addition of the
> >>> Types
> >>>>> and
> >>>>>>> addition
> >>>>>>>>>>> Compute functionality into separate PRs?
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding the process for managing this PR, it sounds
> >>> like a
> >>>>>>> proposal must
> >>>>>>>>>>> be voted on?
> >>>>>>>>>>> i.e. is this proposal still in this phase
> >>>>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> >>>>>>>>>>> Regards
> >>>>>>>>>>>
> >>>>>>>>>>> Simon
> >>>>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: Complex Number support in Arrow

Posted by Antoine Pitrou <an...@python.org>.
I certainly don't think we should have extension types with a different 
type id.  IMHO, it's a recipe for confusion.

Regards

Antoine.


Le 21/06/2021 à 15:54, Simon Perkins a écrit :
> To put it another way, an Extension Type technically has Type::EXTENSION,
> but now there's Type::COMPLEX_FLOAT and Type::COMPLEX_DOUBLE.
> 
> When checking enums, the code see's a Type::COMPLEX_FLOAT and seems to
> mismatch on ComplexFloatType::Type::type_id, as the latter is
> Type::EXTENSION?
> 
> On Mon, Jun 21, 2021 at 3:11 PM Simon Perkins <si...@gmail.com>
> wrote:
> 
>> I did some exploratory coding adding Complex Numbers as ExtensionTypes in
>> this PR: https://github.com/apache/arrow/pull/10565
>>
>>> My understanding is that it means having COMPLEX as an entry in the
>> arrow/type_fwd.h Type enum. I agree this would make implementation
>> work in the C++ library much more straightforward.
>>
>> I implemented this approach, adding COMPLEX_FLOAT and COMPLEX_DOUBLE
>> entries to the Type enum.
>> One thing I noted is that at least some portion of the code base
>> (visitor_inline.h) expects the ExtensionTypes
>> to have first-class type-like interfaces (i.e. needs a TypeTraits entry,
>> Visitor definitions).
>>
>> My impression at this point, is that the work in implementing a hybrid
>> approach
>> (i.e. ExtensionType with a Type enum entry) replicates that of adding a
>> first-class type.
>> As I am not extensively familiar with the code base, I thought I'd just
>> check whether this impression
>> is correct.
>>
>> regards
>>    Simon
>>
>>
>>
>>
>>
>> On Fri, Jun 11, 2021 at 6:52 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>>>
>>>>   It might help this discussion and future discussions like it if we
>>> could
>>>> define how it is determined whether a type should be part of the Arrow
>>>> format, an extension type (and what does it mean to say there is a
>>>> "canonical" extension type), or just something that a language
>>>> implementation or downstream library builds for itself with metadata. I
>>>> feel like this has come up before but I don't recall a resolution.
>>>
>>>
>>> There seemed to be  consensus, but I guess we never formally voted on the
>>> decision points here:
>>>
>>> https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E
>>>
>>> Applying the criteria to complex types:
>>> 1.  Is the type a new parameterization of an existing type?  No
>>>
>>> 2.  Does the type itself have its own specification for processing (e.g.
>>> JSON, BSON, Thrift, Avro, Protobuf)? No
>>>
>>> 3.  Is the underlying encoding of the type already semantically supported
>>> by a type?  Yes.  Two have been mentioned in this thread and I would also
>>> support adding a new packed struct type, but it appears isn't necessary
>>> for
>>> this. Note that FixedSizeLists have some limitations in regards to parquet
>>> compatibility around nullability, there might be a few other sharp edges.
>>>
>>> So if we use this criteria we would lean towards an extension type.
>>>
>>> We never converged on a standard for "canonical" extension types.  I would
>>> propose it roughly be the same criteria as a first class type:
>>> 1.  Specification/document update PR that describes the representation
>>> 2.  Implementation showing working integration tests across two languages
>>> (for canonical types I think this can be any 2 languages instead of C++
>>> and
>>> Java)
>>> 3.  Formal vote accepting the canonical type.
>>>
>>> Thanks,
>>> Micah
>>>
>>>
>>>
>>> On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
>>> jorgecarleitao@gmail.com> wrote:
>>>
>>>> Isn't an array of complexes represented by what arrow already supports?
>>> In
>>>> particular, I see at least two valid in-memory representations to use,
>>> that
>>>> depend on what we are going to do with it:
>>>>
>>>> * Struct[re, im]
>>>> * FixedList[2]
>>>>
>>>> In the first case, we have two buffers, [x0, x1, ...] and [y0, y1,
>>> ...], in
>>>> the second case we have 1 buffer, [x0, y0, x1, y1, ...].
>>>>
>>>> The first representation is useful for column-based operations (e.g.
>>> taking
>>>> the real part in case 1 is trivial; requires a copy in the second case),
>>>> the second representation is useful for row-base operations (e.g. "take"
>>>> and "filter" require a single pass over buffer 1). Case 2 does not
>>> support
>>>> Re and Im of different physical types (arguably an issue). Both cases
>>>> support nullability of individual items or combined.
>>>>
>>>> What I conclude is that this does not seem to be a problem about a base
>>>> in-memory representation, but rather on whether we agree on a
>>>> representation that justifies adding associated metadata to the spec.
>>>>
>>>> The case for the complex interval type recently proposed [1] is more
>>>> compelling to me because a complex ops over intervals usually required
>>> all
>>>> parts of the interval (and thus the "FixedList" representation is more
>>>> compelling), but each part has a different type. I.e. it is like a
>>>> "FixedTypedList[int32, int32, int64]", which we do not natively support.
>>>>
>>>> [1] https://github.com/apache/arrow/pull/10177
>>>>
>>>> Best,
>>>> Jorge
>>>>
>>>>
>>>>
>>>> On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
>>>> neal.p.richardson@gmail.com>
>>>> wrote:
>>>>
>>>>>   It might help this discussion and future discussions like it if we
>>> could
>>>>> define how it is determined whether a type should be part of the Arrow
>>>>> format, an extension type (and what does it mean to say there is a
>>>>> "canonical" extension type), or just something that a language
>>>>> implementation or downstream library builds for itself with metadata.
>>> I
>>>>> feel like this has come up before but I don't recall a resolution.
>>>>>
>>>>> Examples might also help: are there examples of "canonical extension
>>>>> types"?
>>>>>
>>>>> Neal
>>>>>
>>>>> On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>>
>>>>>>> My understanding is that it means having COMPLEX as an entry in
>>> the
>>>>>>> arrow/type_fwd.h Type enum. I agree this would make implementation
>>>>>>> work in the C++ library much more straightforward.
>>>>>>
>>>>>> One idea I proposed would be to do that, and implement the
>>>>>>> serialization of the complex metadata using Extension types.
>>>>>>
>>>>>>
>>>>>> If this is a maintainable strategy for Canonical types it sounds
>>> good
>>>> to
>>>>>> me.
>>>>>>
>>>>>> On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>>> My understanding is that it means having COMPLEX as an entry in
>>> the
>>>>>>> arrow/type_fwd.h Type enum. I agree this would make implementation
>>>>>>> work in the C++ library much more straightforward.
>>>>>>>
>>>>>>> One idea I proposed would be to do that, and implement the
>>>>>>> serialization of the complex metadata using Extension types.
>>>>>>>
>>>>>>> On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <
>>> weston.pace@gmail.com>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> While dedicated types are not strictly required, compute
>>>> functions
>>>>>>> would
>>>>>>>>> be much easier to add for a first-class dedicated complex
>>>> datatype
>>>>>>>>> rather than for an extension type.
>>>>>>>> @pitrou
>>>>>>>>
>>>>>>>> This is perhaps a naive question (and admittedly, I'm not up to
>>>> speed
>>>>>>>> on my compute kernels) but why is this the case?  For example,
>>> if
>>>>>>>> adding a complex addition kernel it seems we would be talking
>>>>> about...
>>>>>>>>
>>>>>>>> dest_scalar.real = scalar1.real + scalar2.real;
>>>>>>>> dest_scalar.im = scalar1.im + scalar2.im;
>>>>>>>>
>>>>>>>> vs...
>>>>>>>>
>>>>>>>> dest_scalar[0] = scalar1[0] + scalar2[0];
>>>>>>>> dest_scalar[1] = scalar1[1] + scalar2[1];
>>>>>>>>
>>>>>>>> On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <
>>> wesmckinn@gmail.com
>>>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I'd be supportive of starting with this as a "canonical"
>>>> extension
>>>>>>>>> type so that all implementations are not expected to support
>>>>> complex
>>>>>>>>> types — this would encourage us to build sufficient
>>> integration
>>>>> e.g.
>>>>>>>>> with NumPy to get things working end-to-end with the on-wire
>>>>>>>>> representation being an extension type. We could certainly
>>> choose
>>>>> to
>>>>>>>>> treat the type as "first class" in the C++ library without it
>>>> being
>>>>>>>>> "top level" in the Type union in Flatbuffers.
>>>>>>>>>
>>>>>>>>> I agree that the use cases are more specialized, and the fact
>>>> that
>>>>> we
>>>>>>>>> haven't needed it until now (or at least, its absence suggests
>>>>> this)
>>>>>>>>> shows that this is the case.
>>>>>>>>>
>>>>>>>>> On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
>>>>>> emkornfield@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm convinced now that  first-class types seem to be the
>>> way
>>>> to
>>>>>> go
>>>>>>> and I'm
>>>>>>>>>>> happy to take this approach.
>>>>>>>>>>
>>>>>>>>>> I agree from an implementation effort it is simpler, but I'm
>>>>> still
>>>>>>> not
>>>>>>>>>> convinced that we should be adding this as a first class
>>> type.
>>>>> As
>>>>>>> noted in
>>>>>>>>>> the survey below it appears Complex numbers are not a core
>>>>> concept
>>>>>>> in many
>>>>>>>>>> general purpose coding languages and it doesn't appear to
>>> be a
>>>>>>> common type
>>>>>>>>>> in SQL systems either.
>>>>>>>>>>
>>>>>>>>>> The reason why I am being nit-picky here is I think that
>>>> having a
>>>>>>> first
>>>>>>>>>> class type indicates that it should eventually be supported
>>> by
>>>>> all
>>>>>>>>>> reference implementations.  An "well known" extension type I
>>>>> think
>>>>>>> offers
>>>>>>>>>> less guarantees which makes it seem more suitable for niche
>>>>> types.
>>>>>>>>>>
>>>>>>>>>>> I don't immediately see a Packed Struct type. Would this
>>> need
>>>>> to
>>>>>> be
>>>>>>>>>>>> implemented?
>>>>>>>>>>> Not necessarily (*).  But before thinking about
>>>> implementation,
>>>>>>> this
>>>>>>>>>>> proposal must be accepted into the format.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, this is a type that has been proposed in the past and I
>>>>> think
>>>>>>> handles
>>>>>>>>>> a lot of  types not yet in Arrow but have been requested
>>> (e.g.
>>>> IP
>>>>>>>>>> Addresses, Geo coordinates), etc.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
>>>>>>> simon.perkins@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
>>>>>> antoine@python.org>
>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
>>>>>>>>>>>>>
>>>>>>>>>>>>> Adding a new first-class type in Arrow requires
>>> working
>>>>>>> integration
>>>>>>>>>>> tests
>>>>>>>>>>>>> between C++ and Java libraries (once the idea is
>>>> informally
>>>>>>> agreed
>>>>>>>>>>> upon)
>>>>>>>>>>>>> and then a final vote for approval.  We haven't
>>>> formalized
>>>>>>> extension
>>>>>>>>>>>> types
>>>>>>>>>>>>> but I imagine a similar cross language requirement
>>> would
>>>> be
>>>>>>> agreed
>>>>>>>>>>> upon.
>>>>>>>>>>>>> Implementation of computation wouldn't be required for
>>>>> adding
>>>>>>> a new
>>>>>>>>>>> type.
>>>>>>>>>>>>> Different language bindings have taken different
>>>> approaches
>>>>>> on
>>>>>>> how much
>>>>>>>>>>>>> additional computational elements are packaged in
>>> them.
>>>>>>>>>>>>
>>>>>>>>>>>> While dedicated types are not strictly required, compute
>>>>>>> functions would
>>>>>>>>>>>> be much easier to add for a first-class dedicated
>>> complex
>>>>>>> datatype
>>>>>>>>>>>> rather than for an extension type.
>>>>>>>>>>>>
>>>>>>>>>>>> Since complex numbers are quite common in some domains,
>>> and
>>>>>>> since they
>>>>>>>>>>>> are conceptually simply, IMHO it would make sense to add
>>>> them
>>>>>> to
>>>>>>> the
>>>>>>>>>>>> native Arrow datatypes (at least COMPLEX64 and
>>> COMPLEX128).
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm convinced now that  first-class types seem to be the
>>> way
>>>> to
>>>>>> go
>>>>>>> and I'm
>>>>>>>>>>> happy to take this approach.
>>>>>>>>>>> Regarding compute functions, it looks like the standard
>>> set
>>>> of
>>>>>>> scalar
>>>>>>>>>>> arithmetic and reduction functionality
>>>>>>>>>>> is desirable for complex numbers:
>>>>>>>>>>> https://arrow.apache.org/docs/cpp/compute.html#
>>>>>>>>>>> Perhaps it would be better to split the addition of the
>>> Types
>>>>> and
>>>>>>> addition
>>>>>>>>>>> Compute functionality into separate PRs?
>>>>>>>>>>>
>>>>>>>>>>> Regarding the process for managing this PR, it sounds
>>> like a
>>>>>>> proposal must
>>>>>>>>>>> be voted on?
>>>>>>>>>>> i.e. is this proposal still in this phase
>>>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> http://arrow.apache.org/docs/developers/contributing.html#before-starting
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> Simon
>>>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

Re: Complex Number support in Arrow

Posted by Simon Perkins <si...@gmail.com>.
To put it another way, an Extension Type technically has Type::EXTENSION,
but now there's Type::COMPLEX_FLOAT and Type::COMPLEX_DOUBLE.

When checking enums, the code see's a Type::COMPLEX_FLOAT and seems to
mismatch on ComplexFloatType::Type::type_id, as the latter is
Type::EXTENSION?

On Mon, Jun 21, 2021 at 3:11 PM Simon Perkins <si...@gmail.com>
wrote:

> I did some exploratory coding adding Complex Numbers as ExtensionTypes in
> this PR: https://github.com/apache/arrow/pull/10565
>
> > My understanding is that it means having COMPLEX as an entry in the
> arrow/type_fwd.h Type enum. I agree this would make implementation
> work in the C++ library much more straightforward.
>
> I implemented this approach, adding COMPLEX_FLOAT and COMPLEX_DOUBLE
> entries to the Type enum.
> One thing I noted is that at least some portion of the code base
> (visitor_inline.h) expects the ExtensionTypes
> to have first-class type-like interfaces (i.e. needs a TypeTraits entry,
> Visitor definitions).
>
> My impression at this point, is that the work in implementing a hybrid
> approach
> (i.e. ExtensionType with a Type enum entry) replicates that of adding a
> first-class type.
> As I am not extensively familiar with the code base, I thought I'd just
> check whether this impression
> is correct.
>
> regards
>   Simon
>
>
>
>
>
> On Fri, Jun 11, 2021 at 6:52 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> >
>> >  It might help this discussion and future discussions like it if we
>> could
>> > define how it is determined whether a type should be part of the Arrow
>> > format, an extension type (and what does it mean to say there is a
>> > "canonical" extension type), or just something that a language
>> > implementation or downstream library builds for itself with metadata. I
>> > feel like this has come up before but I don't recall a resolution.
>>
>>
>> There seemed to be  consensus, but I guess we never formally voted on the
>> decision points here:
>>
>> https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E
>>
>> Applying the criteria to complex types:
>> 1.  Is the type a new parameterization of an existing type?  No
>>
>> 2.  Does the type itself have its own specification for processing (e.g.
>> JSON, BSON, Thrift, Avro, Protobuf)? No
>>
>> 3.  Is the underlying encoding of the type already semantically supported
>> by a type?  Yes.  Two have been mentioned in this thread and I would also
>> support adding a new packed struct type, but it appears isn't necessary
>> for
>> this. Note that FixedSizeLists have some limitations in regards to parquet
>> compatibility around nullability, there might be a few other sharp edges.
>>
>> So if we use this criteria we would lean towards an extension type.
>>
>> We never converged on a standard for "canonical" extension types.  I would
>> propose it roughly be the same criteria as a first class type:
>> 1.  Specification/document update PR that describes the representation
>> 2.  Implementation showing working integration tests across two languages
>> (for canonical types I think this can be any 2 languages instead of C++
>> and
>> Java)
>> 3.  Formal vote accepting the canonical type.
>>
>> Thanks,
>> Micah
>>
>>
>>
>> On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
>> jorgecarleitao@gmail.com> wrote:
>>
>> > Isn't an array of complexes represented by what arrow already supports?
>> In
>> > particular, I see at least two valid in-memory representations to use,
>> that
>> > depend on what we are going to do with it:
>> >
>> > * Struct[re, im]
>> > * FixedList[2]
>> >
>> > In the first case, we have two buffers, [x0, x1, ...] and [y0, y1,
>> ...], in
>> > the second case we have 1 buffer, [x0, y0, x1, y1, ...].
>> >
>> > The first representation is useful for column-based operations (e.g.
>> taking
>> > the real part in case 1 is trivial; requires a copy in the second case),
>> > the second representation is useful for row-base operations (e.g. "take"
>> > and "filter" require a single pass over buffer 1). Case 2 does not
>> support
>> > Re and Im of different physical types (arguably an issue). Both cases
>> > support nullability of individual items or combined.
>> >
>> > What I conclude is that this does not seem to be a problem about a base
>> > in-memory representation, but rather on whether we agree on a
>> > representation that justifies adding associated metadata to the spec.
>> >
>> > The case for the complex interval type recently proposed [1] is more
>> > compelling to me because a complex ops over intervals usually required
>> all
>> > parts of the interval (and thus the "FixedList" representation is more
>> > compelling), but each part has a different type. I.e. it is like a
>> > "FixedTypedList[int32, int32, int64]", which we do not natively support.
>> >
>> > [1] https://github.com/apache/arrow/pull/10177
>> >
>> > Best,
>> > Jorge
>> >
>> >
>> >
>> > On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
>> > neal.p.richardson@gmail.com>
>> > wrote:
>> >
>> > >  It might help this discussion and future discussions like it if we
>> could
>> > > define how it is determined whether a type should be part of the Arrow
>> > > format, an extension type (and what does it mean to say there is a
>> > > "canonical" extension type), or just something that a language
>> > > implementation or downstream library builds for itself with metadata.
>> I
>> > > feel like this has come up before but I don't recall a resolution.
>> > >
>> > > Examples might also help: are there examples of "canonical extension
>> > > types"?
>> > >
>> > > Neal
>> > >
>> > > On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> > > wrote:
>> > >
>> > > > >
>> > > > > My understanding is that it means having COMPLEX as an entry in
>> the
>> > > > > arrow/type_fwd.h Type enum. I agree this would make implementation
>> > > > > work in the C++ library much more straightforward.
>> > > >
>> > > > One idea I proposed would be to do that, and implement the
>> > > > > serialization of the complex metadata using Extension types.
>> > > >
>> > > >
>> > > > If this is a maintainable strategy for Canonical types it sounds
>> good
>> > to
>> > > > me.
>> > > >
>> > > > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > My understanding is that it means having COMPLEX as an entry in
>> the
>> > > > > arrow/type_fwd.h Type enum. I agree this would make implementation
>> > > > > work in the C++ library much more straightforward.
>> > > > >
>> > > > > One idea I proposed would be to do that, and implement the
>> > > > > serialization of the complex metadata using Extension types.
>> > > > >
>> > > > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <
>> weston.pace@gmail.com>
>> > > > wrote:
>> > > > > >
>> > > > > > > While dedicated types are not strictly required, compute
>> > functions
>> > > > > would
>> > > > > > > be much easier to add for a first-class dedicated complex
>> > datatype
>> > > > > > > rather than for an extension type.
>> > > > > > @pitrou
>> > > > > >
>> > > > > > This is perhaps a naive question (and admittedly, I'm not up to
>> > speed
>> > > > > > on my compute kernels) but why is this the case?  For example,
>> if
>> > > > > > adding a complex addition kernel it seems we would be talking
>> > > about...
>> > > > > >
>> > > > > > dest_scalar.real = scalar1.real + scalar2.real;
>> > > > > > dest_scalar.im = scalar1.im + scalar2.im;
>> > > > > >
>> > > > > > vs...
>> > > > > >
>> > > > > > dest_scalar[0] = scalar1[0] + scalar2[0];
>> > > > > > dest_scalar[1] = scalar1[1] + scalar2[1];
>> > > > > >
>> > > > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <
>> wesmckinn@gmail.com
>> > >
>> > > > > wrote:
>> > > > > > >
>> > > > > > > I'd be supportive of starting with this as a "canonical"
>> > extension
>> > > > > > > type so that all implementations are not expected to support
>> > > complex
>> > > > > > > types — this would encourage us to build sufficient
>> integration
>> > > e.g.
>> > > > > > > with NumPy to get things working end-to-end with the on-wire
>> > > > > > > representation being an extension type. We could certainly
>> choose
>> > > to
>> > > > > > > treat the type as "first class" in the C++ library without it
>> > being
>> > > > > > > "top level" in the Type union in Flatbuffers.
>> > > > > > >
>> > > > > > > I agree that the use cases are more specialized, and the fact
>> > that
>> > > we
>> > > > > > > haven't needed it until now (or at least, its absence suggests
>> > > this)
>> > > > > > > shows that this is the case.
>> > > > > > >
>> > > > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
>> > > > emkornfield@gmail.com>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I'm convinced now that  first-class types seem to be the
>> way
>> > to
>> > > > go
>> > > > > and I'm
>> > > > > > > > > happy to take this approach.
>> > > > > > > >
>> > > > > > > > I agree from an implementation effort it is simpler, but I'm
>> > > still
>> > > > > not
>> > > > > > > > convinced that we should be adding this as a first class
>> type.
>> > > As
>> > > > > noted in
>> > > > > > > > the survey below it appears Complex numbers are not a core
>> > > concept
>> > > > > in many
>> > > > > > > > general purpose coding languages and it doesn't appear to
>> be a
>> > > > > common type
>> > > > > > > > in SQL systems either.
>> > > > > > > >
>> > > > > > > > The reason why I am being nit-picky here is I think that
>> > having a
>> > > > > first
>> > > > > > > > class type indicates that it should eventually be supported
>> by
>> > > all
>> > > > > > > > reference implementations.  An "well known" extension type I
>> > > think
>> > > > > offers
>> > > > > > > > less guarantees which makes it seem more suitable for niche
>> > > types.
>> > > > > > > >
>> > > > > > > > > I don't immediately see a Packed Struct type. Would this
>> need
>> > > to
>> > > > be
>> > > > > > > > > > implemented?
>> > > > > > > > > Not necessarily (*).  But before thinking about
>> > implementation,
>> > > > > this
>> > > > > > > > > proposal must be accepted into the format.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Yes, this is a type that has been proposed in the past and I
>> > > think
>> > > > > handles
>> > > > > > > > a lot of  types not yet in Arrow but have been requested
>> (e.g.
>> > IP
>> > > > > > > > Addresses, Geo coordinates), etc.
>> > > > > > > >
>> > > > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
>> > > > > simon.perkins@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
>> > > > antoine@python.org>
>> > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
>> > > > > > > > > > >
>> > > > > > > > > > > Adding a new first-class type in Arrow requires
>> working
>> > > > > integration
>> > > > > > > > > tests
>> > > > > > > > > > > between C++ and Java libraries (once the idea is
>> > informally
>> > > > > agreed
>> > > > > > > > > upon)
>> > > > > > > > > > > and then a final vote for approval.  We haven't
>> > formalized
>> > > > > extension
>> > > > > > > > > > types
>> > > > > > > > > > > but I imagine a similar cross language requirement
>> would
>> > be
>> > > > > agreed
>> > > > > > > > > upon.
>> > > > > > > > > > > Implementation of computation wouldn't be required for
>> > > adding
>> > > > > a new
>> > > > > > > > > type.
>> > > > > > > > > > > Different language bindings have taken different
>> > approaches
>> > > > on
>> > > > > how much
>> > > > > > > > > > > additional computational elements are packaged in
>> them.
>> > > > > > > > > >
>> > > > > > > > > > While dedicated types are not strictly required, compute
>> > > > > functions would
>> > > > > > > > > > be much easier to add for a first-class dedicated
>> complex
>> > > > > datatype
>> > > > > > > > > > rather than for an extension type.
>> > > > > > > > > >
>> > > > > > > > > > Since complex numbers are quite common in some domains,
>> and
>> > > > > since they
>> > > > > > > > > > are conceptually simply, IMHO it would make sense to add
>> > them
>> > > > to
>> > > > > the
>> > > > > > > > > > native Arrow datatypes (at least COMPLEX64 and
>> COMPLEX128).
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I'm convinced now that  first-class types seem to be the
>> way
>> > to
>> > > > go
>> > > > > and I'm
>> > > > > > > > > happy to take this approach.
>> > > > > > > > > Regarding compute functions, it looks like the standard
>> set
>> > of
>> > > > > scalar
>> > > > > > > > > arithmetic and reduction functionality
>> > > > > > > > > is desirable for complex numbers:
>> > > > > > > > > https://arrow.apache.org/docs/cpp/compute.html#
>> > > > > > > > > Perhaps it would be better to split the addition of the
>> Types
>> > > and
>> > > > > addition
>> > > > > > > > > Compute functionality into separate PRs?
>> > > > > > > > >
>> > > > > > > > > Regarding the process for managing this PR, it sounds
>> like a
>> > > > > proposal must
>> > > > > > > > > be voted on?
>> > > > > > > > > i.e. is this proposal still in this phase
>> > > > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://arrow.apache.org/docs/developers/contributing.html#before-starting
>> > > > > > > > > Regards
>> > > > > > > > >
>> > > > > > > > > Simon
>> > > > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Complex Number support in Arrow

Posted by Simon Perkins <si...@gmail.com>.
I did some exploratory coding adding Complex Numbers as ExtensionTypes in
this PR: https://github.com/apache/arrow/pull/10565

> My understanding is that it means having COMPLEX as an entry in the
arrow/type_fwd.h Type enum. I agree this would make implementation
work in the C++ library much more straightforward.

I implemented this approach, adding COMPLEX_FLOAT and COMPLEX_DOUBLE
entries to the Type enum.
One thing I noted is that at least some portion of the code base
(visitor_inline.h) expects the ExtensionTypes
to have first-class type-like interfaces (i.e. needs a TypeTraits entry,
Visitor definitions).

My impression at this point, is that the work in implementing a hybrid
approach
(i.e. ExtensionType with a Type enum entry) replicates that of adding a
first-class type.
As I am not extensively familiar with the code base, I thought I'd just
check whether this impression
is correct.

regards
  Simon





On Fri, Jun 11, 2021 at 6:52 AM Micah Kornfield <em...@gmail.com>
wrote:

> >
> >  It might help this discussion and future discussions like it if we could
> > define how it is determined whether a type should be part of the Arrow
> > format, an extension type (and what does it mean to say there is a
> > "canonical" extension type), or just something that a language
> > implementation or downstream library builds for itself with metadata. I
> > feel like this has come up before but I don't recall a resolution.
>
>
> There seemed to be  consensus, but I guess we never formally voted on the
> decision points here:
>
> https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E
>
> Applying the criteria to complex types:
> 1.  Is the type a new parameterization of an existing type?  No
>
> 2.  Does the type itself have its own specification for processing (e.g.
> JSON, BSON, Thrift, Avro, Protobuf)? No
>
> 3.  Is the underlying encoding of the type already semantically supported
> by a type?  Yes.  Two have been mentioned in this thread and I would also
> support adding a new packed struct type, but it appears isn't necessary for
> this. Note that FixedSizeLists have some limitations in regards to parquet
> compatibility around nullability, there might be a few other sharp edges.
>
> So if we use this criteria we would lean towards an extension type.
>
> We never converged on a standard for "canonical" extension types.  I would
> propose it roughly be the same criteria as a first class type:
> 1.  Specification/document update PR that describes the representation
> 2.  Implementation showing working integration tests across two languages
> (for canonical types I think this can be any 2 languages instead of C++ and
> Java)
> 3.  Formal vote accepting the canonical type.
>
> Thanks,
> Micah
>
>
>
> On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > Isn't an array of complexes represented by what arrow already supports?
> In
> > particular, I see at least two valid in-memory representations to use,
> that
> > depend on what we are going to do with it:
> >
> > * Struct[re, im]
> > * FixedList[2]
> >
> > In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, ...],
> in
> > the second case we have 1 buffer, [x0, y0, x1, y1, ...].
> >
> > The first representation is useful for column-based operations (e.g.
> taking
> > the real part in case 1 is trivial; requires a copy in the second case),
> > the second representation is useful for row-base operations (e.g. "take"
> > and "filter" require a single pass over buffer 1). Case 2 does not
> support
> > Re and Im of different physical types (arguably an issue). Both cases
> > support nullability of individual items or combined.
> >
> > What I conclude is that this does not seem to be a problem about a base
> > in-memory representation, but rather on whether we agree on a
> > representation that justifies adding associated metadata to the spec.
> >
> > The case for the complex interval type recently proposed [1] is more
> > compelling to me because a complex ops over intervals usually required
> all
> > parts of the interval (and thus the "FixedList" representation is more
> > compelling), but each part has a different type. I.e. it is like a
> > "FixedTypedList[int32, int32, int64]", which we do not natively support.
> >
> > [1] https://github.com/apache/arrow/pull/10177
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
> > neal.p.richardson@gmail.com>
> > wrote:
> >
> > >  It might help this discussion and future discussions like it if we
> could
> > > define how it is determined whether a type should be part of the Arrow
> > > format, an extension type (and what does it mean to say there is a
> > > "canonical" extension type), or just something that a language
> > > implementation or downstream library builds for itself with metadata. I
> > > feel like this has come up before but I don't recall a resolution.
> > >
> > > Examples might also help: are there examples of "canonical extension
> > > types"?
> > >
> > > Neal
> > >
> > > On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > > > >
> > > > > My understanding is that it means having COMPLEX as an entry in the
> > > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > > work in the C++ library much more straightforward.
> > > >
> > > > One idea I proposed would be to do that, and implement the
> > > > > serialization of the complex metadata using Extension types.
> > > >
> > > >
> > > > If this is a maintainable strategy for Canonical types it sounds good
> > to
> > > > me.
> > > >
> > > > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > My understanding is that it means having COMPLEX as an entry in the
> > > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > > work in the C++ library much more straightforward.
> > > > >
> > > > > One idea I proposed would be to do that, and implement the
> > > > > serialization of the complex metadata using Extension types.
> > > > >
> > > > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <weston.pace@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > While dedicated types are not strictly required, compute
> > functions
> > > > > would
> > > > > > > be much easier to add for a first-class dedicated complex
> > datatype
> > > > > > > rather than for an extension type.
> > > > > > @pitrou
> > > > > >
> > > > > > This is perhaps a naive question (and admittedly, I'm not up to
> > speed
> > > > > > on my compute kernels) but why is this the case?  For example, if
> > > > > > adding a complex addition kernel it seems we would be talking
> > > about...
> > > > > >
> > > > > > dest_scalar.real = scalar1.real + scalar2.real;
> > > > > > dest_scalar.im = scalar1.im + scalar2.im;
> > > > > >
> > > > > > vs...
> > > > > >
> > > > > > dest_scalar[0] = scalar1[0] + scalar2[0];
> > > > > > dest_scalar[1] = scalar1[1] + scalar2[1];
> > > > > >
> > > > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <
> wesmckinn@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > I'd be supportive of starting with this as a "canonical"
> > extension
> > > > > > > type so that all implementations are not expected to support
> > > complex
> > > > > > > types — this would encourage us to build sufficient integration
> > > e.g.
> > > > > > > with NumPy to get things working end-to-end with the on-wire
> > > > > > > representation being an extension type. We could certainly
> choose
> > > to
> > > > > > > treat the type as "first class" in the C++ library without it
> > being
> > > > > > > "top level" in the Type union in Flatbuffers.
> > > > > > >
> > > > > > > I agree that the use cases are more specialized, and the fact
> > that
> > > we
> > > > > > > haven't needed it until now (or at least, its absence suggests
> > > this)
> > > > > > > shows that this is the case.
> > > > > > >
> > > > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> > > > emkornfield@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm convinced now that  first-class types seem to be the
> way
> > to
> > > > go
> > > > > and I'm
> > > > > > > > > happy to take this approach.
> > > > > > > >
> > > > > > > > I agree from an implementation effort it is simpler, but I'm
> > > still
> > > > > not
> > > > > > > > convinced that we should be adding this as a first class
> type.
> > > As
> > > > > noted in
> > > > > > > > the survey below it appears Complex numbers are not a core
> > > concept
> > > > > in many
> > > > > > > > general purpose coding languages and it doesn't appear to be
> a
> > > > > common type
> > > > > > > > in SQL systems either.
> > > > > > > >
> > > > > > > > The reason why I am being nit-picky here is I think that
> > having a
> > > > > first
> > > > > > > > class type indicates that it should eventually be supported
> by
> > > all
> > > > > > > > reference implementations.  An "well known" extension type I
> > > think
> > > > > offers
> > > > > > > > less guarantees which makes it seem more suitable for niche
> > > types.
> > > > > > > >
> > > > > > > > > I don't immediately see a Packed Struct type. Would this
> need
> > > to
> > > > be
> > > > > > > > > > implemented?
> > > > > > > > > Not necessarily (*).  But before thinking about
> > implementation,
> > > > > this
> > > > > > > > > proposal must be accepted into the format.
> > > > > > > >
> > > > > > > >
> > > > > > > > Yes, this is a type that has been proposed in the past and I
> > > think
> > > > > handles
> > > > > > > > a lot of  types not yet in Arrow but have been requested
> (e.g.
> > IP
> > > > > > > > Addresses, Geo coordinates), etc.
> > > > > > > >
> > > > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> > > > > simon.perkins@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> > > > antoine@python.org>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > > > > > > >
> > > > > > > > > > > Adding a new first-class type in Arrow requires working
> > > > > integration
> > > > > > > > > tests
> > > > > > > > > > > between C++ and Java libraries (once the idea is
> > informally
> > > > > agreed
> > > > > > > > > upon)
> > > > > > > > > > > and then a final vote for approval.  We haven't
> > formalized
> > > > > extension
> > > > > > > > > > types
> > > > > > > > > > > but I imagine a similar cross language requirement
> would
> > be
> > > > > agreed
> > > > > > > > > upon.
> > > > > > > > > > > Implementation of computation wouldn't be required for
> > > adding
> > > > > a new
> > > > > > > > > type.
> > > > > > > > > > > Different language bindings have taken different
> > approaches
> > > > on
> > > > > how much
> > > > > > > > > > > additional computational elements are packaged in them.
> > > > > > > > > >
> > > > > > > > > > While dedicated types are not strictly required, compute
> > > > > functions would
> > > > > > > > > > be much easier to add for a first-class dedicated complex
> > > > > datatype
> > > > > > > > > > rather than for an extension type.
> > > > > > > > > >
> > > > > > > > > > Since complex numbers are quite common in some domains,
> and
> > > > > since they
> > > > > > > > > > are conceptually simply, IMHO it would make sense to add
> > them
> > > > to
> > > > > the
> > > > > > > > > > native Arrow datatypes (at least COMPLEX64 and
> COMPLEX128).
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm convinced now that  first-class types seem to be the
> way
> > to
> > > > go
> > > > > and I'm
> > > > > > > > > happy to take this approach.
> > > > > > > > > Regarding compute functions, it looks like the standard set
> > of
> > > > > scalar
> > > > > > > > > arithmetic and reduction functionality
> > > > > > > > > is desirable for complex numbers:
> > > > > > > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > > > > > > Perhaps it would be better to split the addition of the
> Types
> > > and
> > > > > addition
> > > > > > > > > Compute functionality into separate PRs?
> > > > > > > > >
> > > > > > > > > Regarding the process for managing this PR, it sounds like
> a
> > > > > proposal must
> > > > > > > > > be voted on?
> > > > > > > > > i.e. is this proposal still in this phase
> > > > > > > > >
> > > > >
> > > >
> > >
> >
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > > > > > > Regards
> > > > > > > > >
> > > > > > > > > Simon
> > > > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Complex Number support in Arrow

Posted by Micah Kornfield <em...@gmail.com>.
>
>  It might help this discussion and future discussions like it if we could
> define how it is determined whether a type should be part of the Arrow
> format, an extension type (and what does it mean to say there is a
> "canonical" extension type), or just something that a language
> implementation or downstream library builds for itself with metadata. I
> feel like this has come up before but I don't recall a resolution.


There seemed to be  consensus, but I guess we never formally voted on the
decision points here:
https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E

Applying the criteria to complex types:
1.  Is the type a new parameterization of an existing type?  No

2.  Does the type itself have its own specification for processing (e.g.
JSON, BSON, Thrift, Avro, Protobuf)? No

3.  Is the underlying encoding of the type already semantically supported
by a type?  Yes.  Two have been mentioned in this thread and I would also
support adding a new packed struct type, but it appears isn't necessary for
this. Note that FixedSizeLists have some limitations in regards to parquet
compatibility around nullability, there might be a few other sharp edges.

So if we use this criteria we would lean towards an extension type.

We never converged on a standard for "canonical" extension types.  I would
propose it roughly be the same criteria as a first class type:
1.  Specification/document update PR that describes the representation
2.  Implementation showing working integration tests across two languages
(for canonical types I think this can be any 2 languages instead of C++ and
Java)
3.  Formal vote accepting the canonical type.

Thanks,
Micah



On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Isn't an array of complexes represented by what arrow already supports? In
> particular, I see at least two valid in-memory representations to use, that
> depend on what we are going to do with it:
>
> * Struct[re, im]
> * FixedList[2]
>
> In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, ...], in
> the second case we have 1 buffer, [x0, y0, x1, y1, ...].
>
> The first representation is useful for column-based operations (e.g. taking
> the real part in case 1 is trivial; requires a copy in the second case),
> the second representation is useful for row-base operations (e.g. "take"
> and "filter" require a single pass over buffer 1). Case 2 does not support
> Re and Im of different physical types (arguably an issue). Both cases
> support nullability of individual items or combined.
>
> What I conclude is that this does not seem to be a problem about a base
> in-memory representation, but rather on whether we agree on a
> representation that justifies adding associated metadata to the spec.
>
> The case for the complex interval type recently proposed [1] is more
> compelling to me because a complex ops over intervals usually required all
> parts of the interval (and thus the "FixedList" representation is more
> compelling), but each part has a different type. I.e. it is like a
> "FixedTypedList[int32, int32, int64]", which we do not natively support.
>
> [1] https://github.com/apache/arrow/pull/10177
>
> Best,
> Jorge
>
>
>
> On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
> neal.p.richardson@gmail.com>
> wrote:
>
> >  It might help this discussion and future discussions like it if we could
> > define how it is determined whether a type should be part of the Arrow
> > format, an extension type (and what does it mean to say there is a
> > "canonical" extension type), or just something that a language
> > implementation or downstream library builds for itself with metadata. I
> > feel like this has come up before but I don't recall a resolution.
> >
> > Examples might also help: are there examples of "canonical extension
> > types"?
> >
> > Neal
> >
> > On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > >
> > > > My understanding is that it means having COMPLEX as an entry in the
> > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > work in the C++ library much more straightforward.
> > >
> > > One idea I proposed would be to do that, and implement the
> > > > serialization of the complex metadata using Extension types.
> > >
> > >
> > > If this is a maintainable strategy for Canonical types it sounds good
> to
> > > me.
> > >
> > > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > My understanding is that it means having COMPLEX as an entry in the
> > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > work in the C++ library much more straightforward.
> > > >
> > > > One idea I proposed would be to do that, and implement the
> > > > serialization of the complex metadata using Extension types.
> > > >
> > > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <we...@gmail.com>
> > > wrote:
> > > > >
> > > > > > While dedicated types are not strictly required, compute
> functions
> > > > would
> > > > > > be much easier to add for a first-class dedicated complex
> datatype
> > > > > > rather than for an extension type.
> > > > > @pitrou
> > > > >
> > > > > This is perhaps a naive question (and admittedly, I'm not up to
> speed
> > > > > on my compute kernels) but why is this the case?  For example, if
> > > > > adding a complex addition kernel it seems we would be talking
> > about...
> > > > >
> > > > > dest_scalar.real = scalar1.real + scalar2.real;
> > > > > dest_scalar.im = scalar1.im + scalar2.im;
> > > > >
> > > > > vs...
> > > > >
> > > > > dest_scalar[0] = scalar1[0] + scalar2[0];
> > > > > dest_scalar[1] = scalar1[1] + scalar2[1];
> > > > >
> > > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <wesmckinn@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > I'd be supportive of starting with this as a "canonical"
> extension
> > > > > > type so that all implementations are not expected to support
> > complex
> > > > > > types — this would encourage us to build sufficient integration
> > e.g.
> > > > > > with NumPy to get things working end-to-end with the on-wire
> > > > > > representation being an extension type. We could certainly choose
> > to
> > > > > > treat the type as "first class" in the C++ library without it
> being
> > > > > > "top level" in the Type union in Flatbuffers.
> > > > > >
> > > > > > I agree that the use cases are more specialized, and the fact
> that
> > we
> > > > > > haven't needed it until now (or at least, its absence suggests
> > this)
> > > > > > shows that this is the case.
> > > > > >
> > > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> > > emkornfield@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > I'm convinced now that  first-class types seem to be the way
> to
> > > go
> > > > and I'm
> > > > > > > > happy to take this approach.
> > > > > > >
> > > > > > > I agree from an implementation effort it is simpler, but I'm
> > still
> > > > not
> > > > > > > convinced that we should be adding this as a first class type.
> > As
> > > > noted in
> > > > > > > the survey below it appears Complex numbers are not a core
> > concept
> > > > in many
> > > > > > > general purpose coding languages and it doesn't appear to be a
> > > > common type
> > > > > > > in SQL systems either.
> > > > > > >
> > > > > > > The reason why I am being nit-picky here is I think that
> having a
> > > > first
> > > > > > > class type indicates that it should eventually be supported by
> > all
> > > > > > > reference implementations.  An "well known" extension type I
> > think
> > > > offers
> > > > > > > less guarantees which makes it seem more suitable for niche
> > types.
> > > > > > >
> > > > > > > > I don't immediately see a Packed Struct type. Would this need
> > to
> > > be
> > > > > > > > > implemented?
> > > > > > > > Not necessarily (*).  But before thinking about
> implementation,
> > > > this
> > > > > > > > proposal must be accepted into the format.
> > > > > > >
> > > > > > >
> > > > > > > Yes, this is a type that has been proposed in the past and I
> > think
> > > > handles
> > > > > > > a lot of  types not yet in Arrow but have been requested (e.g.
> IP
> > > > > > > Addresses, Geo coordinates), etc.
> > > > > > >
> > > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> > > > simon.perkins@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> > > antoine@python.org>
> > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > > > > > >
> > > > > > > > > > Adding a new first-class type in Arrow requires working
> > > > integration
> > > > > > > > tests
> > > > > > > > > > between C++ and Java libraries (once the idea is
> informally
> > > > agreed
> > > > > > > > upon)
> > > > > > > > > > and then a final vote for approval.  We haven't
> formalized
> > > > extension
> > > > > > > > > types
> > > > > > > > > > but I imagine a similar cross language requirement would
> be
> > > > agreed
> > > > > > > > upon.
> > > > > > > > > > Implementation of computation wouldn't be required for
> > adding
> > > > a new
> > > > > > > > type.
> > > > > > > > > > Different language bindings have taken different
> approaches
> > > on
> > > > how much
> > > > > > > > > > additional computational elements are packaged in them.
> > > > > > > > >
> > > > > > > > > While dedicated types are not strictly required, compute
> > > > functions would
> > > > > > > > > be much easier to add for a first-class dedicated complex
> > > > datatype
> > > > > > > > > rather than for an extension type.
> > > > > > > > >
> > > > > > > > > Since complex numbers are quite common in some domains, and
> > > > since they
> > > > > > > > > are conceptually simply, IMHO it would make sense to add
> them
> > > to
> > > > the
> > > > > > > > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm convinced now that  first-class types seem to be the way
> to
> > > go
> > > > and I'm
> > > > > > > > happy to take this approach.
> > > > > > > > Regarding compute functions, it looks like the standard set
> of
> > > > scalar
> > > > > > > > arithmetic and reduction functionality
> > > > > > > > is desirable for complex numbers:
> > > > > > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > > > > > Perhaps it would be better to split the addition of the Types
> > and
> > > > addition
> > > > > > > > Compute functionality into separate PRs?
> > > > > > > >
> > > > > > > > Regarding the process for managing this PR, it sounds like a
> > > > proposal must
> > > > > > > > be voted on?
> > > > > > > > i.e. is this proposal still in this phase
> > > > > > > >
> > > >
> > >
> >
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Simon
> > > > > > > >
> > > >
> > >
> >
>

Re: Complex Number support in Arrow

Posted by Simon Perkins <si...@gmail.com>.
> What I conclude is that this does not seem to be a problem about a base
> in-memory representation, but rather on whether we agree on a
> representation that justifies adding associated metadata to the spec.
>

It would still be desirable to maintain the memory layout of C/C++/NumPy to
maintain zero-copy.
FixedList[2] maintains this layout, while a Struct[re, im] does not.


On Fri, Jun 11, 2021 at 6:34 AM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Isn't an array of complexes represented by what arrow already supports? In
> particular, I see at least two valid in-memory representations to use, that
> depend on what we are going to do with it:
>
> * Struct[re, im]
> * FixedList[2]
>
> In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, ...], in
> the second case we have 1 buffer, [x0, y0, x1, y1, ...].
>
> The first representation is useful for column-based operations (e.g. taking
> the real part in case 1 is trivial; requires a copy in the second case),
> the second representation is useful for row-base operations (e.g. "take"
> and "filter" require a single pass over buffer 1). Case 2 does not support
> Re and Im of different physical types (arguably an issue). Both cases
> support nullability of individual items or combined.
>
> What I conclude is that this does not seem to be a problem about a base
> in-memory representation, but rather on whether we agree on a
> representation that justifies adding associated metadata to the spec.
>
> The case for the complex interval type recently proposed [1] is more
> compelling to me because a complex ops over intervals usually required all
> parts of the interval (and thus the "FixedList" representation is more
> compelling), but each part has a different type. I.e. it is like a
> "FixedTypedList[int32, int32, int64]", which we do not natively support.
>
> [1] https://github.com/apache/arrow/pull/10177
>
> Best,
> Jorge
>
>
>
> On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
> neal.p.richardson@gmail.com>
> wrote:
>
> >  It might help this discussion and future discussions like it if we could
> > define how it is determined whether a type should be part of the Arrow
> > format, an extension type (and what does it mean to say there is a
> > "canonical" extension type), or just something that a language
> > implementation or downstream library builds for itself with metadata. I
> > feel like this has come up before but I don't recall a resolution.
> >
> > Examples might also help: are there examples of "canonical extension
> > types"?
> >
> > Neal
> >
> > On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > >
> > > > My understanding is that it means having COMPLEX as an entry in the
> > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > work in the C++ library much more straightforward.
> > >
> > > One idea I proposed would be to do that, and implement the
> > > > serialization of the complex metadata using Extension types.
> > >
> > >
> > > If this is a maintainable strategy for Canonical types it sounds good
> to
> > > me.
> > >
> > > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > My understanding is that it means having COMPLEX as an entry in the
> > > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > > work in the C++ library much more straightforward.
> > > >
> > > > One idea I proposed would be to do that, and implement the
> > > > serialization of the complex metadata using Extension types.
> > > >
> > > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <we...@gmail.com>
> > > wrote:
> > > > >
> > > > > > While dedicated types are not strictly required, compute
> functions
> > > > would
> > > > > > be much easier to add for a first-class dedicated complex
> datatype
> > > > > > rather than for an extension type.
> > > > > @pitrou
> > > > >
> > > > > This is perhaps a naive question (and admittedly, I'm not up to
> speed
> > > > > on my compute kernels) but why is this the case?  For example, if
> > > > > adding a complex addition kernel it seems we would be talking
> > about...
> > > > >
> > > > > dest_scalar.real = scalar1.real + scalar2.real;
> > > > > dest_scalar.im = scalar1.im + scalar2.im;
> > > > >
> > > > > vs...
> > > > >
> > > > > dest_scalar[0] = scalar1[0] + scalar2[0];
> > > > > dest_scalar[1] = scalar1[1] + scalar2[1];
> > > > >
> > > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <wesmckinn@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > I'd be supportive of starting with this as a "canonical"
> extension
> > > > > > type so that all implementations are not expected to support
> > complex
> > > > > > types — this would encourage us to build sufficient integration
> > e.g.
> > > > > > with NumPy to get things working end-to-end with the on-wire
> > > > > > representation being an extension type. We could certainly choose
> > to
> > > > > > treat the type as "first class" in the C++ library without it
> being
> > > > > > "top level" in the Type union in Flatbuffers.
> > > > > >
> > > > > > I agree that the use cases are more specialized, and the fact
> that
> > we
> > > > > > haven't needed it until now (or at least, its absence suggests
> > this)
> > > > > > shows that this is the case.
> > > > > >
> > > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> > > emkornfield@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > I'm convinced now that  first-class types seem to be the way
> to
> > > go
> > > > and I'm
> > > > > > > > happy to take this approach.
> > > > > > >
> > > > > > > I agree from an implementation effort it is simpler, but I'm
> > still
> > > > not
> > > > > > > convinced that we should be adding this as a first class type.
> > As
> > > > noted in
> > > > > > > the survey below it appears Complex numbers are not a core
> > concept
> > > > in many
> > > > > > > general purpose coding languages and it doesn't appear to be a
> > > > common type
> > > > > > > in SQL systems either.
> > > > > > >
> > > > > > > The reason why I am being nit-picky here is I think that
> having a
> > > > first
> > > > > > > class type indicates that it should eventually be supported by
> > all
> > > > > > > reference implementations.  An "well known" extension type I
> > think
> > > > offers
> > > > > > > less guarantees which makes it seem more suitable for niche
> > types.
> > > > > > >
> > > > > > > > I don't immediately see a Packed Struct type. Would this need
> > to
> > > be
> > > > > > > > > implemented?
> > > > > > > > Not necessarily (*).  But before thinking about
> implementation,
> > > > this
> > > > > > > > proposal must be accepted into the format.
> > > > > > >
> > > > > > >
> > > > > > > Yes, this is a type that has been proposed in the past and I
> > think
> > > > handles
> > > > > > > a lot of  types not yet in Arrow but have been requested (e.g.
> IP
> > > > > > > Addresses, Geo coordinates), etc.
> > > > > > >
> > > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> > > > simon.perkins@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> > > antoine@python.org>
> > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > > > > > >
> > > > > > > > > > Adding a new first-class type in Arrow requires working
> > > > integration
> > > > > > > > tests
> > > > > > > > > > between C++ and Java libraries (once the idea is
> informally
> > > > agreed
> > > > > > > > upon)
> > > > > > > > > > and then a final vote for approval.  We haven't
> formalized
> > > > extension
> > > > > > > > > types
> > > > > > > > > > but I imagine a similar cross language requirement would
> be
> > > > agreed
> > > > > > > > upon.
> > > > > > > > > > Implementation of computation wouldn't be required for
> > adding
> > > > a new
> > > > > > > > type.
> > > > > > > > > > Different language bindings have taken different
> approaches
> > > on
> > > > how much
> > > > > > > > > > additional computational elements are packaged in them.
> > > > > > > > >
> > > > > > > > > While dedicated types are not strictly required, compute
> > > > functions would
> > > > > > > > > be much easier to add for a first-class dedicated complex
> > > > datatype
> > > > > > > > > rather than for an extension type.
> > > > > > > > >
> > > > > > > > > Since complex numbers are quite common in some domains, and
> > > > since they
> > > > > > > > > are conceptually simply, IMHO it would make sense to add
> them
> > > to
> > > > the
> > > > > > > > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm convinced now that  first-class types seem to be the way
> to
> > > go
> > > > and I'm
> > > > > > > > happy to take this approach.
> > > > > > > > Regarding compute functions, it looks like the standard set
> of
> > > > scalar
> > > > > > > > arithmetic and reduction functionality
> > > > > > > > is desirable for complex numbers:
> > > > > > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > > > > > Perhaps it would be better to split the addition of the Types
> > and
> > > > addition
> > > > > > > > Compute functionality into separate PRs?
> > > > > > > >
> > > > > > > > Regarding the process for managing this PR, it sounds like a
> > > > proposal must
> > > > > > > > be voted on?
> > > > > > > > i.e. is this proposal still in this phase
> > > > > > > >
> > > >
> > >
> >
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Simon
> > > > > > > >
> > > >
> > >
> >
>

Re: Complex Number support in Arrow

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
Isn't an array of complexes represented by what arrow already supports? In
particular, I see at least two valid in-memory representations to use, that
depend on what we are going to do with it:

* Struct[re, im]
* FixedList[2]

In the first case, we have two buffers, [x0, x1, ...] and [y0, y1, ...], in
the second case we have 1 buffer, [x0, y0, x1, y1, ...].

The first representation is useful for column-based operations (e.g. taking
the real part in case 1 is trivial; requires a copy in the second case),
the second representation is useful for row-base operations (e.g. "take"
and "filter" require a single pass over buffer 1). Case 2 does not support
Re and Im of different physical types (arguably an issue). Both cases
support nullability of individual items or combined.

What I conclude is that this does not seem to be a problem about a base
in-memory representation, but rather on whether we agree on a
representation that justifies adding associated metadata to the spec.

The case for the complex interval type recently proposed [1] is more
compelling to me because a complex ops over intervals usually required all
parts of the interval (and thus the "FixedList" representation is more
compelling), but each part has a different type. I.e. it is like a
"FixedTypedList[int32, int32, int64]", which we do not natively support.

[1] https://github.com/apache/arrow/pull/10177

Best,
Jorge



On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <ne...@gmail.com>
wrote:

>  It might help this discussion and future discussions like it if we could
> define how it is determined whether a type should be part of the Arrow
> format, an extension type (and what does it mean to say there is a
> "canonical" extension type), or just something that a language
> implementation or downstream library builds for itself with metadata. I
> feel like this has come up before but I don't recall a resolution.
>
> Examples might also help: are there examples of "canonical extension
> types"?
>
> Neal
>
> On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > >
> > > My understanding is that it means having COMPLEX as an entry in the
> > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > work in the C++ library much more straightforward.
> >
> > One idea I proposed would be to do that, and implement the
> > > serialization of the complex metadata using Extension types.
> >
> >
> > If this is a maintainable strategy for Canonical types it sounds good to
> > me.
> >
> > On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > My understanding is that it means having COMPLEX as an entry in the
> > > arrow/type_fwd.h Type enum. I agree this would make implementation
> > > work in the C++ library much more straightforward.
> > >
> > > One idea I proposed would be to do that, and implement the
> > > serialization of the complex metadata using Extension types.
> > >
> > > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <we...@gmail.com>
> > wrote:
> > > >
> > > > > While dedicated types are not strictly required, compute functions
> > > would
> > > > > be much easier to add for a first-class dedicated complex datatype
> > > > > rather than for an extension type.
> > > > @pitrou
> > > >
> > > > This is perhaps a naive question (and admittedly, I'm not up to speed
> > > > on my compute kernels) but why is this the case?  For example, if
> > > > adding a complex addition kernel it seems we would be talking
> about...
> > > >
> > > > dest_scalar.real = scalar1.real + scalar2.real;
> > > > dest_scalar.im = scalar1.im + scalar2.im;
> > > >
> > > > vs...
> > > >
> > > > dest_scalar[0] = scalar1[0] + scalar2[0];
> > > > dest_scalar[1] = scalar1[1] + scalar2[1];
> > > >
> > > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >
> > > > > I'd be supportive of starting with this as a "canonical" extension
> > > > > type so that all implementations are not expected to support
> complex
> > > > > types — this would encourage us to build sufficient integration
> e.g.
> > > > > with NumPy to get things working end-to-end with the on-wire
> > > > > representation being an extension type. We could certainly choose
> to
> > > > > treat the type as "first class" in the C++ library without it being
> > > > > "top level" in the Type union in Flatbuffers.
> > > > >
> > > > > I agree that the use cases are more specialized, and the fact that
> we
> > > > > haven't needed it until now (or at least, its absence suggests
> this)
> > > > > shows that this is the case.
> > > > >
> > > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> > emkornfield@gmail.com>
> > > wrote:
> > > > > >
> > > > > > >
> > > > > > > I'm convinced now that  first-class types seem to be the way to
> > go
> > > and I'm
> > > > > > > happy to take this approach.
> > > > > >
> > > > > > I agree from an implementation effort it is simpler, but I'm
> still
> > > not
> > > > > > convinced that we should be adding this as a first class type.
> As
> > > noted in
> > > > > > the survey below it appears Complex numbers are not a core
> concept
> > > in many
> > > > > > general purpose coding languages and it doesn't appear to be a
> > > common type
> > > > > > in SQL systems either.
> > > > > >
> > > > > > The reason why I am being nit-picky here is I think that having a
> > > first
> > > > > > class type indicates that it should eventually be supported by
> all
> > > > > > reference implementations.  An "well known" extension type I
> think
> > > offers
> > > > > > less guarantees which makes it seem more suitable for niche
> types.
> > > > > >
> > > > > > > I don't immediately see a Packed Struct type. Would this need
> to
> > be
> > > > > > > > implemented?
> > > > > > > Not necessarily (*).  But before thinking about implementation,
> > > this
> > > > > > > proposal must be accepted into the format.
> > > > > >
> > > > > >
> > > > > > Yes, this is a type that has been proposed in the past and I
> think
> > > handles
> > > > > > a lot of  types not yet in Arrow but have been requested (e.g. IP
> > > > > > Addresses, Geo coordinates), etc.
> > > > > >
> > > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> > > simon.perkins@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> > antoine@python.org>
> > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > > > > >
> > > > > > > > > Adding a new first-class type in Arrow requires working
> > > integration
> > > > > > > tests
> > > > > > > > > between C++ and Java libraries (once the idea is informally
> > > agreed
> > > > > > > upon)
> > > > > > > > > and then a final vote for approval.  We haven't formalized
> > > extension
> > > > > > > > types
> > > > > > > > > but I imagine a similar cross language requirement would be
> > > agreed
> > > > > > > upon.
> > > > > > > > > Implementation of computation wouldn't be required for
> adding
> > > a new
> > > > > > > type.
> > > > > > > > > Different language bindings have taken different approaches
> > on
> > > how much
> > > > > > > > > additional computational elements are packaged in them.
> > > > > > > >
> > > > > > > > While dedicated types are not strictly required, compute
> > > functions would
> > > > > > > > be much easier to add for a first-class dedicated complex
> > > datatype
> > > > > > > > rather than for an extension type.
> > > > > > > >
> > > > > > > > Since complex numbers are quite common in some domains, and
> > > since they
> > > > > > > > are conceptually simply, IMHO it would make sense to add them
> > to
> > > the
> > > > > > > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > > > > > > >
> > > > > > >
> > > > > > > I'm convinced now that  first-class types seem to be the way to
> > go
> > > and I'm
> > > > > > > happy to take this approach.
> > > > > > > Regarding compute functions, it looks like the standard set of
> > > scalar
> > > > > > > arithmetic and reduction functionality
> > > > > > > is desirable for complex numbers:
> > > > > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > > > > Perhaps it would be better to split the addition of the Types
> and
> > > addition
> > > > > > > Compute functionality into separate PRs?
> > > > > > >
> > > > > > > Regarding the process for managing this PR, it sounds like a
> > > proposal must
> > > > > > > be voted on?
> > > > > > > i.e. is this proposal still in this phase
> > > > > > >
> > >
> >
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > > > > Regards
> > > > > > >
> > > > > > > Simon
> > > > > > >
> > >
> >
>

Re: Complex Number support in Arrow

Posted by Neal Richardson <ne...@gmail.com>.
 It might help this discussion and future discussions like it if we could
define how it is determined whether a type should be part of the Arrow
format, an extension type (and what does it mean to say there is a
"canonical" extension type), or just something that a language
implementation or downstream library builds for itself with metadata. I
feel like this has come up before but I don't recall a resolution.

Examples might also help: are there examples of "canonical extension types"?

Neal

On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <em...@gmail.com>
wrote:

> >
> > My understanding is that it means having COMPLEX as an entry in the
> > arrow/type_fwd.h Type enum. I agree this would make implementation
> > work in the C++ library much more straightforward.
>
> One idea I proposed would be to do that, and implement the
> > serialization of the complex metadata using Extension types.
>
>
> If this is a maintainable strategy for Canonical types it sounds good to
> me.
>
> On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com> wrote:
>
> > My understanding is that it means having COMPLEX as an entry in the
> > arrow/type_fwd.h Type enum. I agree this would make implementation
> > work in the C++ library much more straightforward.
> >
> > One idea I proposed would be to do that, and implement the
> > serialization of the complex metadata using Extension types.
> >
> > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <we...@gmail.com>
> wrote:
> > >
> > > > While dedicated types are not strictly required, compute functions
> > would
> > > > be much easier to add for a first-class dedicated complex datatype
> > > > rather than for an extension type.
> > > @pitrou
> > >
> > > This is perhaps a naive question (and admittedly, I'm not up to speed
> > > on my compute kernels) but why is this the case?  For example, if
> > > adding a complex addition kernel it seems we would be talking about...
> > >
> > > dest_scalar.real = scalar1.real + scalar2.real;
> > > dest_scalar.im = scalar1.im + scalar2.im;
> > >
> > > vs...
> > >
> > > dest_scalar[0] = scalar1[0] + scalar2[0];
> > > dest_scalar[1] = scalar1[1] + scalar2[1];
> > >
> > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > I'd be supportive of starting with this as a "canonical" extension
> > > > type so that all implementations are not expected to support complex
> > > > types — this would encourage us to build sufficient integration e.g.
> > > > with NumPy to get things working end-to-end with the on-wire
> > > > representation being an extension type. We could certainly choose to
> > > > treat the type as "first class" in the C++ library without it being
> > > > "top level" in the Type union in Flatbuffers.
> > > >
> > > > I agree that the use cases are more specialized, and the fact that we
> > > > haven't needed it until now (or at least, its absence suggests this)
> > > > shows that this is the case.
> > > >
> > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> emkornfield@gmail.com>
> > wrote:
> > > > >
> > > > > >
> > > > > > I'm convinced now that  first-class types seem to be the way to
> go
> > and I'm
> > > > > > happy to take this approach.
> > > > >
> > > > > I agree from an implementation effort it is simpler, but I'm still
> > not
> > > > > convinced that we should be adding this as a first class type.  As
> > noted in
> > > > > the survey below it appears Complex numbers are not a core concept
> > in many
> > > > > general purpose coding languages and it doesn't appear to be a
> > common type
> > > > > in SQL systems either.
> > > > >
> > > > > The reason why I am being nit-picky here is I think that having a
> > first
> > > > > class type indicates that it should eventually be supported by all
> > > > > reference implementations.  An "well known" extension type I think
> > offers
> > > > > less guarantees which makes it seem more suitable for niche types.
> > > > >
> > > > > > I don't immediately see a Packed Struct type. Would this need to
> be
> > > > > > > implemented?
> > > > > > Not necessarily (*).  But before thinking about implementation,
> > this
> > > > > > proposal must be accepted into the format.
> > > > >
> > > > >
> > > > > Yes, this is a type that has been proposed in the past and I think
> > handles
> > > > > a lot of  types not yet in Arrow but have been requested (e.g. IP
> > > > > Addresses, Geo coordinates), etc.
> > > > >
> > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> > simon.perkins@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> antoine@python.org>
> > wrote:
> > > > > >
> > > > > > >
> > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > > > >
> > > > > > > > Adding a new first-class type in Arrow requires working
> > integration
> > > > > > tests
> > > > > > > > between C++ and Java libraries (once the idea is informally
> > agreed
> > > > > > upon)
> > > > > > > > and then a final vote for approval.  We haven't formalized
> > extension
> > > > > > > types
> > > > > > > > but I imagine a similar cross language requirement would be
> > agreed
> > > > > > upon.
> > > > > > > > Implementation of computation wouldn't be required for adding
> > a new
> > > > > > type.
> > > > > > > > Different language bindings have taken different approaches
> on
> > how much
> > > > > > > > additional computational elements are packaged in them.
> > > > > > >
> > > > > > > While dedicated types are not strictly required, compute
> > functions would
> > > > > > > be much easier to add for a first-class dedicated complex
> > datatype
> > > > > > > rather than for an extension type.
> > > > > > >
> > > > > > > Since complex numbers are quite common in some domains, and
> > since they
> > > > > > > are conceptually simply, IMHO it would make sense to add them
> to
> > the
> > > > > > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > > > > > >
> > > > > >
> > > > > > I'm convinced now that  first-class types seem to be the way to
> go
> > and I'm
> > > > > > happy to take this approach.
> > > > > > Regarding compute functions, it looks like the standard set of
> > scalar
> > > > > > arithmetic and reduction functionality
> > > > > > is desirable for complex numbers:
> > > > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > > > Perhaps it would be better to split the addition of the Types and
> > addition
> > > > > > Compute functionality into separate PRs?
> > > > > >
> > > > > > Regarding the process for managing this PR, it sounds like a
> > proposal must
> > > > > > be voted on?
> > > > > > i.e. is this proposal still in this phase
> > > > > >
> >
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > > > Regards
> > > > > >
> > > > > > Simon
> > > > > >
> >
>

Re: Complex Number support in Arrow

Posted by Micah Kornfield <em...@gmail.com>.
>
> It would still be desirable to maintain the memory layout of C/C++/NumPy to
> maintain zero-copy.
> FixedList[2] maintains this layout, while a Struct[re, im] does not.


I noted this before but there are some gaps in Parquet support for
FixedSizeList around null handling.  Just something to be aware of if this
is use-case that we think will be encountered.

On Mon, Jun 14, 2021 at 2:02 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 14/06/2021 à 10:54, Simon Perkins a écrit :
> >   > The reason why I am being nit-picky here is I think that having a
> first
> > class type indicates that it should eventually be supported by all
> > reference implementations.  An "well known" extension type I think offers
> > less guarantees which makes it seem more suitable for niche types.
> >
> > What are the requirements imposed on downstream projects by adding new
> types
> > such as Complex Numbers and Intervals? Hypothetically, does a new
> > first-class
> > type impose a requirement to provide full support for it downstream?
>
> This is not a requirement for downstream projects (for example a
> dataframe implementation or a database connector), but for Arrow
> implementations, that is (mostly) the code that lives in the Apache
> Arrow repository.  There are a number of them:
> https://arrow.apache.org/docs/status.html
>
> > Or, does adding a type simply involve exposing a new Arrow Type (the
> > representation)
> > in the respective language (C++/Java/Rust) that downstream projects may
> > choose to support or ignore?
>
> Yes.
>
> > Java/Rust may not have a native Complex Type for example, but this isn't
> > Arrow's responsibility -- it simply provides it's own
> > type that the language/project should interpret.
>
> Having a native complex type isn't a requirement. A complex type is
> trivially expressed in (almost?) any language as a struct, anyway.
>
> Regards
>
> Antoine.
>

Re: Complex Number support in Arrow

Posted by Antoine Pitrou <an...@python.org>.
Le 14/06/2021 à 10:54, Simon Perkins a écrit :
>   > The reason why I am being nit-picky here is I think that having a first
> class type indicates that it should eventually be supported by all
> reference implementations.  An "well known" extension type I think offers
> less guarantees which makes it seem more suitable for niche types.
> 
> What are the requirements imposed on downstream projects by adding new types
> such as Complex Numbers and Intervals? Hypothetically, does a new
> first-class
> type impose a requirement to provide full support for it downstream?

This is not a requirement for downstream projects (for example a 
dataframe implementation or a database connector), but for Arrow 
implementations, that is (mostly) the code that lives in the Apache 
Arrow repository.  There are a number of them:
https://arrow.apache.org/docs/status.html

> Or, does adding a type simply involve exposing a new Arrow Type (the
> representation)
> in the respective language (C++/Java/Rust) that downstream projects may
> choose to support or ignore?

Yes.

> Java/Rust may not have a native Complex Type for example, but this isn't
> Arrow's responsibility -- it simply provides it's own
> type that the language/project should interpret.

Having a native complex type isn't a requirement. A complex type is 
trivially expressed in (almost?) any language as a struct, anyway.

Regards

Antoine.

Re: Complex Number support in Arrow

Posted by Simon Perkins <si...@gmail.com>.
 > The reason why I am being nit-picky here is I think that having a first
class type indicates that it should eventually be supported by all
reference implementations.  An "well known" extension type I think offers
less guarantees which makes it seem more suitable for niche types.

What are the requirements imposed on downstream projects by adding new types
such as Complex Numbers and Intervals? Hypothetically, does a new
first-class
type impose a requirement to provide full support for it downstream?
In other words, does full support include an understanding of the
representation
(i.e. an Arrow Type) *and* expressions
on the representation. This does seem onerous.

Or, does adding a type simply involve exposing a new Arrow Type (the
representation)
in the respective language (C++/Java/Rust) that downstream projects may
choose to support or ignore?
Java/Rust may not have a native Complex Type for example, but this isn't
Arrow's responsibility -- it simply provides it's own
type that the language/project should interpret.
For example, cuDF [1] performs a switch on the arrow types
and fails when encountering a type it doesn't understand (including
extension types).

[1]:
https://github.com/rapidsai/cudf/blob/306ae4ffe584fdf50114875f64ba552f496e13fa/cpp/src/interop/from_arrow.cu#L41-L87
Practically speaking, taking cuDF as an example, the handling might change
as follows:

switch (arrow_type.id()) {
   case arrow::Type::FLOAT:
       ...
       break;
   ...
   case arrow::Type::EXTENSION:
       auto name = static_cast<const
ExtensionType&>(arrow_type->type).extension_name();

       switch(name) {
           case "complex_float":
               ....
               break;
           case "complex_double":
               ....
               break;
           default:
                 CUDF_FAIL("Unsupported Extension Type")
        }
    default:
                 CUDF_FAIL("Unsupported Type");
}

Thus, practically speaking, handling of a First-Class Type vs an Extension
Type involves a multi-level switch statement.

> > > We could certainly choose to treat the type as "first class" in the
C++ library without it being
"top level" in the Type union in Flatbuffers.

> > My understanding is that it means having COMPLEX as an entry in the
> > arrow/type_fwd.h Type enum. I agree this would make implementation
> > work in the C++ library much more straightforward.

> > One idea I proposed would be to do that, and implement the
> > serialization of the complex metadata using Extension types.

> If this is a maintainable strategy for Canonical types it sounds good to
me.

Based on the example above, handling of Canonical Extension Type's
will add an extra layer of indirection in Type Identification logic.
Are downstream projects simply able to fail or ignore first-class types
they don't support in any case?

I think what's not clear to me is the contract between the Arrow API and
downstream projects that use the API. Are downstream projects obligated
to respect all first-class types?

Simon




On Fri, Jun 11, 2021 at 1:20 AM Micah Kornfield <em...@gmail.com>
wrote:

> >
> > My understanding is that it means having COMPLEX as an entry in the
> > arrow/type_fwd.h Type enum. I agree this would make implementation
> > work in the C++ library much more straightforward.
>
> One idea I proposed would be to do that, and implement the
> > serialization of the complex metadata using Extension types.
>
>
> If this is a maintainable strategy for Canonical types it sounds good to
> me.
>
> On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com> wrote:
>
> > My understanding is that it means having COMPLEX as an entry in the
> > arrow/type_fwd.h Type enum. I agree this would make implementation
> > work in the C++ library much more straightforward.
> >
> > One idea I proposed would be to do that, and implement the
> > serialization of the complex metadata using Extension types.
> >
> > On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <we...@gmail.com>
> wrote:
> > >
> > > > While dedicated types are not strictly required, compute functions
> > would
> > > > be much easier to add for a first-class dedicated complex datatype
> > > > rather than for an extension type.
> > > @pitrou
> > >
> > > This is perhaps a naive question (and admittedly, I'm not up to speed
> > > on my compute kernels) but why is this the case?  For example, if
> > > adding a complex addition kernel it seems we would be talking about...
> > >
> > > dest_scalar.real = scalar1.real + scalar2.real;
> > > dest_scalar.im = scalar1.im + scalar2.im;
> > >
> > > vs...
> > >
> > > dest_scalar[0] = scalar1[0] + scalar2[0];
> > > dest_scalar[1] = scalar1[1] + scalar2[1];
> > >
> > > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > I'd be supportive of starting with this as a "canonical" extension
> > > > type so that all implementations are not expected to support complex
> > > > types — this would encourage us to build sufficient integration e.g.
> > > > with NumPy to get things working end-to-end with the on-wire
> > > > representation being an extension type. We could certainly choose to
> > > > treat the type as "first class" in the C++ library without it being
> > > > "top level" in the Type union in Flatbuffers.
> > > >
> > > > I agree that the use cases are more specialized, and the fact that we
> > > > haven't needed it until now (or at least, its absence suggests this)
> > > > shows that this is the case.
> > > >
> > > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
> emkornfield@gmail.com>
> > wrote:
> > > > >
> > > > > >
> > > > > > I'm convinced now that  first-class types seem to be the way to
> go
> > and I'm
> > > > > > happy to take this approach.
> > > > >
> > > > > I agree from an implementation effort it is simpler, but I'm still
> > not
> > > > > convinced that we should be adding this as a first class type.  As
> > noted in
> > > > > the survey below it appears Complex numbers are not a core concept
> > in many
> > > > > general purpose coding languages and it doesn't appear to be a
> > common type
> > > > > in SQL systems either.
> > > > >
> > > > > The reason why I am being nit-picky here is I think that having a
> > first
> > > > > class type indicates that it should eventually be supported by all
> > > > > reference implementations.  An "well known" extension type I think
> > offers
> > > > > less guarantees which makes it seem more suitable for niche types.
> > > > >
> > > > > > I don't immediately see a Packed Struct type. Would this need to
> be
> > > > > > > implemented?
> > > > > > Not necessarily (*).  But before thinking about implementation,
> > this
> > > > > > proposal must be accepted into the format.
> > > > >
> > > > >
> > > > > Yes, this is a type that has been proposed in the past and I think
> > handles
> > > > > a lot of  types not yet in Arrow but have been requested (e.g. IP
> > > > > Addresses, Geo coordinates), etc.
> > > > >
> > > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> > simon.perkins@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
> antoine@python.org>
> > wrote:
> > > > > >
> > > > > > >
> > > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > > > >
> > > > > > > > Adding a new first-class type in Arrow requires working
> > integration
> > > > > > tests
> > > > > > > > between C++ and Java libraries (once the idea is informally
> > agreed
> > > > > > upon)
> > > > > > > > and then a final vote for approval.  We haven't formalized
> > extension
> > > > > > > types
> > > > > > > > but I imagine a similar cross language requirement would be
> > agreed
> > > > > > upon.
> > > > > > > > Implementation of computation wouldn't be required for adding
> > a new
> > > > > > type.
> > > > > > > > Different language bindings have taken different approaches
> on
> > how much
> > > > > > > > additional computational elements are packaged in them.
> > > > > > >
> > > > > > > While dedicated types are not strictly required, compute
> > functions would
> > > > > > > be much easier to add for a first-class dedicated complex
> > datatype
> > > > > > > rather than for an extension type.
> > > > > > >
> > > > > > > Since complex numbers are quite common in some domains, and
> > since they
> > > > > > > are conceptually simply, IMHO it would make sense to add them
> to
> > the
> > > > > > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > > > > > >
> > > > > >
> > > > > > I'm convinced now that  first-class types seem to be the way to
> go
> > and I'm
> > > > > > happy to take this approach.
> > > > > > Regarding compute functions, it looks like the standard set of
> > scalar
> > > > > > arithmetic and reduction functionality
> > > > > > is desirable for complex numbers:
> > > > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > > > Perhaps it would be better to split the addition of the Types and
> > addition
> > > > > > Compute functionality into separate PRs?
> > > > > >
> > > > > > Regarding the process for managing this PR, it sounds like a
> > proposal must
> > > > > > be voted on?
> > > > > > i.e. is this proposal still in this phase
> > > > > >
> >
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > > > Regards
> > > > > >
> > > > > > Simon
> > > > > >
> >
>

Re: Complex Number support in Arrow

Posted by Micah Kornfield <em...@gmail.com>.
>
> My understanding is that it means having COMPLEX as an entry in the
> arrow/type_fwd.h Type enum. I agree this would make implementation
> work in the C++ library much more straightforward.

One idea I proposed would be to do that, and implement the
> serialization of the complex metadata using Extension types.


If this is a maintainable strategy for Canonical types it sounds good to
me.

On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <we...@gmail.com> wrote:

> My understanding is that it means having COMPLEX as an entry in the
> arrow/type_fwd.h Type enum. I agree this would make implementation
> work in the C++ library much more straightforward.
>
> One idea I proposed would be to do that, and implement the
> serialization of the complex metadata using Extension types.
>
> On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <we...@gmail.com> wrote:
> >
> > > While dedicated types are not strictly required, compute functions
> would
> > > be much easier to add for a first-class dedicated complex datatype
> > > rather than for an extension type.
> > @pitrou
> >
> > This is perhaps a naive question (and admittedly, I'm not up to speed
> > on my compute kernels) but why is this the case?  For example, if
> > adding a complex addition kernel it seems we would be talking about...
> >
> > dest_scalar.real = scalar1.real + scalar2.real;
> > dest_scalar.im = scalar1.im + scalar2.im;
> >
> > vs...
> >
> > dest_scalar[0] = scalar1[0] + scalar2[0];
> > dest_scalar[1] = scalar1[1] + scalar2[1];
> >
> > On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > I'd be supportive of starting with this as a "canonical" extension
> > > type so that all implementations are not expected to support complex
> > > types — this would encourage us to build sufficient integration e.g.
> > > with NumPy to get things working end-to-end with the on-wire
> > > representation being an extension type. We could certainly choose to
> > > treat the type as "first class" in the C++ library without it being
> > > "top level" in the Type union in Flatbuffers.
> > >
> > > I agree that the use cases are more specialized, and the fact that we
> > > haven't needed it until now (or at least, its absence suggests this)
> > > shows that this is the case.
> > >
> > > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <em...@gmail.com>
> wrote:
> > > >
> > > > >
> > > > > I'm convinced now that  first-class types seem to be the way to go
> and I'm
> > > > > happy to take this approach.
> > > >
> > > > I agree from an implementation effort it is simpler, but I'm still
> not
> > > > convinced that we should be adding this as a first class type.  As
> noted in
> > > > the survey below it appears Complex numbers are not a core concept
> in many
> > > > general purpose coding languages and it doesn't appear to be a
> common type
> > > > in SQL systems either.
> > > >
> > > > The reason why I am being nit-picky here is I think that having a
> first
> > > > class type indicates that it should eventually be supported by all
> > > > reference implementations.  An "well known" extension type I think
> offers
> > > > less guarantees which makes it seem more suitable for niche types.
> > > >
> > > > > I don't immediately see a Packed Struct type. Would this need to be
> > > > > > implemented?
> > > > > Not necessarily (*).  But before thinking about implementation,
> this
> > > > > proposal must be accepted into the format.
> > > >
> > > >
> > > > Yes, this is a type that has been proposed in the past and I think
> handles
> > > > a lot of  types not yet in Arrow but have been requested (e.g. IP
> > > > Addresses, Geo coordinates), etc.
> > > >
> > > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
> simon.perkins@gmail.com>
> > > > wrote:
> > > >
> > > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <an...@python.org>
> wrote:
> > > > >
> > > > > >
> > > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > > >
> > > > > > > Adding a new first-class type in Arrow requires working
> integration
> > > > > tests
> > > > > > > between C++ and Java libraries (once the idea is informally
> agreed
> > > > > upon)
> > > > > > > and then a final vote for approval.  We haven't formalized
> extension
> > > > > > types
> > > > > > > but I imagine a similar cross language requirement would be
> agreed
> > > > > upon.
> > > > > > > Implementation of computation wouldn't be required for adding
> a new
> > > > > type.
> > > > > > > Different language bindings have taken different approaches on
> how much
> > > > > > > additional computational elements are packaged in them.
> > > > > >
> > > > > > While dedicated types are not strictly required, compute
> functions would
> > > > > > be much easier to add for a first-class dedicated complex
> datatype
> > > > > > rather than for an extension type.
> > > > > >
> > > > > > Since complex numbers are quite common in some domains, and
> since they
> > > > > > are conceptually simply, IMHO it would make sense to add them to
> the
> > > > > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > > > > >
> > > > >
> > > > > I'm convinced now that  first-class types seem to be the way to go
> and I'm
> > > > > happy to take this approach.
> > > > > Regarding compute functions, it looks like the standard set of
> scalar
> > > > > arithmetic and reduction functionality
> > > > > is desirable for complex numbers:
> > > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > > Perhaps it would be better to split the addition of the Types and
> addition
> > > > > Compute functionality into separate PRs?
> > > > >
> > > > > Regarding the process for managing this PR, it sounds like a
> proposal must
> > > > > be voted on?
> > > > > i.e. is this proposal still in this phase
> > > > >
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > > Regards
> > > > >
> > > > > Simon
> > > > >
>

Re: Complex Number support in Arrow

Posted by Wes McKinney <we...@gmail.com>.
My understanding is that it means having COMPLEX as an entry in the
arrow/type_fwd.h Type enum. I agree this would make implementation
work in the C++ library much more straightforward.

One idea I proposed would be to do that, and implement the
serialization of the complex metadata using Extension types.

On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <we...@gmail.com> wrote:
>
> > While dedicated types are not strictly required, compute functions would
> > be much easier to add for a first-class dedicated complex datatype
> > rather than for an extension type.
> @pitrou
>
> This is perhaps a naive question (and admittedly, I'm not up to speed
> on my compute kernels) but why is this the case?  For example, if
> adding a complex addition kernel it seems we would be talking about...
>
> dest_scalar.real = scalar1.real + scalar2.real;
> dest_scalar.im = scalar1.im + scalar2.im;
>
> vs...
>
> dest_scalar[0] = scalar1[0] + scalar2[0];
> dest_scalar[1] = scalar1[1] + scalar2[1];
>
> On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > I'd be supportive of starting with this as a "canonical" extension
> > type so that all implementations are not expected to support complex
> > types — this would encourage us to build sufficient integration e.g.
> > with NumPy to get things working end-to-end with the on-wire
> > representation being an extension type. We could certainly choose to
> > treat the type as "first class" in the C++ library without it being
> > "top level" in the Type union in Flatbuffers.
> >
> > I agree that the use cases are more specialized, and the fact that we
> > haven't needed it until now (or at least, its absence suggests this)
> > shows that this is the case.
> >
> > On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <em...@gmail.com> wrote:
> > >
> > > >
> > > > I'm convinced now that  first-class types seem to be the way to go and I'm
> > > > happy to take this approach.
> > >
> > > I agree from an implementation effort it is simpler, but I'm still not
> > > convinced that we should be adding this as a first class type.  As noted in
> > > the survey below it appears Complex numbers are not a core concept in many
> > > general purpose coding languages and it doesn't appear to be a common type
> > > in SQL systems either.
> > >
> > > The reason why I am being nit-picky here is I think that having a first
> > > class type indicates that it should eventually be supported by all
> > > reference implementations.  An "well known" extension type I think offers
> > > less guarantees which makes it seem more suitable for niche types.
> > >
> > > > I don't immediately see a Packed Struct type. Would this need to be
> > > > > implemented?
> > > > Not necessarily (*).  But before thinking about implementation, this
> > > > proposal must be accepted into the format.
> > >
> > >
> > > Yes, this is a type that has been proposed in the past and I think handles
> > > a lot of  types not yet in Arrow but have been requested (e.g. IP
> > > Addresses, Geo coordinates), etc.
> > >
> > > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <si...@gmail.com>
> > > wrote:
> > >
> > > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <an...@python.org> wrote:
> > > >
> > > > >
> > > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > > >
> > > > > > Adding a new first-class type in Arrow requires working integration
> > > > tests
> > > > > > between C++ and Java libraries (once the idea is informally agreed
> > > > upon)
> > > > > > and then a final vote for approval.  We haven't formalized extension
> > > > > types
> > > > > > but I imagine a similar cross language requirement would be agreed
> > > > upon.
> > > > > > Implementation of computation wouldn't be required for adding a new
> > > > type.
> > > > > > Different language bindings have taken different approaches on how much
> > > > > > additional computational elements are packaged in them.
> > > > >
> > > > > While dedicated types are not strictly required, compute functions would
> > > > > be much easier to add for a first-class dedicated complex datatype
> > > > > rather than for an extension type.
> > > > >
> > > > > Since complex numbers are quite common in some domains, and since they
> > > > > are conceptually simply, IMHO it would make sense to add them to the
> > > > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > > > >
> > > >
> > > > I'm convinced now that  first-class types seem to be the way to go and I'm
> > > > happy to take this approach.
> > > > Regarding compute functions, it looks like the standard set of scalar
> > > > arithmetic and reduction functionality
> > > > is desirable for complex numbers:
> > > > https://arrow.apache.org/docs/cpp/compute.html#
> > > > Perhaps it would be better to split the addition of the Types and addition
> > > > Compute functionality into separate PRs?
> > > >
> > > > Regarding the process for managing this PR, it sounds like a proposal must
> > > > be voted on?
> > > > i.e. is this proposal still in this phase
> > > > http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > > Regards
> > > >
> > > > Simon
> > > >

Re: Complex Number support in Arrow

Posted by Weston Pace <we...@gmail.com>.
> While dedicated types are not strictly required, compute functions would
> be much easier to add for a first-class dedicated complex datatype
> rather than for an extension type.
@pitrou

This is perhaps a naive question (and admittedly, I'm not up to speed
on my compute kernels) but why is this the case?  For example, if
adding a complex addition kernel it seems we would be talking about...

dest_scalar.real = scalar1.real + scalar2.real;
dest_scalar.im = scalar1.im + scalar2.im;

vs...

dest_scalar[0] = scalar1[0] + scalar2[0];
dest_scalar[1] = scalar1[1] + scalar2[1];

On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <we...@gmail.com> wrote:
>
> I'd be supportive of starting with this as a "canonical" extension
> type so that all implementations are not expected to support complex
> types — this would encourage us to build sufficient integration e.g.
> with NumPy to get things working end-to-end with the on-wire
> representation being an extension type. We could certainly choose to
> treat the type as "first class" in the C++ library without it being
> "top level" in the Type union in Flatbuffers.
>
> I agree that the use cases are more specialized, and the fact that we
> haven't needed it until now (or at least, its absence suggests this)
> shows that this is the case.
>
> On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <em...@gmail.com> wrote:
> >
> > >
> > > I'm convinced now that  first-class types seem to be the way to go and I'm
> > > happy to take this approach.
> >
> > I agree from an implementation effort it is simpler, but I'm still not
> > convinced that we should be adding this as a first class type.  As noted in
> > the survey below it appears Complex numbers are not a core concept in many
> > general purpose coding languages and it doesn't appear to be a common type
> > in SQL systems either.
> >
> > The reason why I am being nit-picky here is I think that having a first
> > class type indicates that it should eventually be supported by all
> > reference implementations.  An "well known" extension type I think offers
> > less guarantees which makes it seem more suitable for niche types.
> >
> > > I don't immediately see a Packed Struct type. Would this need to be
> > > > implemented?
> > > Not necessarily (*).  But before thinking about implementation, this
> > > proposal must be accepted into the format.
> >
> >
> > Yes, this is a type that has been proposed in the past and I think handles
> > a lot of  types not yet in Arrow but have been requested (e.g. IP
> > Addresses, Geo coordinates), etc.
> >
> > On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <si...@gmail.com>
> > wrote:
> >
> > > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <an...@python.org> wrote:
> > >
> > > >
> > > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > > >
> > > > > Adding a new first-class type in Arrow requires working integration
> > > tests
> > > > > between C++ and Java libraries (once the idea is informally agreed
> > > upon)
> > > > > and then a final vote for approval.  We haven't formalized extension
> > > > types
> > > > > but I imagine a similar cross language requirement would be agreed
> > > upon.
> > > > > Implementation of computation wouldn't be required for adding a new
> > > type.
> > > > > Different language bindings have taken different approaches on how much
> > > > > additional computational elements are packaged in them.
> > > >
> > > > While dedicated types are not strictly required, compute functions would
> > > > be much easier to add for a first-class dedicated complex datatype
> > > > rather than for an extension type.
> > > >
> > > > Since complex numbers are quite common in some domains, and since they
> > > > are conceptually simply, IMHO it would make sense to add them to the
> > > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > > >
> > >
> > > I'm convinced now that  first-class types seem to be the way to go and I'm
> > > happy to take this approach.
> > > Regarding compute functions, it looks like the standard set of scalar
> > > arithmetic and reduction functionality
> > > is desirable for complex numbers:
> > > https://arrow.apache.org/docs/cpp/compute.html#
> > > Perhaps it would be better to split the addition of the Types and addition
> > > Compute functionality into separate PRs?
> > >
> > > Regarding the process for managing this PR, it sounds like a proposal must
> > > be voted on?
> > > i.e. is this proposal still in this phase
> > > http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > > Regards
> > >
> > > Simon
> > >

Re: Complex Number support in Arrow

Posted by Wes McKinney <we...@gmail.com>.
I'd be supportive of starting with this as a "canonical" extension
type so that all implementations are not expected to support complex
types — this would encourage us to build sufficient integration e.g.
with NumPy to get things working end-to-end with the on-wire
representation being an extension type. We could certainly choose to
treat the type as "first class" in the C++ library without it being
"top level" in the Type union in Flatbuffers.

I agree that the use cases are more specialized, and the fact that we
haven't needed it until now (or at least, its absence suggests this)
shows that this is the case.

On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <em...@gmail.com> wrote:
>
> >
> > I'm convinced now that  first-class types seem to be the way to go and I'm
> > happy to take this approach.
>
> I agree from an implementation effort it is simpler, but I'm still not
> convinced that we should be adding this as a first class type.  As noted in
> the survey below it appears Complex numbers are not a core concept in many
> general purpose coding languages and it doesn't appear to be a common type
> in SQL systems either.
>
> The reason why I am being nit-picky here is I think that having a first
> class type indicates that it should eventually be supported by all
> reference implementations.  An "well known" extension type I think offers
> less guarantees which makes it seem more suitable for niche types.
>
> > I don't immediately see a Packed Struct type. Would this need to be
> > > implemented?
> > Not necessarily (*).  But before thinking about implementation, this
> > proposal must be accepted into the format.
>
>
> Yes, this is a type that has been proposed in the past and I think handles
> a lot of  types not yet in Arrow but have been requested (e.g. IP
> Addresses, Geo coordinates), etc.
>
> On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <si...@gmail.com>
> wrote:
>
> > On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <an...@python.org> wrote:
> >
> > >
> > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > > >
> > > > Adding a new first-class type in Arrow requires working integration
> > tests
> > > > between C++ and Java libraries (once the idea is informally agreed
> > upon)
> > > > and then a final vote for approval.  We haven't formalized extension
> > > types
> > > > but I imagine a similar cross language requirement would be agreed
> > upon.
> > > > Implementation of computation wouldn't be required for adding a new
> > type.
> > > > Different language bindings have taken different approaches on how much
> > > > additional computational elements are packaged in them.
> > >
> > > While dedicated types are not strictly required, compute functions would
> > > be much easier to add for a first-class dedicated complex datatype
> > > rather than for an extension type.
> > >
> > > Since complex numbers are quite common in some domains, and since they
> > > are conceptually simply, IMHO it would make sense to add them to the
> > > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> > >
> >
> > I'm convinced now that  first-class types seem to be the way to go and I'm
> > happy to take this approach.
> > Regarding compute functions, it looks like the standard set of scalar
> > arithmetic and reduction functionality
> > is desirable for complex numbers:
> > https://arrow.apache.org/docs/cpp/compute.html#
> > Perhaps it would be better to split the addition of the Types and addition
> > Compute functionality into separate PRs?
> >
> > Regarding the process for managing this PR, it sounds like a proposal must
> > be voted on?
> > i.e. is this proposal still in this phase
> > http://arrow.apache.org/docs/developers/contributing.html#before-starting
> > Regards
> >
> > Simon
> >

Re: Complex Number support in Arrow

Posted by Micah Kornfield <em...@gmail.com>.
>
> I'm convinced now that  first-class types seem to be the way to go and I'm
> happy to take this approach.

I agree from an implementation effort it is simpler, but I'm still not
convinced that we should be adding this as a first class type.  As noted in
the survey below it appears Complex numbers are not a core concept in many
general purpose coding languages and it doesn't appear to be a common type
in SQL systems either.

The reason why I am being nit-picky here is I think that having a first
class type indicates that it should eventually be supported by all
reference implementations.  An "well known" extension type I think offers
less guarantees which makes it seem more suitable for niche types.

> I don't immediately see a Packed Struct type. Would this need to be
> > implemented?
> Not necessarily (*).  But before thinking about implementation, this
> proposal must be accepted into the format.


Yes, this is a type that has been proposed in the past and I think handles
a lot of  types not yet in Arrow but have been requested (e.g. IP
Addresses, Geo coordinates), etc.

On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <si...@gmail.com>
wrote:

> On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> > >
> > > Adding a new first-class type in Arrow requires working integration
> tests
> > > between C++ and Java libraries (once the idea is informally agreed
> upon)
> > > and then a final vote for approval.  We haven't formalized extension
> > types
> > > but I imagine a similar cross language requirement would be agreed
> upon.
> > > Implementation of computation wouldn't be required for adding a new
> type.
> > > Different language bindings have taken different approaches on how much
> > > additional computational elements are packaged in them.
> >
> > While dedicated types are not strictly required, compute functions would
> > be much easier to add for a first-class dedicated complex datatype
> > rather than for an extension type.
> >
> > Since complex numbers are quite common in some domains, and since they
> > are conceptually simply, IMHO it would make sense to add them to the
> > native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
> >
>
> I'm convinced now that  first-class types seem to be the way to go and I'm
> happy to take this approach.
> Regarding compute functions, it looks like the standard set of scalar
> arithmetic and reduction functionality
> is desirable for complex numbers:
> https://arrow.apache.org/docs/cpp/compute.html#
> Perhaps it would be better to split the addition of the Types and addition
> Compute functionality into separate PRs?
>
> Regarding the process for managing this PR, it sounds like a proposal must
> be voted on?
> i.e. is this proposal still in this phase
> http://arrow.apache.org/docs/developers/contributing.html#before-starting
> Regards
>
> Simon
>

Re: Complex Number support in Arrow

Posted by Simon Perkins <si...@gmail.com>.
On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> >
> > Adding a new first-class type in Arrow requires working integration tests
> > between C++ and Java libraries (once the idea is informally agreed upon)
> > and then a final vote for approval.  We haven't formalized extension
> types
> > but I imagine a similar cross language requirement would be agreed upon.
> > Implementation of computation wouldn't be required for adding a new type.
> > Different language bindings have taken different approaches on how much
> > additional computational elements are packaged in them.
>
> While dedicated types are not strictly required, compute functions would
> be much easier to add for a first-class dedicated complex datatype
> rather than for an extension type.
>
> Since complex numbers are quite common in some domains, and since they
> are conceptually simply, IMHO it would make sense to add them to the
> native Arrow datatypes (at least COMPLEX64 and COMPLEX128).
>

I'm convinced now that  first-class types seem to be the way to go and I'm
happy to take this approach.
Regarding compute functions, it looks like the standard set of scalar
arithmetic and reduction functionality
is desirable for complex numbers:
https://arrow.apache.org/docs/cpp/compute.html#
Perhaps it would be better to split the addition of the Types and addition
Compute functionality into separate PRs?

Regarding the process for managing this PR, it sounds like a proposal must
be voted on?
i.e. is this proposal still in this phase
http://arrow.apache.org/docs/developers/contributing.html#before-starting
Regards

Simon

Re: Complex Number support in Arrow

Posted by Antoine Pitrou <an...@python.org>.
Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
> 
> Adding a new first-class type in Arrow requires working integration tests
> between C++ and Java libraries (once the idea is informally agreed upon)
> and then a final vote for approval.  We haven't formalized extension types
> but I imagine a similar cross language requirement would be agreed upon.
> Implementation of computation wouldn't be required for adding a new type.
> Different language bindings have taken different approaches on how much
> additional computational elements are packaged in them.

While dedicated types are not strictly required, compute functions would 
be much easier to add for a first-class dedicated complex datatype 
rather than for an extension type.

Since complex numbers are quite common in some domains, and since they 
are conceptually simply, IMHO it would make sense to add them to the 
native Arrow datatypes (at least COMPLEX64 and COMPLEX128).

Regards

Antoine.

Re: Complex Number support in Arrow

Posted by Micah Kornfield <em...@gmail.com>.
Hi Simon,

Please see a recent discussion on adding new types [1]

  - Adding first class complex types seems to involve modifying
>    cpp/src/arrow/ipc/feather.fbs which may change the protocol and
> introduce
>    breaking changes. I'm not sure about this and seek advice on how
> invasive
>    this approach is and whether its worth pursuing.


My understanding is that feather.fbs is for V1 feather files and probably
shouldn't be touched.  Only updating schema.fbs should be required and the
type should be doable in a backwards/forwards compatible way (we've added
types without bumping the metadata version and are in the process of adding
more).

   - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
>    imagine a struct([real, imag]) might offer more in terms of affordance
> ot
>    the user. I'd imagine the underlying memory layout would be the same.


What notation is this using (are 32, 64 meant to be substitual
parameters)?  I would think FixedSizeList might be more appropriate then
list.

It seems like what we would want for this is a "Packed Struct" type and
then have an extension type to wrap it. The existing structs in arrow have
a very different memory layout than lists (the real and imaginary
components would not be adjacent in memory with Structs).  All the
representations also have trade-offs on how they would be mapped to parquet
and the relevant feature set there.

   - I don't have a clear understanding of whether adding either a
>    First-Class or ExtensionType involves supporting numeric operations on
> that
>    type (e.g. Complex Exponential, Absolutes, Min or Max operations) or
>    whether Arrow is merely concerned with the underlying data
> representation.


Adding a new first-class type in Arrow requires working integration tests
between C++ and Java libraries (once the idea is informally agreed upon)
and then a final vote for approval.  We haven't formalized extension types
but I imagine a similar cross language requirement would be agreed upon.
Implementation of computation wouldn't be required for adding a new type.
Different language bindings have taken different approaches on how much
additional computational elements are packaged in them.

-Micah

[1]
https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E

On Tue, Jun 8, 2021 at 1:27 AM Simon Perkins <si...@gmail.com>
wrote:

> Greetings Apache Dev Mailing List
>
> I'm interested in adding complex number support to Arrow. The use case is
> Radio Astronomy data, which is represented by complex values.
>
> xref https://issues.apache.org/jira/browse/ARROW-638
> xref https://github.com/apache/arrow/pull/10452
>
> It's fairly easy to support Complex Numbers as a Python Extension -- see
> for e.g. how I've done it here using a list(float{32,64}):
>
>
> https://github.com/ska-sa/dask-ms/blob/a5bd8538ea3de9fabb8fe74e89c3a75c4043f813/daskms/experimental/arrow/extension_types.py#L144-L173
>
> The above seems to work with the standard NumPy complex memory layout
> (consecutive pairs of [real, imag] values) and should work with the C++
> std::complex layout. Note that C complex and C++ std::complex should also
> have the same layout https://stackoverflow.com/a/10540346.
>
> However, this constrains this representation of Complex Numbers to the
> dask-ms only. I think that it would be better to add support for this at a
> base level in Arrow, especially since this will open up the ability for
> other packages to understand the Complex Number Type. For example, it would
> be useful to:
>
>    1. Have a clearly defined Pandas -> Arrow -> Parquet -> Arrow -> Pandas
>    roundtrip. Currently there's no Pandas -> Arrow conversion for
>    np.complex{64, 128}.
>    2. Support complex number types in query engines like DataFusion and
>    BlazingSQL, if only initially via selection on indexing columns.
>
>
> I started up a PR in https://github.com/apache/arrow/pull/10452 adding
> Complex Numbers as a first-class Arrow type, although I note that
>
> https://issues.apache.org/jira/browse/ARROW-638?focusedCommentId=16912456&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16912456
> suggests implementing this as a C++ Extension Type on a first pass. Initial
> experiments suggests this is pretty doable -- I've got some test cases
> running already.
>
> I have some questions going forward:
>
>    - Adding first class complex types seems to involve modifying
>    cpp/src/arrow/ipc/feather.fbs which may change the protocol and
> introduce
>    breaking changes. I'm not sure about this and seek advice on how
> invasive
>    this approach is and whether its worth pursuing.
>    - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
>    imagine a struct([real, imag]) might offer more in terms of affordance
> ot
>    the user. I'd imagine the underlying memory layout would be the same.
>    - I don't have a clear understanding of whether adding either a
>    First-Class or ExtensionType involves supporting numeric operations on
> that
>    type (e.g. Complex Exponential, Absolutes, Min or Max operations) or
>    whether Arrow is merely concerned with the underlying data
> representation.
>
> Thanks for considering this.
>   Simon Perkins
>