You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Yibo Cai <yi...@arm.com> on 2019/10/30 07:07:55 UTC

some questions, please help

Hi,

I'm new to Arrow. Would like to seek for help about some questions. Any comment is welcomed.

- About source code tree, my understand is that "cpp" is the core arrow libraries, "c_glib, go, python, ..." are language bindings to ease integrating arrow into apps developed by that language. Is that correct?

- Arrow implements many data types and aggregation functions(sum, mean, ...). [1]
   IMO, more functions and types should be supported, like min/max, vector/tensor operations, big number, etc. I'm not sure if this is in arrow's scope, or the apps using arrow should deal with it themselves.

- I see some SIMD optimizations in arrow go binding, such as vectored sum. [2]
   But arrow cpp lib doesn't leverage SIMD. [3]
   Why not optimize it in cpp lib so all languages can benefit?

[1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
[2] https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
[3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111

Yibo

Re: some questions, please help

Posted by Micah Kornfield <em...@gmail.com>.
>
> I wonder how arrow deals with gaps among different implementations? Say,
> C++ lib implements some features go lib doesn't support. Is there a
> consistent API document, or documents for each language implementation?


It is important to distinguish between two types of functionality:
1.  Supporting all the features of the interchange format(s).   In this
case the canonical document is the format specification [1]
2.  Additional functionality for processing arrow data (e.g. query engines,
slicing, etc).

For 1 we have integration tests [2] and known gaps for some implementation
(search for skip.add in datagen.py) which should all have JIRAs associated
with them.  Some of the implementations (e.g. C# have not been added to the
integration tests at all).

For 2 the community has not been concerned with keeping feature parity.
For instance, the Java library has a substantially different class
naming/hierarchy than  C++.  Also, at least at the moment, no one has
expressed interest in implementing a query engine/dataframe library as part
of the Arrow project in Java (work has mostly been focused on some
performance improvement and some algorithms that contributors have found
useful).

Hope this helps.

-Micah

[1]
https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst
[2]
https://github.com/apache/arrow/blob/5ca85922ae90bacb96d939503e53e83e6ec47f8c/dev/archery/archery/integration/datagen.py

On Thu, Nov 7, 2019 at 11:25 PM Yibo Cai <yi...@arm.com> wrote:

> Hi Wes,
>
> On 10/30/19 10:24 PM, Wes McKinney wrote:
> > hi Yibo
> >
> > On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
> >>
> >> Hi,
> >>
> >> I'm new to Arrow. Would like to seek for help about some questions. Any
> comment is welcomed.
> >>
> >> - About source code tree, my understand is that "cpp" is the core arrow
> libraries, "c_glib, go, python, ..." are language bindings to ease
> integrating arrow into apps developed by that language. Is that correct?
> >
> > No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and
> Rust
> >
> > * C/GLib, MATLAB, Python, R bind to C++
> > * Ruby binds to GLib
> >
>
> I wonder how arrow deals with gaps among different implementations? Say,
> C++ lib implements some features go lib doesn't support. Is there a
> consistent API document, or documents for each language implementation?
>
> >> - Arrow implements many data types and aggregation functions(sum, mean,
> ...). [1]
> >>     IMO, more functions and types should be supported, like min/max,
> vector/tensor operations, big number, etc. I'm not sure if this is in
> arrow's scope, or the apps using arrow should deal with it themselves.
> >
> > Our objective at least in the C++ library is to have a generally
> > useful "standard library" that handles common application concerns.
> > Whether or not something is thought to be in scope may vary on a case
> > by case basis -- if you can't find a JIRA issue for something in
> > particular, please go ahead and open one.
> >
> >> - I see some SIMD optimizations in arrow go binding, such as vectored
> sum. [2]
> >>     But arrow cpp lib doesn't leverage SIMD. [3]
> >>     Why not optimize it in cpp lib so all languages can benefit?
> >
> > You're welcome to contribute such optimizations to the C++ library
> >
> >
> > - Wes
> >
> >> [1]
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
> >> [2]
> https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
> >> [3]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
> >>
> >> Yibo
>

Re: some questions, please help

Posted by Yibo Cai <yi...@arm.com>.
Hi Wes,

On 10/30/19 10:24 PM, Wes McKinney wrote:
> hi Yibo
> 
> On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
>>
>> Hi,
>>
>> I'm new to Arrow. Would like to seek for help about some questions. Any comment is welcomed.
>>
>> - About source code tree, my understand is that "cpp" is the core arrow libraries, "c_glib, go, python, ..." are language bindings to ease integrating arrow into apps developed by that language. Is that correct?
> 
> No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust
> 
> * C/GLib, MATLAB, Python, R bind to C++
> * Ruby binds to GLib
> 

I wonder how arrow deals with gaps among different implementations? Say, C++ lib implements some features go lib doesn't support. Is there a consistent API document, or documents for each language implementation?

>> - Arrow implements many data types and aggregation functions(sum, mean, ...). [1]
>>     IMO, more functions and types should be supported, like min/max, vector/tensor operations, big number, etc. I'm not sure if this is in arrow's scope, or the apps using arrow should deal with it themselves.
> 
> Our objective at least in the C++ library is to have a generally
> useful "standard library" that handles common application concerns.
> Whether or not something is thought to be in scope may vary on a case
> by case basis -- if you can't find a JIRA issue for something in
> particular, please go ahead and open one.
> 
>> - I see some SIMD optimizations in arrow go binding, such as vectored sum. [2]
>>     But arrow cpp lib doesn't leverage SIMD. [3]
>>     Why not optimize it in cpp lib so all languages can benefit?
> 
> You're welcome to contribute such optimizations to the C++ library
> 
> 
> - Wes
> 
>> [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
>> [2] https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
>> [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
>>
>> Yibo

Re: some questions, please help

Posted by Yibo Cai <yi...@arm.com>.
Thanks Wes, Micah, your comments are very helpful.

Yibo

On 10/30/19 10:45 PM, Wes McKinney wrote:
> On Wed, Oct 30, 2019 at 9:32 AM Micah Kornfield <em...@gmail.com> wrote:
>>
>>>
>>>> - I see some SIMD optimizations in arrow go binding, such as vectored
>>> sum. [2]
>>>>     But arrow cpp lib doesn't leverage SIMD. [3]
>>>>     Why not optimize it in cpp lib so all languages can benefit?
>>> You're welcome to contribute such optimizations to the C++ library
>>
>>
>> Note that even though C++ doesn't use explicit SIMD intrinsics often times
>> the compiler will generate SIMD code because it can auto-vectorize the
>> code.
> 
> Note it will likely be important to have explicit dynamic/runtime SIMD
> dispatching on certain hot paths as we build binaries that need to be
> able to run on both newer and older CPUs
> 
>> On Wed, Oct 30, 2019 at 7:25 AM Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi Yibo
>>>
>>> On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm new to Arrow. Would like to seek for help about some questions. Any
>>> comment is welcomed.
>>>>
>>>> - About source code tree, my understand is that "cpp" is the core arrow
>>> libraries, "c_glib, go, python, ..." are language bindings to ease
>>> integrating arrow into apps developed by that language. Is that correct?
>>>
>>> No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust
>>>
>>> * C/GLib, MATLAB, Python, R bind to C++
>>> * Ruby binds to GLib
>>>
>>>> - Arrow implements many data types and aggregation functions(sum, mean,
>>> ...). [1]
>>>>     IMO, more functions and types should be supported, like min/max,
>>> vector/tensor operations, big number, etc. I'm not sure if this is in
>>> arrow's scope, or the apps using arrow should deal with it themselves.
>>>
>>> Our objective at least in the C++ library is to have a generally
>>> useful "standard library" that handles common application concerns.
>>> Whether or not something is thought to be in scope may vary on a case
>>> by case basis -- if you can't find a JIRA issue for something in
>>> particular, please go ahead and open one.
>>>
>>>> - I see some SIMD optimizations in arrow go binding, such as vectored
>>> sum. [2]
>>>>     But arrow cpp lib doesn't leverage SIMD. [3]
>>>>     Why not optimize it in cpp lib so all languages can benefit?
>>>
>>> You're welcome to contribute such optimizations to the C++ library
>>>
>>>
>>> - Wes
>>>
>>>> [1]
>>> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
>>>> [2]
>>> https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
>>>> [3]
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
>>>>
>>>> Yibo
>>>

Re: some questions, please help

Posted by Wes McKinney <we...@gmail.com>.
On Wed, Oct 30, 2019 at 9:32 AM Micah Kornfield <em...@gmail.com> wrote:
>
> >
> > > - I see some SIMD optimizations in arrow go binding, such as vectored
> > sum. [2]
> > >    But arrow cpp lib doesn't leverage SIMD. [3]
> > >    Why not optimize it in cpp lib so all languages can benefit?
> > You're welcome to contribute such optimizations to the C++ library
>
>
> Note that even though C++ doesn't use explicit SIMD intrinsics often times
> the compiler will generate SIMD code because it can auto-vectorize the
> code.

Note it will likely be important to have explicit dynamic/runtime SIMD
dispatching on certain hot paths as we build binaries that need to be
able to run on both newer and older CPUs

> On Wed, Oct 30, 2019 at 7:25 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Yibo
> >
> > On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
> > >
> > > Hi,
> > >
> > > I'm new to Arrow. Would like to seek for help about some questions. Any
> > comment is welcomed.
> > >
> > > - About source code tree, my understand is that "cpp" is the core arrow
> > libraries, "c_glib, go, python, ..." are language bindings to ease
> > integrating arrow into apps developed by that language. Is that correct?
> >
> > No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust
> >
> > * C/GLib, MATLAB, Python, R bind to C++
> > * Ruby binds to GLib
> >
> > > - Arrow implements many data types and aggregation functions(sum, mean,
> > ...). [1]
> > >    IMO, more functions and types should be supported, like min/max,
> > vector/tensor operations, big number, etc. I'm not sure if this is in
> > arrow's scope, or the apps using arrow should deal with it themselves.
> >
> > Our objective at least in the C++ library is to have a generally
> > useful "standard library" that handles common application concerns.
> > Whether or not something is thought to be in scope may vary on a case
> > by case basis -- if you can't find a JIRA issue for something in
> > particular, please go ahead and open one.
> >
> > > - I see some SIMD optimizations in arrow go binding, such as vectored
> > sum. [2]
> > >    But arrow cpp lib doesn't leverage SIMD. [3]
> > >    Why not optimize it in cpp lib so all languages can benefit?
> >
> > You're welcome to contribute such optimizations to the C++ library
> >
> >
> > - Wes
> >
> > > [1]
> > https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
> > > [2]
> > https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
> > > [3]
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
> > >
> > > Yibo
> >

Re: some questions, please help

Posted by Micah Kornfield <em...@gmail.com>.
>
> > - I see some SIMD optimizations in arrow go binding, such as vectored
> sum. [2]
> >    But arrow cpp lib doesn't leverage SIMD. [3]
> >    Why not optimize it in cpp lib so all languages can benefit?
> You're welcome to contribute such optimizations to the C++ library


Note that even though C++ doesn't use explicit SIMD intrinsics often times
the compiler will generate SIMD code because it can auto-vectorize the
code.

On Wed, Oct 30, 2019 at 7:25 AM Wes McKinney <we...@gmail.com> wrote:

> hi Yibo
>
> On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
> >
> > Hi,
> >
> > I'm new to Arrow. Would like to seek for help about some questions. Any
> comment is welcomed.
> >
> > - About source code tree, my understand is that "cpp" is the core arrow
> libraries, "c_glib, go, python, ..." are language bindings to ease
> integrating arrow into apps developed by that language. Is that correct?
>
> No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust
>
> * C/GLib, MATLAB, Python, R bind to C++
> * Ruby binds to GLib
>
> > - Arrow implements many data types and aggregation functions(sum, mean,
> ...). [1]
> >    IMO, more functions and types should be supported, like min/max,
> vector/tensor operations, big number, etc. I'm not sure if this is in
> arrow's scope, or the apps using arrow should deal with it themselves.
>
> Our objective at least in the C++ library is to have a generally
> useful "standard library" that handles common application concerns.
> Whether or not something is thought to be in scope may vary on a case
> by case basis -- if you can't find a JIRA issue for something in
> particular, please go ahead and open one.
>
> > - I see some SIMD optimizations in arrow go binding, such as vectored
> sum. [2]
> >    But arrow cpp lib doesn't leverage SIMD. [3]
> >    Why not optimize it in cpp lib so all languages can benefit?
>
> You're welcome to contribute such optimizations to the C++ library
>
>
> - Wes
>
> > [1]
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
> > [2]
> https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
> > [3]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
> >
> > Yibo
>

Re: some questions, please help

Posted by Wes McKinney <we...@gmail.com>.
hi Yibo

On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
>
> Hi,
>
> I'm new to Arrow. Would like to seek for help about some questions. Any comment is welcomed.
>
> - About source code tree, my understand is that "cpp" is the core arrow libraries, "c_glib, go, python, ..." are language bindings to ease integrating arrow into apps developed by that language. Is that correct?

No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust

* C/GLib, MATLAB, Python, R bind to C++
* Ruby binds to GLib

> - Arrow implements many data types and aggregation functions(sum, mean, ...). [1]
>    IMO, more functions and types should be supported, like min/max, vector/tensor operations, big number, etc. I'm not sure if this is in arrow's scope, or the apps using arrow should deal with it themselves.

Our objective at least in the C++ library is to have a generally
useful "standard library" that handles common application concerns.
Whether or not something is thought to be in scope may vary on a case
by case basis -- if you can't find a JIRA issue for something in
particular, please go ahead and open one.

> - I see some SIMD optimizations in arrow go binding, such as vectored sum. [2]
>    But arrow cpp lib doesn't leverage SIMD. [3]
>    Why not optimize it in cpp lib so all languages can benefit?

You're welcome to contribute such optimizations to the C++ library


- Wes

> [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
> [2] https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
> [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
>
> Yibo