You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Yibo Cai <yi...@arm.com> on 2019/10/30 07:07:55 UTC
some questions, please help
Hi,
I'm new to Arrow. Would like to seek for help about some questions. Any comment is welcomed.
- About source code tree, my understand is that "cpp" is the core arrow libraries, "c_glib, go, python, ..." are language bindings to ease integrating arrow into apps developed by that language. Is that correct?
- Arrow implements many data types and aggregation functions(sum, mean, ...). [1]
IMO, more functions and types should be supported, like min/max, vector/tensor operations, big number, etc. I'm not sure if this is in arrow's scope, or the apps using arrow should deal with it themselves.
- I see some SIMD optimizations in arrow go binding, such as vectored sum. [2]
But arrow cpp lib doesn't leverage SIMD. [3]
Why not optimize it in cpp lib so all languages can benefit?
[1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
[2] https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
[3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
Yibo
Re: some questions, please help
Posted by Micah Kornfield <em...@gmail.com>.
>
> I wonder how arrow deals with gaps among different implementations? Say,
> C++ lib implements some features go lib doesn't support. Is there a
> consistent API document, or documents for each language implementation?
It is important to distinguish between two types of functionality:
1. Supporting all the features of the interchange format(s). In this
case the canonical document is the format specification [1]
2. Additional functionality for processing arrow data (e.g. query engines,
slicing, etc).
For 1 we have integration tests [2] and known gaps for some implementation
(search for skip.add in datagen.py) which should all have JIRAs associated
with them. Some of the implementations (e.g. C# have not been added to the
integration tests at all).
For 2 the community has not been concerned with keeping feature parity.
For instance, the Java library has a substantially different class
naming/hierarchy than C++. Also, at least at the moment, no one has
expressed interest in implementing a query engine/dataframe library as part
of the Arrow project in Java (work has mostly been focused on some
performance improvement and some algorithms that contributors have found
useful).
Hope this helps.
-Micah
[1]
https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst
[2]
https://github.com/apache/arrow/blob/5ca85922ae90bacb96d939503e53e83e6ec47f8c/dev/archery/archery/integration/datagen.py
On Thu, Nov 7, 2019 at 11:25 PM Yibo Cai <yi...@arm.com> wrote:
> Hi Wes,
>
> On 10/30/19 10:24 PM, Wes McKinney wrote:
> > hi Yibo
> >
> > On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
> >>
> >> Hi,
> >>
> >> I'm new to Arrow. Would like to seek for help about some questions. Any
> comment is welcomed.
> >>
> >> - About source code tree, my understand is that "cpp" is the core arrow
> libraries, "c_glib, go, python, ..." are language bindings to ease
> integrating arrow into apps developed by that language. Is that correct?
> >
> > No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and
> Rust
> >
> > * C/GLib, MATLAB, Python, R bind to C++
> > * Ruby binds to GLib
> >
>
> I wonder how arrow deals with gaps among different implementations? Say,
> C++ lib implements some features go lib doesn't support. Is there a
> consistent API document, or documents for each language implementation?
>
> >> - Arrow implements many data types and aggregation functions(sum, mean,
> ...). [1]
> >> IMO, more functions and types should be supported, like min/max,
> vector/tensor operations, big number, etc. I'm not sure if this is in
> arrow's scope, or the apps using arrow should deal with it themselves.
> >
> > Our objective at least in the C++ library is to have a generally
> > useful "standard library" that handles common application concerns.
> > Whether or not something is thought to be in scope may vary on a case
> > by case basis -- if you can't find a JIRA issue for something in
> > particular, please go ahead and open one.
> >
> >> - I see some SIMD optimizations in arrow go binding, such as vectored
> sum. [2]
> >> But arrow cpp lib doesn't leverage SIMD. [3]
> >> Why not optimize it in cpp lib so all languages can benefit?
> >
> > You're welcome to contribute such optimizations to the C++ library
> >
> >
> > - Wes
> >
> >> [1]
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
> >> [2]
> https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
> >> [3]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
> >>
> >> Yibo
>
Re: some questions, please help
Posted by Yibo Cai <yi...@arm.com>.
Hi Wes,
On 10/30/19 10:24 PM, Wes McKinney wrote:
> hi Yibo
>
> On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
>>
>> Hi,
>>
>> I'm new to Arrow. Would like to seek for help about some questions. Any comment is welcomed.
>>
>> - About source code tree, my understand is that "cpp" is the core arrow libraries, "c_glib, go, python, ..." are language bindings to ease integrating arrow into apps developed by that language. Is that correct?
>
> No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust
>
> * C/GLib, MATLAB, Python, R bind to C++
> * Ruby binds to GLib
>
I wonder how arrow deals with gaps among different implementations? Say, C++ lib implements some features go lib doesn't support. Is there a consistent API document, or documents for each language implementation?
>> - Arrow implements many data types and aggregation functions(sum, mean, ...). [1]
>> IMO, more functions and types should be supported, like min/max, vector/tensor operations, big number, etc. I'm not sure if this is in arrow's scope, or the apps using arrow should deal with it themselves.
>
> Our objective at least in the C++ library is to have a generally
> useful "standard library" that handles common application concerns.
> Whether or not something is thought to be in scope may vary on a case
> by case basis -- if you can't find a JIRA issue for something in
> particular, please go ahead and open one.
>
>> - I see some SIMD optimizations in arrow go binding, such as vectored sum. [2]
>> But arrow cpp lib doesn't leverage SIMD. [3]
>> Why not optimize it in cpp lib so all languages can benefit?
>
> You're welcome to contribute such optimizations to the C++ library
>
>
> - Wes
>
>> [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
>> [2] https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
>> [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
>>
>> Yibo
Re: some questions, please help
Posted by Yibo Cai <yi...@arm.com>.
Thanks Wes, Micah, your comments are very helpful.
Yibo
On 10/30/19 10:45 PM, Wes McKinney wrote:
> On Wed, Oct 30, 2019 at 9:32 AM Micah Kornfield <em...@gmail.com> wrote:
>>
>>>
>>>> - I see some SIMD optimizations in arrow go binding, such as vectored
>>> sum. [2]
>>>> But arrow cpp lib doesn't leverage SIMD. [3]
>>>> Why not optimize it in cpp lib so all languages can benefit?
>>> You're welcome to contribute such optimizations to the C++ library
>>
>>
>> Note that even though C++ doesn't use explicit SIMD intrinsics often times
>> the compiler will generate SIMD code because it can auto-vectorize the
>> code.
>
> Note it will likely be important to have explicit dynamic/runtime SIMD
> dispatching on certain hot paths as we build binaries that need to be
> able to run on both newer and older CPUs
>
>> On Wed, Oct 30, 2019 at 7:25 AM Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi Yibo
>>>
>>> On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm new to Arrow. Would like to seek for help about some questions. Any
>>> comment is welcomed.
>>>>
>>>> - About source code tree, my understand is that "cpp" is the core arrow
>>> libraries, "c_glib, go, python, ..." are language bindings to ease
>>> integrating arrow into apps developed by that language. Is that correct?
>>>
>>> No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust
>>>
>>> * C/GLib, MATLAB, Python, R bind to C++
>>> * Ruby binds to GLib
>>>
>>>> - Arrow implements many data types and aggregation functions(sum, mean,
>>> ...). [1]
>>>> IMO, more functions and types should be supported, like min/max,
>>> vector/tensor operations, big number, etc. I'm not sure if this is in
>>> arrow's scope, or the apps using arrow should deal with it themselves.
>>>
>>> Our objective at least in the C++ library is to have a generally
>>> useful "standard library" that handles common application concerns.
>>> Whether or not something is thought to be in scope may vary on a case
>>> by case basis -- if you can't find a JIRA issue for something in
>>> particular, please go ahead and open one.
>>>
>>>> - I see some SIMD optimizations in arrow go binding, such as vectored
>>> sum. [2]
>>>> But arrow cpp lib doesn't leverage SIMD. [3]
>>>> Why not optimize it in cpp lib so all languages can benefit?
>>>
>>> You're welcome to contribute such optimizations to the C++ library
>>>
>>>
>>> - Wes
>>>
>>>> [1]
>>> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
>>>> [2]
>>> https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
>>>> [3]
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
>>>>
>>>> Yibo
>>>
Re: some questions, please help
Posted by Wes McKinney <we...@gmail.com>.
On Wed, Oct 30, 2019 at 9:32 AM Micah Kornfield <em...@gmail.com> wrote:
>
> >
> > > - I see some SIMD optimizations in arrow go binding, such as vectored
> > sum. [2]
> > > But arrow cpp lib doesn't leverage SIMD. [3]
> > > Why not optimize it in cpp lib so all languages can benefit?
> > You're welcome to contribute such optimizations to the C++ library
>
>
> Note that even though C++ doesn't use explicit SIMD intrinsics often times
> the compiler will generate SIMD code because it can auto-vectorize the
> code.
Note it will likely be important to have explicit dynamic/runtime SIMD
dispatching on certain hot paths as we build binaries that need to be
able to run on both newer and older CPUs
> On Wed, Oct 30, 2019 at 7:25 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Yibo
> >
> > On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
> > >
> > > Hi,
> > >
> > > I'm new to Arrow. Would like to seek for help about some questions. Any
> > comment is welcomed.
> > >
> > > - About source code tree, my understand is that "cpp" is the core arrow
> > libraries, "c_glib, go, python, ..." are language bindings to ease
> > integrating arrow into apps developed by that language. Is that correct?
> >
> > No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust
> >
> > * C/GLib, MATLAB, Python, R bind to C++
> > * Ruby binds to GLib
> >
> > > - Arrow implements many data types and aggregation functions(sum, mean,
> > ...). [1]
> > > IMO, more functions and types should be supported, like min/max,
> > vector/tensor operations, big number, etc. I'm not sure if this is in
> > arrow's scope, or the apps using arrow should deal with it themselves.
> >
> > Our objective at least in the C++ library is to have a generally
> > useful "standard library" that handles common application concerns.
> > Whether or not something is thought to be in scope may vary on a case
> > by case basis -- if you can't find a JIRA issue for something in
> > particular, please go ahead and open one.
> >
> > > - I see some SIMD optimizations in arrow go binding, such as vectored
> > sum. [2]
> > > But arrow cpp lib doesn't leverage SIMD. [3]
> > > Why not optimize it in cpp lib so all languages can benefit?
> >
> > You're welcome to contribute such optimizations to the C++ library
> >
> >
> > - Wes
> >
> > > [1]
> > https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
> > > [2]
> > https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
> > > [3]
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
> > >
> > > Yibo
> >
Re: some questions, please help
Posted by Micah Kornfield <em...@gmail.com>.
>
> > - I see some SIMD optimizations in arrow go binding, such as vectored
> sum. [2]
> > But arrow cpp lib doesn't leverage SIMD. [3]
> > Why not optimize it in cpp lib so all languages can benefit?
> You're welcome to contribute such optimizations to the C++ library
Note that even though C++ doesn't use explicit SIMD intrinsics often times
the compiler will generate SIMD code because it can auto-vectorize the
code.
On Wed, Oct 30, 2019 at 7:25 AM Wes McKinney <we...@gmail.com> wrote:
> hi Yibo
>
> On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
> >
> > Hi,
> >
> > I'm new to Arrow. Would like to seek for help about some questions. Any
> comment is welcomed.
> >
> > - About source code tree, my understand is that "cpp" is the core arrow
> libraries, "c_glib, go, python, ..." are language bindings to ease
> integrating arrow into apps developed by that language. Is that correct?
>
> No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust
>
> * C/GLib, MATLAB, Python, R bind to C++
> * Ruby binds to GLib
>
> > - Arrow implements many data types and aggregation functions(sum, mean,
> ...). [1]
> > IMO, more functions and types should be supported, like min/max,
> vector/tensor operations, big number, etc. I'm not sure if this is in
> arrow's scope, or the apps using arrow should deal with it themselves.
>
> Our objective at least in the C++ library is to have a generally
> useful "standard library" that handles common application concerns.
> Whether or not something is thought to be in scope may vary on a case
> by case basis -- if you can't find a JIRA issue for something in
> particular, please go ahead and open one.
>
> > - I see some SIMD optimizations in arrow go binding, such as vectored
> sum. [2]
> > But arrow cpp lib doesn't leverage SIMD. [3]
> > Why not optimize it in cpp lib so all languages can benefit?
>
> You're welcome to contribute such optimizations to the C++ library
>
>
> - Wes
>
> > [1]
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
> > [2]
> https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
> > [3]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
> >
> > Yibo
>
Re: some questions, please help
Posted by Wes McKinney <we...@gmail.com>.
hi Yibo
On Wed, Oct 30, 2019 at 2:16 AM Yibo Cai <yi...@arm.com> wrote:
>
> Hi,
>
> I'm new to Arrow. Would like to seek for help about some questions. Any comment is welcomed.
>
> - About source code tree, my understand is that "cpp" is the core arrow libraries, "c_glib, go, python, ..." are language bindings to ease integrating arrow into apps developed by that language. Is that correct?
No. We have 6 core implementations: C++, C#, Go, Java, JavaScript, and Rust
* C/GLib, MATLAB, Python, R bind to C++
* Ruby binds to GLib
> - Arrow implements many data types and aggregation functions(sum, mean, ...). [1]
> IMO, more functions and types should be supported, like min/max, vector/tensor operations, big number, etc. I'm not sure if this is in arrow's scope, or the apps using arrow should deal with it themselves.
Our objective at least in the C++ library is to have a generally
useful "standard library" that handles common application concerns.
Whether or not something is thought to be in scope may vary on a case
by case basis -- if you can't find a JIRA issue for something in
particular, please go ahead and open one.
> - I see some SIMD optimizations in arrow go binding, such as vectored sum. [2]
> But arrow cpp lib doesn't leverage SIMD. [3]
> Why not optimize it in cpp lib so all languages can benefit?
You're welcome to contribute such optimizations to the C++ library
- Wes
> [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute/kernels
> [2] https://github.com/apache/arrow/blob/master/go/arrow/math/float64_avx2_amd64.s
> [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L99-L111
>
> Yibo