You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2020/08/14 23:35:43 UTC

[Java] Supporting Big Endian

Kazuaki Ishizak has started working on Big Endian support in Java
(including setting up CI for it).  Thank you!

We previously discussed support for Big Endian architectures in C++ [1] and
generally agreed that it was a reasonable thing to do.

Similar to C++ I think as long as we have a working CI setup it is
reasonable for Java to support Big Endian machines.

But I think there might be differing opinions so it is worth a discussion
to see if there are technical blockers or other reasons for not supporting
Big Endian architectures in the existing java implementation.

Thanks,
Micah


[1]
https://lists.apache.org/thread.html/rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019%40%3Cdev.arrow.apache.org%3E

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.

Hi Micah,
Thank you for expanding the scope for Big Endian support in Arrow. I am 
glad to see this when I am back from one-week vacation.
I agree with this since we have just seen the kickoff of BE support in Go. 


Hi Wes,
Thank you for your positive comments. We should carefully implement BE 
support without performance overhead.

Let me respond to Micah's comments and questions. 

> 1.  As long as there is CI in place to catch regressions (right now I 
think the CI is fairly unreliable?)
I agree. While TravisCI on s390x are unreliable, each platform can set up 
CI script.

> 2.  No degradation in performance for little-endian architectures 
(verified by additional micro benchmarks)
Yes, we can do it. @kou suggested me to use the existing mechanism like 
[1]. Now, it supports C++.We could expand this to other languages.

> 3.  Not a large amount of invasive code to distinguish between 
platforms.
Yes, I will prepare the current holistic strategy for endian-independent 
implementation in another mail.


> Kazuaki Ishizaki I asked question previously, but could you give some 
data points around:
Sorry for the delay. I was waiting for comments during my one-week 
vacation.

> 1.  The current state of C++ support (how much code needed to change)?
Except the parquet support, in C++, in my understanding [2] is the last PR 
to support big-endian for intra-process and inter-process in C++.

> 2.  How many more PRs you expect to need for Java (and approximate 
size)?
I have just submitted two PRs [5, 6] today to pass the current test using 
“mvn install”. The total size of the four PRs [3, 4, 5, 6] are around 
200 lines. In addition, I will submit another PR (~500 lines per 
estimation) to pass integration test between different endians, which 
corresponds to the PR [2] in C++.
For CI for Java, I submitted one draft PR [7] (~100 lines).


BTW, when I saw some test cases, I think that it would be good to increase 
test coverage for little-endian and big-endian by adding test cases (e.g. 
increase the types of a schema to be tested in flight-core like [8]). If 
it is acceptable, we can submit more PRs regardless of endianness.

[1] https://github.com/apache/arrow/pull/7940#issuecomment-672690540
[2] https://github.com/apache/arrow/pull/7507
[3] https://github.com/apache/arrow/pull/7942
[4] https://github.com/apache/arrow/pull/7944
[5] https://github.com/apache/arrow/pull/8056
[6] https://github.com/apache/arrow/pull/8057
[7] https://github.com/apache/arrow/pull/7938
[8] https://github.com/apache/arrow/pull/7555

Best Regards,
Kazuaki Ishizaki

Wes McKinney <we...@gmail.com> wrote on 2020/08/26 21:27:49:

> From: Wes McKinney <we...@gmail.com>
> To: dev <de...@arrow.apache.org>, Micah Kornfield <em...@gmail.com>
> Cc: Fan Liya <li...@gmail.com>
> Date: 2020/08/26 21:28
> Subject: [EXTERNAL] Re: [DISCUSS] Big Endian support in Arrow (was: 
> Re: [Java] Supporting Big Endian)
> 
> hi Micah,
> 
> I agree with your reasoning. If supporting BE in some languages (e.g.
> Java) is impractical due to performance regressions on LE platforms,
> then I don't think it's worth it. But if it can be handled at compile
> time or without runtime overhead, and tested / maintained properly on
> an ongoing basis, then it seems reasonable to me. It seems that the
> number of Arrow stakeholders will only increase from here so I would
> hope that there will be more people invested in helping maintain BE in
> the future.
> 
> - Wes
> 
> On Tue, Aug 25, 2020 at 11:33 PM Micah Kornfield 
> <em...@gmail.com> wrote:
> >
> > I'm expanding the scope of this thread since it looks like work has 
also
> > started for making golang support BigEndian architectures.
> >
> > I think as a community we should come to a consensus on whether we 
want to
> > support Big Endian architectures in general.  I don't think it is a 
good
> > outcome if some implementations accept PRs for Big Endian fixes and 
some
> > don't.
> >
> > But maybe this is OK with others?
> >
> > My current opinion on the matter is that we should support it under 
the
> > following conditions:
> >
> > 1.  As long as there is CI in place to catch regressions (right now I 
think
> > the CI is fairly unreliable?)
> > 2.  No degradation in performance for little-endian architectures 
(verified
> > by additional micro benchmarks)
> > 3.  Not a large amount of invasive code to distinguish between 
platforms.
> >
> > Kazuaki Ishizaki I asked question previously, but could you give some 
data
> > points around:
> > 1.  The current state of C++ support (how much code needed to change)?
> > 2.  How many more PRs you expect to need for Java (and approximate 
size)?
> >
> > I think this would help myself and others in the decision making 
process.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield 
<em...@gmail.com>
> > wrote:
> >
> > > My thoughts on the points raised so far:
> > >
> > > * Does supporting Big Endian increase the reach of Arrow by a lot?
> > >
> > > Probably not a significant amount, but it does provide one more 
avenue of
> > > adoption.
> > >
> > > * Does it increase code complexity?
> > >
> > > Yes.  I agree this is a concern.  The PR in question did not seem 
too bad
> > > to me but this is subjective.  I think the remaining question is how 
many
> > > more places need to be fixed up in the code base and how invasive 
are the
> > > changes.  In C++ IIUC it turned out to be a relatively small number 
of
> > > places.
> > >
> > > Kazuaki Ishizaki have you been able to get the Java implementation 
working
> > > fully locally?  How many additional PRs will be needed and what do
> > > they look like (I think there already a few more in the queue)?
> > >
> > > * Will it introduce performance regressions?
> > >
> > > If done properly I suspect no, but I think if we continue with 
BigEndian
> > > support the places that need to be touched should have benchmarks 
added to
> > > confirm this (including for PRs already merged).
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <li...@gmail.com> 
wrote:
> > >
> > >> Thank Kazuaki Ishizaki for working on this.
> > >> IMO, supporting the big-endian should be a large change, as in many
> > >> places of the code base, we have implicitly assumed the 
little-endian
> > >> platform (e.g.
> > >> INVALID URI REMOVED
> 
u=https-3A__github.com_apache_arrow_blob_master_java_memory_memory-2Dcore_src_main_java_org_apache_arrow_memory_util_ByteFunctionHelpers.java&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> n7ezRb8qyWtog&s=poFSWqjJv99prou53ciinHyBmh5IZlXLlhYvftb9fu4&e= 
> > >> ).
> > >> Supporting the big-endian platform may introduce branches in such 
places
> > >> (or virtual calls) which will affect the performance.
> > >> So it would be helpful to evaluate the performance impact.
> > >>
> > >> Best,
> > >> Liya Fan
> > >>
> > >>
> > >> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <ja...@apache.org>
> > >> wrote:
> > >>
> > >>> Hey Micah, thanks for starting the discussion.
> > >>>
> > >>> I just skimmed that thread and it isn't entirely clear that there 
was a
> > >>> conclusion that the overhead was worth it. I think everybody 
agrees that
> > >>> it
> > >>> would be nice to have the code work on both platforms. On the 
flipside,
> > >>> the
> > >>> code noise for a rare case makes the cost-benefit questionable.
> > >>>
> > >>> In the Java code, we wrote the code to explicitly disallow big 
endian
> > >>> platforms and put preconditions checks in. I definitely think if 
we want
> > >>> to
> > >>> support this, it should be done holistically across the code with
> > >>> appropriate test plan (both functional and perf).
> > >>>
> > >>> To me, the question is really about how many use cases are blocked 
by
> > >>> this.
> > >>> I'm not sure I've heard anyone say that the limiting factor 
toleveraging
> > >>> Java Arrow was the block on endianess. Keep in mind that until 
very
> > >>> recently, using any Arrow Java code would throw a preconditions 
check
> > >>> before you could even get started on big-endian and I don't think 
we've
> > >>> seen a bunch of messages on that exception. Adding if conditions
> > >>> throughout
> > >>> the codebase like this patch: [1] isn't exactly awesome and it can 
also
> > >>> risk performance impacts depending on how carefully it is done.
> > >>>
> > >>> If there isn't a preponderance of evidence of many users 
beingblocked by
> > >>> this capability, I don't think we should accept the code. We 
already
> > >>> have a
> > >>> backlog of items that we need to address just ensure existing use 
cases
> > >>> work well. Expanding to new use cases that there is no clear 
demand for
> > >>> will likely just increase code development cost at little benefit.
> > >>>
> > >>> What do others think?
> > >>>
> > >>> [1] INVALID URI REMOVED
> 
u=https-3A__github.com_apache_arrow_pull_7923-23issuecomment-2D674311119&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> n7ezRb8qyWtog&s=vmvc0b4yHFfWLjLheCRysSiyaeRFO_6p0wdH-sLa7M8&e= 
> > >>>
> > >>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield 
<em...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Kazuaki Ishizak has started working on Big Endian support in 
Java
> > >>> > (including setting up CI for it).  Thank you!
> > >>> >
> > >>> > We previously discussed support for Big Endian architectures in 
C++
> > >>> [1] and
> > >>> > generally agreed that it was a reasonable thing to do.
> > >>> >
> > >>> > Similar to C++ I think as long as we have a working CI setup it 
is
> > >>> > reasonable for Java to support Big Endian machines.
> > >>> >
> > >>> > But I think there might be differing opinions so it is worth a
> > >>> discussion
> > >>> > to see if there are technical blockers or other reasons for not
> > >>> supporting
> > >>> > Big Endian architectures in the existing java implementation.
> > >>> >
> > >>> > Thanks,
> > >>> > Micah
> > >>> >
> > >>> >
> > >>> > [1]
> > >>> >
> > >>> >
> > >>> INVALID URI REMOVED
> 
u=https-3A__lists.apache.org_thread.html_rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019-2540-253Cdev.arrow.apache.org-253E&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> n7ezRb8qyWtog&s=oDBWI9pmI39bTsEieQNDxZit0My21hLIW0fJRPJI0AM&e= 
> > >>> >
> > >>>
> > >>
>

Re: [DISCUSS] Big Endian support in Arrow

Posted by Antoine Pitrou <an...@python.org>.

Le 01/09/2020 à 23:26, Sutou Kouhei a écrit :
> 
> Note that we need qemu-user-static 2.12 or later and
> binfmt-support 2.1.7 or later. Ubuntu 18.10 or later ships
> them.

Ah, unfortunately I'm on Ubuntu 18.04.

Regards

Antoine.

Re: [DISCUSS] Big Endian support in Arrow

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

In <4c...@python.org>
  "Re: [DISCUSS] Big Endian support in Arrow" on Tue, 1 Sep 2020 19:53:18 +0200,
  Antoine Pitrou <an...@python.org> wrote:

> I don't know about debian-java, but when I try e.g. debian-cpp, it fails
> with the following error:
> 
> $ ARCH=s390x docker-compose run --rm debian-cpp
> Building debian-cpp
> Step 1/7 : ARG arch=amd64
> Step 2/7 : FROM ${arch}/debian:10
> 10: Pulling from s390x/debian
> f2b199a6d9ad: Pull complete
> Digest:
> sha256:306d71cf97b70f08cd86f48c5dfdf2a4efbb901b1ac96f829bb4879a6ea94a2e
> Status: Downloaded newer image for s390x/debian:10
>  ---> 3d4e9be33cfe
> Step 3/7 : ENV DEBIAN_FRONTEND noninteractive
>  ---> Running in 2154fbbf1f11
> Removing intermediate container 2154fbbf1f11
>  ---> 4b87c2ffacbf
> Step 4/7 : RUN   echo "deb http://deb.debian.org/debian buster-backports
> main" >     /etc/apt/sources.list.d/backports.list
>  ---> Running in 313fbeae3132
> standard_init_linux.go:211: exec user process caused "exec format error"
> ERROR: Service 'debian-cpp' failed to build: The command '/bin/sh -c
> echo "deb http://deb.debian.org/debian buster-backports main" >
> /etc/apt/sources.list.d/backports.list' returned a non-zero code: 1

Did you run this command on Ubuntu?
We need to install qemu-user-static (and binfmt-support
that is a recommended package of qemu-user-static) to run
this command.

Note that we need qemu-user-static 2.12 or later and
binfmt-support 2.1.7 or later. Ubuntu 18.10 or later ships
them.


Thanks,
--
kou

Re: [DISCUSS] Big Endian support in Arrow

Posted by Antoine Pitrou <an...@python.org>.

Le 01/09/2020 à 04:47, Sutou Kouhei a écrit :
> 
> We can use "ARCH=s390x docker-compose run debian-java" (or
> "ARCH=s390x archery docker run debian-java") to debug on
> a local amd64 machine with qemu-user-static.

I don't know about debian-java, but when I try e.g. debian-cpp, it fails
with the following error:

$ ARCH=s390x docker-compose run --rm debian-cpp
Building debian-cpp
Step 1/7 : ARG arch=amd64
Step 2/7 : FROM ${arch}/debian:10
10: Pulling from s390x/debian
f2b199a6d9ad: Pull complete
Digest:
sha256:306d71cf97b70f08cd86f48c5dfdf2a4efbb901b1ac96f829bb4879a6ea94a2e
Status: Downloaded newer image for s390x/debian:10
 ---> 3d4e9be33cfe
Step 3/7 : ENV DEBIAN_FRONTEND noninteractive
 ---> Running in 2154fbbf1f11
Removing intermediate container 2154fbbf1f11
 ---> 4b87c2ffacbf
Step 4/7 : RUN   echo "deb http://deb.debian.org/debian buster-backports
main" >     /etc/apt/sources.list.d/backports.list
 ---> Running in 313fbeae3132
standard_init_linux.go:211: exec user process caused "exec format error"
ERROR: Service 'debian-cpp' failed to build: The command '/bin/sh -c
echo "deb http://deb.debian.org/debian buster-backports main" >
/etc/apt/sources.list.d/backports.list' returned a non-zero code: 1


Regards

Antoine.

Re: [DISCUSS] Big Endian support in Arrow

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

In <CA...@mail.gmail.com>
  "Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)" on Sun, 30 Aug 2020 22:11:46 -0700,
  Jacques Nadeau <ja...@apache.org> wrote:

> I know I don't want to have to go and debug on a remote BE
> system if some tests starts failing for that platform...

We can use "ARCH=s390x docker-compose run debian-java" (or
"ARCH=s390x archery docker run debian-java") to debug on
a local amd64 machine with qemu-user-static.

Thanks,
--
kou

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.

Hi Micah,
Thank you. Your proposal also sounds reasonable to me.

Best Regards,
Kazuaki Ishizaki


Fan Liya <li...@gmail.com> wrote on 2020/09/22 15:51:58:

> From: Fan Liya <li...@gmail.com>
> To: dev <de...@arrow.apache.org>, Micah Kornfield <em...@gmail.com>
> Date: 2020/09/22 15:52
> Subject: [EXTERNAL] Re: [DISCUSS] Big Endian support in Arrow (was: 
> Re: [Java] Supporting Big Endian)
> 
> Hi Micah,
> 
> Thanks for your summary. Your proposal sounds reasonable to me.
> 
> Best,
> Liya Fan
> 
> 
> On Tue, Sep 22, 2020 at 1:16 PM Micah Kornfield <em...@gmail.com>
> wrote:
> 
> > I wanted to give this thread a bump, does the proposal I made below 
sound
> > reasonable?
> >
> > On Sun, Sep 13, 2020 at 9:57 PM Micah Kornfield 
<em...@gmail.com>
> > wrote:
> >
> > > If I read the responses so far it seems like the following might be 
a
> > good
> > > compromise/summary:
> > >
> > > 1. It does not seem too invasive to support native endianness in
> > > implementation libraries.  As long as there is appropriate 
performance
> > > testing and CI infrastructure to demonstrate the changes work.
> > > 2. It is up to implementation maintainers if they wish to accept PRs 
that
> > > handle byte swapping between different architectures.  (Right now it
> > sounds
> > > like C++ is potentially OK with it and for Java at least Jacques is
> > opposed
> > > to it?
> > >
> > > Testing changes that break big-endian can be a potential drag on
> > developer
> > > productivity but there are methods to run locally (at least on more
> > recent
> > > OSes).
> > >
> > > Thoughts?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Mon, Aug 31, 2020 at 7:08 PM Fan Liya <li...@gmail.com> 
wrote:
> > >
> > >> Thank Kazuaki for the survey and thank Micah for starting the
> > discussion.
> > >>
> > >> I do not oppose supporting BE. In fact, I am in general optimistic 
about
> > >> the performance impact (for Java).
> > >> IMO, this is going to be a painful way (many byte order related 
problems
> > >> are tricky to debug), so I hope we can make it short.
> > >>
> > >> It is good that someone is willing to take this on, and I would 
like to
> > >> provide help if needed.
> > >>
> > >> Best,
> > >> Liya Fan
> > >>
> > >>
> > >>
> > >> On Tue, Sep 1, 2020 at 7:25 AM Bryan Cutler <cu...@gmail.com> 
wrote:
> > >>
> > >> > I also think this would be a worthwhile addition and help the 
project
> > >> > expand in more areas. Beyond the Apache Spark optimization use 
case,
> > >> having
> > >> > Arrow interoperability with the Python data science stack on BE 
would
> > be
> > >> > very useful. I have looked at the remaining PRs for Java and they 
seem
> > >> > pretty minimal and straightforward. Implementing the equivalent 
record
> > >> > batch swapping as done in C++ at [1] would be a little more 
involved,
> > >> but
> > >> > still reasonable. Would it make sense to create a branch to apply 
all
> > >> > remaining changes with CI to get a better picture before deciding 
on
> > >> > bringing into master branch?  I could help out with shepherding 
this
> > >> effort
> > >> > and assist in maintenance, if we decide to accept.
> > >> >
> > >> > Bryan
> > >> >
> > >> > [1] INVALID URI REMOVED
> u=https-3A__github.com_apache_arrow_pull_7507&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=SWYdav0xVJ2EmwdbH9QiN--pc-NOGm4aigFKX2tlZhs&s=JCN-
> VA4mo9RyvlwhiCaHSTxQwcLL3CvkrMY8_mOhTks&e= 
> > >> >
> > >> > On Mon, Aug 31, 2020 at 1:42 PM Wes McKinney 
<we...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > I think it's well within the right of an implementation to 
reject BE
> > >> > > data (or non-native-endian), but if an implementation chooses 
to
> > >> > > implement and maintain the endianness conversions, then it does 
not
> > >> > > seem so bad to me.
> > >> > >
> > >> > > On Mon, Aug 31, 2020 at 3:33 PM Jacques Nadeau 
<ja...@apache.org>
> > >> > wrote:
> > >> > > >
> > >> > > > And yes, for those of you looking closely, I commented on
> > ARROW-245
> > >> > when
> > >> > > it
> > >> > > > was committed. I just forgot about it.
> > >> > > >
> > >> > > > It looks like I had mostly the same concerns then that I do 
now :)
> > >> Now
> > >> > > I'm
> > >> > > > just more worried about format sprawl...
> > >> > > >
> > >> > > > On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <
> > jacques@apache.org>
> > >> > > wrote:
> > >> > > >
> > >> > > > > What do you mean?  The Endianness field (a Big|Little enum) 
was
> > >> > added 4
> > >> > > > >> years ago:
> > >> > > > >> INVALID URI REMOVED
> 
u=https-3A__issues.apache.org_jira_browse_ARROW-2D245&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=SWYdav0xVJ2EmwdbH9QiN--pc-
> NOGm4aigFKX2tlZhs&s=DSjBn3vdsIO8m3CeztvbX3x6U_7sqWil6NmGY_jlZaQ&e= 
> > >> > > > >
> > >> > > > >
> > >> > > > > I didn't realize that was done, my bad. Good example of 
format
> > rot
> > >> > > from my
> > >> > > > > pov.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > >
> > >> >
> > >>
> > >
> >

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Micah Kornfield <em...@gmail.com>.

In case any one wants to comment further, I've opened
https://github.com/apache/arrow/pull/8374
<https://github.com/apache/arrow/pull/8374#pullrequestreview-504324361> to
canonicalize the details.

On Mon, Sep 28, 2020 at 9:08 PM Micah Kornfield <em...@gmail.com>
wrote:

> OK, I will try to update documentation reflecting this in the next few
> days (in particular it would be good to document which implementations are
> willing to support byte flipping).
>
> On Tue, Sep 22, 2020 at 3:30 AM Antoine Pitrou <an...@python.org> wrote:
>
>>
>>
>> Le 22/09/2020 à 06:36, Micah Kornfield a écrit :
>> > I wanted to give this thread a bump, does the proposal I made below
>> sound
>> > reasonable?
>>
>> It does!
>>
>> Regards
>>
>> Antoine.
>>
>>
>> >
>> > On Sun, Sep 13, 2020 at 9:57 PM Micah Kornfield <em...@gmail.com>
>> > wrote:
>> >
>> >> If I read the responses so far it seems like the following might be a
>> good
>> >> compromise/summary:
>> >>
>> >> 1. It does not seem too invasive to support native endianness in
>> >> implementation libraries.  As long as there is appropriate performance
>> >> testing and CI infrastructure to demonstrate the changes work.
>> >> 2. It is up to implementation maintainers if they wish to accept PRs
>> that
>> >> handle byte swapping between different architectures.  (Right now it
>> sounds
>> >> like C++ is potentially OK with it and for Java at least Jacques is
>> opposed
>> >> to it?
>> >>
>> >> Testing changes that break big-endian can be a potential drag on
>> developer
>> >> productivity but there are methods to run locally (at least on more
>> recent
>> >> OSes).
>> >>
>> >> Thoughts?
>> >>
>> >> Thanks,
>> >> Micah
>> >>
>> >> On Mon, Aug 31, 2020 at 7:08 PM Fan Liya <li...@gmail.com> wrote:
>> >>
>> >>> Thank Kazuaki for the survey and thank Micah for starting the
>> discussion.
>> >>>
>> >>> I do not oppose supporting BE. In fact, I am in general optimistic
>> about
>> >>> the performance impact (for Java).
>> >>> IMO, this is going to be a painful way (many byte order related
>> problems
>> >>> are tricky to debug), so I hope we can make it short.
>> >>>
>> >>> It is good that someone is willing to take this on, and I would like
>> to
>> >>> provide help if needed.
>> >>>
>> >>> Best,
>> >>> Liya Fan
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Sep 1, 2020 at 7:25 AM Bryan Cutler <cu...@gmail.com>
>> wrote:
>> >>>
>> >>>> I also think this would be a worthwhile addition and help the project
>> >>>> expand in more areas. Beyond the Apache Spark optimization use case,
>> >>> having
>> >>>> Arrow interoperability with the Python data science stack on BE
>> would be
>> >>>> very useful. I have looked at the remaining PRs for Java and they
>> seem
>> >>>> pretty minimal and straightforward. Implementing the equivalent
>> record
>> >>>> batch swapping as done in C++ at [1] would be a little more involved,
>> >>> but
>> >>>> still reasonable. Would it make sense to create a branch to apply all
>> >>>> remaining changes with CI to get a better picture before deciding on
>> >>>> bringing into master branch?  I could help out with shepherding this
>> >>> effort
>> >>>> and assist in maintenance, if we decide to accept.
>> >>>>
>> >>>> Bryan
>> >>>>
>> >>>> [1] https://github.com/apache/arrow/pull/7507
>> >>>>
>> >>>> On Mon, Aug 31, 2020 at 1:42 PM Wes McKinney <we...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>>> I think it's well within the right of an implementation to reject BE
>> >>>>> data (or non-native-endian), but if an implementation chooses to
>> >>>>> implement and maintain the endianness conversions, then it does not
>> >>>>> seem so bad to me.
>> >>>>>
>> >>>>> On Mon, Aug 31, 2020 at 3:33 PM Jacques Nadeau <ja...@apache.org>
>> >>>> wrote:
>> >>>>>>
>> >>>>>> And yes, for those of you looking closely, I commented on ARROW-245
>> >>>> when
>> >>>>> it
>> >>>>>> was committed. I just forgot about it.
>> >>>>>>
>> >>>>>> It looks like I had mostly the same concerns then that I do now :)
>> >>> Now
>> >>>>> I'm
>> >>>>>> just more worried about format sprawl...
>> >>>>>>
>> >>>>>> On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <jacques@apache.org
>> >
>> >>>>> wrote:
>> >>>>>>
>> >>>>>>> What do you mean?  The Endianness field (a Big|Little enum) was
>> >>>> added 4
>> >>>>>>>> years ago:
>> >>>>>>>> https://issues.apache.org/jira/browse/ARROW-245
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> I didn't realize that was done, my bad. Good example of format rot
>> >>>>> from my
>> >>>>>>> pov.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Micah Kornfield <em...@gmail.com>.

OK, I will try to update documentation reflecting this in the next few days
(in particular it would be good to document which implementations are
willing to support byte flipping).

On Tue, Sep 22, 2020 at 3:30 AM Antoine Pitrou <an...@python.org> wrote:

>
>
> Le 22/09/2020 à 06:36, Micah Kornfield a écrit :
> > I wanted to give this thread a bump, does the proposal I made below sound
> > reasonable?
>
> It does!
>
> Regards
>
> Antoine.
>
>
> >
> > On Sun, Sep 13, 2020 at 9:57 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> If I read the responses so far it seems like the following might be a
> good
> >> compromise/summary:
> >>
> >> 1. It does not seem too invasive to support native endianness in
> >> implementation libraries.  As long as there is appropriate performance
> >> testing and CI infrastructure to demonstrate the changes work.
> >> 2. It is up to implementation maintainers if they wish to accept PRs
> that
> >> handle byte swapping between different architectures.  (Right now it
> sounds
> >> like C++ is potentially OK with it and for Java at least Jacques is
> opposed
> >> to it?
> >>
> >> Testing changes that break big-endian can be a potential drag on
> developer
> >> productivity but there are methods to run locally (at least on more
> recent
> >> OSes).
> >>
> >> Thoughts?
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Mon, Aug 31, 2020 at 7:08 PM Fan Liya <li...@gmail.com> wrote:
> >>
> >>> Thank Kazuaki for the survey and thank Micah for starting the
> discussion.
> >>>
> >>> I do not oppose supporting BE. In fact, I am in general optimistic
> about
> >>> the performance impact (for Java).
> >>> IMO, this is going to be a painful way (many byte order related
> problems
> >>> are tricky to debug), so I hope we can make it short.
> >>>
> >>> It is good that someone is willing to take this on, and I would like to
> >>> provide help if needed.
> >>>
> >>> Best,
> >>> Liya Fan
> >>>
> >>>
> >>>
> >>> On Tue, Sep 1, 2020 at 7:25 AM Bryan Cutler <cu...@gmail.com> wrote:
> >>>
> >>>> I also think this would be a worthwhile addition and help the project
> >>>> expand in more areas. Beyond the Apache Spark optimization use case,
> >>> having
> >>>> Arrow interoperability with the Python data science stack on BE would
> be
> >>>> very useful. I have looked at the remaining PRs for Java and they seem
> >>>> pretty minimal and straightforward. Implementing the equivalent record
> >>>> batch swapping as done in C++ at [1] would be a little more involved,
> >>> but
> >>>> still reasonable. Would it make sense to create a branch to apply all
> >>>> remaining changes with CI to get a better picture before deciding on
> >>>> bringing into master branch?  I could help out with shepherding this
> >>> effort
> >>>> and assist in maintenance, if we decide to accept.
> >>>>
> >>>> Bryan
> >>>>
> >>>> [1] https://github.com/apache/arrow/pull/7507
> >>>>
> >>>> On Mon, Aug 31, 2020 at 1:42 PM Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> I think it's well within the right of an implementation to reject BE
> >>>>> data (or non-native-endian), but if an implementation chooses to
> >>>>> implement and maintain the endianness conversions, then it does not
> >>>>> seem so bad to me.
> >>>>>
> >>>>> On Mon, Aug 31, 2020 at 3:33 PM Jacques Nadeau <ja...@apache.org>
> >>>> wrote:
> >>>>>>
> >>>>>> And yes, for those of you looking closely, I commented on ARROW-245
> >>>> when
> >>>>> it
> >>>>>> was committed. I just forgot about it.
> >>>>>>
> >>>>>> It looks like I had mostly the same concerns then that I do now :)
> >>> Now
> >>>>> I'm
> >>>>>> just more worried about format sprawl...
> >>>>>>
> >>>>>> On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <ja...@apache.org>
> >>>>> wrote:
> >>>>>>
> >>>>>>> What do you mean?  The Endianness field (a Big|Little enum) was
> >>>> added 4
> >>>>>>>> years ago:
> >>>>>>>> https://issues.apache.org/jira/browse/ARROW-245
> >>>>>>>
> >>>>>>>
> >>>>>>> I didn't realize that was done, my bad. Good example of format rot
> >>>>> from my
> >>>>>>> pov.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Antoine Pitrou <an...@python.org>.


Le 22/09/2020 à 06:36, Micah Kornfield a écrit :
> I wanted to give this thread a bump, does the proposal I made below sound
> reasonable?

It does!

Regards

Antoine.


> 
> On Sun, Sep 13, 2020 at 9:57 PM Micah Kornfield <em...@gmail.com>
> wrote:
> 
>> If I read the responses so far it seems like the following might be a good
>> compromise/summary:
>>
>> 1. It does not seem too invasive to support native endianness in
>> implementation libraries.  As long as there is appropriate performance
>> testing and CI infrastructure to demonstrate the changes work.
>> 2. It is up to implementation maintainers if they wish to accept PRs that
>> handle byte swapping between different architectures.  (Right now it sounds
>> like C++ is potentially OK with it and for Java at least Jacques is opposed
>> to it?
>>
>> Testing changes that break big-endian can be a potential drag on developer
>> productivity but there are methods to run locally (at least on more recent
>> OSes).
>>
>> Thoughts?
>>
>> Thanks,
>> Micah
>>
>> On Mon, Aug 31, 2020 at 7:08 PM Fan Liya <li...@gmail.com> wrote:
>>
>>> Thank Kazuaki for the survey and thank Micah for starting the discussion.
>>>
>>> I do not oppose supporting BE. In fact, I am in general optimistic about
>>> the performance impact (for Java).
>>> IMO, this is going to be a painful way (many byte order related problems
>>> are tricky to debug), so I hope we can make it short.
>>>
>>> It is good that someone is willing to take this on, and I would like to
>>> provide help if needed.
>>>
>>> Best,
>>> Liya Fan
>>>
>>>
>>>
>>> On Tue, Sep 1, 2020 at 7:25 AM Bryan Cutler <cu...@gmail.com> wrote:
>>>
>>>> I also think this would be a worthwhile addition and help the project
>>>> expand in more areas. Beyond the Apache Spark optimization use case,
>>> having
>>>> Arrow interoperability with the Python data science stack on BE would be
>>>> very useful. I have looked at the remaining PRs for Java and they seem
>>>> pretty minimal and straightforward. Implementing the equivalent record
>>>> batch swapping as done in C++ at [1] would be a little more involved,
>>> but
>>>> still reasonable. Would it make sense to create a branch to apply all
>>>> remaining changes with CI to get a better picture before deciding on
>>>> bringing into master branch?  I could help out with shepherding this
>>> effort
>>>> and assist in maintenance, if we decide to accept.
>>>>
>>>> Bryan
>>>>
>>>> [1] https://github.com/apache/arrow/pull/7507
>>>>
>>>> On Mon, Aug 31, 2020 at 1:42 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>
>>>>> I think it's well within the right of an implementation to reject BE
>>>>> data (or non-native-endian), but if an implementation chooses to
>>>>> implement and maintain the endianness conversions, then it does not
>>>>> seem so bad to me.
>>>>>
>>>>> On Mon, Aug 31, 2020 at 3:33 PM Jacques Nadeau <ja...@apache.org>
>>>> wrote:
>>>>>>
>>>>>> And yes, for those of you looking closely, I commented on ARROW-245
>>>> when
>>>>> it
>>>>>> was committed. I just forgot about it.
>>>>>>
>>>>>> It looks like I had mostly the same concerns then that I do now :)
>>> Now
>>>>> I'm
>>>>>> just more worried about format sprawl...
>>>>>>
>>>>>> On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <ja...@apache.org>
>>>>> wrote:
>>>>>>
>>>>>>> What do you mean?  The Endianness field (a Big|Little enum) was
>>>> added 4
>>>>>>>> years ago:
>>>>>>>> https://issues.apache.org/jira/browse/ARROW-245
>>>>>>>
>>>>>>>
>>>>>>> I didn't realize that was done, my bad. Good example of format rot
>>>>> from my
>>>>>>> pov.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Fan Liya <li...@gmail.com>.

Hi Micah,

Thanks for your summary. Your proposal sounds reasonable to me.

Best,
Liya Fan


On Tue, Sep 22, 2020 at 1:16 PM Micah Kornfield <em...@gmail.com>
wrote:

> I wanted to give this thread a bump, does the proposal I made below sound
> reasonable?
>
> On Sun, Sep 13, 2020 at 9:57 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > If I read the responses so far it seems like the following might be a
> good
> > compromise/summary:
> >
> > 1. It does not seem too invasive to support native endianness in
> > implementation libraries.  As long as there is appropriate performance
> > testing and CI infrastructure to demonstrate the changes work.
> > 2. It is up to implementation maintainers if they wish to accept PRs that
> > handle byte swapping between different architectures.  (Right now it
> sounds
> > like C++ is potentially OK with it and for Java at least Jacques is
> opposed
> > to it?
> >
> > Testing changes that break big-endian can be a potential drag on
> developer
> > productivity but there are methods to run locally (at least on more
> recent
> > OSes).
> >
> > Thoughts?
> >
> > Thanks,
> > Micah
> >
> > On Mon, Aug 31, 2020 at 7:08 PM Fan Liya <li...@gmail.com> wrote:
> >
> >> Thank Kazuaki for the survey and thank Micah for starting the
> discussion.
> >>
> >> I do not oppose supporting BE. In fact, I am in general optimistic about
> >> the performance impact (for Java).
> >> IMO, this is going to be a painful way (many byte order related problems
> >> are tricky to debug), so I hope we can make it short.
> >>
> >> It is good that someone is willing to take this on, and I would like to
> >> provide help if needed.
> >>
> >> Best,
> >> Liya Fan
> >>
> >>
> >>
> >> On Tue, Sep 1, 2020 at 7:25 AM Bryan Cutler <cu...@gmail.com> wrote:
> >>
> >> > I also think this would be a worthwhile addition and help the project
> >> > expand in more areas. Beyond the Apache Spark optimization use case,
> >> having
> >> > Arrow interoperability with the Python data science stack on BE would
> be
> >> > very useful. I have looked at the remaining PRs for Java and they seem
> >> > pretty minimal and straightforward. Implementing the equivalent record
> >> > batch swapping as done in C++ at [1] would be a little more involved,
> >> but
> >> > still reasonable. Would it make sense to create a branch to apply all
> >> > remaining changes with CI to get a better picture before deciding on
> >> > bringing into master branch?  I could help out with shepherding this
> >> effort
> >> > and assist in maintenance, if we decide to accept.
> >> >
> >> > Bryan
> >> >
> >> > [1] https://github.com/apache/arrow/pull/7507
> >> >
> >> > On Mon, Aug 31, 2020 at 1:42 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >
> >> > > I think it's well within the right of an implementation to reject BE
> >> > > data (or non-native-endian), but if an implementation chooses to
> >> > > implement and maintain the endianness conversions, then it does not
> >> > > seem so bad to me.
> >> > >
> >> > > On Mon, Aug 31, 2020 at 3:33 PM Jacques Nadeau <ja...@apache.org>
> >> > wrote:
> >> > > >
> >> > > > And yes, for those of you looking closely, I commented on
> ARROW-245
> >> > when
> >> > > it
> >> > > > was committed. I just forgot about it.
> >> > > >
> >> > > > It looks like I had mostly the same concerns then that I do now :)
> >> Now
> >> > > I'm
> >> > > > just more worried about format sprawl...
> >> > > >
> >> > > > On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <
> jacques@apache.org>
> >> > > wrote:
> >> > > >
> >> > > > > What do you mean?  The Endianness field (a Big|Little enum) was
> >> > added 4
> >> > > > >> years ago:
> >> > > > >> https://issues.apache.org/jira/browse/ARROW-245
> >> > > > >
> >> > > > >
> >> > > > > I didn't realize that was done, my bad. Good example of format
> rot
> >> > > from my
> >> > > > > pov.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Micah Kornfield <em...@gmail.com>.

I wanted to give this thread a bump, does the proposal I made below sound
reasonable?

On Sun, Sep 13, 2020 at 9:57 PM Micah Kornfield <em...@gmail.com>
wrote:

> If I read the responses so far it seems like the following might be a good
> compromise/summary:
>
> 1. It does not seem too invasive to support native endianness in
> implementation libraries.  As long as there is appropriate performance
> testing and CI infrastructure to demonstrate the changes work.
> 2. It is up to implementation maintainers if they wish to accept PRs that
> handle byte swapping between different architectures.  (Right now it sounds
> like C++ is potentially OK with it and for Java at least Jacques is opposed
> to it?
>
> Testing changes that break big-endian can be a potential drag on developer
> productivity but there are methods to run locally (at least on more recent
> OSes).
>
> Thoughts?
>
> Thanks,
> Micah
>
> On Mon, Aug 31, 2020 at 7:08 PM Fan Liya <li...@gmail.com> wrote:
>
>> Thank Kazuaki for the survey and thank Micah for starting the discussion.
>>
>> I do not oppose supporting BE. In fact, I am in general optimistic about
>> the performance impact (for Java).
>> IMO, this is going to be a painful way (many byte order related problems
>> are tricky to debug), so I hope we can make it short.
>>
>> It is good that someone is willing to take this on, and I would like to
>> provide help if needed.
>>
>> Best,
>> Liya Fan
>>
>>
>>
>> On Tue, Sep 1, 2020 at 7:25 AM Bryan Cutler <cu...@gmail.com> wrote:
>>
>> > I also think this would be a worthwhile addition and help the project
>> > expand in more areas. Beyond the Apache Spark optimization use case,
>> having
>> > Arrow interoperability with the Python data science stack on BE would be
>> > very useful. I have looked at the remaining PRs for Java and they seem
>> > pretty minimal and straightforward. Implementing the equivalent record
>> > batch swapping as done in C++ at [1] would be a little more involved,
>> but
>> > still reasonable. Would it make sense to create a branch to apply all
>> > remaining changes with CI to get a better picture before deciding on
>> > bringing into master branch?  I could help out with shepherding this
>> effort
>> > and assist in maintenance, if we decide to accept.
>> >
>> > Bryan
>> >
>> > [1] https://github.com/apache/arrow/pull/7507
>> >
>> > On Mon, Aug 31, 2020 at 1:42 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > > I think it's well within the right of an implementation to reject BE
>> > > data (or non-native-endian), but if an implementation chooses to
>> > > implement and maintain the endianness conversions, then it does not
>> > > seem so bad to me.
>> > >
>> > > On Mon, Aug 31, 2020 at 3:33 PM Jacques Nadeau <ja...@apache.org>
>> > wrote:
>> > > >
>> > > > And yes, for those of you looking closely, I commented on ARROW-245
>> > when
>> > > it
>> > > > was committed. I just forgot about it.
>> > > >
>> > > > It looks like I had mostly the same concerns then that I do now :)
>> Now
>> > > I'm
>> > > > just more worried about format sprawl...
>> > > >
>> > > > On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <ja...@apache.org>
>> > > wrote:
>> > > >
>> > > > > What do you mean?  The Endianness field (a Big|Little enum) was
>> > added 4
>> > > > >> years ago:
>> > > > >> https://issues.apache.org/jira/browse/ARROW-245
>> > > > >
>> > > > >
>> > > > > I didn't realize that was done, my bad. Good example of format rot
>> > > from my
>> > > > > pov.
>> > > > >
>> > > > >
>> > > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Micah Kornfield <em...@gmail.com>.

If I read the responses so far it seems like the following might be a good
compromise/summary:

1. It does not seem too invasive to support native endianness in
implementation libraries.  As long as there is appropriate performance
testing and CI infrastructure to demonstrate the changes work.
2. It is up to implementation maintainers if they wish to accept PRs that
handle byte swapping between different architectures.  (Right now it sounds
like C++ is potentially OK with it and for Java at least Jacques is opposed
to it?

Testing changes that break big-endian can be a potential drag on developer
productivity but there are methods to run locally (at least on more recent
OSes).

Thoughts?

Thanks,
Micah

On Mon, Aug 31, 2020 at 7:08 PM Fan Liya <li...@gmail.com> wrote:

> Thank Kazuaki for the survey and thank Micah for starting the discussion.
>
> I do not oppose supporting BE. In fact, I am in general optimistic about
> the performance impact (for Java).
> IMO, this is going to be a painful way (many byte order related problems
> are tricky to debug), so I hope we can make it short.
>
> It is good that someone is willing to take this on, and I would like to
> provide help if needed.
>
> Best,
> Liya Fan
>
>
>
> On Tue, Sep 1, 2020 at 7:25 AM Bryan Cutler <cu...@gmail.com> wrote:
>
> > I also think this would be a worthwhile addition and help the project
> > expand in more areas. Beyond the Apache Spark optimization use case,
> having
> > Arrow interoperability with the Python data science stack on BE would be
> > very useful. I have looked at the remaining PRs for Java and they seem
> > pretty minimal and straightforward. Implementing the equivalent record
> > batch swapping as done in C++ at [1] would be a little more involved, but
> > still reasonable. Would it make sense to create a branch to apply all
> > remaining changes with CI to get a better picture before deciding on
> > bringing into master branch?  I could help out with shepherding this
> effort
> > and assist in maintenance, if we decide to accept.
> >
> > Bryan
> >
> > [1] https://github.com/apache/arrow/pull/7507
> >
> > On Mon, Aug 31, 2020 at 1:42 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > I think it's well within the right of an implementation to reject BE
> > > data (or non-native-endian), but if an implementation chooses to
> > > implement and maintain the endianness conversions, then it does not
> > > seem so bad to me.
> > >
> > > On Mon, Aug 31, 2020 at 3:33 PM Jacques Nadeau <ja...@apache.org>
> > wrote:
> > > >
> > > > And yes, for those of you looking closely, I commented on ARROW-245
> > when
> > > it
> > > > was committed. I just forgot about it.
> > > >
> > > > It looks like I had mostly the same concerns then that I do now :)
> Now
> > > I'm
> > > > just more worried about format sprawl...
> > > >
> > > > On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > >
> > > > > What do you mean?  The Endianness field (a Big|Little enum) was
> > added 4
> > > > >> years ago:
> > > > >> https://issues.apache.org/jira/browse/ARROW-245
> > > > >
> > > > >
> > > > > I didn't realize that was done, my bad. Good example of format rot
> > > from my
> > > > > pov.
> > > > >
> > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Fan Liya <li...@gmail.com>.

Thank Kazuaki for the survey and thank Micah for starting the discussion.

I do not oppose supporting BE. In fact, I am in general optimistic about
the performance impact (for Java).
IMO, this is going to be a painful way (many byte order related problems
are tricky to debug), so I hope we can make it short.

It is good that someone is willing to take this on, and I would like to
provide help if needed.

Best,
Liya Fan



On Tue, Sep 1, 2020 at 7:25 AM Bryan Cutler <cu...@gmail.com> wrote:

> I also think this would be a worthwhile addition and help the project
> expand in more areas. Beyond the Apache Spark optimization use case, having
> Arrow interoperability with the Python data science stack on BE would be
> very useful. I have looked at the remaining PRs for Java and they seem
> pretty minimal and straightforward. Implementing the equivalent record
> batch swapping as done in C++ at [1] would be a little more involved, but
> still reasonable. Would it make sense to create a branch to apply all
> remaining changes with CI to get a better picture before deciding on
> bringing into master branch?  I could help out with shepherding this effort
> and assist in maintenance, if we decide to accept.
>
> Bryan
>
> [1] https://github.com/apache/arrow/pull/7507
>
> On Mon, Aug 31, 2020 at 1:42 PM Wes McKinney <we...@gmail.com> wrote:
>
> > I think it's well within the right of an implementation to reject BE
> > data (or non-native-endian), but if an implementation chooses to
> > implement and maintain the endianness conversions, then it does not
> > seem so bad to me.
> >
> > On Mon, Aug 31, 2020 at 3:33 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> > >
> > > And yes, for those of you looking closely, I commented on ARROW-245
> when
> > it
> > > was committed. I just forgot about it.
> > >
> > > It looks like I had mostly the same concerns then that I do now :) Now
> > I'm
> > > just more worried about format sprawl...
> > >
> > > On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <ja...@apache.org>
> > wrote:
> > >
> > > > What do you mean?  The Endianness field (a Big|Little enum) was
> added 4
> > > >> years ago:
> > > >> https://issues.apache.org/jira/browse/ARROW-245
> > > >
> > > >
> > > > I didn't realize that was done, my bad. Good example of format rot
> > from my
> > > > pov.
> > > >
> > > >
> > > >
> >
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Bryan Cutler <cu...@gmail.com>.

I also think this would be a worthwhile addition and help the project
expand in more areas. Beyond the Apache Spark optimization use case, having
Arrow interoperability with the Python data science stack on BE would be
very useful. I have looked at the remaining PRs for Java and they seem
pretty minimal and straightforward. Implementing the equivalent record
batch swapping as done in C++ at [1] would be a little more involved, but
still reasonable. Would it make sense to create a branch to apply all
remaining changes with CI to get a better picture before deciding on
bringing into master branch?  I could help out with shepherding this effort
and assist in maintenance, if we decide to accept.

Bryan

[1] https://github.com/apache/arrow/pull/7507

On Mon, Aug 31, 2020 at 1:42 PM Wes McKinney <we...@gmail.com> wrote:

> I think it's well within the right of an implementation to reject BE
> data (or non-native-endian), but if an implementation chooses to
> implement and maintain the endianness conversions, then it does not
> seem so bad to me.
>
> On Mon, Aug 31, 2020 at 3:33 PM Jacques Nadeau <ja...@apache.org> wrote:
> >
> > And yes, for those of you looking closely, I commented on ARROW-245 when
> it
> > was committed. I just forgot about it.
> >
> > It looks like I had mostly the same concerns then that I do now :) Now
> I'm
> > just more worried about format sprawl...
> >
> > On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> > > What do you mean?  The Endianness field (a Big|Little enum) was added 4
> > >> years ago:
> > >> https://issues.apache.org/jira/browse/ARROW-245
> > >
> > >
> > > I didn't realize that was done, my bad. Good example of format rot
> from my
> > > pov.
> > >
> > >
> > >
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Wes McKinney <we...@gmail.com>.

I think it's well within the right of an implementation to reject BE
data (or non-native-endian), but if an implementation chooses to
implement and maintain the endianness conversions, then it does not
seem so bad to me.

On Mon, Aug 31, 2020 at 3:33 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> And yes, for those of you looking closely, I commented on ARROW-245 when it
> was committed. I just forgot about it.
>
> It looks like I had mostly the same concerns then that I do now :) Now I'm
> just more worried about format sprawl...
>
> On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> > What do you mean?  The Endianness field (a Big|Little enum) was added 4
> >> years ago:
> >> https://issues.apache.org/jira/browse/ARROW-245
> >
> >
> > I didn't realize that was done, my bad. Good example of format rot from my
> > pov.
> >
> >
> >

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Jacques Nadeau <ja...@apache.org>.

And yes, for those of you looking closely, I commented on ARROW-245 when it
was committed. I just forgot about it.

It looks like I had mostly the same concerns then that I do now :) Now I'm
just more worried about format sprawl...

On Mon, Aug 31, 2020 at 1:30 PM Jacques Nadeau <ja...@apache.org> wrote:

> What do you mean?  The Endianness field (a Big|Little enum) was added 4
>> years ago:
>> https://issues.apache.org/jira/browse/ARROW-245
>
>
> I didn't realize that was done, my bad. Good example of format rot from my
> pov.
>
>
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Jacques Nadeau <ja...@apache.org>.

>
> What do you mean?  The Endianness field (a Big|Little enum) was added 4
> years ago:
> https://issues.apache.org/jira/browse/ARROW-245


I didn't realize that was done, my bad. Good example of format rot from my
pov.

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Antoine Pitrou <an...@python.org>.

Le 31/08/2020 à 07:11, Jacques Nadeau a écrit :
> I didn't realize that Ishizaki isn't just proposing a BE platform support,
> he is proposing a new BE version of the format.

What do you mean?  The Endianness field (a Big|Little enum) was added 4
years ago:
https://issues.apache.org/jira/browse/ARROW-245

The format doc has the following wording:
"""
Serialized Schema metadata has an endianness field indicating endianness
of RecordBatches. Typically this is the endianness of the system where
the RecordBatch was generated.  [...] The reference implementation is
focused on Little Endian and provides tests for it. Eventually we may
provide automatic conversion via byte swapping.
"""

So here is how I read it:
* the format allows conveying big-endian as well as little-endian data
(since it's "typically the endianness of the system where the
RecordBatch was generated")
* the reference implementations may or may not choose to support reading
foreign endian data by byte swapping on reception

Regards

Antoine.



>
 In this situation computers
> speaking Arrow potentially have to convert from one version to the other
> version. For example two machines communicating with Arrow flight now have
> to do order swaps if they are speaking different endianess. This makes me
> feel even less positive about the proposal.
> 
> In terms of my main concern: it was around demand versus cost. I don't
> believe that has been addressed. There have been ways discussed to try to
> minimize cost but that cost is non-zero whereas the demand feels very close
> to zero. As far as I can tell, there was one ask for big-endian support in
> August of 2016 (Sanjay) and none since [1]. Even in these threads I don't
> hear a resounding demand for big endian. There are many features that are
> far less invasive/expensive to support that have much more demand.
> 
> This feels like something that could easily rot and/or just become a load
> on the core maintainers of the project. I know I don't want to have to go
> and debug on a remote BE system if some tests starts failing for that
> platform...
> 
> [1] Only a skim of archives, there may have been others:
> https://markmail.org/thread/7gbf5est6mcsnc6l
> 
> 
> 
> 
> On Sun, Aug 30, 2020 at 8:39 PM Micah Kornfield <em...@gmail.com>
> wrote:
> 
>> Looking over the outstanding PRs while the code isn't necessarily pretty, I
>> don't think they are too invasive.
>>
>> Also it seems that Kazuaki Ishizaki is willing to add benchmarks where
>> necessary to verify the lack of performance regressions.  (Please correct
>> me if I misunderstood).
>>
>> Jacques and Liya Fan does this address your concerns?  Are there further
>> details that you would like to discuss?  Are you still opposed to support
>> in Java?
>>
>> Do maintainers of other implementations have concerns (in particular Go
>> seems to be the other language in progress)?
>>
>> Thanks,
>> Micah
>>
>> On Wed, Aug 26, 2020 at 6:57 AM Kazuaki Ishizaki <IS...@jp.ibm.com>
>> wrote:
>>
>>> Hi,
>>> I waited for comments regarding Java Big-Endian (BE) support during my
>>> one-week vacation. Thank you for good suggestions and comments.
>>> I already responded to some questions in another mail. This mail
>> addresses
>>> the remaining questions: Use cases, holistic strategy for BE support, and
>>> testing plans
>>>
>>> 1. Use cases
>>> The use case of Arrow Java is in Apache Spark, which was already
>> published
>>> in Arrow Blog [1]. This is used as the typical performance acceleration
>> of
>>> Spark with other languages such as Python [2] and R [3]. In DataBricks
>>> notebook, 68% of commands come from Python [4].
>>>
>>> 2. Holistic strategy of BE support across languages
>>> I mostly completed BE support in C++. This implementation uses the
>>> following strategy:
>>> A. Write and read data in a record batch using platform-native endian
>> (NE)
>>> when the data is created on a host. The endianness is stored in an endian
>>> field in the schema.
>>> B. Send data using the IPC-host endian among processes using IPC.
>>> C. At B, if an IPC-client endian is different from the received data
>>> endian, the IPC client receives data without data copy.
>>> D. At B, if an IPC-client endian is different from the received data
>>> endian, the IPC client swaps endian of the received data to match the
>>> endian with the IPC-client endian as default.
>>> E. The primitive data types in memory (e.g. Decimal128 in C++ and
>>> UnsafeDirectLittleEndian in Java) is read/written using the NE.
>>>
>>> A and B-C are typical use cases in Apache Arrow. Therefore, no endian
>> swap
>>> occurs in these use cases without performance overhead. B-D is rarely
>> used
>>> (e.g. send data from x86_64 to s390x). Thus, the data swap occurs only
>> once
>>> at the receive. After that, no data swap occurs for performance. For some
>>> use cases, this swap can be stopped by using an option. In these cases,
>>> Arrow will not process any data.
>>> E. allows us to accessing primitive data (e.g. int32, double, decimal128)
>>> without performance loss by using the platform-native endian load/stores.
>>>
>>> 2-1. Implementation strategy in Java Language
>>> The existing primitive data structures such as UnsafeDirectLittleEndian,
>>> ArrowBuf, and ValueVector should handle platform-native endian for the
>>> strategies A, B-C, and E without performance overhead.
>>> In the remaining strategy D, the method
>>> MessageSerializer.deserializeRecordBatch() will handle data swap when the
>>> endian of the host is different from that of the client, which
>> corresponds
>>> to the PR [6] in C++.
>>>
>>> 3. Testing plan
>>> For testing the strategies, A, B-C, and E, it would be good to increase
>>> the test coverage regardless of endianness e.g. increase the types of a
>>> schema to be tested in flight-core).
>>> For testing the strategy D, I already prepared data for be and le. When a
>>> PR will enable the data swap, the PR will also enable integration test.
>>> For performance testing, we can use the existing framework [7] by
>>> extending the support for other languages. We can run performance
>>> benchmarks on a little-endian platform to avoid performance regression.
>>>
>>> [1] https://arrow.apache.org/blog/2017/07/26/spark-arrow/
>>> [2]
>>>
>> https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
>>> [3]
>>>
>> https://databricks.com/jp/blog/2020/06/01/vectorized-r-i-o-in-upcoming-apache-spark-3-0.html
>>> [4] https://databricks.com/jp/session_na20/wednesday-morning-keynotes
>>> [5] https://github.com/apache/arrow/pull/7507#discussion_r46819873
>>> [6] https://github.com/apache/arrow/pull/7507
>>> [7] https://github.com/apache/arrow/pull/7940#issuecomment-672690540
>>>
>>> Best Regards,
>>> Kazuaki Ishizaki
>>>
>>> Wes McKinney <we...@gmail.com> wrote on 2020/08/26 21:27:49:
>>>
>>>> From: Wes McKinney <we...@gmail.com>
>>>> To: dev <de...@arrow.apache.org>, Micah Kornfield <emkornfield@gmail.com
>>>
>>>> Cc: Fan Liya <li...@gmail.com>
>>>> Date: 2020/08/26 21:28
>>>> Subject: [EXTERNAL] Re: [DISCUSS] Big Endian support in Arrow (was:
>>>> Re: [Java] Supporting Big Endian)
>>>>
>>>> hi Micah,
>>>>
>>>> I agree with your reasoning. If supporting BE in some languages (e.g.
>>>> Java) is impractical due to performance regressions on LE platforms,
>>>> then I don't think it's worth it. But if it can be handled at compile
>>>> time or without runtime overhead, and tested / maintained properly on
>>>> an ongoing basis, then it seems reasonable to me. It seems that the
>>>> number of Arrow stakeholders will only increase from here so I would
>>>> hope that there will be more people invested in helping maintain BE in
>>>> the future.
>>>>
>>>> - Wes
>>>>
>>>> On Tue, Aug 25, 2020 at 11:33 PM Micah Kornfield
>>>> <em...@gmail.com> wrote:
>>>>>
>>>>> I'm expanding the scope of this thread since it looks like work has
>>> also
>>>>> started for making golang support BigEndian architectures.
>>>>>
>>>>> I think as a community we should come to a consensus on whether we
>>> want to
>>>>> support Big Endian architectures in general.  I don't think it is a
>>> good
>>>>> outcome if some implementations accept PRs for Big Endian fixes and
>>> some
>>>>> don't.
>>>>>
>>>>> But maybe this is OK with others?
>>>>>
>>>>> My current opinion on the matter is that we should support it under
>> the
>>>>> following conditions:
>>>>>
>>>>> 1.  As long as there is CI in place to catch regressions (right now I
>>> think
>>>>> the CI is fairly unreliable?)
>>>>> 2.  No degradation in performance for little-endian architectures
>>> (verified
>>>>> by additional micro benchmarks)
>>>>> 3.  Not a large amount of invasive code to distinguish between
>>> platforms.
>>>>>
>>>>> Kazuaki Ishizaki I asked question previously, but could you give some
>>> data
>>>>> points around:
>>>>> 1.  The current state of C++ support (how much code needed to
>> change)?
>>>>> 2.  How many more PRs you expect to need for Java (and approximate
>>> size)?
>>>>>
>>>>> I think this would help myself and others in the decision making
>>> process.
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield <
>> emkornfield@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>>> My thoughts on the points raised so far:
>>>>>>
>>>>>> * Does supporting Big Endian increase the reach of Arrow by a lot?
>>>>>>
>>>>>> Probably not a significant amount, but it does provide one more
>>> avenue of
>>>>>> adoption.
>>>>>>
>>>>>> * Does it increase code complexity?
>>>>>>
>>>>>> Yes.  I agree this is a concern.  The PR in question did not seem
>>> too bad
>>>>>> to me but this is subjective.  I think the remaining question is
>> how
>>> many
>>>>>> more places need to be fixed up in the code base and how invasive
>>> are the
>>>>>> changes.  In C++ IIUC it turned out to be a relatively small number
>>> of
>>>>>> places.
>>>>>>
>>>>>> Kazuaki Ishizaki have you been able to get the Java implementation
>>> working
>>>>>> fully locally?  How many additional PRs will be needed and what do
>>>>>> they look like (I think there already a few more in the queue)?
>>>>>>
>>>>>> * Will it introduce performance regressions?
>>>>>>
>>>>>> If done properly I suspect no, but I think if we continue with
>>> BigEndian
>>>>>> support the places that need to be touched should have benchmarks
>>> added to
>>>>>> confirm this (including for PRs already merged).
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <li...@gmail.com>
>>> wrote:
>>>>>>
>>>>>>> Thank Kazuaki Ishizaki for working on this.
>>>>>>> IMO, supporting the big-endian should be a large change, as in
>> many
>>>>>>> places of the code base, we have implicitly assumed the
>>> little-endian
>>>>>>> platform (e.g.
>>>>>>> https://urldefense.proofpoint.com/v2/url?
>>>>
>>>
>> u=https-3A__github.com_apache_arrow_blob_master_java_memory_memory-2Dcore_src_main_java_org_apache_arrow_memory_util_ByteFunctionHelpers.java&d=DwIBaQ&c=jf_iaSHvJObTbx-
>>>> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
>>>> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
>>>> n7ezRb8qyWtog&s=poFSWqjJv99prou53ciinHyBmh5IZlXLlhYvftb9fu4&e=
>>>>>>> ).
>>>>>>> Supporting the big-endian platform may introduce branches in such
>>> places
>>>>>>> (or virtual calls) which will affect the performance.
>>>>>>> So it would be helpful to evaluate the performance impact.
>>>>>>>
>>>>>>> Best,
>>>>>>> Liya Fan
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <
>> jacques@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Micah, thanks for starting the discussion.
>>>>>>>>
>>>>>>>> I just skimmed that thread and it isn't entirely clear that there
>>> was a
>>>>>>>> conclusion that the overhead was worth it. I think everybody
>>> agrees that
>>>>>>>> it
>>>>>>>> would be nice to have the code work on both platforms. On the
>>> flipside,
>>>>>>>> the
>>>>>>>> code noise for a rare case makes the cost-benefit questionable.
>>>>>>>>
>>>>>>>> In the Java code, we wrote the code to explicitly disallow big
>>> endian
>>>>>>>> platforms and put preconditions checks in. I definitely think if
>>> we want
>>>>>>>> to
>>>>>>>> support this, it should be done holistically across the code with
>>>>>>>> appropriate test plan (both functional and perf).
>>>>>>>>
>>>>>>>> To me, the question is really about how many use cases are
>> blocked
>>> by
>>>>>>>> this.
>>>>>>>> I'm not sure I've heard anyone say that the limiting factor
>>> toleveraging
>>>>>>>> Java Arrow was the block on endianess. Keep in mind that until
>> very
>>>>>>>> recently, using any Arrow Java code would throw a preconditions
>>> check
>>>>>>>> before you could even get started on big-endian and I don't think
>>> we've
>>>>>>>> seen a bunch of messages on that exception. Adding if conditions
>>>>>>>> throughout
>>>>>>>> the codebase like this patch: [1] isn't exactly awesome and it
>> can
>>> also
>>>>>>>> risk performance impacts depending on how carefully it is done.
>>>>>>>>
>>>>>>>> If there isn't a preponderance of evidence of many users
>>> beingblocked by
>>>>>>>> this capability, I don't think we should accept the code. We
>>> already
>>>>>>>> have a
>>>>>>>> backlog of items that we need to address just ensure existing use
>>> cases
>>>>>>>> work well. Expanding to new use cases that there is no clear
>>> demand for
>>>>>>>> will likely just increase code development cost at little
>> benefit.
>>>>>>>>
>>>>>>>> What do others think?
>>>>>>>>
>>>>>>>> [1] https://urldefense.proofpoint.com/v2/url?
>>>>
>>>
>> u=https-3A__github.com_apache_arrow_pull_7923-23issuecomment-2D674311119&d=DwIBaQ&c=jf_iaSHvJObTbx-
>>>> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
>>>> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
>>>> n7ezRb8qyWtog&s=vmvc0b4yHFfWLjLheCRysSiyaeRFO_6p0wdH-sLa7M8&e=
>>>>>>>>
>>>>>>>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Kazuaki Ishizak has started working on Big Endian support in
>> Java
>>>>>>>>> (including setting up CI for it).  Thank you!
>>>>>>>>>
>>>>>>>>> We previously discussed support for Big Endian architectures in
>>> C++
>>>>>>>> [1] and
>>>>>>>>> generally agreed that it was a reasonable thing to do.
>>>>>>>>>
>>>>>>>>> Similar to C++ I think as long as we have a working CI setup it
>>> is
>>>>>>>>> reasonable for Java to support Big Endian machines.
>>>>>>>>>
>>>>>>>>> But I think there might be differing opinions so it is worth a
>>>>>>>> discussion
>>>>>>>>> to see if there are technical blockers or other reasons for not
>>>>>>>> supporting
>>>>>>>>> Big Endian architectures in the existing java implementation.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Micah
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>>
>>>>>>>>>
>>>>>>>> https://urldefense.proofpoint.com/v2/url?
>>>>
>>>
>> u=https-3A__lists.apache.org_thread.html_rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019-2540-253Cdev.arrow.apache.org-253E&d=DwIBaQ&c=jf_iaSHvJObTbx-
>>>> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
>>>> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
>>>> n7ezRb8qyWtog&s=oDBWI9pmI39bTsEieQNDxZit0My21hLIW0fJRPJI0AM&e=
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>
>>>
>>
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Jacques Nadeau <ja...@apache.org>.

I didn't realize that Ishizaki isn't just proposing a BE platform support,
he is proposing a new BE version of the format. In this situation computers
speaking Arrow potentially have to convert from one version to the other
version. For example two machines communicating with Arrow flight now have
to do order swaps if they are speaking different endianess. This makes me
feel even less positive about the proposal.

In terms of my main concern: it was around demand versus cost. I don't
believe that has been addressed. There have been ways discussed to try to
minimize cost but that cost is non-zero whereas the demand feels very close
to zero. As far as I can tell, there was one ask for big-endian support in
August of 2016 (Sanjay) and none since [1]. Even in these threads I don't
hear a resounding demand for big endian. There are many features that are
far less invasive/expensive to support that have much more demand.

This feels like something that could easily rot and/or just become a load
on the core maintainers of the project. I know I don't want to have to go
and debug on a remote BE system if some tests starts failing for that
platform...

[1] Only a skim of archives, there may have been others:
https://markmail.org/thread/7gbf5est6mcsnc6l




On Sun, Aug 30, 2020 at 8:39 PM Micah Kornfield <em...@gmail.com>
wrote:

> Looking over the outstanding PRs while the code isn't necessarily pretty, I
> don't think they are too invasive.
>
> Also it seems that Kazuaki Ishizaki is willing to add benchmarks where
> necessary to verify the lack of performance regressions.  (Please correct
> me if I misunderstood).
>
> Jacques and Liya Fan does this address your concerns?  Are there further
> details that you would like to discuss?  Are you still opposed to support
> in Java?
>
> Do maintainers of other implementations have concerns (in particular Go
> seems to be the other language in progress)?
>
> Thanks,
> Micah
>
> On Wed, Aug 26, 2020 at 6:57 AM Kazuaki Ishizaki <IS...@jp.ibm.com>
> wrote:
>
> > Hi,
> > I waited for comments regarding Java Big-Endian (BE) support during my
> > one-week vacation. Thank you for good suggestions and comments.
> > I already responded to some questions in another mail. This mail
> addresses
> > the remaining questions: Use cases, holistic strategy for BE support, and
> > testing plans
> >
> > 1. Use cases
> > The use case of Arrow Java is in Apache Spark, which was already
> published
> > in Arrow Blog [1]. This is used as the typical performance acceleration
> of
> > Spark with other languages such as Python [2] and R [3]. In DataBricks
> > notebook, 68% of commands come from Python [4].
> >
> > 2. Holistic strategy of BE support across languages
> > I mostly completed BE support in C++. This implementation uses the
> > following strategy:
> > A. Write and read data in a record batch using platform-native endian
> (NE)
> > when the data is created on a host. The endianness is stored in an endian
> > field in the schema.
> > B. Send data using the IPC-host endian among processes using IPC.
> > C. At B, if an IPC-client endian is different from the received data
> > endian, the IPC client receives data without data copy.
> > D. At B, if an IPC-client endian is different from the received data
> > endian, the IPC client swaps endian of the received data to match the
> > endian with the IPC-client endian as default.
> > E. The primitive data types in memory (e.g. Decimal128 in C++ and
> > UnsafeDirectLittleEndian in Java) is read/written using the NE.
> >
> > A and B-C are typical use cases in Apache Arrow. Therefore, no endian
> swap
> > occurs in these use cases without performance overhead. B-D is rarely
> used
> > (e.g. send data from x86_64 to s390x). Thus, the data swap occurs only
> once
> > at the receive. After that, no data swap occurs for performance. For some
> > use cases, this swap can be stopped by using an option. In these cases,
> > Arrow will not process any data.
> > E. allows us to accessing primitive data (e.g. int32, double, decimal128)
> > without performance loss by using the platform-native endian load/stores.
> >
> > 2-1. Implementation strategy in Java Language
> > The existing primitive data structures such as UnsafeDirectLittleEndian,
> > ArrowBuf, and ValueVector should handle platform-native endian for the
> > strategies A, B-C, and E without performance overhead.
> > In the remaining strategy D, the method
> > MessageSerializer.deserializeRecordBatch() will handle data swap when the
> > endian of the host is different from that of the client, which
> corresponds
> > to the PR [6] in C++.
> >
> > 3. Testing plan
> > For testing the strategies, A, B-C, and E, it would be good to increase
> > the test coverage regardless of endianness e.g. increase the types of a
> > schema to be tested in flight-core).
> > For testing the strategy D, I already prepared data for be and le. When a
> > PR will enable the data swap, the PR will also enable integration test.
> > For performance testing, we can use the existing framework [7] by
> > extending the support for other languages. We can run performance
> > benchmarks on a little-endian platform to avoid performance regression.
> >
> > [1] https://arrow.apache.org/blog/2017/07/26/spark-arrow/
> > [2]
> >
> https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
> > [3]
> >
> https://databricks.com/jp/blog/2020/06/01/vectorized-r-i-o-in-upcoming-apache-spark-3-0.html
> > [4] https://databricks.com/jp/session_na20/wednesday-morning-keynotes
> > [5] https://github.com/apache/arrow/pull/7507#discussion_r46819873
> > [6] https://github.com/apache/arrow/pull/7507
> > [7] https://github.com/apache/arrow/pull/7940#issuecomment-672690540
> >
> > Best Regards,
> > Kazuaki Ishizaki
> >
> > Wes McKinney <we...@gmail.com> wrote on 2020/08/26 21:27:49:
> >
> > > From: Wes McKinney <we...@gmail.com>
> > > To: dev <de...@arrow.apache.org>, Micah Kornfield <emkornfield@gmail.com
> >
> > > Cc: Fan Liya <li...@gmail.com>
> > > Date: 2020/08/26 21:28
> > > Subject: [EXTERNAL] Re: [DISCUSS] Big Endian support in Arrow (was:
> > > Re: [Java] Supporting Big Endian)
> > >
> > > hi Micah,
> > >
> > > I agree with your reasoning. If supporting BE in some languages (e.g.
> > > Java) is impractical due to performance regressions on LE platforms,
> > > then I don't think it's worth it. But if it can be handled at compile
> > > time or without runtime overhead, and tested / maintained properly on
> > > an ongoing basis, then it seems reasonable to me. It seems that the
> > > number of Arrow stakeholders will only increase from here so I would
> > > hope that there will be more people invested in helping maintain BE in
> > > the future.
> > >
> > > - Wes
> > >
> > > On Tue, Aug 25, 2020 at 11:33 PM Micah Kornfield
> > > <em...@gmail.com> wrote:
> > > >
> > > > I'm expanding the scope of this thread since it looks like work has
> > also
> > > > started for making golang support BigEndian architectures.
> > > >
> > > > I think as a community we should come to a consensus on whether we
> > want to
> > > > support Big Endian architectures in general.  I don't think it is a
> > good
> > > > outcome if some implementations accept PRs for Big Endian fixes and
> > some
> > > > don't.
> > > >
> > > > But maybe this is OK with others?
> > > >
> > > > My current opinion on the matter is that we should support it under
> the
> > > > following conditions:
> > > >
> > > > 1.  As long as there is CI in place to catch regressions (right now I
> > think
> > > > the CI is fairly unreliable?)
> > > > 2.  No degradation in performance for little-endian architectures
> > (verified
> > > > by additional micro benchmarks)
> > > > 3.  Not a large amount of invasive code to distinguish between
> > platforms.
> > > >
> > > > Kazuaki Ishizaki I asked question previously, but could you give some
> > data
> > > > points around:
> > > > 1.  The current state of C++ support (how much code needed to
> change)?
> > > > 2.  How many more PRs you expect to need for Java (and approximate
> > size)?
> > > >
> > > > I think this would help myself and others in the decision making
> > process.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield <
> emkornfield@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > My thoughts on the points raised so far:
> > > > >
> > > > > * Does supporting Big Endian increase the reach of Arrow by a lot?
> > > > >
> > > > > Probably not a significant amount, but it does provide one more
> > avenue of
> > > > > adoption.
> > > > >
> > > > > * Does it increase code complexity?
> > > > >
> > > > > Yes.  I agree this is a concern.  The PR in question did not seem
> > too bad
> > > > > to me but this is subjective.  I think the remaining question is
> how
> > many
> > > > > more places need to be fixed up in the code base and how invasive
> > are the
> > > > > changes.  In C++ IIUC it turned out to be a relatively small number
> > of
> > > > > places.
> > > > >
> > > > > Kazuaki Ishizaki have you been able to get the Java implementation
> > working
> > > > > fully locally?  How many additional PRs will be needed and what do
> > > > > they look like (I think there already a few more in the queue)?
> > > > >
> > > > > * Will it introduce performance regressions?
> > > > >
> > > > > If done properly I suspect no, but I think if we continue with
> > BigEndian
> > > > > support the places that need to be touched should have benchmarks
> > added to
> > > > > confirm this (including for PRs already merged).
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > > On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <li...@gmail.com>
> > wrote:
> > > > >
> > > > >> Thank Kazuaki Ishizaki for working on this.
> > > > >> IMO, supporting the big-endian should be a large change, as in
> many
> > > > >> places of the code base, we have implicitly assumed the
> > little-endian
> > > > >> platform (e.g.
> > > > >> https://urldefense.proofpoint.com/v2/url?
> > >
> >
> u=https-3A__github.com_apache_arrow_blob_master_java_memory_memory-2Dcore_src_main_java_org_apache_arrow_memory_util_ByteFunctionHelpers.java&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > > siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> > > ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> > > n7ezRb8qyWtog&s=poFSWqjJv99prou53ciinHyBmh5IZlXLlhYvftb9fu4&e=
> > > > >> ).
> > > > >> Supporting the big-endian platform may introduce branches in such
> > places
> > > > >> (or virtual calls) which will affect the performance.
> > > > >> So it would be helpful to evaluate the performance impact.
> > > > >>
> > > > >> Best,
> > > > >> Liya Fan
> > > > >>
> > > > >>
> > > > >> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <
> jacques@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >>> Hey Micah, thanks for starting the discussion.
> > > > >>>
> > > > >>> I just skimmed that thread and it isn't entirely clear that there
> > was a
> > > > >>> conclusion that the overhead was worth it. I think everybody
> > agrees that
> > > > >>> it
> > > > >>> would be nice to have the code work on both platforms. On the
> > flipside,
> > > > >>> the
> > > > >>> code noise for a rare case makes the cost-benefit questionable.
> > > > >>>
> > > > >>> In the Java code, we wrote the code to explicitly disallow big
> > endian
> > > > >>> platforms and put preconditions checks in. I definitely think if
> > we want
> > > > >>> to
> > > > >>> support this, it should be done holistically across the code with
> > > > >>> appropriate test plan (both functional and perf).
> > > > >>>
> > > > >>> To me, the question is really about how many use cases are
> blocked
> > by
> > > > >>> this.
> > > > >>> I'm not sure I've heard anyone say that the limiting factor
> > toleveraging
> > > > >>> Java Arrow was the block on endianess. Keep in mind that until
> very
> > > > >>> recently, using any Arrow Java code would throw a preconditions
> > check
> > > > >>> before you could even get started on big-endian and I don't think
> > we've
> > > > >>> seen a bunch of messages on that exception. Adding if conditions
> > > > >>> throughout
> > > > >>> the codebase like this patch: [1] isn't exactly awesome and it
> can
> > also
> > > > >>> risk performance impacts depending on how carefully it is done.
> > > > >>>
> > > > >>> If there isn't a preponderance of evidence of many users
> > beingblocked by
> > > > >>> this capability, I don't think we should accept the code. We
> > already
> > > > >>> have a
> > > > >>> backlog of items that we need to address just ensure existing use
> > cases
> > > > >>> work well. Expanding to new use cases that there is no clear
> > demand for
> > > > >>> will likely just increase code development cost at little
> benefit.
> > > > >>>
> > > > >>> What do others think?
> > > > >>>
> > > > >>> [1] https://urldefense.proofpoint.com/v2/url?
> > >
> >
> u=https-3A__github.com_apache_arrow_pull_7923-23issuecomment-2D674311119&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > > siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> > > ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> > > n7ezRb8qyWtog&s=vmvc0b4yHFfWLjLheCRysSiyaeRFO_6p0wdH-sLa7M8&e=
> > > > >>>
> > > > >>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield <
> > emkornfield@gmail.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > Kazuaki Ishizak has started working on Big Endian support in
> Java
> > > > >>> > (including setting up CI for it).  Thank you!
> > > > >>> >
> > > > >>> > We previously discussed support for Big Endian architectures in
> > C++
> > > > >>> [1] and
> > > > >>> > generally agreed that it was a reasonable thing to do.
> > > > >>> >
> > > > >>> > Similar to C++ I think as long as we have a working CI setup it
> > is
> > > > >>> > reasonable for Java to support Big Endian machines.
> > > > >>> >
> > > > >>> > But I think there might be differing opinions so it is worth a
> > > > >>> discussion
> > > > >>> > to see if there are technical blockers or other reasons for not
> > > > >>> supporting
> > > > >>> > Big Endian architectures in the existing java implementation.
> > > > >>> >
> > > > >>> > Thanks,
> > > > >>> > Micah
> > > > >>> >
> > > > >>> >
> > > > >>> > [1]
> > > > >>> >
> > > > >>> >
> > > > >>> https://urldefense.proofpoint.com/v2/url?
> > >
> >
> u=https-3A__lists.apache.org_thread.html_rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019-2540-253Cdev.arrow.apache.org-253E&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > > siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> > > ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> > > n7ezRb8qyWtog&s=oDBWI9pmI39bTsEieQNDxZit0My21hLIW0fJRPJI0AM&e=
> > > > >>> >
> > > > >>>
> > > > >>
> > >
> >
> >
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Micah Kornfield <em...@gmail.com>.

Looking over the outstanding PRs while the code isn't necessarily pretty, I
don't think they are too invasive.

Also it seems that Kazuaki Ishizaki is willing to add benchmarks where
necessary to verify the lack of performance regressions.  (Please correct
me if I misunderstood).

Jacques and Liya Fan does this address your concerns?  Are there further
details that you would like to discuss?  Are you still opposed to support
in Java?

Do maintainers of other implementations have concerns (in particular Go
seems to be the other language in progress)?

Thanks,
Micah

On Wed, Aug 26, 2020 at 6:57 AM Kazuaki Ishizaki <IS...@jp.ibm.com>
wrote:

> Hi,
> I waited for comments regarding Java Big-Endian (BE) support during my
> one-week vacation. Thank you for good suggestions and comments.
> I already responded to some questions in another mail. This mail addresses
> the remaining questions: Use cases, holistic strategy for BE support, and
> testing plans
>
> 1. Use cases
> The use case of Arrow Java is in Apache Spark, which was already published
> in Arrow Blog [1]. This is used as the typical performance acceleration of
> Spark with other languages such as Python [2] and R [3]. In DataBricks
> notebook, 68% of commands come from Python [4].
>
> 2. Holistic strategy of BE support across languages
> I mostly completed BE support in C++. This implementation uses the
> following strategy:
> A. Write and read data in a record batch using platform-native endian (NE)
> when the data is created on a host. The endianness is stored in an endian
> field in the schema.
> B. Send data using the IPC-host endian among processes using IPC.
> C. At B, if an IPC-client endian is different from the received data
> endian, the IPC client receives data without data copy.
> D. At B, if an IPC-client endian is different from the received data
> endian, the IPC client swaps endian of the received data to match the
> endian with the IPC-client endian as default.
> E. The primitive data types in memory (e.g. Decimal128 in C++ and
> UnsafeDirectLittleEndian in Java) is read/written using the NE.
>
> A and B-C are typical use cases in Apache Arrow. Therefore, no endian swap
> occurs in these use cases without performance overhead. B-D is rarely used
> (e.g. send data from x86_64 to s390x). Thus, the data swap occurs only once
> at the receive. After that, no data swap occurs for performance. For some
> use cases, this swap can be stopped by using an option. In these cases,
> Arrow will not process any data.
> E. allows us to accessing primitive data (e.g. int32, double, decimal128)
> without performance loss by using the platform-native endian load/stores.
>
> 2-1. Implementation strategy in Java Language
> The existing primitive data structures such as UnsafeDirectLittleEndian,
> ArrowBuf, and ValueVector should handle platform-native endian for the
> strategies A, B-C, and E without performance overhead.
> In the remaining strategy D, the method
> MessageSerializer.deserializeRecordBatch() will handle data swap when the
> endian of the host is different from that of the client, which corresponds
> to the PR [6] in C++.
>
> 3. Testing plan
> For testing the strategies, A, B-C, and E, it would be good to increase
> the test coverage regardless of endianness e.g. increase the types of a
> schema to be tested in flight-core).
> For testing the strategy D, I already prepared data for be and le. When a
> PR will enable the data swap, the PR will also enable integration test.
> For performance testing, we can use the existing framework [7] by
> extending the support for other languages. We can run performance
> benchmarks on a little-endian platform to avoid performance regression.
>
> [1] https://arrow.apache.org/blog/2017/07/26/spark-arrow/
> [2]
> https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
> [3]
> https://databricks.com/jp/blog/2020/06/01/vectorized-r-i-o-in-upcoming-apache-spark-3-0.html
> [4] https://databricks.com/jp/session_na20/wednesday-morning-keynotes
> [5] https://github.com/apache/arrow/pull/7507#discussion_r46819873
> [6] https://github.com/apache/arrow/pull/7507
> [7] https://github.com/apache/arrow/pull/7940#issuecomment-672690540
>
> Best Regards,
> Kazuaki Ishizaki
>
> Wes McKinney <we...@gmail.com> wrote on 2020/08/26 21:27:49:
>
> > From: Wes McKinney <we...@gmail.com>
> > To: dev <de...@arrow.apache.org>, Micah Kornfield <em...@gmail.com>
> > Cc: Fan Liya <li...@gmail.com>
> > Date: 2020/08/26 21:28
> > Subject: [EXTERNAL] Re: [DISCUSS] Big Endian support in Arrow (was:
> > Re: [Java] Supporting Big Endian)
> >
> > hi Micah,
> >
> > I agree with your reasoning. If supporting BE in some languages (e.g.
> > Java) is impractical due to performance regressions on LE platforms,
> > then I don't think it's worth it. But if it can be handled at compile
> > time or without runtime overhead, and tested / maintained properly on
> > an ongoing basis, then it seems reasonable to me. It seems that the
> > number of Arrow stakeholders will only increase from here so I would
> > hope that there will be more people invested in helping maintain BE in
> > the future.
> >
> > - Wes
> >
> > On Tue, Aug 25, 2020 at 11:33 PM Micah Kornfield
> > <em...@gmail.com> wrote:
> > >
> > > I'm expanding the scope of this thread since it looks like work has
> also
> > > started for making golang support BigEndian architectures.
> > >
> > > I think as a community we should come to a consensus on whether we
> want to
> > > support Big Endian architectures in general.  I don't think it is a
> good
> > > outcome if some implementations accept PRs for Big Endian fixes and
> some
> > > don't.
> > >
> > > But maybe this is OK with others?
> > >
> > > My current opinion on the matter is that we should support it under the
> > > following conditions:
> > >
> > > 1.  As long as there is CI in place to catch regressions (right now I
> think
> > > the CI is fairly unreliable?)
> > > 2.  No degradation in performance for little-endian architectures
> (verified
> > > by additional micro benchmarks)
> > > 3.  Not a large amount of invasive code to distinguish between
> platforms.
> > >
> > > Kazuaki Ishizaki I asked question previously, but could you give some
> data
> > > points around:
> > > 1.  The current state of C++ support (how much code needed to change)?
> > > 2.  How many more PRs you expect to need for Java (and approximate
> size)?
> > >
> > > I think this would help myself and others in the decision making
> process.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > > > My thoughts on the points raised so far:
> > > >
> > > > * Does supporting Big Endian increase the reach of Arrow by a lot?
> > > >
> > > > Probably not a significant amount, but it does provide one more
> avenue of
> > > > adoption.
> > > >
> > > > * Does it increase code complexity?
> > > >
> > > > Yes.  I agree this is a concern.  The PR in question did not seem
> too bad
> > > > to me but this is subjective.  I think the remaining question is how
> many
> > > > more places need to be fixed up in the code base and how invasive
> are the
> > > > changes.  In C++ IIUC it turned out to be a relatively small number
> of
> > > > places.
> > > >
> > > > Kazuaki Ishizaki have you been able to get the Java implementation
> working
> > > > fully locally?  How many additional PRs will be needed and what do
> > > > they look like (I think there already a few more in the queue)?
> > > >
> > > > * Will it introduce performance regressions?
> > > >
> > > > If done properly I suspect no, but I think if we continue with
> BigEndian
> > > > support the places that need to be touched should have benchmarks
> added to
> > > > confirm this (including for PRs already merged).
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <li...@gmail.com>
> wrote:
> > > >
> > > >> Thank Kazuaki Ishizaki for working on this.
> > > >> IMO, supporting the big-endian should be a large change, as in many
> > > >> places of the code base, we have implicitly assumed the
> little-endian
> > > >> platform (e.g.
> > > >> https://urldefense.proofpoint.com/v2/url?
> >
> u=https-3A__github.com_apache_arrow_blob_master_java_memory_memory-2Dcore_src_main_java_org_apache_arrow_memory_util_ByteFunctionHelpers.java&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> > ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> > n7ezRb8qyWtog&s=poFSWqjJv99prou53ciinHyBmh5IZlXLlhYvftb9fu4&e=
> > > >> ).
> > > >> Supporting the big-endian platform may introduce branches in such
> places
> > > >> (or virtual calls) which will affect the performance.
> > > >> So it would be helpful to evaluate the performance impact.
> > > >>
> > > >> Best,
> > > >> Liya Fan
> > > >>
> > > >>
> > > >> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <ja...@apache.org>
> > > >> wrote:
> > > >>
> > > >>> Hey Micah, thanks for starting the discussion.
> > > >>>
> > > >>> I just skimmed that thread and it isn't entirely clear that there
> was a
> > > >>> conclusion that the overhead was worth it. I think everybody
> agrees that
> > > >>> it
> > > >>> would be nice to have the code work on both platforms. On the
> flipside,
> > > >>> the
> > > >>> code noise for a rare case makes the cost-benefit questionable.
> > > >>>
> > > >>> In the Java code, we wrote the code to explicitly disallow big
> endian
> > > >>> platforms and put preconditions checks in. I definitely think if
> we want
> > > >>> to
> > > >>> support this, it should be done holistically across the code with
> > > >>> appropriate test plan (both functional and perf).
> > > >>>
> > > >>> To me, the question is really about how many use cases are blocked
> by
> > > >>> this.
> > > >>> I'm not sure I've heard anyone say that the limiting factor
> toleveraging
> > > >>> Java Arrow was the block on endianess. Keep in mind that until very
> > > >>> recently, using any Arrow Java code would throw a preconditions
> check
> > > >>> before you could even get started on big-endian and I don't think
> we've
> > > >>> seen a bunch of messages on that exception. Adding if conditions
> > > >>> throughout
> > > >>> the codebase like this patch: [1] isn't exactly awesome and it can
> also
> > > >>> risk performance impacts depending on how carefully it is done.
> > > >>>
> > > >>> If there isn't a preponderance of evidence of many users
> beingblocked by
> > > >>> this capability, I don't think we should accept the code. We
> already
> > > >>> have a
> > > >>> backlog of items that we need to address just ensure existing use
> cases
> > > >>> work well. Expanding to new use cases that there is no clear
> demand for
> > > >>> will likely just increase code development cost at little benefit.
> > > >>>
> > > >>> What do others think?
> > > >>>
> > > >>> [1] https://urldefense.proofpoint.com/v2/url?
> >
> u=https-3A__github.com_apache_arrow_pull_7923-23issuecomment-2D674311119&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> > ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> > n7ezRb8qyWtog&s=vmvc0b4yHFfWLjLheCRysSiyaeRFO_6p0wdH-sLa7M8&e=
> > > >>>
> > > >>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield <
> emkornfield@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> > Kazuaki Ishizak has started working on Big Endian support in Java
> > > >>> > (including setting up CI for it).  Thank you!
> > > >>> >
> > > >>> > We previously discussed support for Big Endian architectures in
> C++
> > > >>> [1] and
> > > >>> > generally agreed that it was a reasonable thing to do.
> > > >>> >
> > > >>> > Similar to C++ I think as long as we have a working CI setup it
> is
> > > >>> > reasonable for Java to support Big Endian machines.
> > > >>> >
> > > >>> > But I think there might be differing opinions so it is worth a
> > > >>> discussion
> > > >>> > to see if there are technical blockers or other reasons for not
> > > >>> supporting
> > > >>> > Big Endian architectures in the existing java implementation.
> > > >>> >
> > > >>> > Thanks,
> > > >>> > Micah
> > > >>> >
> > > >>> >
> > > >>> > [1]
> > > >>> >
> > > >>> >
> > > >>> https://urldefense.proofpoint.com/v2/url?
> >
> u=https-3A__lists.apache.org_thread.html_rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019-2540-253Cdev.arrow.apache.org-253E&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> > ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> > n7ezRb8qyWtog&s=oDBWI9pmI39bTsEieQNDxZit0My21hLIW0fJRPJI0AM&e=
> > > >>> >
> > > >>>
> > > >>
> >
>
>

RE: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.

Hi,
I waited for comments regarding Java Big-Endian (BE) support during my 
one-week vacation. Thank you for good suggestions and comments.
I already responded to some questions in another mail. This mail addresses 
the remaining questions: Use cases, holistic strategy for BE support, and 
testing plans 

1. Use cases
The use case of Arrow Java is in Apache Spark, which was already published 
in Arrow Blog [1]. This is used as the typical performance acceleration of 
Spark with other languages such as Python [2] and R [3]. In DataBricks 
notebook, 68% of commands come from Python [4].

2. Holistic strategy of BE support across languages
I mostly completed BE support in C++. This implementation uses the 
following strategy:
A. Write and read data in a record batch using platform-native endian (NE) 
when the data is created on a host. The endianness is stored in an endian 
field in the schema.
B. Send data using the IPC-host endian among processes using IPC.
C. At B, if an IPC-client endian is different from the received data 
endian, the IPC client receives data without data copy.
D. At B, if an IPC-client endian is different from the received data 
endian, the IPC client swaps endian of the received data to match the 
endian with the IPC-client endian as default. 
E. The primitive data types in memory (e.g. Decimal128 in C++ and 
UnsafeDirectLittleEndian in Java) is read/written using the NE.

A and B-C are typical use cases in Apache Arrow. Therefore, no endian swap 
occurs in these use cases without performance overhead. B-D is rarely used 
(e.g. send data from x86_64 to s390x). Thus, the data swap occurs only 
once at the receive. After that, no data swap occurs for performance. For 
some use cases, this swap can be stopped by using an option. In these 
cases, Arrow will not process any data.
E. allows us to accessing primitive data (e.g. int32, double, decimal128) 
without performance loss by using the platform-native endian load/stores.

2-1. Implementation strategy in Java Language
The existing primitive data structures such as UnsafeDirectLittleEndian, 
ArrowBuf, and ValueVector should handle platform-native endian for the 
strategies A, B-C, and E without performance overhead. 
In the remaining strategy D, the method 
MessageSerializer.deserializeRecordBatch() will handle data swap when the 
endian of the host is different from that of the client, which corresponds 
to the PR [6] in C++.

3. Testing plan
For testing the strategies, A, B-C, and E, it would be good to increase 
the test coverage regardless of endianness e.g. increase the types of a 
schema to be tested in flight-core). 
For testing the strategy D, I already prepared data for be and le. When a 
PR will enable the data swap, the PR will also enable integration test.
For performance testing, we can use the existing framework [7] by 
extending the support for other languages. We can run performance 
benchmarks on a little-endian platform to avoid performance regression.

[1] https://arrow.apache.org/blog/2017/07/26/spark-arrow/
[2] 
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
[3] 
https://databricks.com/jp/blog/2020/06/01/vectorized-r-i-o-in-upcoming-apache-spark-3-0.html
[4] https://databricks.com/jp/session_na20/wednesday-morning-keynotes
[5] https://github.com/apache/arrow/pull/7507#discussion_r46819873
[6] https://github.com/apache/arrow/pull/7507
[7] https://github.com/apache/arrow/pull/7940#issuecomment-672690540

Best Regards,
Kazuaki Ishizaki

Wes McKinney <we...@gmail.com> wrote on 2020/08/26 21:27:49:

> From: Wes McKinney <we...@gmail.com>
> To: dev <de...@arrow.apache.org>, Micah Kornfield <em...@gmail.com>
> Cc: Fan Liya <li...@gmail.com>
> Date: 2020/08/26 21:28
> Subject: [EXTERNAL] Re: [DISCUSS] Big Endian support in Arrow (was: 
> Re: [Java] Supporting Big Endian)
> 
> hi Micah,
> 
> I agree with your reasoning. If supporting BE in some languages (e.g.
> Java) is impractical due to performance regressions on LE platforms,
> then I don't think it's worth it. But if it can be handled at compile
> time or without runtime overhead, and tested / maintained properly on
> an ongoing basis, then it seems reasonable to me. It seems that the
> number of Arrow stakeholders will only increase from here so I would
> hope that there will be more people invested in helping maintain BE in
> the future.
> 
> - Wes
> 
> On Tue, Aug 25, 2020 at 11:33 PM Micah Kornfield 
> <em...@gmail.com> wrote:
> >
> > I'm expanding the scope of this thread since it looks like work has 
also
> > started for making golang support BigEndian architectures.
> >
> > I think as a community we should come to a consensus on whether we 
want to
> > support Big Endian architectures in general.  I don't think it is a 
good
> > outcome if some implementations accept PRs for Big Endian fixes and 
some
> > don't.
> >
> > But maybe this is OK with others?
> >
> > My current opinion on the matter is that we should support it under 
the
> > following conditions:
> >
> > 1.  As long as there is CI in place to catch regressions (right now I 
think
> > the CI is fairly unreliable?)
> > 2.  No degradation in performance for little-endian architectures 
(verified
> > by additional micro benchmarks)
> > 3.  Not a large amount of invasive code to distinguish between 
platforms.
> >
> > Kazuaki Ishizaki I asked question previously, but could you give some 
data
> > points around:
> > 1.  The current state of C++ support (how much code needed to change)?
> > 2.  How many more PRs you expect to need for Java (and approximate 
size)?
> >
> > I think this would help myself and others in the decision making 
process.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield 
<em...@gmail.com>
> > wrote:
> >
> > > My thoughts on the points raised so far:
> > >
> > > * Does supporting Big Endian increase the reach of Arrow by a lot?
> > >
> > > Probably not a significant amount, but it does provide one more 
avenue of
> > > adoption.
> > >
> > > * Does it increase code complexity?
> > >
> > > Yes.  I agree this is a concern.  The PR in question did not seem 
too bad
> > > to me but this is subjective.  I think the remaining question is how 
many
> > > more places need to be fixed up in the code base and how invasive 
are the
> > > changes.  In C++ IIUC it turned out to be a relatively small number 
of
> > > places.
> > >
> > > Kazuaki Ishizaki have you been able to get the Java implementation 
working
> > > fully locally?  How many additional PRs will be needed and what do
> > > they look like (I think there already a few more in the queue)?
> > >
> > > * Will it introduce performance regressions?
> > >
> > > If done properly I suspect no, but I think if we continue with 
BigEndian
> > > support the places that need to be touched should have benchmarks 
added to
> > > confirm this (including for PRs already merged).
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <li...@gmail.com> 
wrote:
> > >
> > >> Thank Kazuaki Ishizaki for working on this.
> > >> IMO, supporting the big-endian should be a large change, as in many
> > >> places of the code base, we have implicitly assumed the 
little-endian
> > >> platform (e.g.
> > >> INVALID URI REMOVED
> 
u=https-3A__github.com_apache_arrow_blob_master_java_memory_memory-2Dcore_src_main_java_org_apache_arrow_memory_util_ByteFunctionHelpers.java&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> n7ezRb8qyWtog&s=poFSWqjJv99prou53ciinHyBmh5IZlXLlhYvftb9fu4&e= 
> > >> ).
> > >> Supporting the big-endian platform may introduce branches in such 
places
> > >> (or virtual calls) which will affect the performance.
> > >> So it would be helpful to evaluate the performance impact.
> > >>
> > >> Best,
> > >> Liya Fan
> > >>
> > >>
> > >> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <ja...@apache.org>
> > >> wrote:
> > >>
> > >>> Hey Micah, thanks for starting the discussion.
> > >>>
> > >>> I just skimmed that thread and it isn't entirely clear that there 
was a
> > >>> conclusion that the overhead was worth it. I think everybody 
agrees that
> > >>> it
> > >>> would be nice to have the code work on both platforms. On the 
flipside,
> > >>> the
> > >>> code noise for a rare case makes the cost-benefit questionable.
> > >>>
> > >>> In the Java code, we wrote the code to explicitly disallow big 
endian
> > >>> platforms and put preconditions checks in. I definitely think if 
we want
> > >>> to
> > >>> support this, it should be done holistically across the code with
> > >>> appropriate test plan (both functional and perf).
> > >>>
> > >>> To me, the question is really about how many use cases are blocked 
by
> > >>> this.
> > >>> I'm not sure I've heard anyone say that the limiting factor 
toleveraging
> > >>> Java Arrow was the block on endianess. Keep in mind that until 
very
> > >>> recently, using any Arrow Java code would throw a preconditions 
check
> > >>> before you could even get started on big-endian and I don't think 
we've
> > >>> seen a bunch of messages on that exception. Adding if conditions
> > >>> throughout
> > >>> the codebase like this patch: [1] isn't exactly awesome and it can 
also
> > >>> risk performance impacts depending on how carefully it is done.
> > >>>
> > >>> If there isn't a preponderance of evidence of many users 
beingblocked by
> > >>> this capability, I don't think we should accept the code. We 
already
> > >>> have a
> > >>> backlog of items that we need to address just ensure existing use 
cases
> > >>> work well. Expanding to new use cases that there is no clear 
demand for
> > >>> will likely just increase code development cost at little benefit.
> > >>>
> > >>> What do others think?
> > >>>
> > >>> [1] INVALID URI REMOVED
> 
u=https-3A__github.com_apache_arrow_pull_7923-23issuecomment-2D674311119&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> n7ezRb8qyWtog&s=vmvc0b4yHFfWLjLheCRysSiyaeRFO_6p0wdH-sLa7M8&e= 
> > >>>
> > >>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield 
<em...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Kazuaki Ishizak has started working on Big Endian support in 
Java
> > >>> > (including setting up CI for it).  Thank you!
> > >>> >
> > >>> > We previously discussed support for Big Endian architectures in 
C++
> > >>> [1] and
> > >>> > generally agreed that it was a reasonable thing to do.
> > >>> >
> > >>> > Similar to C++ I think as long as we have a working CI setup it 
is
> > >>> > reasonable for Java to support Big Endian machines.
> > >>> >
> > >>> > But I think there might be differing opinions so it is worth a
> > >>> discussion
> > >>> > to see if there are technical blockers or other reasons for not
> > >>> supporting
> > >>> > Big Endian architectures in the existing java implementation.
> > >>> >
> > >>> > Thanks,
> > >>> > Micah
> > >>> >
> > >>> >
> > >>> > [1]
> > >>> >
> > >>> >
> > >>> INVALID URI REMOVED
> 
u=https-3A__lists.apache.org_thread.html_rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019-2540-253Cdev.arrow.apache.org-253E&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> n7ezRb8qyWtog&s=oDBWI9pmI39bTsEieQNDxZit0My21hLIW0fJRPJI0AM&e= 
> > >>> >
> > >>>
> > >>
>

RE: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.

Kazuaki Ishizaki, Ph.D., Senior Technical Staff Member (STSM), IBM 
Research - Tokyo
ACM Distinguished Member - Apache Spark committer - IBM Academy of 
Technology Member

Wes McKinney <we...@gmail.com> wrote on 2020/08/26 21:27:49:

> From: Wes McKinney <we...@gmail.com>
> To: dev <de...@arrow.apache.org>, Micah Kornfield <em...@gmail.com>
> Cc: Fan Liya <li...@gmail.com>
> Date: 2020/08/26 21:28
> Subject: [EXTERNAL] Re: [DISCUSS] Big Endian support in Arrow (was: 
> Re: [Java] Supporting Big Endian)
> 
> hi Micah,
> 
> I agree with your reasoning. If supporting BE in some languages (e.g.
> Java) is impractical due to performance regressions on LE platforms,
> then I don't think it's worth it. But if it can be handled at compile
> time or without runtime overhead, and tested / maintained properly on
> an ongoing basis, then it seems reasonable to me. It seems that the
> number of Arrow stakeholders will only increase from here so I would
> hope that there will be more people invested in helping maintain BE in
> the future.
> 
> - Wes
> 
> On Tue, Aug 25, 2020 at 11:33 PM Micah Kornfield 
> <em...@gmail.com> wrote:
> >
> > I'm expanding the scope of this thread since it looks like work has 
also
> > started for making golang support BigEndian architectures.
> >
> > I think as a community we should come to a consensus on whether we 
want to
> > support Big Endian architectures in general.  I don't think it is a 
good
> > outcome if some implementations accept PRs for Big Endian fixes and 
some
> > don't.
> >
> > But maybe this is OK with others?
> >
> > My current opinion on the matter is that we should support it under 
the
> > following conditions:
> >
> > 1.  As long as there is CI in place to catch regressions (right now I 
think
> > the CI is fairly unreliable?)
> > 2.  No degradation in performance for little-endian architectures 
(verified
> > by additional micro benchmarks)
> > 3.  Not a large amount of invasive code to distinguish between 
platforms.
> >
> > Kazuaki Ishizaki I asked question previously, but could you give some 
data
> > points around:
> > 1.  The current state of C++ support (how much code needed to change)?
> > 2.  How many more PRs you expect to need for Java (and approximate 
size)?
> >
> > I think this would help myself and others in the decision making 
process.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield 
<em...@gmail.com>
> > wrote:
> >
> > > My thoughts on the points raised so far:
> > >
> > > * Does supporting Big Endian increase the reach of Arrow by a lot?
> > >
> > > Probably not a significant amount, but it does provide one more 
avenue of
> > > adoption.
> > >
> > > * Does it increase code complexity?
> > >
> > > Yes.  I agree this is a concern.  The PR in question did not seem 
too bad
> > > to me but this is subjective.  I think the remaining question is how 
many
> > > more places need to be fixed up in the code base and how invasive 
are the
> > > changes.  In C++ IIUC it turned out to be a relatively small number 
of
> > > places.
> > >
> > > Kazuaki Ishizaki have you been able to get the Java implementation 
working
> > > fully locally?  How many additional PRs will be needed and what do
> > > they look like (I think there already a few more in the queue)?
> > >
> > > * Will it introduce performance regressions?
> > >
> > > If done properly I suspect no, but I think if we continue with 
BigEndian
> > > support the places that need to be touched should have benchmarks 
added to
> > > confirm this (including for PRs already merged).
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <li...@gmail.com> 
wrote:
> > >
> > >> Thank Kazuaki Ishizaki for working on this.
> > >> IMO, supporting the big-endian should be a large change, as in many
> > >> places of the code base, we have implicitly assumed the 
little-endian
> > >> platform (e.g.
> > >> INVALID URI REMOVED
> 
u=https-3A__github.com_apache_arrow_blob_master_java_memory_memory-2Dcore_src_main_java_org_apache_arrow_memory_util_ByteFunctionHelpers.java&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> n7ezRb8qyWtog&s=poFSWqjJv99prou53ciinHyBmh5IZlXLlhYvftb9fu4&e= 
> > >> ).
> > >> Supporting the big-endian platform may introduce branches in such 
places
> > >> (or virtual calls) which will affect the performance.
> > >> So it would be helpful to evaluate the performance impact.
> > >>
> > >> Best,
> > >> Liya Fan
> > >>
> > >>
> > >> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <ja...@apache.org>
> > >> wrote:
> > >>
> > >>> Hey Micah, thanks for starting the discussion.
> > >>>
> > >>> I just skimmed that thread and it isn't entirely clear that there 
was a
> > >>> conclusion that the overhead was worth it. I think everybody 
agrees that
> > >>> it
> > >>> would be nice to have the code work on both platforms. On the 
flipside,
> > >>> the
> > >>> code noise for a rare case makes the cost-benefit questionable.
> > >>>
> > >>> In the Java code, we wrote the code to explicitly disallow big 
endian
> > >>> platforms and put preconditions checks in. I definitely think if 
we want
> > >>> to
> > >>> support this, it should be done holistically across the code with
> > >>> appropriate test plan (both functional and perf).
> > >>>
> > >>> To me, the question is really about how many use cases are blocked 
by
> > >>> this.
> > >>> I'm not sure I've heard anyone say that the limiting factor 
toleveraging
> > >>> Java Arrow was the block on endianess. Keep in mind that until 
very
> > >>> recently, using any Arrow Java code would throw a preconditions 
check
> > >>> before you could even get started on big-endian and I don't think 
we've
> > >>> seen a bunch of messages on that exception. Adding if conditions
> > >>> throughout
> > >>> the codebase like this patch: [1] isn't exactly awesome and it can 
also
> > >>> risk performance impacts depending on how carefully it is done.
> > >>>
> > >>> If there isn't a preponderance of evidence of many users 
beingblocked by
> > >>> this capability, I don't think we should accept the code. We 
already
> > >>> have a
> > >>> backlog of items that we need to address just ensure existing use 
cases
> > >>> work well. Expanding to new use cases that there is no clear 
demand for
> > >>> will likely just increase code development cost at little benefit.
> > >>>
> > >>> What do others think?
> > >>>
> > >>> [1] INVALID URI REMOVED
> 
u=https-3A__github.com_apache_arrow_pull_7923-23issuecomment-2D674311119&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> n7ezRb8qyWtog&s=vmvc0b4yHFfWLjLheCRysSiyaeRFO_6p0wdH-sLa7M8&e= 
> > >>>
> > >>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield 
<em...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Kazuaki Ishizak has started working on Big Endian support in 
Java
> > >>> > (including setting up CI for it).  Thank you!
> > >>> >
> > >>> > We previously discussed support for Big Endian architectures in 
C++
> > >>> [1] and
> > >>> > generally agreed that it was a reasonable thing to do.
> > >>> >
> > >>> > Similar to C++ I think as long as we have a working CI setup it 
is
> > >>> > reasonable for Java to support Big Endian machines.
> > >>> >
> > >>> > But I think there might be differing opinions so it is worth a
> > >>> discussion
> > >>> > to see if there are technical blockers or other reasons for not
> > >>> supporting
> > >>> > Big Endian architectures in the existing java implementation.
> > >>> >
> > >>> > Thanks,
> > >>> > Micah
> > >>> >
> > >>> >
> > >>> > [1]
> > >>> >
> > >>> >
> > >>> INVALID URI REMOVED
> 
u=https-3A__lists.apache.org_thread.html_rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019-2540-253Cdev.arrow.apache.org-253E&d=DwIBaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-
> ZCGj9Pg&m=3rVsa9EYwGOrvQw8rg0L9EtFs7I7B-
> n7ezRb8qyWtog&s=oDBWI9pmI39bTsEieQNDxZit0My21hLIW0fJRPJI0AM&e= 
> > >>> >
> > >>>
> > >>
>

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Wes McKinney <we...@gmail.com>.

hi Micah,

I agree with your reasoning. If supporting BE in some languages (e.g.
Java) is impractical due to performance regressions on LE platforms,
then I don't think it's worth it. But if it can be handled at compile
time or without runtime overhead, and tested / maintained properly on
an ongoing basis, then it seems reasonable to me. It seems that the
number of Arrow stakeholders will only increase from here so I would
hope that there will be more people invested in helping maintain BE in
the future.

- Wes

On Tue, Aug 25, 2020 at 11:33 PM Micah Kornfield <em...@gmail.com> wrote:
>
> I'm expanding the scope of this thread since it looks like work has also
> started for making golang support BigEndian architectures.
>
> I think as a community we should come to a consensus on whether we want to
> support Big Endian architectures in general.  I don't think it is a good
> outcome if some implementations accept PRs for Big Endian fixes and some
> don't.
>
> But maybe this is OK with others?
>
> My current opinion on the matter is that we should support it under the
> following conditions:
>
> 1.  As long as there is CI in place to catch regressions (right now I think
> the CI is fairly unreliable?)
> 2.  No degradation in performance for little-endian architectures (verified
> by additional micro benchmarks)
> 3.  Not a large amount of invasive code to distinguish between platforms.
>
> Kazuaki Ishizaki I asked question previously, but could you give some data
> points around:
> 1.  The current state of C++ support (how much code needed to change)?
> 2.  How many more PRs you expect to need for Java (and approximate size)?
>
> I think this would help myself and others in the decision making process.
>
> Thanks,
> Micah
>
> On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > My thoughts on the points raised so far:
> >
> > * Does supporting Big Endian increase the reach of Arrow by a lot?
> >
> > Probably not a significant amount, but it does provide one more avenue of
> > adoption.
> >
> > * Does it increase code complexity?
> >
> > Yes.  I agree this is a concern.  The PR in question did not seem too bad
> > to me but this is subjective.  I think the remaining question is how many
> > more places need to be fixed up in the code base and how invasive are the
> > changes.  In C++ IIUC it turned out to be a relatively small number of
> > places.
> >
> > Kazuaki Ishizaki have you been able to get the Java implementation working
> > fully locally?  How many additional PRs will be needed and what do
> > they look like (I think there already a few more in the queue)?
> >
> > * Will it introduce performance regressions?
> >
> > If done properly I suspect no, but I think if we continue with BigEndian
> > support the places that need to be touched should have benchmarks added to
> > confirm this (including for PRs already merged).
> >
> > Thanks,
> > Micah
> >
> > On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <li...@gmail.com> wrote:
> >
> >> Thank Kazuaki Ishizaki for working on this.
> >> IMO, supporting the big-endian should be a large change, as in many
> >> places of the code base, we have implicitly assumed the little-endian
> >> platform (e.g.
> >> https://github.com/apache/arrow/blob/master/java/memory/memory-core/src/main/java/org/apache/arrow/memory/util/ByteFunctionHelpers.java
> >> ).
> >> Supporting the big-endian platform may introduce branches in such places
> >> (or virtual calls) which will affect the performance.
> >> So it would be helpful to evaluate the performance impact.
> >>
> >> Best,
> >> Liya Fan
> >>
> >>
> >> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >>
> >>> Hey Micah, thanks for starting the discussion.
> >>>
> >>> I just skimmed that thread and it isn't entirely clear that there was a
> >>> conclusion that the overhead was worth it. I think everybody agrees that
> >>> it
> >>> would be nice to have the code work on both platforms. On the flipside,
> >>> the
> >>> code noise for a rare case makes the cost-benefit questionable.
> >>>
> >>> In the Java code, we wrote the code to explicitly disallow big endian
> >>> platforms and put preconditions checks in. I definitely think if we want
> >>> to
> >>> support this, it should be done holistically across the code with
> >>> appropriate test plan (both functional and perf).
> >>>
> >>> To me, the question is really about how many use cases are blocked by
> >>> this.
> >>> I'm not sure I've heard anyone say that the limiting factor to leveraging
> >>> Java Arrow was the block on endianess. Keep in mind that until very
> >>> recently, using any Arrow Java code would throw a preconditions check
> >>> before you could even get started on big-endian and I don't think we've
> >>> seen a bunch of messages on that exception. Adding if conditions
> >>> throughout
> >>> the codebase like this patch: [1] isn't exactly awesome and it can also
> >>> risk performance impacts depending on how carefully it is done.
> >>>
> >>> If there isn't a preponderance of evidence of many users being blocked by
> >>> this capability, I don't think we should accept the code. We already
> >>> have a
> >>> backlog of items that we need to address just ensure existing use cases
> >>> work well. Expanding to new use cases that there is no clear demand for
> >>> will likely just increase code development cost at little benefit.
> >>>
> >>> What do others think?
> >>>
> >>> [1] https://github.com/apache/arrow/pull/7923#issuecomment-674311119
> >>>
> >>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield <em...@gmail.com>
> >>> wrote:
> >>>
> >>> > Kazuaki Ishizak has started working on Big Endian support in Java
> >>> > (including setting up CI for it).  Thank you!
> >>> >
> >>> > We previously discussed support for Big Endian architectures in C++
> >>> [1] and
> >>> > generally agreed that it was a reasonable thing to do.
> >>> >
> >>> > Similar to C++ I think as long as we have a working CI setup it is
> >>> > reasonable for Java to support Big Endian machines.
> >>> >
> >>> > But I think there might be differing opinions so it is worth a
> >>> discussion
> >>> > to see if there are technical blockers or other reasons for not
> >>> supporting
> >>> > Big Endian architectures in the existing java implementation.
> >>> >
> >>> > Thanks,
> >>> > Micah
> >>> >
> >>> >
> >>> > [1]
> >>> >
> >>> >
> >>> https://lists.apache.org/thread.html/rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019%40%3Cdev.arrow.apache.org%3E
> >>> >
> >>>
> >>

[DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

Posted by Micah Kornfield <em...@gmail.com>.

I'm expanding the scope of this thread since it looks like work has also
started for making golang support BigEndian architectures.

I think as a community we should come to a consensus on whether we want to
support Big Endian architectures in general.  I don't think it is a good
outcome if some implementations accept PRs for Big Endian fixes and some
don't.

But maybe this is OK with others?

My current opinion on the matter is that we should support it under the
following conditions:

1.  As long as there is CI in place to catch regressions (right now I think
the CI is fairly unreliable?)
2.  No degradation in performance for little-endian architectures (verified
by additional micro benchmarks)
3.  Not a large amount of invasive code to distinguish between platforms.

Kazuaki Ishizaki I asked question previously, but could you give some data
points around:
1.  The current state of C++ support (how much code needed to change)?
2.  How many more PRs you expect to need for Java (and approximate size)?

I think this would help myself and others in the decision making process.

Thanks,
Micah

On Tue, Aug 18, 2020 at 9:15 AM Micah Kornfield <em...@gmail.com>
wrote:

> My thoughts on the points raised so far:
>
> * Does supporting Big Endian increase the reach of Arrow by a lot?
>
> Probably not a significant amount, but it does provide one more avenue of
> adoption.
>
> * Does it increase code complexity?
>
> Yes.  I agree this is a concern.  The PR in question did not seem too bad
> to me but this is subjective.  I think the remaining question is how many
> more places need to be fixed up in the code base and how invasive are the
> changes.  In C++ IIUC it turned out to be a relatively small number of
> places.
>
> Kazuaki Ishizaki have you been able to get the Java implementation working
> fully locally?  How many additional PRs will be needed and what do
> they look like (I think there already a few more in the queue)?
>
> * Will it introduce performance regressions?
>
> If done properly I suspect no, but I think if we continue with BigEndian
> support the places that need to be touched should have benchmarks added to
> confirm this (including for PRs already merged).
>
> Thanks,
> Micah
>
> On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <li...@gmail.com> wrote:
>
>> Thank Kazuaki Ishizaki for working on this.
>> IMO, supporting the big-endian should be a large change, as in many
>> places of the code base, we have implicitly assumed the little-endian
>> platform (e.g.
>> https://github.com/apache/arrow/blob/master/java/memory/memory-core/src/main/java/org/apache/arrow/memory/util/ByteFunctionHelpers.java
>> ).
>> Supporting the big-endian platform may introduce branches in such places
>> (or virtual calls) which will affect the performance.
>> So it would be helpful to evaluate the performance impact.
>>
>> Best,
>> Liya Fan
>>
>>
>> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>>> Hey Micah, thanks for starting the discussion.
>>>
>>> I just skimmed that thread and it isn't entirely clear that there was a
>>> conclusion that the overhead was worth it. I think everybody agrees that
>>> it
>>> would be nice to have the code work on both platforms. On the flipside,
>>> the
>>> code noise for a rare case makes the cost-benefit questionable.
>>>
>>> In the Java code, we wrote the code to explicitly disallow big endian
>>> platforms and put preconditions checks in. I definitely think if we want
>>> to
>>> support this, it should be done holistically across the code with
>>> appropriate test plan (both functional and perf).
>>>
>>> To me, the question is really about how many use cases are blocked by
>>> this.
>>> I'm not sure I've heard anyone say that the limiting factor to leveraging
>>> Java Arrow was the block on endianess. Keep in mind that until very
>>> recently, using any Arrow Java code would throw a preconditions check
>>> before you could even get started on big-endian and I don't think we've
>>> seen a bunch of messages on that exception. Adding if conditions
>>> throughout
>>> the codebase like this patch: [1] isn't exactly awesome and it can also
>>> risk performance impacts depending on how carefully it is done.
>>>
>>> If there isn't a preponderance of evidence of many users being blocked by
>>> this capability, I don't think we should accept the code. We already
>>> have a
>>> backlog of items that we need to address just ensure existing use cases
>>> work well. Expanding to new use cases that there is no clear demand for
>>> will likely just increase code development cost at little benefit.
>>>
>>> What do others think?
>>>
>>> [1] https://github.com/apache/arrow/pull/7923#issuecomment-674311119
>>>
>>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>> > Kazuaki Ishizak has started working on Big Endian support in Java
>>> > (including setting up CI for it).  Thank you!
>>> >
>>> > We previously discussed support for Big Endian architectures in C++
>>> [1] and
>>> > generally agreed that it was a reasonable thing to do.
>>> >
>>> > Similar to C++ I think as long as we have a working CI setup it is
>>> > reasonable for Java to support Big Endian machines.
>>> >
>>> > But I think there might be differing opinions so it is worth a
>>> discussion
>>> > to see if there are technical blockers or other reasons for not
>>> supporting
>>> > Big Endian architectures in the existing java implementation.
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> >
>>> > [1]
>>> >
>>> >
>>> https://lists.apache.org/thread.html/rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019%40%3Cdev.arrow.apache.org%3E
>>> >
>>>
>>

Re: [Java] Supporting Big Endian

Posted by Micah Kornfield <em...@gmail.com>.

My thoughts on the points raised so far:

* Does supporting Big Endian increase the reach of Arrow by a lot?

Probably not a significant amount, but it does provide one more avenue of
adoption.

* Does it increase code complexity?

Yes.  I agree this is a concern.  The PR in question did not seem too bad
to me but this is subjective.  I think the remaining question is how many
more places need to be fixed up in the code base and how invasive are the
changes.  In C++ IIUC it turned out to be a relatively small number of
places.

Kazuaki Ishizaki have you been able to get the Java implementation working
fully locally?  How many additional PRs will be needed and what do
they look like (I think there already a few more in the queue)?

* Will it introduce performance regressions?

If done properly I suspect no, but I think if we continue with BigEndian
support the places that need to be touched should have benchmarks added to
confirm this (including for PRs already merged).

Thanks,
Micah

On Sun, Aug 16, 2020 at 7:37 PM Fan Liya <li...@gmail.com> wrote:

> Thank Kazuaki Ishizaki for working on this.
> IMO, supporting the big-endian should be a large change, as in many places
> of the code base, we have implicitly assumed the little-endian platform
> (e.g.
> https://github.com/apache/arrow/blob/master/java/memory/memory-core/src/main/java/org/apache/arrow/memory/util/ByteFunctionHelpers.java
> ).
> Supporting the big-endian platform may introduce branches in such places
> (or virtual calls) which will affect the performance.
> So it would be helpful to evaluate the performance impact.
>
> Best,
> Liya Fan
>
>
> On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <ja...@apache.org> wrote:
>
>> Hey Micah, thanks for starting the discussion.
>>
>> I just skimmed that thread and it isn't entirely clear that there was a
>> conclusion that the overhead was worth it. I think everybody agrees that
>> it
>> would be nice to have the code work on both platforms. On the flipside,
>> the
>> code noise for a rare case makes the cost-benefit questionable.
>>
>> In the Java code, we wrote the code to explicitly disallow big endian
>> platforms and put preconditions checks in. I definitely think if we want
>> to
>> support this, it should be done holistically across the code with
>> appropriate test plan (both functional and perf).
>>
>> To me, the question is really about how many use cases are blocked by
>> this.
>> I'm not sure I've heard anyone say that the limiting factor to leveraging
>> Java Arrow was the block on endianess. Keep in mind that until very
>> recently, using any Arrow Java code would throw a preconditions check
>> before you could even get started on big-endian and I don't think we've
>> seen a bunch of messages on that exception. Adding if conditions
>> throughout
>> the codebase like this patch: [1] isn't exactly awesome and it can also
>> risk performance impacts depending on how carefully it is done.
>>
>> If there isn't a preponderance of evidence of many users being blocked by
>> this capability, I don't think we should accept the code. We already have
>> a
>> backlog of items that we need to address just ensure existing use cases
>> work well. Expanding to new use cases that there is no clear demand for
>> will likely just increase code development cost at little benefit.
>>
>> What do others think?
>>
>> [1] https://github.com/apache/arrow/pull/7923#issuecomment-674311119
>>
>> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>> > Kazuaki Ishizak has started working on Big Endian support in Java
>> > (including setting up CI for it).  Thank you!
>> >
>> > We previously discussed support for Big Endian architectures in C++ [1]
>> and
>> > generally agreed that it was a reasonable thing to do.
>> >
>> > Similar to C++ I think as long as we have a working CI setup it is
>> > reasonable for Java to support Big Endian machines.
>> >
>> > But I think there might be differing opinions so it is worth a
>> discussion
>> > to see if there are technical blockers or other reasons for not
>> supporting
>> > Big Endian architectures in the existing java implementation.
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> > [1]
>> >
>> >
>> https://lists.apache.org/thread.html/rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019%40%3Cdev.arrow.apache.org%3E
>> >
>>
>

Re: [Java] Supporting Big Endian

Posted by Fan Liya <li...@gmail.com>.

Thank Kazuaki Ishizaki for working on this.
IMO, supporting the big-endian should be a large change, as in many places
of the code base, we have implicitly assumed the little-endian platform
(e.g.
https://github.com/apache/arrow/blob/master/java/memory/memory-core/src/main/java/org/apache/arrow/memory/util/ByteFunctionHelpers.java
).
Supporting the big-endian platform may introduce branches in such places
(or virtual calls) which will affect the performance.
So it would be helpful to evaluate the performance impact.

Best,
Liya Fan


On Sat, Aug 15, 2020 at 7:54 AM Jacques Nadeau <ja...@apache.org> wrote:

> Hey Micah, thanks for starting the discussion.
>
> I just skimmed that thread and it isn't entirely clear that there was a
> conclusion that the overhead was worth it. I think everybody agrees that it
> would be nice to have the code work on both platforms. On the flipside, the
> code noise for a rare case makes the cost-benefit questionable.
>
> In the Java code, we wrote the code to explicitly disallow big endian
> platforms and put preconditions checks in. I definitely think if we want to
> support this, it should be done holistically across the code with
> appropriate test plan (both functional and perf).
>
> To me, the question is really about how many use cases are blocked by this.
> I'm not sure I've heard anyone say that the limiting factor to leveraging
> Java Arrow was the block on endianess. Keep in mind that until very
> recently, using any Arrow Java code would throw a preconditions check
> before you could even get started on big-endian and I don't think we've
> seen a bunch of messages on that exception. Adding if conditions throughout
> the codebase like this patch: [1] isn't exactly awesome and it can also
> risk performance impacts depending on how carefully it is done.
>
> If there isn't a preponderance of evidence of many users being blocked by
> this capability, I don't think we should accept the code. We already have a
> backlog of items that we need to address just ensure existing use cases
> work well. Expanding to new use cases that there is no clear demand for
> will likely just increase code development cost at little benefit.
>
> What do others think?
>
> [1] https://github.com/apache/arrow/pull/7923#issuecomment-674311119
>
> On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Kazuaki Ishizak has started working on Big Endian support in Java
> > (including setting up CI for it).  Thank you!
> >
> > We previously discussed support for Big Endian architectures in C++ [1]
> and
> > generally agreed that it was a reasonable thing to do.
> >
> > Similar to C++ I think as long as we have a working CI setup it is
> > reasonable for Java to support Big Endian machines.
> >
> > But I think there might be differing opinions so it is worth a discussion
> > to see if there are technical blockers or other reasons for not
> supporting
> > Big Endian architectures in the existing java implementation.
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019%40%3Cdev.arrow.apache.org%3E
> >
>

Re: [Java] Supporting Big Endian

Posted by Jacques Nadeau <ja...@apache.org>.

Hey Micah, thanks for starting the discussion.

I just skimmed that thread and it isn't entirely clear that there was a
conclusion that the overhead was worth it. I think everybody agrees that it
would be nice to have the code work on both platforms. On the flipside, the
code noise for a rare case makes the cost-benefit questionable.

In the Java code, we wrote the code to explicitly disallow big endian
platforms and put preconditions checks in. I definitely think if we want to
support this, it should be done holistically across the code with
appropriate test plan (both functional and perf).

To me, the question is really about how many use cases are blocked by this.
I'm not sure I've heard anyone say that the limiting factor to leveraging
Java Arrow was the block on endianess. Keep in mind that until very
recently, using any Arrow Java code would throw a preconditions check
before you could even get started on big-endian and I don't think we've
seen a bunch of messages on that exception. Adding if conditions throughout
the codebase like this patch: [1] isn't exactly awesome and it can also
risk performance impacts depending on how carefully it is done.

If there isn't a preponderance of evidence of many users being blocked by
this capability, I don't think we should accept the code. We already have a
backlog of items that we need to address just ensure existing use cases
work well. Expanding to new use cases that there is no clear demand for
will likely just increase code development cost at little benefit.

What do others think?

[1] https://github.com/apache/arrow/pull/7923#issuecomment-674311119

On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield <em...@gmail.com>
wrote:

> Kazuaki Ishizak has started working on Big Endian support in Java
> (including setting up CI for it).  Thank you!
>
> We previously discussed support for Big Endian architectures in C++ [1] and
> generally agreed that it was a reasonable thing to do.
>
> Similar to C++ I think as long as we have a working CI setup it is
> reasonable for Java to support Big Endian machines.
>
> But I think there might be differing opinions so it is worth a discussion
> to see if there are technical blockers or other reasons for not supporting
> Big Endian architectures in the existing java implementation.
>
> Thanks,
> Micah
>
>
> [1]
>
> https://lists.apache.org/thread.html/rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019%40%3Cdev.arrow.apache.org%3E
>

Re: [Java] Supporting Big Endian

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.

Hi Micah,
Thank you for pick up this topic.

It is great to discuss the support of Big Endian in Java implementation in 
the mailing list. I may miss the technical blockers.
Any comments are appreciated.

Best Regards,
Kazuaki Ishizaki



From:   Micah Kornfield <em...@gmail.com>
To:     dev <de...@arrow.apache.org>
Date:   2020/08/15 08:36
Subject:        [EXTERNAL] [Java] Supporting Big Endian



Kazuaki Ishizak has started working on Big Endian support in Java
(including setting up CI for it).  Thank you!

We previously discussed support for Big Endian architectures in C++ [1] 
and
generally agreed that it was a reasonable thing to do.

Similar to C++ I think as long as we have a working CI setup it is
reasonable for Java to support Big Endian machines.

But I think there might be differing opinions so it is worth a discussion
to see if there are technical blockers or other reasons for not supporting
Big Endian architectures in the existing java implementation.

Thanks,
Micah


[1]
https://lists.apache.org/thread.html/rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019%40%3Cdev.arrow.apache.org%3E