You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/08/06 07:35:30 UTC

[Discuss][Java] 64-bit lengths for ValueVectors

There have been some previous discussions on the mailing about supporting
64-bit lengths for  Java ValueVectors (this is what the IPC specification
and C++ support).  I created a PR [1] that changes all APIs that I could
find that take an index to take an "long" instead of an "int" (and
similarly change "size/rowcount" APIs).

It is a big change, so I think it is worth discussing if it is something we
still want to move forward with.  It would be nice to come to a conclusion
quickly, ideally in the next few days, to avoid a lot of merge conflicts.

The reason I did this work now is the C++ implementation has added support
for LargeList, LargeBinary and LargeString arrays and based on prior
discussions we need to have similar support in Java before our next
release. Support 64-bit indexes means we can have full compatibility and
make the most use of the types in Java.

Look forward to hearing feedback.

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/5020

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Wes McKinney <we...@gmail.com>.

My stance on this is that I don't know how important it is for Java to
support vectors over INT32_MAX elements. The use cases enabled by
having very large arrays seem to be concentrated in the native code
world (e.g. C/C++/Rust) -- that could just be implementation-centrism
on my part, though. It's possible there are use cases where Java would
want to be able to produce very large memory regions to be exposed to
native code.

On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Jacques,
> I definitely understand these concerns and this change is risky because it
> is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
> of dealing with this.  This could have other benefits like cleaning up some
> cruft around dictionary encode and "orphaned" method.   Per past e-mail
> threads I agree it is beneficial to have 2 separate reference
> implementations that can communicate fully, and my intent here was to close
> that gap.
>
> Trying to
> > determine the ramifications of these changes would be challenging and time
> > consuming against all the different ways we interact with the Arrow Java
> > library.
>
>
> Understood.  I took a quick look at Dremio-OSS it seems like it has a
> simple java build system?  If it is helpful, I can try to get a fork
> running that at least compiles against this PR.  My plan would be to cast
> any place that was changed to return a long back to an int, so in essence
> the Dremio algorithms would reman 32-bit implementations.
>
> I don't  have the infrastructure to test this change properly from a
> distributed systems perspective, so it would still take some time from
> Dremio to validate for regressions.
>
> I'm not saying I'm against this but want to make sure we've
> > explored all less disruptive options before considering changing something
> > this fundamental (especially when I generally hold the view that large cell
> > counts against massive contiguous memory is an anti pattern to scalable
> > analytical processing--purely subjective of course).
>
>
> I'm open to other ideas here, as well. I don't think it is out of the
> question to leave the Java implementation as 32-bit, but if we do, then I
> think we should consider a different strategy for reference implementations.
>
> Thanks,
> Micah
>
> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> > Hey Micah, I didn't have a particular path in mind. Was thinking more along
> > the lines of extra methods as opposed to separate classes.
> >
> > Arrow hasn't historically been a place where we're writing algorithms in
> > Java so the fact that they aren't there doesn't mean they don't exist. We
> > have a large amount of code that depends on the current behavior that is
> > deployed in hundreds of customer clusters (you can peruse our dremio repo
> > to see how extensively we leverage Arrow if interested). Trying to
> > determine the ramifications of these changes would be challenging and time
> > consuming against all the different ways we interact with the Arrow Java
> > library. I'm not saying I'm against this but want to make sure we've
> > explored all less disruptive options before considering changing something
> > this fundamental (especially when I generally hold the view that large cell
> > counts against massive contiguous memory is an anti pattern to scalable
> > analytical processing--purely subjective of course).
> >
> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > Hi Jacques,
> > > What avenue were you thinking for supporting both paths?   I didn't want
> > > to pursue a different class hierarchy, because I felt like that would
> > > effectively fork the code base, but that is potentially an option that
> > > would allow us to have a complete reference implementation in Java that
> > can
> > > fully interact with C++, without major changes to this code.
> > >
> > > For supporting both APIs on the same classes/interfaces, I think they
> > > roughly fall into three categories, changes to input parameters, changes
> > to
> > > output parameters and algorithm changes.
> > >
> > > For inputs, changing from int to long is essentially a no-op from the
> > > compiler perspective.  From the limited micro-benchmarking this also
> > > doesn't seem to have a performance impact.  So we could keep two versions
> > > of the methods that only differ on inputs, but it is not clear what the
> > > value of that would be.
> > >
> > > For outputs, we can't support methods "long getLength()" and "int
> > > getLength()" in the same class, so we would be forced into something like
> > > "long getLength(boolean dummy)" which I think is a less desirable.
> > >
> > > For algorithm changes, there did not appear to be too many places where
> > we
> > > actually loop over all elements (it is quite possible I missed something
> > > here), the ones that I did find I was able to mitigate performance
> > > penalties as noted above.  Some of the current implementation will get a
> > > lot slower for "large arrays", but we can likely fix those later or in
> > this
> > > PR with a nested while loop instead of 2 for loops.
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > > On Saturday, August 10, 2019, Jacques Nadeau <ja...@apache.org> wrote:
> > >
> > >> This is a pretty massive change to the apis. I wonder how nasty it would
> > >> be to just support both paths. Have you evaluated how complex that
> > would be?
> > >>
> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <em...@gmail.com>
> > >> wrote:
> > >>
> > >>> After more investigation, it looks like Float8Benchmarks at least on my
> > >>> machine are within the range of noise.
> > >>>
> > >>> For BitVectorHelper I pushed a new commit [1], seems to bring the
> > >>> BitVectorHelper benchmarks back inline (and even with some improvement
> > >>> for
> > >>> getNullCountBenchmark).
> > >>>
> > >>> Benchmark                                        Mode  Cnt   Score
> > >>>  Error
> > >>>  Units
> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ±
> > >>> 0.031
> > >>>  ns/op
> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ±
> > >>> 0.141
> > >>>  ns/op
> > >>>
> > >>> I applied the same pattern to other loops that I could find, and for
> > any
> > >>> "for (long" loop on the critical path, I broke it up into two loops.
> > the
> > >>> first loop does iteration by integer, the second finishes off for any
> > >>> long
> > >>> values.  As a side note it seems like optimization for loops using long
> > >>> counters at least have a semi-recent open bug for the JVM [2]
> > >>>
> > >>> Thanks,
> > >>> Micah
> > >>>
> > >>> [1]
> > >>>
> > >>>
> > https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
> > >>>
> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <em...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Indeed, the BoundChecking and CheckNullForGet variables can make a
> > big
> > >>> > difference.  I didn't initially run the benchmarks with these turned
> > on
> > >>> > (you can see the result from above with Float8Benchmarks).  Here are
> > >>> new
> > >>> > numbers including with the flags enabled.  It looks like using longs
> > >>> might
> > >>> > be a little bit slower, I'll see what I can do to mitigate this.
> > >>> >
> > >>> > Ravindra also volunteered to try to benchmark the changes with
> > Dremio's
> > >>> > code on today's sync call.
> > >>> >
> > >>> > New
> > >>> >
> > >>> > Benchmark                                        Mode  Cnt   Score
> > >>>  Error
> > >>> > Units
> > >>> >
> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   4.176 ±
> > >>> 1.292
> > >>> > ns/op
> > >>> >
> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  26.102 ±
> > >>> 0.700
> > >>> > ns/op
> > >>> >
> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084  us/op
> > >>> >
> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057  us/op
> > >>> >
> > >>> >
> > >>> >
> > >>> > old
> > >>> >
> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.828 ±
> > >>> 0.030
> > >>> > ns/op
> > >>> >
> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  20.611 ±
> > >>> 0.188
> > >>> > ns/op
> > >>> >
> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462  us/op
> > >>> >
> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027  us/op
> > >>> >
> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com>
> > wrote:
> > >>> >
> > >>> >> Hi Gonzalo,
> > >>> >>
> > >>> >> Thanks for sharing the performance results.
> > >>> >> I am wondering if you have turned off the flag
> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> > >>> >> If not, the lower throughput should be expected.
> > >>> >>
> > >>> >> Best,
> > >>> >> Liya Fan
> > >>> >>
> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
> > >>> emkornfield@gmail.com>
> > >>> >> wrote:
> > >>> >>
> > >>> >>> Hi Gonzalo,
> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
> > >>> implications.   At
> > >>> >>> least on the benchmark run they don't seem to have an impact.
> > >>> >>>
> > >>> >>> If there are other benchmarks that people have that can validate if
> > >>> this
> > >>> >>> change will be problematic I would appreciate trying to run them
> > >>> with the
> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to see
> > if
> > >>> >>> there
> > >>> >>> is a change in those.
> > >>> >>>
> > >>> >>> -Micah
> > >>> >>>
> > >>> >>>
> > >>> >>>
> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
> > >>> >>> golthiryus@gmail.com> wrote:
> > >>> >>>
> > >>> >>> > I would recommend to take care with this kind of changes.
> > >>> >>> >
> > >>> >>> > I didn't try Arrow in more than one year, but by then the
> > >>> performance
> > >>> >>> was
> > >>> >>> > quite bad in comparison with plain byte buffer access
> > >>> >>> > (see http://git.net/apache-arrow-development/msg02353.html *)
> > and
> > >>> >>> > there are several optimizations that the JVM (specifically, C2)
> > >>> does
> > >>> >>> not
> > >>> >>> > apply when dealing with int instead of longs. One of the
> > >>> >>> > most commons is the loop unrolling and vectorization.
> > >>> >>> >
> > >>> >>> > * It doesn't seem the best way to reference an old email on the
> > >>> list,
> > >>> >>> but
> > >>> >>> > it is the only result shown by Google
> > >>> >>> >
> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
> > liya.fan03@gmail.com
> > >>> >)
> > >>> >>> > escribió:
> > >>> >>> >
> > >>> >>> >> Hi Micah,
> > >>> >>> >>
> > >>> >>> >> Thanks for your effort. The performance result looks good.
> > >>> >>> >>
> > >>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4
> > bytes
> > >>> for
> > >>> >>> each
> > >>> >>> >> of length, write index, and read index).
> > >>> >>> >> Similar overheads also exist for vectors like
> > >>> BaseFixedWidthVector,
> > >>> >>> >> BaseVariableWidthVector, etc.
> > >>> >>> >>
> > >>> >>> >> IMO, such overheads are small enough to justify the change.
> > >>> >>> >> Let's check if there are other overheads.
> > >>> >>> >>
> > >>> >>> >> Best,
> > >>> >>> >> Liya Fan
> > >>> >>> >>
> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
> > >>> emkornfield@gmail.com
> > >>> >>> >
> > >>> >>> >> wrote:
> > >>> >>> >>
> > >>> >>> >> > Hi Liya Fan,
> > >>> >>> >> > Based on the Float8Benchmark there does not seem to be any
> > >>> >>> meaningful
> > >>> >>> >> > performance difference on my machine.  At least for me, the
> > >>> >>> benchmarks
> > >>> >>> >> are
> > >>> >>> >> > not stable enough to say one is faster than the other (I've
> > >>> pasted
> > >>> >>> >> results
> > >>> >>> >> > below).  That being said my machine isn't necessarily the most
> > >>> >>> reliable
> > >>> >>> >> for
> > >>> >>> >> > benchmarking.
> > >>> >>> >> >
> > >>> >>> >> > On an intuitive level, this makes sense to me,  for the most
> > >>> part it
> > >>> >>> >> seems
> > >>> >>> >> > like the change just moves casting from "int" to "long"
> > further
> > >>> up
> > >>> >>> the
> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are
> > other
> > >>> >>> >> benchmarks
> > >>> >>> >> > that you think are worth running let me know.
> > >>> >>> >> >
> > >>> >>> >> > One downside performance wise I think for his change is it
> > >>> >>> increases the
> > >>> >>> >> > size of ArrowBuf objects, which I suppose could influence
> > cache
> > >>> >>> misses
> > >>> >>> >> at
> > >>> >>> >> > some level or increase the size of call-stacks, but this
> > doesn't
> > >>> >>> seem to
> > >>> >>> >> > show up in the benchmark..
> > >>> >>> >> >
> > >>> >>> >> > Thanks,
> > >>> >>> >> > Micah
> > >>> >>> >> >
> > >>> >>> >> > Sample benchmark numbers:
> > >>> >>> >> >
> > >>> >>> >> > [New Code]
> > >>> >>> >> > Benchmark                            Mode  Cnt   Score   Error
> > >>> >>> Units
> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ± 0.469
> > >>> >>> us/op
> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ± 0.115
> > >>> >>> us/op
> > >>> >>> >> >
> > >>> >>> >> >
> > >>> >>> >> > [Old code]
> > >>> >>> >> > Benchmark                            Mode  Cnt   Score   Error
> > >>> >>> Units
> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ± 1.409
> > >>> >>> us/op
> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ± 0.084
> > >>> >>> us/op
> > >>> >>> >> >
> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <liya.fan03@gmail.com
> > >
> > >>> >>> wrote:
> > >>> >>> >> >
> > >>> >>> >> >> Hi Micah,
> > >>> >>> >> >>
> > >>> >>> >> >> Thanks a lot for doing this.
> > >>> >>> >> >>
> > >>> >>> >> >> I am a little concerned about if there is any negative
> > >>> performance
> > >>> >>> >> impact
> > >>> >>> >> >> on the current 32-bit-length based applications.
> > >>> >>> >> >> Can we do some performance comparison on our existing
> > >>> benchmarks?
> > >>> >>> >> >>
> > >>> >>> >> >> Best,
> > >>> >>> >> >> Liya Fan
> > >>> >>> >> >>
> > >>> >>> >> >>
> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
> > >>> >>> emkornfield@gmail.com>
> > >>> >>> >> >> wrote:
> > >>> >>> >> >>
> > >>> >>> >> >>> There have been some previous discussions on the mailing
> > about
> > >>> >>> >> supporting
> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
> > >>> >>> >> specification
> > >>> >>> >> >>> and C++ support).  I created a PR [1] that changes all APIs
> > >>> that I
> > >>> >>> >> could
> > >>> >>> >> >>> find that take an index to take an "long" instead of an
> > "int"
> > >>> (and
> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
> > >>> >>> >> >>>
> > >>> >>> >> >>> It is a big change, so I think it is worth discussing if it
> > is
> > >>> >>> >> something
> > >>> >>> >> >>> we
> > >>> >>> >> >>> still want to move forward with.  It would be nice to come
> > to
> > >>> a
> > >>> >>> >> >>> conclusion
> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of
> > merge
> > >>> >>> >> conflicts.
> > >>> >>> >> >>>
> > >>> >>> >> >>> The reason I did this work now is the C++ implementation has
> > >>> added
> > >>> >>> >> >>> support
> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and based
> > on
> > >>> >>> prior
> > >>> >>> >> >>> discussions we need to have similar support in Java before
> > our
> > >>> >>> next
> > >>> >>> >> >>> release. Support 64-bit indexes means we can have full
> > >>> >>> compatibility
> > >>> >>> >> and
> > >>> >>> >> >>> make the most use of the types in Java.
> > >>> >>> >> >>>
> > >>> >>> >> >>> Look forward to hearing feedback.
> > >>> >>> >> >>>
> > >>> >>> >> >>> Thanks,
> > >>> >>> >> >>> Micah
> > >>> >>> >> >>>
> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
> > >>> >>> >> >>>
> > >>> >>> >> >>
> > >>> >>> >>
> > >>> >>> >
> > >>> >>>
> > >>> >>
> > >>>
> > >>
> >

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Jacques Nadeau <ja...@apache.org>.

>
>  Hi Jacques, I hope you had a good rest.


I did, thanks!

On Fri, Aug 23, 2019 at 9:25 AM Jacques Nadeau <ja...@apache.org> wrote:

> I don't think we should couple this discussion with the implementation of
> large list, etc since I think those two concepts are independent.
>
> I've asked some others on my team their opinions on the risk here. I think
> we should probably review some our more complex vector interactions and see
> how the jvm's assembly changes with this kind of change. Using
> microbenchmarking is good but I think we also need to see whether we're
> constantly inserting additional instructions or if in most cases, this
> actually doesn't impact instruction count.
>
>
>
> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>>
>>> With regards to the reference implementation point. It is a good point.
>>> I'm on vacation this week. Unless you're pushing hard on this, can we pick
>>> this up and discuss more next week?
>>
>>
>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
>> reference implementation aspect of this?
>>
>>
>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>> would be best to decouple this discussion from the release timeline
>>> given how many people we have relying on regular releases coming out.
>>> We can keep continue making major 0.x releases until we're ready to
>>> release 1.0.0.
>>
>>
>> I'm OK with it as long as other stakeholders are. Timed releases are the
>> way to go.  As stated on the release thread [1] we need a better mechanism
>> to avoid this type of issue arising again.  The release thread also had
>> some more discussion on compatibility.
>>
>> Thanks,
>> Micah
>>
>> [1]
>> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>>
>>
>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com> wrote:
>>
>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Wes and Jacques,
>>> > See responses below.
>>> >
>>> > With regards to the reference implementation point. It is a good
>>> point. I'm
>>> > > on vacation this week. Unless you're pushing hard on this, can we
>>> pick this
>>> > > up and discuss more next week?
>>> >
>>> >
>>> > Sure thing, enjoy your vacation.  I think the only practical
>>> implications
>>> > are it delays choices around implementing LargeList, LargeBinary,
>>> > LargeString in Java, which in turn might push out the 0.15.0 release.
>>> >
>>>
>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>> would be best to decouple this discussion from the release timeline
>>> given how many people we have relying on regular releases coming out.
>>> We can keep continue making major 0.x releases until we're ready to
>>> release 1.0.0.
>>>
>>> > My stance on this is that I don't know how important it is for Java to
>>> > > support vectors over INT32_MAX elements. The use cases enabled by
>>> > > having very large arrays seem to be concentrated in the native code
>>> > > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
>>> > > on my part, though.
>>> >
>>> >
>>> > A data point against this view is Spark has done work to eliminate 2GB
>>> > memory limits on its block sizes [1].  I don't claim to understand the
>>> > implications of this. Bryan might you have any thoughts here?  I'm OK
>>> with
>>> > INT32_MAX, as well, I think we should think about what this means for
>>> > adding Large types to Java and implications for reference
>>> implementations.
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>>> >
>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>> >
>>> > > Hey Micah,
>>> > >
>>> > > Appreciate the offer on the compiling. The reality is I'm more
>>> concerned
>>> > > about the unknowns than the compiling issue itself. Any time you've
>>> been
>>> > > tuning for a while, changing something like this could be totally
>>> fine or
>>> > > cause a couple of major issues. For example, we've done a very large
>>> amount
>>> > > of work reducing heap memory footprint of the vectors. Are target is
>>> to
>>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per
>>> vector
>>> > > (not including arrow bufs).
>>> > >
>>> > > With regards to the reference implementation point. It is a good
>>> point.
>>> > > I'm on vacation this week. Unless you're pushing hard on this, can
>>> we pick
>>> > > this up and discuss more next week?
>>> > >
>>> > > thanks,
>>> > > Jacques
>>> > >
>>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>> > > wrote:
>>> > >
>>> > >> Hi Jacques,
>>> > >> I definitely understand these concerns and this change is risky
>>> because it
>>> > >> is so large.  Perhaps, creating a new hierarchy, might be the
>>> cleanest way
>>> > >> of dealing with this.  This could have other benefits like cleaning
>>> up
>>> > >> some
>>> > >> cruft around dictionary encode and "orphaned" method.   Per past
>>> e-mail
>>> > >> threads I agree it is beneficial to have 2 separate reference
>>> > >> implementations that can communicate fully, and my intent here was
>>> to
>>> > >> close
>>> > >> that gap.
>>> > >>
>>> > >> Trying to
>>> > >> > determine the ramifications of these changes would be challenging
>>> and
>>> > >> time
>>> > >> > consuming against all the different ways we interact with the
>>> Arrow Java
>>> > >> > library.
>>> > >>
>>> > >>
>>> > >> Understood.  I took a quick look at Dremio-OSS it seems like it has
>>> a
>>> > >> simple java build system?  If it is helpful, I can try to get a fork
>>> > >> running that at least compiles against this PR.  My plan would be
>>> to cast
>>> > >> any place that was changed to return a long back to an int, so in
>>> essence
>>> > >> the Dremio algorithms would reman 32-bit implementations.
>>> > >>
>>> > >> I don't  have the infrastructure to test this change properly from a
>>> > >> distributed systems perspective, so it would still take some time
>>> from
>>> > >> Dremio to validate for regressions.
>>> > >>
>>> > >> I'm not saying I'm against this but want to make sure we've
>>> > >> > explored all less disruptive options before considering changing
>>> > >> something
>>> > >> > this fundamental (especially when I generally hold the view that
>>> large
>>> > >> cell
>>> > >> > counts against massive contiguous memory is an anti pattern to
>>> scalable
>>> > >> > analytical processing--purely subjective of course).
>>> > >>
>>> > >>
>>> > >> I'm open to other ideas here, as well. I don't think it is out of
>>> the
>>> > >> question to leave the Java implementation as 32-bit, but if we do,
>>> then I
>>> > >> think we should consider a different strategy for reference
>>> > >> implementations.
>>> > >>
>>> > >> Thanks,
>>> > >> Micah
>>> > >>
>>> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <ja...@apache.org>
>>> > >> wrote:
>>> > >>
>>> > >> > Hey Micah, I didn't have a particular path in mind. Was thinking
>>> more
>>> > >> along
>>> > >> > the lines of extra methods as opposed to separate classes.
>>> > >> >
>>> > >> > Arrow hasn't historically been a place where we're writing
>>> algorithms in
>>> > >> > Java so the fact that they aren't there doesn't mean they don't
>>> exist.
>>> > >> We
>>> > >> > have a large amount of code that depends on the current behavior
>>> that is
>>> > >> > deployed in hundreds of customer clusters (you can peruse our
>>> dremio
>>> > >> repo
>>> > >> > to see how extensively we leverage Arrow if interested). Trying to
>>> > >> > determine the ramifications of these changes would be challenging
>>> and
>>> > >> time
>>> > >> > consuming against all the different ways we interact with the
>>> Arrow Java
>>> > >> > library. I'm not saying I'm against this but want to make sure
>>> we've
>>> > >> > explored all less disruptive options before considering changing
>>> > >> something
>>> > >> > this fundamental (especially when I generally hold the view that
>>> large
>>> > >> cell
>>> > >> > counts against massive contiguous memory is an anti pattern to
>>> scalable
>>> > >> > analytical processing--purely subjective of course).
>>> > >> >
>>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>> > >> > wrote:
>>> > >> >
>>> > >> > > Hi Jacques,
>>> > >> > > What avenue were you thinking for supporting both paths?   I
>>> didn't
>>> > >> want
>>> > >> > > to pursue a different class hierarchy, because I felt like that
>>> would
>>> > >> > > effectively fork the code base, but that is potentially an
>>> option that
>>> > >> > > would allow us to have a complete reference implementation in
>>> Java
>>> > >> that
>>> > >> > can
>>> > >> > > fully interact with C++, without major changes to this code.
>>> > >> > >
>>> > >> > > For supporting both APIs on the same classes/interfaces, I
>>> think they
>>> > >> > > roughly fall into three categories, changes to input parameters,
>>> > >> changes
>>> > >> > to
>>> > >> > > output parameters and algorithm changes.
>>> > >> > >
>>> > >> > > For inputs, changing from int to long is essentially a no-op
>>> from the
>>> > >> > > compiler perspective.  From the limited micro-benchmarking this
>>> also
>>> > >> > > doesn't seem to have a performance impact.  So we could keep two
>>> > >> versions
>>> > >> > > of the methods that only differ on inputs, but it is not clear
>>> what
>>> > >> the
>>> > >> > > value of that would be.
>>> > >> > >
>>> > >> > > For outputs, we can't support methods "long getLength()" and
>>> "int
>>> > >> > > getLength()" in the same class, so we would be forced into
>>> something
>>> > >> like
>>> > >> > > "long getLength(boolean dummy)" which I think is a less
>>> desirable.
>>> > >> > >
>>> > >> > > For algorithm changes, there did not appear to be too many
>>> places
>>> > >> where
>>> > >> > we
>>> > >> > > actually loop over all elements (it is quite possible I missed
>>> > >> something
>>> > >> > > here), the ones that I did find I was able to mitigate
>>> performance
>>> > >> > > penalties as noted above.  Some of the current implementation
>>> will
>>> > >> get a
>>> > >> > > lot slower for "large arrays", but we can likely fix those
>>> later or in
>>> > >> > this
>>> > >> > > PR with a nested while loop instead of 2 for loops.
>>> > >> > >
>>> > >> > > Thanks,
>>> > >> > > Micah
>>> > >> > >
>>> > >> > >
>>> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <
>>> jacques@apache.org>
>>> > >> wrote:
>>> > >> > >
>>> > >> > >> This is a pretty massive change to the apis. I wonder how
>>> nasty it
>>> > >> would
>>> > >> > >> be to just support both paths. Have you evaluated how complex
>>> that
>>> > >> > would be?
>>> > >> > >>
>>> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
>>> > >> emkornfield@gmail.com>
>>> > >> > >> wrote:
>>> > >> > >>
>>> > >> > >>> After more investigation, it looks like Float8Benchmarks at
>>> least
>>> > >> on my
>>> > >> > >>> machine are within the range of noise.
>>> > >> > >>>
>>> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems to bring
>>> the
>>> > >> > >>> BitVectorHelper benchmarks back inline (and even with some
>>> > >> improvement
>>> > >> > >>> for
>>> > >> > >>> getNullCountBenchmark).
>>> > >> > >>>
>>> > >> > >>> Benchmark                                        Mode  Cnt
>>>  Score
>>> > >> > >>>  Error
>>> > >> > >>>  Units
>>> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>>  3.821 ±
>>> > >> > >>> 0.031
>>> > >> > >>>  ns/op
>>> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>> 14.884 ±
>>> > >> > >>> 0.141
>>> > >> > >>>  ns/op
>>> > >> > >>>
>>> > >> > >>> I applied the same pattern to other loops that I could find,
>>> and for
>>> > >> > any
>>> > >> > >>> "for (long" loop on the critical path, I broke it up into two
>>> loops.
>>> > >> > the
>>> > >> > >>> first loop does iteration by integer, the second finishes off
>>> for
>>> > >> any
>>> > >> > >>> long
>>> > >> > >>> values.  As a side note it seems like optimization for loops
>>> using
>>> > >> long
>>> > >> > >>> counters at least have a semi-recent open bug for the JVM [2]
>>> > >> > >>>
>>> > >> > >>> Thanks,
>>> > >> > >>> Micah
>>> > >> > >>>
>>> > >> > >>> [1]
>>> > >> > >>>
>>> > >> > >>>
>>> > >> >
>>> > >>
>>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>>> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>> > >> > >>>
>>> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
>>> > >> emkornfield@gmail.com>
>>> > >> > >>> wrote:
>>> > >> > >>>
>>> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables can
>>> make a
>>> > >> > big
>>> > >> > >>> > difference.  I didn't initially run the benchmarks with
>>> these
>>> > >> turned
>>> > >> > on
>>> > >> > >>> > (you can see the result from above with Float8Benchmarks).
>>> Here
>>> > >> are
>>> > >> > >>> new
>>> > >> > >>> > numbers including with the flags enabled.  It looks like
>>> using
>>> > >> longs
>>> > >> > >>> might
>>> > >> > >>> > be a little bit slower, I'll see what I can do to mitigate
>>> this.
>>> > >> > >>> >
>>> > >> > >>> > Ravindra also volunteered to try to benchmark the changes
>>> with
>>> > >> > Dremio's
>>> > >> > >>> > code on today's sync call.
>>> > >> > >>> >
>>> > >> > >>> > New
>>> > >> > >>> >
>>> > >> > >>> > Benchmark                                        Mode  Cnt
>>>  Score
>>> > >> > >>>  Error
>>> > >> > >>> > Units
>>> > >> > >>> >
>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>> > >>  4.176 ±
>>> > >> > >>> 1.292
>>> > >> > >>> > ns/op
>>> > >> > >>> >
>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>> > >> 26.102 ±
>>> > >> > >>> 0.700
>>> > >> > >>> > ns/op
>>> > >> > >>> >
>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ±
>>> 0.084
>>> > >> us/op
>>> > >> > >>> >
>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ±
>>> 0.057
>>> > >> us/op
>>> > >> > >>> >
>>> > >> > >>> >
>>> > >> > >>> >
>>> > >> > >>> > old
>>> > >> > >>> >
>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>> > >>  3.828 ±
>>> > >> > >>> 0.030
>>> > >> > >>> > ns/op
>>> > >> > >>> >
>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>> > >> 20.611 ±
>>> > >> > >>> 0.188
>>> > >> > >>> > ns/op
>>> > >> > >>> >
>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ±
>>> 0.462
>>> > >> us/op
>>> > >> > >>> >
>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ±
>>> 0.027
>>> > >> us/op
>>> > >> > >>> >
>>> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <
>>> liya.fan03@gmail.com>
>>> > >> > wrote:
>>> > >> > >>> >
>>> > >> > >>> >> Hi Gonzalo,
>>> > >> > >>> >>
>>> > >> > >>> >> Thanks for sharing the performance results.
>>> > >> > >>> >> I am wondering if you have turned off the flag
>>> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>>> > >> > >>> >> If not, the lower throughput should be expected.
>>> > >> > >>> >>
>>> > >> > >>> >> Best,
>>> > >> > >>> >> Liya Fan
>>> > >> > >>> >>
>>> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>>> > >> > >>> emkornfield@gmail.com>
>>> > >> > >>> >> wrote:
>>> > >> > >>> >>
>>> > >> > >>> >>> Hi Gonzalo,
>>> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
>>> > >> > >>> implications.   At
>>> > >> > >>> >>> least on the benchmark run they don't seem to have an
>>> impact.
>>> > >> > >>> >>>
>>> > >> > >>> >>> If there are other benchmarks that people have that can
>>> > >> validate if
>>> > >> > >>> this
>>> > >> > >>> >>> change will be problematic I would appreciate trying to
>>> run them
>>> > >> > >>> with the
>>> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt
>>> tonight to
>>> > >> see
>>> > >> > if
>>> > >> > >>> >>> there
>>> > >> > >>> >>> is a change in those.
>>> > >> > >>> >>>
>>> > >> > >>> >>> -Micah
>>> > >> > >>> >>>
>>> > >> > >>> >>>
>>> > >> > >>> >>>
>>> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>>> > >> > >>> >>> golthiryus@gmail.com> wrote:
>>> > >> > >>> >>>
>>> > >> > >>> >>> > I would recommend to take care with this kind of
>>> changes.
>>> > >> > >>> >>> >
>>> > >> > >>> >>> > I didn't try Arrow in more than one year, but by then
>>> the
>>> > >> > >>> performance
>>> > >> > >>> >>> was
>>> > >> > >>> >>> > quite bad in comparison with plain byte buffer access
>>> > >> > >>> >>> > (see
>>> http://git.net/apache-arrow-development/msg02353.html *)
>>> > >> > and
>>> > >> > >>> >>> > there are several optimizations that the JVM
>>> (specifically,
>>> > >> C2)
>>> > >> > >>> does
>>> > >> > >>> >>> not
>>> > >> > >>> >>> > apply when dealing with int instead of longs. One of the
>>> > >> > >>> >>> > most commons is the loop unrolling and vectorization.
>>> > >> > >>> >>> >
>>> > >> > >>> >>> > * It doesn't seem the best way to reference an old
>>> email on
>>> > >> the
>>> > >> > >>> list,
>>> > >> > >>> >>> but
>>> > >> > >>> >>> > it is the only result shown by Google
>>> > >> > >>> >>> >
>>> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
>>> > >> > liya.fan03@gmail.com
>>> > >> > >>> >)
>>> > >> > >>> >>> > escribió:
>>> > >> > >>> >>> >
>>> > >> > >>> >>> >> Hi Micah,
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> Thanks for your effort. The performance result looks
>>> good.
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12
>>> bytes (4
>>> > >> > bytes
>>> > >> > >>> for
>>> > >> > >>> >>> each
>>> > >> > >>> >>> >> of length, write index, and read index).
>>> > >> > >>> >>> >> Similar overheads also exist for vectors like
>>> > >> > >>> BaseFixedWidthVector,
>>> > >> > >>> >>> >> BaseVariableWidthVector, etc.
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> IMO, such overheads are small enough to justify the
>>> change.
>>> > >> > >>> >>> >> Let's check if there are other overheads.
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> Best,
>>> > >> > >>> >>> >> Liya Fan
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>>> > >> > >>> emkornfield@gmail.com
>>> > >> > >>> >>> >
>>> > >> > >>> >>> >> wrote:
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> > Hi Liya Fan,
>>> > >> > >>> >>> >> > Based on the Float8Benchmark there does not seem to
>>> be any
>>> > >> > >>> >>> meaningful
>>> > >> > >>> >>> >> > performance difference on my machine.  At least for
>>> me, the
>>> > >> > >>> >>> benchmarks
>>> > >> > >>> >>> >> are
>>> > >> > >>> >>> >> > not stable enough to say one is faster than the
>>> other (I've
>>> > >> > >>> pasted
>>> > >> > >>> >>> >> results
>>> > >> > >>> >>> >> > below).  That being said my machine isn't
>>> necessarily the
>>> > >> most
>>> > >> > >>> >>> reliable
>>> > >> > >>> >>> >> for
>>> > >> > >>> >>> >> > benchmarking.
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > On an intuitive level, this makes sense to me,  for
>>> the
>>> > >> most
>>> > >> > >>> part it
>>> > >> > >>> >>> >> seems
>>> > >> > >>> >>> >> > like the change just moves casting from "int" to
>>> "long"
>>> > >> > further
>>> > >> > >>> up
>>> > >> > >>> >>> the
>>> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If
>>> there are
>>> > >> > other
>>> > >> > >>> >>> >> benchmarks
>>> > >> > >>> >>> >> > that you think are worth running let me know.
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > One downside performance wise I think for his change
>>> is it
>>> > >> > >>> >>> increases the
>>> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could
>>> influence
>>> > >> > cache
>>> > >> > >>> >>> misses
>>> > >> > >>> >>> >> at
>>> > >> > >>> >>> >> > some level or increase the size of call-stacks, but
>>> this
>>> > >> > doesn't
>>> > >> > >>> >>> seem to
>>> > >> > >>> >>> >> > show up in the benchmark..
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > Thanks,
>>> > >> > >>> >>> >> > Micah
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > Sample benchmark numbers:
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > [New Code]
>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>  Score
>>> > >>  Error
>>> > >> > >>> >>> Units
>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>> 15.441 ±
>>> > >> 0.469
>>> > >> > >>> >>> us/op
>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>> 14.057 ±
>>> > >> 0.115
>>> > >> > >>> >>> us/op
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > [Old code]
>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>  Score
>>> > >>  Error
>>> > >> > >>> >>> Units
>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>> 16.248 ±
>>> > >> 1.409
>>> > >> > >>> >>> us/op
>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>> 14.150 ±
>>> > >> 0.084
>>> > >> > >>> >>> us/op
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
>>> > >> liya.fan03@gmail.com
>>> > >> > >
>>> > >> > >>> >>> wrote:
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> >> Hi Micah,
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >> Thanks a lot for doing this.
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >> I am a little concerned about if there is any
>>> negative
>>> > >> > >>> performance
>>> > >> > >>> >>> >> impact
>>> > >> > >>> >>> >> >> on the current 32-bit-length based applications.
>>> > >> > >>> >>> >> >> Can we do some performance comparison on our
>>> existing
>>> > >> > >>> benchmarks?
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >> Best,
>>> > >> > >>> >>> >> >> Liya Fan
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>>> > >> > >>> >>> emkornfield@gmail.com>
>>> > >> > >>> >>> >> >> wrote:
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >>> There have been some previous discussions on the
>>> mailing
>>> > >> > about
>>> > >> > >>> >>> >> supporting
>>> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is
>>> what the
>>> > >> IPC
>>> > >> > >>> >>> >> specification
>>> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that changes
>>> all
>>> > >> APIs
>>> > >> > >>> that I
>>> > >> > >>> >>> >> could
>>> > >> > >>> >>> >> >>> find that take an index to take an "long" instead
>>> of an
>>> > >> > "int"
>>> > >> > >>> (and
>>> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>> It is a big change, so I think it is worth
>>> discussing if
>>> > >> it
>>> > >> > is
>>> > >> > >>> >>> >> something
>>> > >> > >>> >>> >> >>> we
>>> > >> > >>> >>> >> >>> still want to move forward with.  It would be nice
>>> to
>>> > >> come
>>> > >> > to
>>> > >> > >>> a
>>> > >> > >>> >>> >> >>> conclusion
>>> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a
>>> lot of
>>> > >> > merge
>>> > >> > >>> >>> >> conflicts.
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>> The reason I did this work now is the C++
>>> implementation
>>> > >> has
>>> > >> > >>> added
>>> > >> > >>> >>> >> >>> support
>>> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays
>>> and
>>> > >> based
>>> > >> > on
>>> > >> > >>> >>> prior
>>> > >> > >>> >>> >> >>> discussions we need to have similar support in Java
>>> > >> before
>>> > >> > our
>>> > >> > >>> >>> next
>>> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we can have
>>> full
>>> > >> > >>> >>> compatibility
>>> > >> > >>> >>> >> and
>>> > >> > >>> >>> >> >>> make the most use of the types in Java.
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>> Look forward to hearing feedback.
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>> Thanks,
>>> > >> > >>> >>> >> >>> Micah
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >
>>> > >> > >>> >>>
>>> > >> > >>> >>
>>> > >> > >>>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> > >
>>>
>>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Fan Liya <li...@gmail.com>.

I think the 2GB limit is overly restrictive for modern computers.
This is a problem we must face anyway.

Best,
Liya Fan

On Fri, Nov 15, 2019 at 5:07 PM Micah Kornfield <em...@gmail.com>
wrote:

> Apologies for the long delay, I chose to do the minimal work of limiting
> this change [1] to allowing ArrowBuf to 64-bit lengths.  This would unblock
> work on LargeString and LargeBinary.  If this change looks OK, I think
> there is some follow-up work to add more thorough unit/integration tests.
>
> As an aside, it does seem like the 2GB limit is affecting some users in
> Spark [2][3], so hopefully LargeString would help with this.
>
> Allowing more than MAX_INT elements is Vectors/array still a blocker for
> making LargeList useful.
>
> Thanks,
> Micah
>
> [1]  https://github.com/apache/arrow/pull/5020
> [2]
> https://stackoverflow.com/questions/58739888/spark-is-it-possible-to-increase-pyarrow-buffer#comment103812119_58739888
> [3] https://issues.apache.org/jira/browse/ARROW-4890
>
> On Fri, Aug 23, 2019 at 8:33 AM Jacques Nadeau <ja...@apache.org> wrote:
>
>>
>>
>> On Fri, Aug 23, 2019, 8:55 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> The vector indexes being limited to 32 bits doesn't limit the addressing
>>>> to 32 bit chunks of memory. For example, you're prime example before was
>>>> image data. Having 2 billion images of 1mb images would still be supported
>>>> without changing the index addressing.
>>>
>>> This might be pre-coffee math, but I think we are limited to
>>> approximately 2000 images because an ArrowBuf only holds up-to 2 billion
>>> bytes [1].  While we have plenty of room for the offsets, we quickly run
>>> out of contiguous data storage space. For LargeString and LargeBinary this
>>> could be fixed by changing ArrowBuf.
>>>
>>> LargeArray faces the same problem only it applies to its child vectors.
>>> Supporting LargeArray properly is really what drove the large-scale
>>> interface change.
>>>
>>
>> My expressed concern about these changes was specifically about the use
>> of long for get/set in the vector interfaces. I'm not saying that we
>> constrain memory/ArrowBufs to 32bits.
>>
>>
>>> Rebase would help if possible.
>>>
>>> I'll try to get to this in the next few days.
>>>
>>> [1]
>>> https://github.com/apache/arrow/blob/95175fe7cb8439eebe6d2f6e0495f551d6864380/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java#L164
>>>
>>> On Fri, Aug 23, 2019 at 4:55 AM Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Aug 23, 2019, 11:49 AM Micah Kornfield <em...@gmail.com>
>>>> wrote:
>>>>
>>>>> I don't think we should couple this discussion with the implementation
>>>>>> of large list, etc since I think those two concepts are independent.
>>>>>
>>>>> I'm still trying to balance in my mind which is a worse experience for
>>>>> consumers of the libraries for these types.  Claiming that Java supports
>>>>> these types but throwing an exception when the Vectors exceed 32-bits or
>>>>> just say they aren't supported until we have 64-bit support in Java.
>>>>>
>>>>
>>>> The vector indexes being limited to 32 bits doesn't limit the
>>>> addressing to 32 bit chunks of memory. For example, you're prime example
>>>> before was image data. Having 2 billion images of 1mb images would still be
>>>> supported without changing the index addressing.
>>>>
>>>>
>>>>
>>>>>
>>>>>> I've asked some others on my team their opinions on the risk here. I
>>>>>> think we should probably review some our more complex vector interactions
>>>>>> and see how the jvm's assembly changes with this kind of change. Using
>>>>>> microbenchmarking is good but I think we also need to see whether we're
>>>>>> constantly inserting additional instructions or if in most cases, this
>>>>>> actually doesn't impact instruction count.
>>>>>
>>>>>
>>>>> Is this something that your team will take on?
>>>>>
>>>>
>>>>
>>>> Yeah, we need to look at this I think.
>>>>
>>>> Do you need a rebased version of the PR or is the existing one
>>>>> sufficient?
>>>>>
>>>>
>>>> Rebase would help if possible.
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> On Thu, Aug 22, 2019 at 8:56 PM Jacques Nadeau <ja...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I don't think we should couple this discussion with the
>>>>>> implementation of large list, etc since I think those two concepts are
>>>>>> independent.
>>>>>>
>>>>>> I've asked some others on my team their opinions on the risk here. I
>>>>>> think we should probably review some our more complex vector interactions
>>>>>> and see how the jvm's assembly changes with this kind of change. Using
>>>>>> microbenchmarking is good but I think we also need to see whether we're
>>>>>> constantly inserting additional instructions or if in most cases, this
>>>>>> actually doesn't impact instruction count.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield <
>>>>>> emkornfield@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>> With regards to the reference implementation point. It is a good
>>>>>>>> point. I'm on vacation this week. Unless you're pushing hard on this, can
>>>>>>>> we pick this up and discuss more next week?
>>>>>>>
>>>>>>>
>>>>>>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
>>>>>>> reference implementation aspect of this?
>>>>>>>
>>>>>>>
>>>>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>>>>>> would be best to decouple this discussion from the release timeline
>>>>>>>> given how many people we have relying on regular releases coming
>>>>>>>> out.
>>>>>>>> We can keep continue making major 0.x releases until we're ready to
>>>>>>>> release 1.0.0.
>>>>>>>
>>>>>>>
>>>>>>> I'm OK with it as long as other stakeholders are. Timed releases are
>>>>>>> the way to go.  As stated on the release thread [1] we need a better
>>>>>>> mechanism to avoid this type of issue arising again.  The release thread
>>>>>>> also had some more discussion on compatibility.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Micah
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <
>>>>>>>> emkornfield@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Hi Wes and Jacques,
>>>>>>>> > See responses below.
>>>>>>>> >
>>>>>>>> > With regards to the reference implementation point. It is a good
>>>>>>>> point. I'm
>>>>>>>> > > on vacation this week. Unless you're pushing hard on this, can
>>>>>>>> we pick this
>>>>>>>> > > up and discuss more next week?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Sure thing, enjoy your vacation.  I think the only practical
>>>>>>>> implications
>>>>>>>> > are it delays choices around implementing LargeList, LargeBinary,
>>>>>>>> > LargeString in Java, which in turn might push out the 0.15.0
>>>>>>>> release.
>>>>>>>> >
>>>>>>>>
>>>>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>>>>>> would be best to decouple this discussion from the release timeline
>>>>>>>> given how many people we have relying on regular releases coming
>>>>>>>> out.
>>>>>>>> We can keep continue making major 0.x releases until we're ready to
>>>>>>>> release 1.0.0.
>>>>>>>>
>>>>>>>> > My stance on this is that I don't know how important it is for
>>>>>>>> Java to
>>>>>>>> > > support vectors over INT32_MAX elements. The use cases enabled
>>>>>>>> by
>>>>>>>> > > having very large arrays seem to be concentrated in the native
>>>>>>>> code
>>>>>>>> > > world (e.g. C/C++/Rust) -- that could just be
>>>>>>>> implementation-centrism
>>>>>>>> > > on my part, though.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > A data point against this view is Spark has done work to
>>>>>>>> eliminate 2GB
>>>>>>>> > memory limits on its block sizes [1].  I don't claim to
>>>>>>>> understand the
>>>>>>>> > implications of this. Bryan might you have any thoughts here?
>>>>>>>> I'm OK with
>>>>>>>> > INT32_MAX, as well, I think we should think about what this means
>>>>>>>> for
>>>>>>>> > adding Large types to Java and implications for reference
>>>>>>>> implementations.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Micah
>>>>>>>> >
>>>>>>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>>>>>>>> >
>>>>>>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <
>>>>>>>> jacques@apache.org> wrote:
>>>>>>>> >
>>>>>>>> > > Hey Micah,
>>>>>>>> > >
>>>>>>>> > > Appreciate the offer on the compiling. The reality is I'm more
>>>>>>>> concerned
>>>>>>>> > > about the unknowns than the compiling issue itself. Any time
>>>>>>>> you've been
>>>>>>>> > > tuning for a while, changing something like this could be
>>>>>>>> totally fine or
>>>>>>>> > > cause a couple of major issues. For example, we've done a very
>>>>>>>> large amount
>>>>>>>> > > of work reducing heap memory footprint of the vectors. Are
>>>>>>>> target is to
>>>>>>>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap
>>>>>>>> per vector
>>>>>>>> > > (not including arrow bufs).
>>>>>>>> > >
>>>>>>>> > > With regards to the reference implementation point. It is a
>>>>>>>> good point.
>>>>>>>> > > I'm on vacation this week. Unless you're pushing hard on this,
>>>>>>>> can we pick
>>>>>>>> > > this up and discuss more next week?
>>>>>>>> > >
>>>>>>>> > > thanks,
>>>>>>>> > > Jacques
>>>>>>>> > >
>>>>>>>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
>>>>>>>> emkornfield@gmail.com>
>>>>>>>> > > wrote:
>>>>>>>> > >
>>>>>>>> > >> Hi Jacques,
>>>>>>>> > >> I definitely understand these concerns and this change is
>>>>>>>> risky because it
>>>>>>>> > >> is so large.  Perhaps, creating a new hierarchy, might be the
>>>>>>>> cleanest way
>>>>>>>> > >> of dealing with this.  This could have other benefits like
>>>>>>>> cleaning up
>>>>>>>> > >> some
>>>>>>>> > >> cruft around dictionary encode and "orphaned" method.   Per
>>>>>>>> past e-mail
>>>>>>>> > >> threads I agree it is beneficial to have 2 separate reference
>>>>>>>> > >> implementations that can communicate fully, and my intent here
>>>>>>>> was to
>>>>>>>> > >> close
>>>>>>>> > >> that gap.
>>>>>>>> > >>
>>>>>>>> > >> Trying to
>>>>>>>> > >> > determine the ramifications of these changes would be
>>>>>>>> challenging and
>>>>>>>> > >> time
>>>>>>>> > >> > consuming against all the different ways we interact with
>>>>>>>> the Arrow Java
>>>>>>>> > >> > library.
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >> Understood.  I took a quick look at Dremio-OSS it seems like
>>>>>>>> it has a
>>>>>>>> > >> simple java build system?  If it is helpful, I can try to get
>>>>>>>> a fork
>>>>>>>> > >> running that at least compiles against this PR.  My plan would
>>>>>>>> be to cast
>>>>>>>> > >> any place that was changed to return a long back to an int, so
>>>>>>>> in essence
>>>>>>>> > >> the Dremio algorithms would reman 32-bit implementations.
>>>>>>>> > >>
>>>>>>>> > >> I don't  have the infrastructure to test this change properly
>>>>>>>> from a
>>>>>>>> > >> distributed systems perspective, so it would still take some
>>>>>>>> time from
>>>>>>>> > >> Dremio to validate for regressions.
>>>>>>>> > >>
>>>>>>>> > >> I'm not saying I'm against this but want to make sure we've
>>>>>>>> > >> > explored all less disruptive options before considering
>>>>>>>> changing
>>>>>>>> > >> something
>>>>>>>> > >> > this fundamental (especially when I generally hold the view
>>>>>>>> that large
>>>>>>>> > >> cell
>>>>>>>> > >> > counts against massive contiguous memory is an anti pattern
>>>>>>>> to scalable
>>>>>>>> > >> > analytical processing--purely subjective of course).
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >> I'm open to other ideas here, as well. I don't think it is out
>>>>>>>> of the
>>>>>>>> > >> question to leave the Java implementation as 32-bit, but if we
>>>>>>>> do, then I
>>>>>>>> > >> think we should consider a different strategy for reference
>>>>>>>> > >> implementations.
>>>>>>>> > >>
>>>>>>>> > >> Thanks,
>>>>>>>> > >> Micah
>>>>>>>> > >>
>>>>>>>> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <
>>>>>>>> jacques@apache.org>
>>>>>>>> > >> wrote:
>>>>>>>> > >>
>>>>>>>> > >> > Hey Micah, I didn't have a particular path in mind. Was
>>>>>>>> thinking more
>>>>>>>> > >> along
>>>>>>>> > >> > the lines of extra methods as opposed to separate classes.
>>>>>>>> > >> >
>>>>>>>> > >> > Arrow hasn't historically been a place where we're writing
>>>>>>>> algorithms in
>>>>>>>> > >> > Java so the fact that they aren't there doesn't mean they
>>>>>>>> don't exist.
>>>>>>>> > >> We
>>>>>>>> > >> > have a large amount of code that depends on the current
>>>>>>>> behavior that is
>>>>>>>> > >> > deployed in hundreds of customer clusters (you can peruse
>>>>>>>> our dremio
>>>>>>>> > >> repo
>>>>>>>> > >> > to see how extensively we leverage Arrow if interested).
>>>>>>>> Trying to
>>>>>>>> > >> > determine the ramifications of these changes would be
>>>>>>>> challenging and
>>>>>>>> > >> time
>>>>>>>> > >> > consuming against all the different ways we interact with
>>>>>>>> the Arrow Java
>>>>>>>> > >> > library. I'm not saying I'm against this but want to make
>>>>>>>> sure we've
>>>>>>>> > >> > explored all less disruptive options before considering
>>>>>>>> changing
>>>>>>>> > >> something
>>>>>>>> > >> > this fundamental (especially when I generally hold the view
>>>>>>>> that large
>>>>>>>> > >> cell
>>>>>>>> > >> > counts against massive contiguous memory is an anti pattern
>>>>>>>> to scalable
>>>>>>>> > >> > analytical processing--purely subjective of course).
>>>>>>>> > >> >
>>>>>>>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <
>>>>>>>> emkornfield@gmail.com>
>>>>>>>> > >> > wrote:
>>>>>>>> > >> >
>>>>>>>> > >> > > Hi Jacques,
>>>>>>>> > >> > > What avenue were you thinking for supporting both paths?
>>>>>>>>  I didn't
>>>>>>>> > >> want
>>>>>>>> > >> > > to pursue a different class hierarchy, because I felt like
>>>>>>>> that would
>>>>>>>> > >> > > effectively fork the code base, but that is potentially an
>>>>>>>> option that
>>>>>>>> > >> > > would allow us to have a complete reference implementation
>>>>>>>> in Java
>>>>>>>> > >> that
>>>>>>>> > >> > can
>>>>>>>> > >> > > fully interact with C++, without major changes to this
>>>>>>>> code.
>>>>>>>> > >> > >
>>>>>>>> > >> > > For supporting both APIs on the same classes/interfaces, I
>>>>>>>> think they
>>>>>>>> > >> > > roughly fall into three categories, changes to input
>>>>>>>> parameters,
>>>>>>>> > >> changes
>>>>>>>> > >> > to
>>>>>>>> > >> > > output parameters and algorithm changes.
>>>>>>>> > >> > >
>>>>>>>> > >> > > For inputs, changing from int to long is essentially a
>>>>>>>> no-op from the
>>>>>>>> > >> > > compiler perspective.  From the limited micro-benchmarking
>>>>>>>> this also
>>>>>>>> > >> > > doesn't seem to have a performance impact.  So we could
>>>>>>>> keep two
>>>>>>>> > >> versions
>>>>>>>> > >> > > of the methods that only differ on inputs, but it is not
>>>>>>>> clear what
>>>>>>>> > >> the
>>>>>>>> > >> > > value of that would be.
>>>>>>>> > >> > >
>>>>>>>> > >> > > For outputs, we can't support methods "long getLength()"
>>>>>>>> and "int
>>>>>>>> > >> > > getLength()" in the same class, so we would be forced into
>>>>>>>> something
>>>>>>>> > >> like
>>>>>>>> > >> > > "long getLength(boolean dummy)" which I think is a less
>>>>>>>> desirable.
>>>>>>>> > >> > >
>>>>>>>> > >> > > For algorithm changes, there did not appear to be too many
>>>>>>>> places
>>>>>>>> > >> where
>>>>>>>> > >> > we
>>>>>>>> > >> > > actually loop over all elements (it is quite possible I
>>>>>>>> missed
>>>>>>>> > >> something
>>>>>>>> > >> > > here), the ones that I did find I was able to mitigate
>>>>>>>> performance
>>>>>>>> > >> > > penalties as noted above.  Some of the current
>>>>>>>> implementation will
>>>>>>>> > >> get a
>>>>>>>> > >> > > lot slower for "large arrays", but we can likely fix those
>>>>>>>> later or in
>>>>>>>> > >> > this
>>>>>>>> > >> > > PR with a nested while loop instead of 2 for loops.
>>>>>>>> > >> > >
>>>>>>>> > >> > > Thanks,
>>>>>>>> > >> > > Micah
>>>>>>>> > >> > >
>>>>>>>> > >> > >
>>>>>>>> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <
>>>>>>>> jacques@apache.org>
>>>>>>>> > >> wrote:
>>>>>>>> > >> > >
>>>>>>>> > >> > >> This is a pretty massive change to the apis. I wonder how
>>>>>>>> nasty it
>>>>>>>> > >> would
>>>>>>>> > >> > >> be to just support both paths. Have you evaluated how
>>>>>>>> complex that
>>>>>>>> > >> > would be?
>>>>>>>> > >> > >>
>>>>>>>> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
>>>>>>>> > >> emkornfield@gmail.com>
>>>>>>>> > >> > >> wrote:
>>>>>>>> > >> > >>
>>>>>>>> > >> > >>> After more investigation, it looks like Float8Benchmarks
>>>>>>>> at least
>>>>>>>> > >> on my
>>>>>>>> > >> > >>> machine are within the range of noise.
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems to
>>>>>>>> bring the
>>>>>>>> > >> > >>> BitVectorHelper benchmarks back inline (and even with
>>>>>>>> some
>>>>>>>> > >> improvement
>>>>>>>> > >> > >>> for
>>>>>>>> > >> > >>> getNullCountBenchmark).
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> Benchmark                                        Mode
>>>>>>>> Cnt   Score
>>>>>>>> > >> > >>>  Error
>>>>>>>> > >> > >>>  Units
>>>>>>>> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt
>>>>>>>> 5   3.821 ±
>>>>>>>> > >> > >>> 0.031
>>>>>>>> > >> > >>>  ns/op
>>>>>>>> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt
>>>>>>>> 5  14.884 ±
>>>>>>>> > >> > >>> 0.141
>>>>>>>> > >> > >>>  ns/op
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> I applied the same pattern to other loops that I could
>>>>>>>> find, and for
>>>>>>>> > >> > any
>>>>>>>> > >> > >>> "for (long" loop on the critical path, I broke it up
>>>>>>>> into two loops.
>>>>>>>> > >> > the
>>>>>>>> > >> > >>> first loop does iteration by integer, the second
>>>>>>>> finishes off for
>>>>>>>> > >> any
>>>>>>>> > >> > >>> long
>>>>>>>> > >> > >>> values.  As a side note it seems like optimization for
>>>>>>>> loops using
>>>>>>>> > >> long
>>>>>>>> > >> > >>> counters at least have a semi-recent open bug for the
>>>>>>>> JVM [2]
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> Thanks,
>>>>>>>> > >> > >>> Micah
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> [1]
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>>
>>>>>>>> > >> >
>>>>>>>> > >>
>>>>>>>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>>>>>>>> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
>>>>>>>> > >> emkornfield@gmail.com>
>>>>>>>> > >> > >>> wrote:
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet
>>>>>>>> variables can make a
>>>>>>>> > >> > big
>>>>>>>> > >> > >>> > difference.  I didn't initially run the benchmarks
>>>>>>>> with these
>>>>>>>> > >> turned
>>>>>>>> > >> > on
>>>>>>>> > >> > >>> > (you can see the result from above with
>>>>>>>> Float8Benchmarks).  Here
>>>>>>>> > >> are
>>>>>>>> > >> > >>> new
>>>>>>>> > >> > >>> > numbers including with the flags enabled.  It looks
>>>>>>>> like using
>>>>>>>> > >> longs
>>>>>>>> > >> > >>> might
>>>>>>>> > >> > >>> > be a little bit slower, I'll see what I can do to
>>>>>>>> mitigate this.
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Ravindra also volunteered to try to benchmark the
>>>>>>>> changes with
>>>>>>>> > >> > Dremio's
>>>>>>>> > >> > >>> > code on today's sync call.
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > New
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Benchmark                                        Mode
>>>>>>>> Cnt   Score
>>>>>>>> > >> > >>>  Error
>>>>>>>> > >> > >>> > Units
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt
>>>>>>>>   5
>>>>>>>> > >>  4.176 ±
>>>>>>>> > >> > >>> 1.292
>>>>>>>> > >> > >>> > ns/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt
>>>>>>>>   5
>>>>>>>> > >> 26.102 ±
>>>>>>>> > >> > >>> 0.700
>>>>>>>> > >> > >>> > ns/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398
>>>>>>>> ± 0.084
>>>>>>>> > >> us/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711
>>>>>>>> ± 0.057
>>>>>>>> > >> us/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > old
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt
>>>>>>>>   5
>>>>>>>> > >>  3.828 ±
>>>>>>>> > >> > >>> 0.030
>>>>>>>> > >> > >>> > ns/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt
>>>>>>>>   5
>>>>>>>> > >> 20.611 ±
>>>>>>>> > >> > >>> 0.188
>>>>>>>> > >> > >>> > ns/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597
>>>>>>>> ± 0.462
>>>>>>>> > >> us/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615
>>>>>>>> ± 0.027
>>>>>>>> > >> us/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <
>>>>>>>> liya.fan03@gmail.com>
>>>>>>>> > >> > wrote:
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> >> Hi Gonzalo,
>>>>>>>> > >> > >>> >>
>>>>>>>> > >> > >>> >> Thanks for sharing the performance results.
>>>>>>>> > >> > >>> >> I am wondering if you have turned off the flag
>>>>>>>> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>>>>>>>> > >> > >>> >> If not, the lower throughput should be expected.
>>>>>>>> > >> > >>> >>
>>>>>>>> > >> > >>> >> Best,
>>>>>>>> > >> > >>> >> Liya Fan
>>>>>>>> > >> > >>> >>
>>>>>>>> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>>>>>>>> > >> > >>> emkornfield@gmail.com>
>>>>>>>> > >> > >>> >> wrote:
>>>>>>>> > >> > >>> >>
>>>>>>>> > >> > >>> >>> Hi Gonzalo,
>>>>>>>> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the
>>>>>>>> JIT
>>>>>>>> > >> > >>> implications.   At
>>>>>>>> > >> > >>> >>> least on the benchmark run they don't seem to have
>>>>>>>> an impact.
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>> If there are other benchmarks that people have that
>>>>>>>> can
>>>>>>>> > >> validate if
>>>>>>>> > >> > >>> this
>>>>>>>> > >> > >>> >>> change will be problematic I would appreciate trying
>>>>>>>> to run them
>>>>>>>> > >> > >>> with the
>>>>>>>> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt
>>>>>>>> tonight to
>>>>>>>> > >> see
>>>>>>>> > >> > if
>>>>>>>> > >> > >>> >>> there
>>>>>>>> > >> > >>> >>> is a change in those.
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>> -Micah
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz
>>>>>>>> Jaureguizar <
>>>>>>>> > >> > >>> >>> golthiryus@gmail.com> wrote:
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>> > I would recommend to take care with this kind of
>>>>>>>> changes.
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>> > I didn't try Arrow in more than one year, but by
>>>>>>>> then the
>>>>>>>> > >> > >>> performance
>>>>>>>> > >> > >>> >>> was
>>>>>>>> > >> > >>> >>> > quite bad in comparison with plain byte buffer
>>>>>>>> access
>>>>>>>> > >> > >>> >>> > (see
>>>>>>>> http://git.net/apache-arrow-development/msg02353.html *)
>>>>>>>> > >> > and
>>>>>>>> > >> > >>> >>> > there are several optimizations that the JVM
>>>>>>>> (specifically,
>>>>>>>> > >> C2)
>>>>>>>> > >> > >>> does
>>>>>>>> > >> > >>> >>> not
>>>>>>>> > >> > >>> >>> > apply when dealing with int instead of longs. One
>>>>>>>> of the
>>>>>>>> > >> > >>> >>> > most commons is the loop unrolling and
>>>>>>>> vectorization.
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>> > * It doesn't seem the best way to reference an old
>>>>>>>> email on
>>>>>>>> > >> the
>>>>>>>> > >> > >>> list,
>>>>>>>> > >> > >>> >>> but
>>>>>>>> > >> > >>> >>> > it is the only result shown by Google
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
>>>>>>>> > >> > liya.fan03@gmail.com
>>>>>>>> > >> > >>> >)
>>>>>>>> > >> > >>> >>> > escribió:
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>> >> Hi Micah,
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> Thanks for your effort. The performance result
>>>>>>>> looks good.
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional
>>>>>>>> 12 bytes (4
>>>>>>>> > >> > bytes
>>>>>>>> > >> > >>> for
>>>>>>>> > >> > >>> >>> each
>>>>>>>> > >> > >>> >>> >> of length, write index, and read index).
>>>>>>>> > >> > >>> >>> >> Similar overheads also exist for vectors like
>>>>>>>> > >> > >>> BaseFixedWidthVector,
>>>>>>>> > >> > >>> >>> >> BaseVariableWidthVector, etc.
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> IMO, such overheads are small enough to justify
>>>>>>>> the change.
>>>>>>>> > >> > >>> >>> >> Let's check if there are other overheads.
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> Best,
>>>>>>>> > >> > >>> >>> >> Liya Fan
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>>>>>>>> > >> > >>> emkornfield@gmail.com
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>> >> wrote:
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> > Hi Liya Fan,
>>>>>>>> > >> > >>> >>> >> > Based on the Float8Benchmark there does not
>>>>>>>> seem to be any
>>>>>>>> > >> > >>> >>> meaningful
>>>>>>>> > >> > >>> >>> >> > performance difference on my machine.  At least
>>>>>>>> for me, the
>>>>>>>> > >> > >>> >>> benchmarks
>>>>>>>> > >> > >>> >>> >> are
>>>>>>>> > >> > >>> >>> >> > not stable enough to say one is faster than the
>>>>>>>> other (I've
>>>>>>>> > >> > >>> pasted
>>>>>>>> > >> > >>> >>> >> results
>>>>>>>> > >> > >>> >>> >> > below).  That being said my machine isn't
>>>>>>>> necessarily the
>>>>>>>> > >> most
>>>>>>>> > >> > >>> >>> reliable
>>>>>>>> > >> > >>> >>> >> for
>>>>>>>> > >> > >>> >>> >> > benchmarking.
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > On an intuitive level, this makes sense to me,
>>>>>>>> for the
>>>>>>>> > >> most
>>>>>>>> > >> > >>> part it
>>>>>>>> > >> > >>> >>> >> seems
>>>>>>>> > >> > >>> >>> >> > like the change just moves casting from "int"
>>>>>>>> to "long"
>>>>>>>> > >> > further
>>>>>>>> > >> > >>> up
>>>>>>>> > >> > >>> >>> the
>>>>>>>> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.
>>>>>>>> If there are
>>>>>>>> > >> > other
>>>>>>>> > >> > >>> >>> >> benchmarks
>>>>>>>> > >> > >>> >>> >> > that you think are worth running let me know.
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > One downside performance wise I think for his
>>>>>>>> change is it
>>>>>>>> > >> > >>> >>> increases the
>>>>>>>> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could
>>>>>>>> influence
>>>>>>>> > >> > cache
>>>>>>>> > >> > >>> >>> misses
>>>>>>>> > >> > >>> >>> >> at
>>>>>>>> > >> > >>> >>> >> > some level or increase the size of call-stacks,
>>>>>>>> but this
>>>>>>>> > >> > doesn't
>>>>>>>> > >> > >>> >>> seem to
>>>>>>>> > >> > >>> >>> >> > show up in the benchmark..
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > Thanks,
>>>>>>>> > >> > >>> >>> >> > Micah
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > Sample benchmark numbers:
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > [New Code]
>>>>>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>>>>>>  Score
>>>>>>>> > >>  Error
>>>>>>>> > >> > >>> >>> Units
>>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>>>>>>> 15.441 ±
>>>>>>>> > >> 0.469
>>>>>>>> > >> > >>> >>> us/op
>>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>>>>>>> 14.057 ±
>>>>>>>> > >> 0.115
>>>>>>>> > >> > >>> >>> us/op
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > [Old code]
>>>>>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>>>>>>  Score
>>>>>>>> > >>  Error
>>>>>>>> > >> > >>> >>> Units
>>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>>>>>>> 16.248 ±
>>>>>>>> > >> 1.409
>>>>>>>> > >> > >>> >>> us/op
>>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>>>>>>> 14.150 ±
>>>>>>>> > >> 0.084
>>>>>>>> > >> > >>> >>> us/op
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
>>>>>>>> > >> liya.fan03@gmail.com
>>>>>>>> > >> > >
>>>>>>>> > >> > >>> >>> wrote:
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> >> Hi Micah,
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >> Thanks a lot for doing this.
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >> I am a little concerned about if there is any
>>>>>>>> negative
>>>>>>>> > >> > >>> performance
>>>>>>>> > >> > >>> >>> >> impact
>>>>>>>> > >> > >>> >>> >> >> on the current 32-bit-length based
>>>>>>>> applications.
>>>>>>>> > >> > >>> >>> >> >> Can we do some performance comparison on our
>>>>>>>> existing
>>>>>>>> > >> > >>> benchmarks?
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >> Best,
>>>>>>>> > >> > >>> >>> >> >> Liya Fan
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield
>>>>>>>> <
>>>>>>>> > >> > >>> >>> emkornfield@gmail.com>
>>>>>>>> > >> > >>> >>> >> >> wrote:
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >>> There have been some previous discussions on
>>>>>>>> the mailing
>>>>>>>> > >> > about
>>>>>>>> > >> > >>> >>> >> supporting
>>>>>>>> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this
>>>>>>>> is what the
>>>>>>>> > >> IPC
>>>>>>>> > >> > >>> >>> >> specification
>>>>>>>> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that
>>>>>>>> changes all
>>>>>>>> > >> APIs
>>>>>>>> > >> > >>> that I
>>>>>>>> > >> > >>> >>> >> could
>>>>>>>> > >> > >>> >>> >> >>> find that take an index to take an "long"
>>>>>>>> instead of an
>>>>>>>> > >> > "int"
>>>>>>>> > >> > >>> (and
>>>>>>>> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>> It is a big change, so I think it is worth
>>>>>>>> discussing if
>>>>>>>> > >> it
>>>>>>>> > >> > is
>>>>>>>> > >> > >>> >>> >> something
>>>>>>>> > >> > >>> >>> >> >>> we
>>>>>>>> > >> > >>> >>> >> >>> still want to move forward with.  It would be
>>>>>>>> nice to
>>>>>>>> > >> come
>>>>>>>> > >> > to
>>>>>>>> > >> > >>> a
>>>>>>>> > >> > >>> >>> >> >>> conclusion
>>>>>>>> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to
>>>>>>>> avoid a lot of
>>>>>>>> > >> > merge
>>>>>>>> > >> > >>> >>> >> conflicts.
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>> The reason I did this work now is the C++
>>>>>>>> implementation
>>>>>>>> > >> has
>>>>>>>> > >> > >>> added
>>>>>>>> > >> > >>> >>> >> >>> support
>>>>>>>> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString
>>>>>>>> arrays and
>>>>>>>> > >> based
>>>>>>>> > >> > on
>>>>>>>> > >> > >>> >>> prior
>>>>>>>> > >> > >>> >>> >> >>> discussions we need to have similar support
>>>>>>>> in Java
>>>>>>>> > >> before
>>>>>>>> > >> > our
>>>>>>>> > >> > >>> >>> next
>>>>>>>> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we can
>>>>>>>> have full
>>>>>>>> > >> > >>> >>> compatibility
>>>>>>>> > >> > >>> >>> >> and
>>>>>>>> > >> > >>> >>> >> >>> make the most use of the types in Java.
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>> Look forward to hearing feedback.
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>> Thanks,
>>>>>>>> > >> > >>> >>> >> >>> Micah
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>
>>>>>>>> > >> >
>>>>>>>> > >>
>>>>>>>> > >
>>>>>>>>
>>>>>>>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Micah Kornfield <em...@gmail.com>.

I'll also note this isn't quite in final form, I'd still like to add some
more unit tests.

On Fri, Nov 22, 2019 at 11:36 AM Wes McKinney <we...@gmail.com> wrote:

> hi Micah -- it makes sense to limit the scope for the time being to
> permitting LargeString/Binary work to proceed. Jacques, have you had a
> chance to look at this?
>
> On Fri, Nov 15, 2019 at 3:07 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Apologies for the long delay, I chose to do the minimal work of limiting
> this change [1] to allowing ArrowBuf to 64-bit lengths.  This would unblock
> work on LargeString and LargeBinary.  If this change looks OK, I think
> there is some follow-up work to add more thorough unit/integration tests.
> >
> > As an aside, it does seem like the 2GB limit is affecting some users in
> Spark [2][3], so hopefully LargeString would help with this.
> >
> > Allowing more than MAX_INT elements is Vectors/array still a blocker for
> making LargeList useful.
> >
> > Thanks,
> > Micah
> >
> > [1]  https://github.com/apache/arrow/pull/5020
> > [2]
> https://stackoverflow.com/questions/58739888/spark-is-it-possible-to-increase-pyarrow-buffer#comment103812119_58739888
> > [3] https://issues.apache.org/jira/browse/ARROW-4890
> >
> > On Fri, Aug 23, 2019 at 8:33 AM Jacques Nadeau <ja...@apache.org>
> wrote:
> >>
> >>
> >>
> >> On Fri, Aug 23, 2019, 8:55 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >>>>
> >>>> The vector indexes being limited to 32 bits doesn't limit the
> addressing to 32 bit chunks of memory. For example, you're prime example
> before was image data. Having 2 billion images of 1mb images would still be
> supported without changing the index addressing.
> >>>
> >>> This might be pre-coffee math, but I think we are limited to
> approximately 2000 images because an ArrowBuf only holds up-to 2 billion
> bytes [1].  While we have plenty of room for the offsets, we quickly run
> out of contiguous data storage space. For LargeString and LargeBinary this
> could be fixed by changing ArrowBuf.
> >>>
> >>> LargeArray faces the same problem only it applies to its child
> vectors.  Supporting LargeArray properly is really what drove the
> large-scale interface change.
> >>
> >>
> >> My expressed concern about these changes was specifically about the use
> of long for get/set in the vector interfaces. I'm not saying that we
> constrain memory/ArrowBufs to 32bits.
> >>
> >>>
> >>>> Rebase would help if possible.
> >>>
> >>> I'll try to get to this in the next few days.
> >>>
> >>> [1]
> https://github.com/apache/arrow/blob/95175fe7cb8439eebe6d2f6e0495f551d6864380/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java#L164
> >>>
> >>> On Fri, Aug 23, 2019 at 4:55 AM Jacques Nadeau <ja...@apache.org>
> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Aug 23, 2019, 11:49 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> I don't think we should couple this discussion with the
> implementation of large list, etc since I think those two concepts are
> independent.
> >>>>>
> >>>>> I'm still trying to balance in my mind which is a worse experience
> for consumers of the libraries for these types.  Claiming that Java
> supports these types but throwing an exception when the Vectors exceed
> 32-bits or just say they aren't supported until we have 64-bit support in
> Java.
> >>>>
> >>>>
> >>>> The vector indexes being limited to 32 bits doesn't limit the
> addressing to 32 bit chunks of memory. For example, you're prime example
> before was image data. Having 2 billion images of 1mb images would still be
> supported without changing the index addressing.
> >>>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> I've asked some others on my team their opinions on the risk here.
> I think we should probably review some our more complex vector interactions
> and see how the jvm's assembly changes with this kind of change. Using
> microbenchmarking is good but I think we also need to see whether we're
> constantly inserting additional instructions or if in most cases, this
> actually doesn't impact instruction count.
> >>>>>
> >>>>>
> >>>>> Is this something that your team will take on?
> >>>>
> >>>>
> >>>>
> >>>> Yeah, we need to look at this I think.
> >>>>
> >>>>> Do you need a rebased version of the PR or is the existing one
> sufficient?
> >>>>
> >>>>
> >>>> Rebase would help if possible.
> >>>>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Micah
> >>>>>
> >>>>> On Thu, Aug 22, 2019 at 8:56 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> >>>>>>
> >>>>>> I don't think we should couple this discussion with the
> implementation of large list, etc since I think those two concepts are
> independent.
> >>>>>>
> >>>>>> I've asked some others on my team their opinions on the risk here.
> I think we should probably review some our more complex vector interactions
> and see how the jvm's assembly changes with this kind of change. Using
> microbenchmarking is good but I think we also need to see whether we're
> constantly inserting additional instructions or if in most cases, this
> actually doesn't impact instruction count.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield <
> emkornfield@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> With regards to the reference implementation point. It is a good
> point. I'm on vacation this week. Unless you're pushing hard on this, can
> we pick this up and discuss more next week?
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
> reference implementation aspect of this?
> >>>>>>>
> >>>>>>>>
> >>>>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
> >>>>>>>> would be best to decouple this discussion from the release
> timeline
> >>>>>>>> given how many people we have relying on regular releases coming
> out.
> >>>>>>>> We can keep continue making major 0.x releases until we're ready
> to
> >>>>>>>> release 1.0.0.
> >>>>>>>
> >>>>>>>
> >>>>>>> I'm OK with it as long as other stakeholders are. Timed releases
> are the way to go.  As stated on the release thread [1] we need a better
> mechanism to avoid this type of issue arising again.  The release thread
> also had some more discussion on compatibility.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Micah
> >>>>>>>
> >>>>>>> [1]
> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <
> emkornfield@gmail.com> wrote:
> >>>>>>>> >
> >>>>>>>> > Hi Wes and Jacques,
> >>>>>>>> > See responses below.
> >>>>>>>> >
> >>>>>>>> > With regards to the reference implementation point. It is a
> good point. I'm
> >>>>>>>> > > on vacation this week. Unless you're pushing hard on this,
> can we pick this
> >>>>>>>> > > up and discuss more next week?
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>> > Sure thing, enjoy your vacation.  I think the only practical
> implications
> >>>>>>>> > are it delays choices around implementing LargeList,
> LargeBinary,
> >>>>>>>> > LargeString in Java, which in turn might push out the 0.15.0
> release.
> >>>>>>>> >
> >>>>>>>>
> >>>>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
> >>>>>>>> would be best to decouple this discussion from the release
> timeline
> >>>>>>>> given how many people we have relying on regular releases coming
> out.
> >>>>>>>> We can keep continue making major 0.x releases until we're ready
> to
> >>>>>>>> release 1.0.0.
> >>>>>>>>
> >>>>>>>> > My stance on this is that I don't know how important it is for
> Java to
> >>>>>>>> > > support vectors over INT32_MAX elements. The use cases
> enabled by
> >>>>>>>> > > having very large arrays seem to be concentrated in the
> native code
> >>>>>>>> > > world (e.g. C/C++/Rust) -- that could just be
> implementation-centrism
> >>>>>>>> > > on my part, though.
> >>>>>>>> >
> >>>>>>>> >
> >>>>>>>> > A data point against this view is Spark has done work to
> eliminate 2GB
> >>>>>>>> > memory limits on its block sizes [1].  I don't claim to
> understand the
> >>>>>>>> > implications of this. Bryan might you have any thoughts here?
> I'm OK with
> >>>>>>>> > INT32_MAX, as well, I think we should think about what this
> means for
> >>>>>>>> > adding Large types to Java and implications for reference
> implementations.
> >>>>>>>> >
> >>>>>>>> > Thanks,
> >>>>>>>> > Micah
> >>>>>>>> >
> >>>>>>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
> >>>>>>>> >
> >>>>>>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <
> jacques@apache.org> wrote:
> >>>>>>>> >
> >>>>>>>> > > Hey Micah,
> >>>>>>>> > >
> >>>>>>>> > > Appreciate the offer on the compiling. The reality is I'm
> more concerned
> >>>>>>>> > > about the unknowns than the compiling issue itself. Any time
> you've been
> >>>>>>>> > > tuning for a while, changing something like this could be
> totally fine or
> >>>>>>>> > > cause a couple of major issues. For example, we've done a
> very large amount
> >>>>>>>> > > of work reducing heap memory footprint of the vectors. Are
> target is to
> >>>>>>>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes
> heap per vector
> >>>>>>>> > > (not including arrow bufs).
> >>>>>>>> > >
> >>>>>>>> > > With regards to the reference implementation point. It is a
> good point.
> >>>>>>>> > > I'm on vacation this week. Unless you're pushing hard on
> this, can we pick
> >>>>>>>> > > this up and discuss more next week?
> >>>>>>>> > >
> >>>>>>>> > > thanks,
> >>>>>>>> > > Jacques
> >>>>>>>> > >
> >>>>>>>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
> emkornfield@gmail.com>
> >>>>>>>> > > wrote:
> >>>>>>>> > >
> >>>>>>>> > >> Hi Jacques,
> >>>>>>>> > >> I definitely understand these concerns and this change is
> risky because it
> >>>>>>>> > >> is so large.  Perhaps, creating a new hierarchy, might be
> the cleanest way
> >>>>>>>> > >> of dealing with this.  This could have other benefits like
> cleaning up
> >>>>>>>> > >> some
> >>>>>>>> > >> cruft around dictionary encode and "orphaned" method.   Per
> past e-mail
> >>>>>>>> > >> threads I agree it is beneficial to have 2 separate reference
> >>>>>>>> > >> implementations that can communicate fully, and my intent
> here was to
> >>>>>>>> > >> close
> >>>>>>>> > >> that gap.
> >>>>>>>> > >>
> >>>>>>>> > >> Trying to
> >>>>>>>> > >> > determine the ramifications of these changes would be
> challenging and
> >>>>>>>> > >> time
> >>>>>>>> > >> > consuming against all the different ways we interact with
> the Arrow Java
> >>>>>>>> > >> > library.
> >>>>>>>> > >>
> >>>>>>>> > >>
> >>>>>>>> > >> Understood.  I took a quick look at Dremio-OSS it seems like
> it has a
> >>>>>>>> > >> simple java build system?  If it is helpful, I can try to
> get a fork
> >>>>>>>> > >> running that at least compiles against this PR.  My plan
> would be to cast
> >>>>>>>> > >> any place that was changed to return a long back to an int,
> so in essence
> >>>>>>>> > >> the Dremio algorithms would reman 32-bit implementations.
> >>>>>>>> > >>
> >>>>>>>> > >> I don't  have the infrastructure to test this change
> properly from a
> >>>>>>>> > >> distributed systems perspective, so it would still take some
> time from
> >>>>>>>> > >> Dremio to validate for regressions.
> >>>>>>>> > >>
> >>>>>>>> > >> I'm not saying I'm against this but want to make sure we've
> >>>>>>>> > >> > explored all less disruptive options before considering
> changing
> >>>>>>>> > >> something
> >>>>>>>> > >> > this fundamental (especially when I generally hold the
> view that large
> >>>>>>>> > >> cell
> >>>>>>>> > >> > counts against massive contiguous memory is an anti
> pattern to scalable
> >>>>>>>> > >> > analytical processing--purely subjective of course).
> >>>>>>>> > >>
> >>>>>>>> > >>
> >>>>>>>> > >> I'm open to other ideas here, as well. I don't think it is
> out of the
> >>>>>>>> > >> question to leave the Java implementation as 32-bit, but if
> we do, then I
> >>>>>>>> > >> think we should consider a different strategy for reference
> >>>>>>>> > >> implementations.
> >>>>>>>> > >>
> >>>>>>>> > >> Thanks,
> >>>>>>>> > >> Micah
> >>>>>>>> > >>
> >>>>>>>> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <
> jacques@apache.org>
> >>>>>>>> > >> wrote:
> >>>>>>>> > >>
> >>>>>>>> > >> > Hey Micah, I didn't have a particular path in mind. Was
> thinking more
> >>>>>>>> > >> along
> >>>>>>>> > >> > the lines of extra methods as opposed to separate classes.
> >>>>>>>> > >> >
> >>>>>>>> > >> > Arrow hasn't historically been a place where we're writing
> algorithms in
> >>>>>>>> > >> > Java so the fact that they aren't there doesn't mean they
> don't exist.
> >>>>>>>> > >> We
> >>>>>>>> > >> > have a large amount of code that depends on the current
> behavior that is
> >>>>>>>> > >> > deployed in hundreds of customer clusters (you can peruse
> our dremio
> >>>>>>>> > >> repo
> >>>>>>>> > >> > to see how extensively we leverage Arrow if interested).
> Trying to
> >>>>>>>> > >> > determine the ramifications of these changes would be
> challenging and
> >>>>>>>> > >> time
> >>>>>>>> > >> > consuming against all the different ways we interact with
> the Arrow Java
> >>>>>>>> > >> > library. I'm not saying I'm against this but want to make
> sure we've
> >>>>>>>> > >> > explored all less disruptive options before considering
> changing
> >>>>>>>> > >> something
> >>>>>>>> > >> > this fundamental (especially when I generally hold the
> view that large
> >>>>>>>> > >> cell
> >>>>>>>> > >> > counts against massive contiguous memory is an anti
> pattern to scalable
> >>>>>>>> > >> > analytical processing--purely subjective of course).
> >>>>>>>> > >> >
> >>>>>>>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <
> emkornfield@gmail.com>
> >>>>>>>> > >> > wrote:
> >>>>>>>> > >> >
> >>>>>>>> > >> > > Hi Jacques,
> >>>>>>>> > >> > > What avenue were you thinking for supporting both
> paths?   I didn't
> >>>>>>>> > >> want
> >>>>>>>> > >> > > to pursue a different class hierarchy, because I felt
> like that would
> >>>>>>>> > >> > > effectively fork the code base, but that is potentially
> an option that
> >>>>>>>> > >> > > would allow us to have a complete reference
> implementation in Java
> >>>>>>>> > >> that
> >>>>>>>> > >> > can
> >>>>>>>> > >> > > fully interact with C++, without major changes to this
> code.
> >>>>>>>> > >> > >
> >>>>>>>> > >> > > For supporting both APIs on the same classes/interfaces,
> I think they
> >>>>>>>> > >> > > roughly fall into three categories, changes to input
> parameters,
> >>>>>>>> > >> changes
> >>>>>>>> > >> > to
> >>>>>>>> > >> > > output parameters and algorithm changes.
> >>>>>>>> > >> > >
> >>>>>>>> > >> > > For inputs, changing from int to long is essentially a
> no-op from the
> >>>>>>>> > >> > > compiler perspective.  From the limited
> micro-benchmarking this also
> >>>>>>>> > >> > > doesn't seem to have a performance impact.  So we could
> keep two
> >>>>>>>> > >> versions
> >>>>>>>> > >> > > of the methods that only differ on inputs, but it is not
> clear what
> >>>>>>>> > >> the
> >>>>>>>> > >> > > value of that would be.
> >>>>>>>> > >> > >
> >>>>>>>> > >> > > For outputs, we can't support methods "long getLength()"
> and "int
> >>>>>>>> > >> > > getLength()" in the same class, so we would be forced
> into something
> >>>>>>>> > >> like
> >>>>>>>> > >> > > "long getLength(boolean dummy)" which I think is a less
> desirable.
> >>>>>>>> > >> > >
> >>>>>>>> > >> > > For algorithm changes, there did not appear to be too
> many places
> >>>>>>>> > >> where
> >>>>>>>> > >> > we
> >>>>>>>> > >> > > actually loop over all elements (it is quite possible I
> missed
> >>>>>>>> > >> something
> >>>>>>>> > >> > > here), the ones that I did find I was able to mitigate
> performance
> >>>>>>>> > >> > > penalties as noted above.  Some of the current
> implementation will
> >>>>>>>> > >> get a
> >>>>>>>> > >> > > lot slower for "large arrays", but we can likely fix
> those later or in
> >>>>>>>> > >> > this
> >>>>>>>> > >> > > PR with a nested while loop instead of 2 for loops.
> >>>>>>>> > >> > >
> >>>>>>>> > >> > > Thanks,
> >>>>>>>> > >> > > Micah
> >>>>>>>> > >> > >
> >>>>>>>> > >> > >
> >>>>>>>> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <
> jacques@apache.org>
> >>>>>>>> > >> wrote:
> >>>>>>>> > >> > >
> >>>>>>>> > >> > >> This is a pretty massive change to the apis. I wonder
> how nasty it
> >>>>>>>> > >> would
> >>>>>>>> > >> > >> be to just support both paths. Have you evaluated how
> complex that
> >>>>>>>> > >> > would be?
> >>>>>>>> > >> > >>
> >>>>>>>> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
> >>>>>>>> > >> emkornfield@gmail.com>
> >>>>>>>> > >> > >> wrote:
> >>>>>>>> > >> > >>
> >>>>>>>> > >> > >>> After more investigation, it looks like
> Float8Benchmarks at least
> >>>>>>>> > >> on my
> >>>>>>>> > >> > >>> machine are within the range of noise.
> >>>>>>>> > >> > >>>
> >>>>>>>> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems
> to bring the
> >>>>>>>> > >> > >>> BitVectorHelper benchmarks back inline (and even with
> some
> >>>>>>>> > >> improvement
> >>>>>>>> > >> > >>> for
> >>>>>>>> > >> > >>> getNullCountBenchmark).
> >>>>>>>> > >> > >>>
> >>>>>>>> > >> > >>> Benchmark                                        Mode
> Cnt   Score
> >>>>>>>> > >> > >>>  Error
> >>>>>>>> > >> > >>>  Units
> >>>>>>>> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt
>   5   3.821 ±
> >>>>>>>> > >> > >>> 0.031
> >>>>>>>> > >> > >>>  ns/op
> >>>>>>>> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt
>   5  14.884 ±
> >>>>>>>> > >> > >>> 0.141
> >>>>>>>> > >> > >>>  ns/op
> >>>>>>>> > >> > >>>
> >>>>>>>> > >> > >>> I applied the same pattern to other loops that I could
> find, and for
> >>>>>>>> > >> > any
> >>>>>>>> > >> > >>> "for (long" loop on the critical path, I broke it up
> into two loops.
> >>>>>>>> > >> > the
> >>>>>>>> > >> > >>> first loop does iteration by integer, the second
> finishes off for
> >>>>>>>> > >> any
> >>>>>>>> > >> > >>> long
> >>>>>>>> > >> > >>> values.  As a side note it seems like optimization for
> loops using
> >>>>>>>> > >> long
> >>>>>>>> > >> > >>> counters at least have a semi-recent open bug for the
> JVM [2]
> >>>>>>>> > >> > >>>
> >>>>>>>> > >> > >>> Thanks,
> >>>>>>>> > >> > >>> Micah
> >>>>>>>> > >> > >>>
> >>>>>>>> > >> > >>> [1]
> >>>>>>>> > >> > >>>
> >>>>>>>> > >> > >>>
> >>>>>>>> > >> >
> >>>>>>>> > >>
> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
> >>>>>>>> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
> >>>>>>>> > >> > >>>
> >>>>>>>> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
> >>>>>>>> > >> emkornfield@gmail.com>
> >>>>>>>> > >> > >>> wrote:
> >>>>>>>> > >> > >>>
> >>>>>>>> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet
> variables can make a
> >>>>>>>> > >> > big
> >>>>>>>> > >> > >>> > difference.  I didn't initially run the benchmarks
> with these
> >>>>>>>> > >> turned
> >>>>>>>> > >> > on
> >>>>>>>> > >> > >>> > (you can see the result from above with
> Float8Benchmarks).  Here
> >>>>>>>> > >> are
> >>>>>>>> > >> > >>> new
> >>>>>>>> > >> > >>> > numbers including with the flags enabled.  It looks
> like using
> >>>>>>>> > >> longs
> >>>>>>>> > >> > >>> might
> >>>>>>>> > >> > >>> > be a little bit slower, I'll see what I can do to
> mitigate this.
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > Ravindra also volunteered to try to benchmark the
> changes with
> >>>>>>>> > >> > Dremio's
> >>>>>>>> > >> > >>> > code on today's sync call.
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > New
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > Benchmark
> Mode  Cnt   Score
> >>>>>>>> > >> > >>>  Error
> >>>>>>>> > >> > >>> > Units
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark
>  avgt    5
> >>>>>>>> > >>  4.176 ±
> >>>>>>>> > >> > >>> 1.292
> >>>>>>>> > >> > >>> > ns/op
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark
> avgt    5
> >>>>>>>> > >> 26.102 ±
> >>>>>>>> > >> > >>> 0.700
> >>>>>>>> > >> > >>> > ns/op
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5
> 7.398 ± 0.084
> >>>>>>>> > >> us/op
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5
> 2.711 ± 0.057
> >>>>>>>> > >> us/op
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > old
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark
>  avgt    5
> >>>>>>>> > >>  3.828 ±
> >>>>>>>> > >> > >>> 0.030
> >>>>>>>> > >> > >>> > ns/op
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark
> avgt    5
> >>>>>>>> > >> 20.611 ±
> >>>>>>>> > >> > >>> 0.188
> >>>>>>>> > >> > >>> > ns/op
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5
> 6.597 ± 0.462
> >>>>>>>> > >> us/op
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5
> 2.615 ± 0.027
> >>>>>>>> > >> us/op
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <
> liya.fan03@gmail.com>
> >>>>>>>> > >> > wrote:
> >>>>>>>> > >> > >>> >
> >>>>>>>> > >> > >>> >> Hi Gonzalo,
> >>>>>>>> > >> > >>> >>
> >>>>>>>> > >> > >>> >> Thanks for sharing the performance results.
> >>>>>>>> > >> > >>> >> I am wondering if you have turned off the flag
> >>>>>>>> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> >>>>>>>> > >> > >>> >> If not, the lower throughput should be expected.
> >>>>>>>> > >> > >>> >>
> >>>>>>>> > >> > >>> >> Best,
> >>>>>>>> > >> > >>> >> Liya Fan
> >>>>>>>> > >> > >>> >>
> >>>>>>>> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
> >>>>>>>> > >> > >>> emkornfield@gmail.com>
> >>>>>>>> > >> > >>> >> wrote:
> >>>>>>>> > >> > >>> >>
> >>>>>>>> > >> > >>> >>> Hi Gonzalo,
> >>>>>>>> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the
> JIT
> >>>>>>>> > >> > >>> implications.   At
> >>>>>>>> > >> > >>> >>> least on the benchmark run they don't seem to have
> an impact.
> >>>>>>>> > >> > >>> >>>
> >>>>>>>> > >> > >>> >>> If there are other benchmarks that people have
> that can
> >>>>>>>> > >> validate if
> >>>>>>>> > >> > >>> this
> >>>>>>>> > >> > >>> >>> change will be problematic I would appreciate
> trying to run them
> >>>>>>>> > >> > >>> with the
> >>>>>>>> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt
> tonight to
> >>>>>>>> > >> see
> >>>>>>>> > >> > if
> >>>>>>>> > >> > >>> >>> there
> >>>>>>>> > >> > >>> >>> is a change in those.
> >>>>>>>> > >> > >>> >>>
> >>>>>>>> > >> > >>> >>> -Micah
> >>>>>>>> > >> > >>> >>>
> >>>>>>>> > >> > >>> >>>
> >>>>>>>> > >> > >>> >>>
> >>>>>>>> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz
> Jaureguizar <
> >>>>>>>> > >> > >>> >>> golthiryus@gmail.com> wrote:
> >>>>>>>> > >> > >>> >>>
> >>>>>>>> > >> > >>> >>> > I would recommend to take care with this kind of
> changes.
> >>>>>>>> > >> > >>> >>> >
> >>>>>>>> > >> > >>> >>> > I didn't try Arrow in more than one year, but by
> then the
> >>>>>>>> > >> > >>> performance
> >>>>>>>> > >> > >>> >>> was
> >>>>>>>> > >> > >>> >>> > quite bad in comparison with plain byte buffer
> access
> >>>>>>>> > >> > >>> >>> > (see
> http://git.net/apache-arrow-development/msg02353.html *)
> >>>>>>>> > >> > and
> >>>>>>>> > >> > >>> >>> > there are several optimizations that the JVM
> (specifically,
> >>>>>>>> > >> C2)
> >>>>>>>> > >> > >>> does
> >>>>>>>> > >> > >>> >>> not
> >>>>>>>> > >> > >>> >>> > apply when dealing with int instead of longs.
> One of the
> >>>>>>>> > >> > >>> >>> > most commons is the loop unrolling and
> vectorization.
> >>>>>>>> > >> > >>> >>> >
> >>>>>>>> > >> > >>> >>> > * It doesn't seem the best way to reference an
> old email on
> >>>>>>>> > >> the
> >>>>>>>> > >> > >>> list,
> >>>>>>>> > >> > >>> >>> but
> >>>>>>>> > >> > >>> >>> > it is the only result shown by Google
> >>>>>>>> > >> > >>> >>> >
> >>>>>>>> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
> >>>>>>>> > >> > liya.fan03@gmail.com
> >>>>>>>> > >> > >>> >)
> >>>>>>>> > >> > >>> >>> > escribió:
> >>>>>>>> > >> > >>> >>> >
> >>>>>>>> > >> > >>> >>> >> Hi Micah,
> >>>>>>>> > >> > >>> >>> >>
> >>>>>>>> > >> > >>> >>> >> Thanks for your effort. The performance result
> looks good.
> >>>>>>>> > >> > >>> >>> >>
> >>>>>>>> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional
> 12 bytes (4
> >>>>>>>> > >> > bytes
> >>>>>>>> > >> > >>> for
> >>>>>>>> > >> > >>> >>> each
> >>>>>>>> > >> > >>> >>> >> of length, write index, and read index).
> >>>>>>>> > >> > >>> >>> >> Similar overheads also exist for vectors like
> >>>>>>>> > >> > >>> BaseFixedWidthVector,
> >>>>>>>> > >> > >>> >>> >> BaseVariableWidthVector, etc.
> >>>>>>>> > >> > >>> >>> >>
> >>>>>>>> > >> > >>> >>> >> IMO, such overheads are small enough to justify
> the change.
> >>>>>>>> > >> > >>> >>> >> Let's check if there are other overheads.
> >>>>>>>> > >> > >>> >>> >>
> >>>>>>>> > >> > >>> >>> >> Best,
> >>>>>>>> > >> > >>> >>> >> Liya Fan
> >>>>>>>> > >> > >>> >>> >>
> >>>>>>>> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
> >>>>>>>> > >> > >>> emkornfield@gmail.com
> >>>>>>>> > >> > >>> >>> >
> >>>>>>>> > >> > >>> >>> >> wrote:
> >>>>>>>> > >> > >>> >>> >>
> >>>>>>>> > >> > >>> >>> >> > Hi Liya Fan,
> >>>>>>>> > >> > >>> >>> >> > Based on the Float8Benchmark there does not
> seem to be any
> >>>>>>>> > >> > >>> >>> meaningful
> >>>>>>>> > >> > >>> >>> >> > performance difference on my machine.  At
> least for me, the
> >>>>>>>> > >> > >>> >>> benchmarks
> >>>>>>>> > >> > >>> >>> >> are
> >>>>>>>> > >> > >>> >>> >> > not stable enough to say one is faster than
> the other (I've
> >>>>>>>> > >> > >>> pasted
> >>>>>>>> > >> > >>> >>> >> results
> >>>>>>>> > >> > >>> >>> >> > below).  That being said my machine isn't
> necessarily the
> >>>>>>>> > >> most
> >>>>>>>> > >> > >>> >>> reliable
> >>>>>>>> > >> > >>> >>> >> for
> >>>>>>>> > >> > >>> >>> >> > benchmarking.
> >>>>>>>> > >> > >>> >>> >> >
> >>>>>>>> > >> > >>> >>> >> > On an intuitive level, this makes sense to
> me,  for the
> >>>>>>>> > >> most
> >>>>>>>> > >> > >>> part it
> >>>>>>>> > >> > >>> >>> >> seems
> >>>>>>>> > >> > >>> >>> >> > like the change just moves casting from "int"
> to "long"
> >>>>>>>> > >> > further
> >>>>>>>> > >> > >>> up
> >>>>>>>> > >> > >>> >>> the
> >>>>>>>> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.
> If there are
> >>>>>>>> > >> > other
> >>>>>>>> > >> > >>> >>> >> benchmarks
> >>>>>>>> > >> > >>> >>> >> > that you think are worth running let me know.
> >>>>>>>> > >> > >>> >>> >> >
> >>>>>>>> > >> > >>> >>> >> > One downside performance wise I think for his
> change is it
> >>>>>>>> > >> > >>> >>> increases the
> >>>>>>>> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose
> could influence
> >>>>>>>> > >> > cache
> >>>>>>>> > >> > >>> >>> misses
> >>>>>>>> > >> > >>> >>> >> at
> >>>>>>>> > >> > >>> >>> >> > some level or increase the size of
> call-stacks, but this
> >>>>>>>> > >> > doesn't
> >>>>>>>> > >> > >>> >>> seem to
> >>>>>>>> > >> > >>> >>> >> > show up in the benchmark..
> >>>>>>>> > >> > >>> >>> >> >
> >>>>>>>> > >> > >>> >>> >> > Thanks,
> >>>>>>>> > >> > >>> >>> >> > Micah
> >>>>>>>> > >> > >>> >>> >> >
> >>>>>>>> > >> > >>> >>> >> > Sample benchmark numbers:
> >>>>>>>> > >> > >>> >>> >> >
> >>>>>>>> > >> > >>> >>> >> > [New Code]
> >>>>>>>> > >> > >>> >>> >> > Benchmark                            Mode
> Cnt   Score
> >>>>>>>> > >>  Error
> >>>>>>>> > >> > >>> >>> Units
> >>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt
> 5  15.441 ±
> >>>>>>>> > >> 0.469
> >>>>>>>> > >> > >>> >>> us/op
> >>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt
> 5  14.057 ±
> >>>>>>>> > >> 0.115
> >>>>>>>> > >> > >>> >>> us/op
> >>>>>>>> > >> > >>> >>> >> >
> >>>>>>>> > >> > >>> >>> >> >
> >>>>>>>> > >> > >>> >>> >> > [Old code]
> >>>>>>>> > >> > >>> >>> >> > Benchmark                            Mode
> Cnt   Score
> >>>>>>>> > >>  Error
> >>>>>>>> > >> > >>> >>> Units
> >>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt
> 5  16.248 ±
> >>>>>>>> > >> 1.409
> >>>>>>>> > >> > >>> >>> us/op
> >>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt
> 5  14.150 ±
> >>>>>>>> > >> 0.084
> >>>>>>>> > >> > >>> >>> us/op
> >>>>>>>> > >> > >>> >>> >> >
> >>>>>>>> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
> >>>>>>>> > >> liya.fan03@gmail.com
> >>>>>>>> > >> > >
> >>>>>>>> > >> > >>> >>> wrote:
> >>>>>>>> > >> > >>> >>> >> >
> >>>>>>>> > >> > >>> >>> >> >> Hi Micah,
> >>>>>>>> > >> > >>> >>> >> >>
> >>>>>>>> > >> > >>> >>> >> >> Thanks a lot for doing this.
> >>>>>>>> > >> > >>> >>> >> >>
> >>>>>>>> > >> > >>> >>> >> >> I am a little concerned about if there is
> any negative
> >>>>>>>> > >> > >>> performance
> >>>>>>>> > >> > >>> >>> >> impact
> >>>>>>>> > >> > >>> >>> >> >> on the current 32-bit-length based
> applications.
> >>>>>>>> > >> > >>> >>> >> >> Can we do some performance comparison on our
> existing
> >>>>>>>> > >> > >>> benchmarks?
> >>>>>>>> > >> > >>> >>> >> >>
> >>>>>>>> > >> > >>> >>> >> >> Best,
> >>>>>>>> > >> > >>> >>> >> >> Liya Fan
> >>>>>>>> > >> > >>> >>> >> >>
> >>>>>>>> > >> > >>> >>> >> >>
> >>>>>>>> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah
> Kornfield <
> >>>>>>>> > >> > >>> >>> emkornfield@gmail.com>
> >>>>>>>> > >> > >>> >>> >> >> wrote:
> >>>>>>>> > >> > >>> >>> >> >>
> >>>>>>>> > >> > >>> >>> >> >>> There have been some previous discussions
> on the mailing
> >>>>>>>> > >> > about
> >>>>>>>> > >> > >>> >>> >> supporting
> >>>>>>>> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this
> is what the
> >>>>>>>> > >> IPC
> >>>>>>>> > >> > >>> >>> >> specification
> >>>>>>>> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that
> changes all
> >>>>>>>> > >> APIs
> >>>>>>>> > >> > >>> that I
> >>>>>>>> > >> > >>> >>> >> could
> >>>>>>>> > >> > >>> >>> >> >>> find that take an index to take an "long"
> instead of an
> >>>>>>>> > >> > "int"
> >>>>>>>> > >> > >>> (and
> >>>>>>>> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
> >>>>>>>> > >> > >>> >>> >> >>>
> >>>>>>>> > >> > >>> >>> >> >>> It is a big change, so I think it is worth
> discussing if
> >>>>>>>> > >> it
> >>>>>>>> > >> > is
> >>>>>>>> > >> > >>> >>> >> something
> >>>>>>>> > >> > >>> >>> >> >>> we
> >>>>>>>> > >> > >>> >>> >> >>> still want to move forward with.  It would
> be nice to
> >>>>>>>> > >> come
> >>>>>>>> > >> > to
> >>>>>>>> > >> > >>> a
> >>>>>>>> > >> > >>> >>> >> >>> conclusion
> >>>>>>>> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to
> avoid a lot of
> >>>>>>>> > >> > merge
> >>>>>>>> > >> > >>> >>> >> conflicts.
> >>>>>>>> > >> > >>> >>> >> >>>
> >>>>>>>> > >> > >>> >>> >> >>> The reason I did this work now is the C++
> implementation
> >>>>>>>> > >> has
> >>>>>>>> > >> > >>> added
> >>>>>>>> > >> > >>> >>> >> >>> support
> >>>>>>>> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString
> arrays and
> >>>>>>>> > >> based
> >>>>>>>> > >> > on
> >>>>>>>> > >> > >>> >>> prior
> >>>>>>>> > >> > >>> >>> >> >>> discussions we need to have similar support
> in Java
> >>>>>>>> > >> before
> >>>>>>>> > >> > our
> >>>>>>>> > >> > >>> >>> next
> >>>>>>>> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we
> can have full
> >>>>>>>> > >> > >>> >>> compatibility
> >>>>>>>> > >> > >>> >>> >> and
> >>>>>>>> > >> > >>> >>> >> >>> make the most use of the types in Java.
> >>>>>>>> > >> > >>> >>> >> >>>
> >>>>>>>> > >> > >>> >>> >> >>> Look forward to hearing feedback.
> >>>>>>>> > >> > >>> >>> >> >>>
> >>>>>>>> > >> > >>> >>> >> >>> Thanks,
> >>>>>>>> > >> > >>> >>> >> >>> Micah
> >>>>>>>> > >> > >>> >>> >> >>>
> >>>>>>>> > >> > >>> >>> >> >>> [1]
> https://github.com/apache/arrow/pull/5020
> >>>>>>>> > >> > >>> >>> >> >>>
> >>>>>>>> > >> > >>> >>> >> >>
> >>>>>>>> > >> > >>> >>> >>
> >>>>>>>> > >> > >>> >>> >
> >>>>>>>> > >> > >>> >>>
> >>>>>>>> > >> > >>> >>
> >>>>>>>> > >> > >>>
> >>>>>>>> > >> > >>
> >>>>>>>> > >> >
> >>>>>>>> > >>
> >>>>>>>> > >
>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Wes McKinney <we...@gmail.com>.

hi Micah -- it makes sense to limit the scope for the time being to
permitting LargeString/Binary work to proceed. Jacques, have you had a
chance to look at this?

On Fri, Nov 15, 2019 at 3:07 AM Micah Kornfield <em...@gmail.com> wrote:
>
> Apologies for the long delay, I chose to do the minimal work of limiting this change [1] to allowing ArrowBuf to 64-bit lengths.  This would unblock work on LargeString and LargeBinary.  If this change looks OK, I think there is some follow-up work to add more thorough unit/integration tests.
>
> As an aside, it does seem like the 2GB limit is affecting some users in Spark [2][3], so hopefully LargeString would help with this.
>
> Allowing more than MAX_INT elements is Vectors/array still a blocker for making LargeList useful.
>
> Thanks,
> Micah
>
> [1]  https://github.com/apache/arrow/pull/5020
> [2] https://stackoverflow.com/questions/58739888/spark-is-it-possible-to-increase-pyarrow-buffer#comment103812119_58739888
> [3] https://issues.apache.org/jira/browse/ARROW-4890
>
> On Fri, Aug 23, 2019 at 8:33 AM Jacques Nadeau <ja...@apache.org> wrote:
>>
>>
>>
>> On Fri, Aug 23, 2019, 8:55 PM Micah Kornfield <em...@gmail.com> wrote:
>>>>
>>>> The vector indexes being limited to 32 bits doesn't limit the addressing to 32 bit chunks of memory. For example, you're prime example before was image data. Having 2 billion images of 1mb images would still be supported without changing the index addressing.
>>>
>>> This might be pre-coffee math, but I think we are limited to approximately 2000 images because an ArrowBuf only holds up-to 2 billion bytes [1].  While we have plenty of room for the offsets, we quickly run out of contiguous data storage space. For LargeString and LargeBinary this could be fixed by changing ArrowBuf.
>>>
>>> LargeArray faces the same problem only it applies to its child vectors.  Supporting LargeArray properly is really what drove the large-scale interface change.
>>
>>
>> My expressed concern about these changes was specifically about the use of long for get/set in the vector interfaces. I'm not saying that we constrain memory/ArrowBufs to 32bits.
>>
>>>
>>>> Rebase would help if possible.
>>>
>>> I'll try to get to this in the next few days.
>>>
>>> [1] https://github.com/apache/arrow/blob/95175fe7cb8439eebe6d2f6e0495f551d6864380/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java#L164
>>>
>>> On Fri, Aug 23, 2019 at 4:55 AM Jacques Nadeau <ja...@apache.org> wrote:
>>>>
>>>>
>>>>
>>>> On Fri, Aug 23, 2019, 11:49 AM Micah Kornfield <em...@gmail.com> wrote:
>>>>>>
>>>>>> I don't think we should couple this discussion with the implementation of large list, etc since I think those two concepts are independent.
>>>>>
>>>>> I'm still trying to balance in my mind which is a worse experience for consumers of the libraries for these types.  Claiming that Java supports these types but throwing an exception when the Vectors exceed 32-bits or just say they aren't supported until we have 64-bit support in Java.
>>>>
>>>>
>>>> The vector indexes being limited to 32 bits doesn't limit the addressing to 32 bit chunks of memory. For example, you're prime example before was image data. Having 2 billion images of 1mb images would still be supported without changing the index addressing.
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> I've asked some others on my team their opinions on the risk here. I think we should probably review some our more complex vector interactions and see how the jvm's assembly changes with this kind of change. Using microbenchmarking is good but I think we also need to see whether we're constantly inserting additional instructions or if in most cases, this actually doesn't impact instruction count.
>>>>>
>>>>>
>>>>> Is this something that your team will take on?
>>>>
>>>>
>>>>
>>>> Yeah, we need to look at this I think.
>>>>
>>>>> Do you need a rebased version of the PR or is the existing one sufficient?
>>>>
>>>>
>>>> Rebase would help if possible.
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> On Thu, Aug 22, 2019 at 8:56 PM Jacques Nadeau <ja...@apache.org> wrote:
>>>>>>
>>>>>> I don't think we should couple this discussion with the implementation of large list, etc since I think those two concepts are independent.
>>>>>>
>>>>>> I've asked some others on my team their opinions on the risk here. I think we should probably review some our more complex vector interactions and see how the jvm's assembly changes with this kind of change. Using microbenchmarking is good but I think we also need to see whether we're constantly inserting additional instructions or if in most cases, this actually doesn't impact instruction count.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield <em...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> With regards to the reference implementation point. It is a good point. I'm on vacation this week. Unless you're pushing hard on this, can we pick this up and discuss more next week?
>>>>>>>
>>>>>>>
>>>>>>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the reference implementation aspect of this?
>>>>>>>
>>>>>>>>
>>>>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>>>>>> would be best to decouple this discussion from the release timeline
>>>>>>>> given how many people we have relying on regular releases coming out.
>>>>>>>> We can keep continue making major 0.x releases until we're ready to
>>>>>>>> release 1.0.0.
>>>>>>>
>>>>>>>
>>>>>>> I'm OK with it as long as other stakeholders are. Timed releases are the way to go.  As stated on the release thread [1] we need a better mechanism to avoid this type of issue arising again.  The release thread also had some more discussion on compatibility.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Micah
>>>>>>>
>>>>>>> [1] https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <em...@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Hi Wes and Jacques,
>>>>>>>> > See responses below.
>>>>>>>> >
>>>>>>>> > With regards to the reference implementation point. It is a good point. I'm
>>>>>>>> > > on vacation this week. Unless you're pushing hard on this, can we pick this
>>>>>>>> > > up and discuss more next week?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Sure thing, enjoy your vacation.  I think the only practical implications
>>>>>>>> > are it delays choices around implementing LargeList, LargeBinary,
>>>>>>>> > LargeString in Java, which in turn might push out the 0.15.0 release.
>>>>>>>> >
>>>>>>>>
>>>>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>>>>>> would be best to decouple this discussion from the release timeline
>>>>>>>> given how many people we have relying on regular releases coming out.
>>>>>>>> We can keep continue making major 0.x releases until we're ready to
>>>>>>>> release 1.0.0.
>>>>>>>>
>>>>>>>> > My stance on this is that I don't know how important it is for Java to
>>>>>>>> > > support vectors over INT32_MAX elements. The use cases enabled by
>>>>>>>> > > having very large arrays seem to be concentrated in the native code
>>>>>>>> > > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
>>>>>>>> > > on my part, though.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > A data point against this view is Spark has done work to eliminate 2GB
>>>>>>>> > memory limits on its block sizes [1].  I don't claim to understand the
>>>>>>>> > implications of this. Bryan might you have any thoughts here?  I'm OK with
>>>>>>>> > INT32_MAX, as well, I think we should think about what this means for
>>>>>>>> > adding Large types to Java and implications for reference implementations.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Micah
>>>>>>>> >
>>>>>>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>>>>>>>> >
>>>>>>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org> wrote:
>>>>>>>> >
>>>>>>>> > > Hey Micah,
>>>>>>>> > >
>>>>>>>> > > Appreciate the offer on the compiling. The reality is I'm more concerned
>>>>>>>> > > about the unknowns than the compiling issue itself. Any time you've been
>>>>>>>> > > tuning for a while, changing something like this could be totally fine or
>>>>>>>> > > cause a couple of major issues. For example, we've done a very large amount
>>>>>>>> > > of work reducing heap memory footprint of the vectors. Are target is to
>>>>>>>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per vector
>>>>>>>> > > (not including arrow bufs).
>>>>>>>> > >
>>>>>>>> > > With regards to the reference implementation point. It is a good point.
>>>>>>>> > > I'm on vacation this week. Unless you're pushing hard on this, can we pick
>>>>>>>> > > this up and discuss more next week?
>>>>>>>> > >
>>>>>>>> > > thanks,
>>>>>>>> > > Jacques
>>>>>>>> > >
>>>>>>>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <em...@gmail.com>
>>>>>>>> > > wrote:
>>>>>>>> > >
>>>>>>>> > >> Hi Jacques,
>>>>>>>> > >> I definitely understand these concerns and this change is risky because it
>>>>>>>> > >> is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
>>>>>>>> > >> of dealing with this.  This could have other benefits like cleaning up
>>>>>>>> > >> some
>>>>>>>> > >> cruft around dictionary encode and "orphaned" method.   Per past e-mail
>>>>>>>> > >> threads I agree it is beneficial to have 2 separate reference
>>>>>>>> > >> implementations that can communicate fully, and my intent here was to
>>>>>>>> > >> close
>>>>>>>> > >> that gap.
>>>>>>>> > >>
>>>>>>>> > >> Trying to
>>>>>>>> > >> > determine the ramifications of these changes would be challenging and
>>>>>>>> > >> time
>>>>>>>> > >> > consuming against all the different ways we interact with the Arrow Java
>>>>>>>> > >> > library.
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >> Understood.  I took a quick look at Dremio-OSS it seems like it has a
>>>>>>>> > >> simple java build system?  If it is helpful, I can try to get a fork
>>>>>>>> > >> running that at least compiles against this PR.  My plan would be to cast
>>>>>>>> > >> any place that was changed to return a long back to an int, so in essence
>>>>>>>> > >> the Dremio algorithms would reman 32-bit implementations.
>>>>>>>> > >>
>>>>>>>> > >> I don't  have the infrastructure to test this change properly from a
>>>>>>>> > >> distributed systems perspective, so it would still take some time from
>>>>>>>> > >> Dremio to validate for regressions.
>>>>>>>> > >>
>>>>>>>> > >> I'm not saying I'm against this but want to make sure we've
>>>>>>>> > >> > explored all less disruptive options before considering changing
>>>>>>>> > >> something
>>>>>>>> > >> > this fundamental (especially when I generally hold the view that large
>>>>>>>> > >> cell
>>>>>>>> > >> > counts against massive contiguous memory is an anti pattern to scalable
>>>>>>>> > >> > analytical processing--purely subjective of course).
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >> I'm open to other ideas here, as well. I don't think it is out of the
>>>>>>>> > >> question to leave the Java implementation as 32-bit, but if we do, then I
>>>>>>>> > >> think we should consider a different strategy for reference
>>>>>>>> > >> implementations.
>>>>>>>> > >>
>>>>>>>> > >> Thanks,
>>>>>>>> > >> Micah
>>>>>>>> > >>
>>>>>>>> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <ja...@apache.org>
>>>>>>>> > >> wrote:
>>>>>>>> > >>
>>>>>>>> > >> > Hey Micah, I didn't have a particular path in mind. Was thinking more
>>>>>>>> > >> along
>>>>>>>> > >> > the lines of extra methods as opposed to separate classes.
>>>>>>>> > >> >
>>>>>>>> > >> > Arrow hasn't historically been a place where we're writing algorithms in
>>>>>>>> > >> > Java so the fact that they aren't there doesn't mean they don't exist.
>>>>>>>> > >> We
>>>>>>>> > >> > have a large amount of code that depends on the current behavior that is
>>>>>>>> > >> > deployed in hundreds of customer clusters (you can peruse our dremio
>>>>>>>> > >> repo
>>>>>>>> > >> > to see how extensively we leverage Arrow if interested). Trying to
>>>>>>>> > >> > determine the ramifications of these changes would be challenging and
>>>>>>>> > >> time
>>>>>>>> > >> > consuming against all the different ways we interact with the Arrow Java
>>>>>>>> > >> > library. I'm not saying I'm against this but want to make sure we've
>>>>>>>> > >> > explored all less disruptive options before considering changing
>>>>>>>> > >> something
>>>>>>>> > >> > this fundamental (especially when I generally hold the view that large
>>>>>>>> > >> cell
>>>>>>>> > >> > counts against massive contiguous memory is an anti pattern to scalable
>>>>>>>> > >> > analytical processing--purely subjective of course).
>>>>>>>> > >> >
>>>>>>>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <em...@gmail.com>
>>>>>>>> > >> > wrote:
>>>>>>>> > >> >
>>>>>>>> > >> > > Hi Jacques,
>>>>>>>> > >> > > What avenue were you thinking for supporting both paths?   I didn't
>>>>>>>> > >> want
>>>>>>>> > >> > > to pursue a different class hierarchy, because I felt like that would
>>>>>>>> > >> > > effectively fork the code base, but that is potentially an option that
>>>>>>>> > >> > > would allow us to have a complete reference implementation in Java
>>>>>>>> > >> that
>>>>>>>> > >> > can
>>>>>>>> > >> > > fully interact with C++, without major changes to this code.
>>>>>>>> > >> > >
>>>>>>>> > >> > > For supporting both APIs on the same classes/interfaces, I think they
>>>>>>>> > >> > > roughly fall into three categories, changes to input parameters,
>>>>>>>> > >> changes
>>>>>>>> > >> > to
>>>>>>>> > >> > > output parameters and algorithm changes.
>>>>>>>> > >> > >
>>>>>>>> > >> > > For inputs, changing from int to long is essentially a no-op from the
>>>>>>>> > >> > > compiler perspective.  From the limited micro-benchmarking this also
>>>>>>>> > >> > > doesn't seem to have a performance impact.  So we could keep two
>>>>>>>> > >> versions
>>>>>>>> > >> > > of the methods that only differ on inputs, but it is not clear what
>>>>>>>> > >> the
>>>>>>>> > >> > > value of that would be.
>>>>>>>> > >> > >
>>>>>>>> > >> > > For outputs, we can't support methods "long getLength()" and "int
>>>>>>>> > >> > > getLength()" in the same class, so we would be forced into something
>>>>>>>> > >> like
>>>>>>>> > >> > > "long getLength(boolean dummy)" which I think is a less desirable.
>>>>>>>> > >> > >
>>>>>>>> > >> > > For algorithm changes, there did not appear to be too many places
>>>>>>>> > >> where
>>>>>>>> > >> > we
>>>>>>>> > >> > > actually loop over all elements (it is quite possible I missed
>>>>>>>> > >> something
>>>>>>>> > >> > > here), the ones that I did find I was able to mitigate performance
>>>>>>>> > >> > > penalties as noted above.  Some of the current implementation will
>>>>>>>> > >> get a
>>>>>>>> > >> > > lot slower for "large arrays", but we can likely fix those later or in
>>>>>>>> > >> > this
>>>>>>>> > >> > > PR with a nested while loop instead of 2 for loops.
>>>>>>>> > >> > >
>>>>>>>> > >> > > Thanks,
>>>>>>>> > >> > > Micah
>>>>>>>> > >> > >
>>>>>>>> > >> > >
>>>>>>>> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <ja...@apache.org>
>>>>>>>> > >> wrote:
>>>>>>>> > >> > >
>>>>>>>> > >> > >> This is a pretty massive change to the apis. I wonder how nasty it
>>>>>>>> > >> would
>>>>>>>> > >> > >> be to just support both paths. Have you evaluated how complex that
>>>>>>>> > >> > would be?
>>>>>>>> > >> > >>
>>>>>>>> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
>>>>>>>> > >> emkornfield@gmail.com>
>>>>>>>> > >> > >> wrote:
>>>>>>>> > >> > >>
>>>>>>>> > >> > >>> After more investigation, it looks like Float8Benchmarks at least
>>>>>>>> > >> on my
>>>>>>>> > >> > >>> machine are within the range of noise.
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems to bring the
>>>>>>>> > >> > >>> BitVectorHelper benchmarks back inline (and even with some
>>>>>>>> > >> improvement
>>>>>>>> > >> > >>> for
>>>>>>>> > >> > >>> getNullCountBenchmark).
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> Benchmark                                        Mode  Cnt   Score
>>>>>>>> > >> > >>>  Error
>>>>>>>> > >> > >>>  Units
>>>>>>>> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ±
>>>>>>>> > >> > >>> 0.031
>>>>>>>> > >> > >>>  ns/op
>>>>>>>> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ±
>>>>>>>> > >> > >>> 0.141
>>>>>>>> > >> > >>>  ns/op
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> I applied the same pattern to other loops that I could find, and for
>>>>>>>> > >> > any
>>>>>>>> > >> > >>> "for (long" loop on the critical path, I broke it up into two loops.
>>>>>>>> > >> > the
>>>>>>>> > >> > >>> first loop does iteration by integer, the second finishes off for
>>>>>>>> > >> any
>>>>>>>> > >> > >>> long
>>>>>>>> > >> > >>> values.  As a side note it seems like optimization for loops using
>>>>>>>> > >> long
>>>>>>>> > >> > >>> counters at least have a semi-recent open bug for the JVM [2]
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> Thanks,
>>>>>>>> > >> > >>> Micah
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> [1]
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>>
>>>>>>>> > >> >
>>>>>>>> > >> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>>>>>>>> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
>>>>>>>> > >> emkornfield@gmail.com>
>>>>>>>> > >> > >>> wrote:
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables can make a
>>>>>>>> > >> > big
>>>>>>>> > >> > >>> > difference.  I didn't initially run the benchmarks with these
>>>>>>>> > >> turned
>>>>>>>> > >> > on
>>>>>>>> > >> > >>> > (you can see the result from above with Float8Benchmarks).  Here
>>>>>>>> > >> are
>>>>>>>> > >> > >>> new
>>>>>>>> > >> > >>> > numbers including with the flags enabled.  It looks like using
>>>>>>>> > >> longs
>>>>>>>> > >> > >>> might
>>>>>>>> > >> > >>> > be a little bit slower, I'll see what I can do to mitigate this.
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Ravindra also volunteered to try to benchmark the changes with
>>>>>>>> > >> > Dremio's
>>>>>>>> > >> > >>> > code on today's sync call.
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > New
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Benchmark                                        Mode  Cnt   Score
>>>>>>>> > >> > >>>  Error
>>>>>>>> > >> > >>> > Units
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>>>>>>> > >>  4.176 ±
>>>>>>>> > >> > >>> 1.292
>>>>>>>> > >> > >>> > ns/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>>>>>>> > >> 26.102 ±
>>>>>>>> > >> > >>> 0.700
>>>>>>>> > >> > >>> > ns/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084
>>>>>>>> > >> us/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057
>>>>>>>> > >> us/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > old
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>>>>>>> > >>  3.828 ±
>>>>>>>> > >> > >>> 0.030
>>>>>>>> > >> > >>> > ns/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>>>>>>> > >> 20.611 ±
>>>>>>>> > >> > >>> 0.188
>>>>>>>> > >> > >>> > ns/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462
>>>>>>>> > >> us/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027
>>>>>>>> > >> us/op
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com>
>>>>>>>> > >> > wrote:
>>>>>>>> > >> > >>> >
>>>>>>>> > >> > >>> >> Hi Gonzalo,
>>>>>>>> > >> > >>> >>
>>>>>>>> > >> > >>> >> Thanks for sharing the performance results.
>>>>>>>> > >> > >>> >> I am wondering if you have turned off the flag
>>>>>>>> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>>>>>>>> > >> > >>> >> If not, the lower throughput should be expected.
>>>>>>>> > >> > >>> >>
>>>>>>>> > >> > >>> >> Best,
>>>>>>>> > >> > >>> >> Liya Fan
>>>>>>>> > >> > >>> >>
>>>>>>>> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>>>>>>>> > >> > >>> emkornfield@gmail.com>
>>>>>>>> > >> > >>> >> wrote:
>>>>>>>> > >> > >>> >>
>>>>>>>> > >> > >>> >>> Hi Gonzalo,
>>>>>>>> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
>>>>>>>> > >> > >>> implications.   At
>>>>>>>> > >> > >>> >>> least on the benchmark run they don't seem to have an impact.
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>> If there are other benchmarks that people have that can
>>>>>>>> > >> validate if
>>>>>>>> > >> > >>> this
>>>>>>>> > >> > >>> >>> change will be problematic I would appreciate trying to run them
>>>>>>>> > >> > >>> with the
>>>>>>>> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to
>>>>>>>> > >> see
>>>>>>>> > >> > if
>>>>>>>> > >> > >>> >>> there
>>>>>>>> > >> > >>> >>> is a change in those.
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>> -Micah
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>>>>>>>> > >> > >>> >>> golthiryus@gmail.com> wrote:
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>> > I would recommend to take care with this kind of changes.
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>> > I didn't try Arrow in more than one year, but by then the
>>>>>>>> > >> > >>> performance
>>>>>>>> > >> > >>> >>> was
>>>>>>>> > >> > >>> >>> > quite bad in comparison with plain byte buffer access
>>>>>>>> > >> > >>> >>> > (see http://git.net/apache-arrow-development/msg02353.html *)
>>>>>>>> > >> > and
>>>>>>>> > >> > >>> >>> > there are several optimizations that the JVM (specifically,
>>>>>>>> > >> C2)
>>>>>>>> > >> > >>> does
>>>>>>>> > >> > >>> >>> not
>>>>>>>> > >> > >>> >>> > apply when dealing with int instead of longs. One of the
>>>>>>>> > >> > >>> >>> > most commons is the loop unrolling and vectorization.
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>> > * It doesn't seem the best way to reference an old email on
>>>>>>>> > >> the
>>>>>>>> > >> > >>> list,
>>>>>>>> > >> > >>> >>> but
>>>>>>>> > >> > >>> >>> > it is the only result shown by Google
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
>>>>>>>> > >> > liya.fan03@gmail.com
>>>>>>>> > >> > >>> >)
>>>>>>>> > >> > >>> >>> > escribió:
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>> >> Hi Micah,
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> Thanks for your effort. The performance result looks good.
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4
>>>>>>>> > >> > bytes
>>>>>>>> > >> > >>> for
>>>>>>>> > >> > >>> >>> each
>>>>>>>> > >> > >>> >>> >> of length, write index, and read index).
>>>>>>>> > >> > >>> >>> >> Similar overheads also exist for vectors like
>>>>>>>> > >> > >>> BaseFixedWidthVector,
>>>>>>>> > >> > >>> >>> >> BaseVariableWidthVector, etc.
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> IMO, such overheads are small enough to justify the change.
>>>>>>>> > >> > >>> >>> >> Let's check if there are other overheads.
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> Best,
>>>>>>>> > >> > >>> >>> >> Liya Fan
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>>>>>>>> > >> > >>> emkornfield@gmail.com
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>> >> wrote:
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >> > Hi Liya Fan,
>>>>>>>> > >> > >>> >>> >> > Based on the Float8Benchmark there does not seem to be any
>>>>>>>> > >> > >>> >>> meaningful
>>>>>>>> > >> > >>> >>> >> > performance difference on my machine.  At least for me, the
>>>>>>>> > >> > >>> >>> benchmarks
>>>>>>>> > >> > >>> >>> >> are
>>>>>>>> > >> > >>> >>> >> > not stable enough to say one is faster than the other (I've
>>>>>>>> > >> > >>> pasted
>>>>>>>> > >> > >>> >>> >> results
>>>>>>>> > >> > >>> >>> >> > below).  That being said my machine isn't necessarily the
>>>>>>>> > >> most
>>>>>>>> > >> > >>> >>> reliable
>>>>>>>> > >> > >>> >>> >> for
>>>>>>>> > >> > >>> >>> >> > benchmarking.
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > On an intuitive level, this makes sense to me,  for the
>>>>>>>> > >> most
>>>>>>>> > >> > >>> part it
>>>>>>>> > >> > >>> >>> >> seems
>>>>>>>> > >> > >>> >>> >> > like the change just moves casting from "int" to "long"
>>>>>>>> > >> > further
>>>>>>>> > >> > >>> up
>>>>>>>> > >> > >>> >>> the
>>>>>>>> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are
>>>>>>>> > >> > other
>>>>>>>> > >> > >>> >>> >> benchmarks
>>>>>>>> > >> > >>> >>> >> > that you think are worth running let me know.
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > One downside performance wise I think for his change is it
>>>>>>>> > >> > >>> >>> increases the
>>>>>>>> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could influence
>>>>>>>> > >> > cache
>>>>>>>> > >> > >>> >>> misses
>>>>>>>> > >> > >>> >>> >> at
>>>>>>>> > >> > >>> >>> >> > some level or increase the size of call-stacks, but this
>>>>>>>> > >> > doesn't
>>>>>>>> > >> > >>> >>> seem to
>>>>>>>> > >> > >>> >>> >> > show up in the benchmark..
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > Thanks,
>>>>>>>> > >> > >>> >>> >> > Micah
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > Sample benchmark numbers:
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > [New Code]
>>>>>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
>>>>>>>> > >>  Error
>>>>>>>> > >> > >>> >>> Units
>>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ±
>>>>>>>> > >> 0.469
>>>>>>>> > >> > >>> >>> us/op
>>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ±
>>>>>>>> > >> 0.115
>>>>>>>> > >> > >>> >>> us/op
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > [Old code]
>>>>>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
>>>>>>>> > >>  Error
>>>>>>>> > >> > >>> >>> Units
>>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ±
>>>>>>>> > >> 1.409
>>>>>>>> > >> > >>> >>> us/op
>>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ±
>>>>>>>> > >> 0.084
>>>>>>>> > >> > >>> >>> us/op
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
>>>>>>>> > >> liya.fan03@gmail.com
>>>>>>>> > >> > >
>>>>>>>> > >> > >>> >>> wrote:
>>>>>>>> > >> > >>> >>> >> >
>>>>>>>> > >> > >>> >>> >> >> Hi Micah,
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >> Thanks a lot for doing this.
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >> I am a little concerned about if there is any negative
>>>>>>>> > >> > >>> performance
>>>>>>>> > >> > >>> >>> >> impact
>>>>>>>> > >> > >>> >>> >> >> on the current 32-bit-length based applications.
>>>>>>>> > >> > >>> >>> >> >> Can we do some performance comparison on our existing
>>>>>>>> > >> > >>> benchmarks?
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >> Best,
>>>>>>>> > >> > >>> >>> >> >> Liya Fan
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>>>>>>>> > >> > >>> >>> emkornfield@gmail.com>
>>>>>>>> > >> > >>> >>> >> >> wrote:
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >> >>> There have been some previous discussions on the mailing
>>>>>>>> > >> > about
>>>>>>>> > >> > >>> >>> >> supporting
>>>>>>>> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the
>>>>>>>> > >> IPC
>>>>>>>> > >> > >>> >>> >> specification
>>>>>>>> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that changes all
>>>>>>>> > >> APIs
>>>>>>>> > >> > >>> that I
>>>>>>>> > >> > >>> >>> >> could
>>>>>>>> > >> > >>> >>> >> >>> find that take an index to take an "long" instead of an
>>>>>>>> > >> > "int"
>>>>>>>> > >> > >>> (and
>>>>>>>> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>> It is a big change, so I think it is worth discussing if
>>>>>>>> > >> it
>>>>>>>> > >> > is
>>>>>>>> > >> > >>> >>> >> something
>>>>>>>> > >> > >>> >>> >> >>> we
>>>>>>>> > >> > >>> >>> >> >>> still want to move forward with.  It would be nice to
>>>>>>>> > >> come
>>>>>>>> > >> > to
>>>>>>>> > >> > >>> a
>>>>>>>> > >> > >>> >>> >> >>> conclusion
>>>>>>>> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of
>>>>>>>> > >> > merge
>>>>>>>> > >> > >>> >>> >> conflicts.
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>> The reason I did this work now is the C++ implementation
>>>>>>>> > >> has
>>>>>>>> > >> > >>> added
>>>>>>>> > >> > >>> >>> >> >>> support
>>>>>>>> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and
>>>>>>>> > >> based
>>>>>>>> > >> > on
>>>>>>>> > >> > >>> >>> prior
>>>>>>>> > >> > >>> >>> >> >>> discussions we need to have similar support in Java
>>>>>>>> > >> before
>>>>>>>> > >> > our
>>>>>>>> > >> > >>> >>> next
>>>>>>>> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we can have full
>>>>>>>> > >> > >>> >>> compatibility
>>>>>>>> > >> > >>> >>> >> and
>>>>>>>> > >> > >>> >>> >> >>> make the most use of the types in Java.
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>> Look forward to hearing feedback.
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>> Thanks,
>>>>>>>> > >> > >>> >>> >> >>> Micah
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>>>>>>>> > >> > >>> >>> >> >>>
>>>>>>>> > >> > >>> >>> >> >>
>>>>>>>> > >> > >>> >>> >>
>>>>>>>> > >> > >>> >>> >
>>>>>>>> > >> > >>> >>>
>>>>>>>> > >> > >>> >>
>>>>>>>> > >> > >>>
>>>>>>>> > >> > >>
>>>>>>>> > >> >
>>>>>>>> > >>
>>>>>>>> > >

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Micah Kornfield <em...@gmail.com>.

Apologies for the long delay, I chose to do the minimal work of limiting
this change [1] to allowing ArrowBuf to 64-bit lengths.  This would unblock
work on LargeString and LargeBinary.  If this change looks OK, I think
there is some follow-up work to add more thorough unit/integration tests.

As an aside, it does seem like the 2GB limit is affecting some users in
Spark [2][3], so hopefully LargeString would help with this.

Allowing more than MAX_INT elements is Vectors/array still a blocker for
making LargeList useful.

Thanks,
Micah

[1]  https://github.com/apache/arrow/pull/5020
[2]
https://stackoverflow.com/questions/58739888/spark-is-it-possible-to-increase-pyarrow-buffer#comment103812119_58739888
[3] https://issues.apache.org/jira/browse/ARROW-4890

On Fri, Aug 23, 2019 at 8:33 AM Jacques Nadeau <ja...@apache.org> wrote:

>
>
> On Fri, Aug 23, 2019, 8:55 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> The vector indexes being limited to 32 bits doesn't limit the addressing
>>> to 32 bit chunks of memory. For example, you're prime example before was
>>> image data. Having 2 billion images of 1mb images would still be supported
>>> without changing the index addressing.
>>
>> This might be pre-coffee math, but I think we are limited to
>> approximately 2000 images because an ArrowBuf only holds up-to 2 billion
>> bytes [1].  While we have plenty of room for the offsets, we quickly run
>> out of contiguous data storage space. For LargeString and LargeBinary this
>> could be fixed by changing ArrowBuf.
>>
>> LargeArray faces the same problem only it applies to its child vectors.
>> Supporting LargeArray properly is really what drove the large-scale
>> interface change.
>>
>
> My expressed concern about these changes was specifically about the use of
> long for get/set in the vector interfaces. I'm not saying that we constrain
> memory/ArrowBufs to 32bits.
>
>
>> Rebase would help if possible.
>>
>> I'll try to get to this in the next few days.
>>
>> [1]
>> https://github.com/apache/arrow/blob/95175fe7cb8439eebe6d2f6e0495f551d6864380/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java#L164
>>
>> On Fri, Aug 23, 2019 at 4:55 AM Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>>>
>>>
>>> On Fri, Aug 23, 2019, 11:49 AM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>> I don't think we should couple this discussion with the implementation
>>>>> of large list, etc since I think those two concepts are independent.
>>>>
>>>> I'm still trying to balance in my mind which is a worse experience for
>>>> consumers of the libraries for these types.  Claiming that Java supports
>>>> these types but throwing an exception when the Vectors exceed 32-bits or
>>>> just say they aren't supported until we have 64-bit support in Java.
>>>>
>>>
>>> The vector indexes being limited to 32 bits doesn't limit the addressing
>>> to 32 bit chunks of memory. For example, you're prime example before was
>>> image data. Having 2 billion images of 1mb images would still be supported
>>> without changing the index addressing.
>>>
>>>
>>>
>>>>
>>>>> I've asked some others on my team their opinions on the risk here. I
>>>>> think we should probably review some our more complex vector interactions
>>>>> and see how the jvm's assembly changes with this kind of change. Using
>>>>> microbenchmarking is good but I think we also need to see whether we're
>>>>> constantly inserting additional instructions or if in most cases, this
>>>>> actually doesn't impact instruction count.
>>>>
>>>>
>>>> Is this something that your team will take on?
>>>>
>>>
>>>
>>> Yeah, we need to look at this I think.
>>>
>>> Do you need a rebased version of the PR or is the existing one
>>>> sufficient?
>>>>
>>>
>>> Rebase would help if possible.
>>>
>>>
>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Thu, Aug 22, 2019 at 8:56 PM Jacques Nadeau <ja...@apache.org>
>>>> wrote:
>>>>
>>>>> I don't think we should couple this discussion with the implementation
>>>>> of large list, etc since I think those two concepts are independent.
>>>>>
>>>>> I've asked some others on my team their opinions on the risk here. I
>>>>> think we should probably review some our more complex vector interactions
>>>>> and see how the jvm's assembly changes with this kind of change. Using
>>>>> microbenchmarking is good but I think we also need to see whether we're
>>>>> constantly inserting additional instructions or if in most cases, this
>>>>> actually doesn't impact instruction count.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield <
>>>>> emkornfield@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>> With regards to the reference implementation point. It is a good
>>>>>>> point. I'm on vacation this week. Unless you're pushing hard on this, can
>>>>>>> we pick this up and discuss more next week?
>>>>>>
>>>>>>
>>>>>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
>>>>>> reference implementation aspect of this?
>>>>>>
>>>>>>
>>>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>>>>> would be best to decouple this discussion from the release timeline
>>>>>>> given how many people we have relying on regular releases coming out.
>>>>>>> We can keep continue making major 0.x releases until we're ready to
>>>>>>> release 1.0.0.
>>>>>>
>>>>>>
>>>>>> I'm OK with it as long as other stakeholders are. Timed releases are
>>>>>> the way to go.  As stated on the release thread [1] we need a better
>>>>>> mechanism to avoid this type of issue arising again.  The release thread
>>>>>> also had some more discussion on compatibility.
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> [1]
>>>>>> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <
>>>>>>> emkornfield@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Hi Wes and Jacques,
>>>>>>> > See responses below.
>>>>>>> >
>>>>>>> > With regards to the reference implementation point. It is a good
>>>>>>> point. I'm
>>>>>>> > > on vacation this week. Unless you're pushing hard on this, can
>>>>>>> we pick this
>>>>>>> > > up and discuss more next week?
>>>>>>> >
>>>>>>> >
>>>>>>> > Sure thing, enjoy your vacation.  I think the only practical
>>>>>>> implications
>>>>>>> > are it delays choices around implementing LargeList, LargeBinary,
>>>>>>> > LargeString in Java, which in turn might push out the 0.15.0
>>>>>>> release.
>>>>>>> >
>>>>>>>
>>>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>>>>> would be best to decouple this discussion from the release timeline
>>>>>>> given how many people we have relying on regular releases coming out.
>>>>>>> We can keep continue making major 0.x releases until we're ready to
>>>>>>> release 1.0.0.
>>>>>>>
>>>>>>> > My stance on this is that I don't know how important it is for
>>>>>>> Java to
>>>>>>> > > support vectors over INT32_MAX elements. The use cases enabled by
>>>>>>> > > having very large arrays seem to be concentrated in the native
>>>>>>> code
>>>>>>> > > world (e.g. C/C++/Rust) -- that could just be
>>>>>>> implementation-centrism
>>>>>>> > > on my part, though.
>>>>>>> >
>>>>>>> >
>>>>>>> > A data point against this view is Spark has done work to eliminate
>>>>>>> 2GB
>>>>>>> > memory limits on its block sizes [1].  I don't claim to understand
>>>>>>> the
>>>>>>> > implications of this. Bryan might you have any thoughts here?  I'm
>>>>>>> OK with
>>>>>>> > INT32_MAX, as well, I think we should think about what this means
>>>>>>> for
>>>>>>> > adding Large types to Java and implications for reference
>>>>>>> implementations.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Micah
>>>>>>> >
>>>>>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>>>>>>> >
>>>>>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > > Hey Micah,
>>>>>>> > >
>>>>>>> > > Appreciate the offer on the compiling. The reality is I'm more
>>>>>>> concerned
>>>>>>> > > about the unknowns than the compiling issue itself. Any time
>>>>>>> you've been
>>>>>>> > > tuning for a while, changing something like this could be
>>>>>>> totally fine or
>>>>>>> > > cause a couple of major issues. For example, we've done a very
>>>>>>> large amount
>>>>>>> > > of work reducing heap memory footprint of the vectors. Are
>>>>>>> target is to
>>>>>>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap
>>>>>>> per vector
>>>>>>> > > (not including arrow bufs).
>>>>>>> > >
>>>>>>> > > With regards to the reference implementation point. It is a good
>>>>>>> point.
>>>>>>> > > I'm on vacation this week. Unless you're pushing hard on this,
>>>>>>> can we pick
>>>>>>> > > this up and discuss more next week?
>>>>>>> > >
>>>>>>> > > thanks,
>>>>>>> > > Jacques
>>>>>>> > >
>>>>>>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
>>>>>>> emkornfield@gmail.com>
>>>>>>> > > wrote:
>>>>>>> > >
>>>>>>> > >> Hi Jacques,
>>>>>>> > >> I definitely understand these concerns and this change is risky
>>>>>>> because it
>>>>>>> > >> is so large.  Perhaps, creating a new hierarchy, might be the
>>>>>>> cleanest way
>>>>>>> > >> of dealing with this.  This could have other benefits like
>>>>>>> cleaning up
>>>>>>> > >> some
>>>>>>> > >> cruft around dictionary encode and "orphaned" method.   Per
>>>>>>> past e-mail
>>>>>>> > >> threads I agree it is beneficial to have 2 separate reference
>>>>>>> > >> implementations that can communicate fully, and my intent here
>>>>>>> was to
>>>>>>> > >> close
>>>>>>> > >> that gap.
>>>>>>> > >>
>>>>>>> > >> Trying to
>>>>>>> > >> > determine the ramifications of these changes would be
>>>>>>> challenging and
>>>>>>> > >> time
>>>>>>> > >> > consuming against all the different ways we interact with the
>>>>>>> Arrow Java
>>>>>>> > >> > library.
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >> Understood.  I took a quick look at Dremio-OSS it seems like it
>>>>>>> has a
>>>>>>> > >> simple java build system?  If it is helpful, I can try to get a
>>>>>>> fork
>>>>>>> > >> running that at least compiles against this PR.  My plan would
>>>>>>> be to cast
>>>>>>> > >> any place that was changed to return a long back to an int, so
>>>>>>> in essence
>>>>>>> > >> the Dremio algorithms would reman 32-bit implementations.
>>>>>>> > >>
>>>>>>> > >> I don't  have the infrastructure to test this change properly
>>>>>>> from a
>>>>>>> > >> distributed systems perspective, so it would still take some
>>>>>>> time from
>>>>>>> > >> Dremio to validate for regressions.
>>>>>>> > >>
>>>>>>> > >> I'm not saying I'm against this but want to make sure we've
>>>>>>> > >> > explored all less disruptive options before considering
>>>>>>> changing
>>>>>>> > >> something
>>>>>>> > >> > this fundamental (especially when I generally hold the view
>>>>>>> that large
>>>>>>> > >> cell
>>>>>>> > >> > counts against massive contiguous memory is an anti pattern
>>>>>>> to scalable
>>>>>>> > >> > analytical processing--purely subjective of course).
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >> I'm open to other ideas here, as well. I don't think it is out
>>>>>>> of the
>>>>>>> > >> question to leave the Java implementation as 32-bit, but if we
>>>>>>> do, then I
>>>>>>> > >> think we should consider a different strategy for reference
>>>>>>> > >> implementations.
>>>>>>> > >>
>>>>>>> > >> Thanks,
>>>>>>> > >> Micah
>>>>>>> > >>
>>>>>>> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <
>>>>>>> jacques@apache.org>
>>>>>>> > >> wrote:
>>>>>>> > >>
>>>>>>> > >> > Hey Micah, I didn't have a particular path in mind. Was
>>>>>>> thinking more
>>>>>>> > >> along
>>>>>>> > >> > the lines of extra methods as opposed to separate classes.
>>>>>>> > >> >
>>>>>>> > >> > Arrow hasn't historically been a place where we're writing
>>>>>>> algorithms in
>>>>>>> > >> > Java so the fact that they aren't there doesn't mean they
>>>>>>> don't exist.
>>>>>>> > >> We
>>>>>>> > >> > have a large amount of code that depends on the current
>>>>>>> behavior that is
>>>>>>> > >> > deployed in hundreds of customer clusters (you can peruse our
>>>>>>> dremio
>>>>>>> > >> repo
>>>>>>> > >> > to see how extensively we leverage Arrow if interested).
>>>>>>> Trying to
>>>>>>> > >> > determine the ramifications of these changes would be
>>>>>>> challenging and
>>>>>>> > >> time
>>>>>>> > >> > consuming against all the different ways we interact with the
>>>>>>> Arrow Java
>>>>>>> > >> > library. I'm not saying I'm against this but want to make
>>>>>>> sure we've
>>>>>>> > >> > explored all less disruptive options before considering
>>>>>>> changing
>>>>>>> > >> something
>>>>>>> > >> > this fundamental (especially when I generally hold the view
>>>>>>> that large
>>>>>>> > >> cell
>>>>>>> > >> > counts against massive contiguous memory is an anti pattern
>>>>>>> to scalable
>>>>>>> > >> > analytical processing--purely subjective of course).
>>>>>>> > >> >
>>>>>>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <
>>>>>>> emkornfield@gmail.com>
>>>>>>> > >> > wrote:
>>>>>>> > >> >
>>>>>>> > >> > > Hi Jacques,
>>>>>>> > >> > > What avenue were you thinking for supporting both paths?
>>>>>>>  I didn't
>>>>>>> > >> want
>>>>>>> > >> > > to pursue a different class hierarchy, because I felt like
>>>>>>> that would
>>>>>>> > >> > > effectively fork the code base, but that is potentially an
>>>>>>> option that
>>>>>>> > >> > > would allow us to have a complete reference implementation
>>>>>>> in Java
>>>>>>> > >> that
>>>>>>> > >> > can
>>>>>>> > >> > > fully interact with C++, without major changes to this code.
>>>>>>> > >> > >
>>>>>>> > >> > > For supporting both APIs on the same classes/interfaces, I
>>>>>>> think they
>>>>>>> > >> > > roughly fall into three categories, changes to input
>>>>>>> parameters,
>>>>>>> > >> changes
>>>>>>> > >> > to
>>>>>>> > >> > > output parameters and algorithm changes.
>>>>>>> > >> > >
>>>>>>> > >> > > For inputs, changing from int to long is essentially a
>>>>>>> no-op from the
>>>>>>> > >> > > compiler perspective.  From the limited micro-benchmarking
>>>>>>> this also
>>>>>>> > >> > > doesn't seem to have a performance impact.  So we could
>>>>>>> keep two
>>>>>>> > >> versions
>>>>>>> > >> > > of the methods that only differ on inputs, but it is not
>>>>>>> clear what
>>>>>>> > >> the
>>>>>>> > >> > > value of that would be.
>>>>>>> > >> > >
>>>>>>> > >> > > For outputs, we can't support methods "long getLength()"
>>>>>>> and "int
>>>>>>> > >> > > getLength()" in the same class, so we would be forced into
>>>>>>> something
>>>>>>> > >> like
>>>>>>> > >> > > "long getLength(boolean dummy)" which I think is a less
>>>>>>> desirable.
>>>>>>> > >> > >
>>>>>>> > >> > > For algorithm changes, there did not appear to be too many
>>>>>>> places
>>>>>>> > >> where
>>>>>>> > >> > we
>>>>>>> > >> > > actually loop over all elements (it is quite possible I
>>>>>>> missed
>>>>>>> > >> something
>>>>>>> > >> > > here), the ones that I did find I was able to mitigate
>>>>>>> performance
>>>>>>> > >> > > penalties as noted above.  Some of the current
>>>>>>> implementation will
>>>>>>> > >> get a
>>>>>>> > >> > > lot slower for "large arrays", but we can likely fix those
>>>>>>> later or in
>>>>>>> > >> > this
>>>>>>> > >> > > PR with a nested while loop instead of 2 for loops.
>>>>>>> > >> > >
>>>>>>> > >> > > Thanks,
>>>>>>> > >> > > Micah
>>>>>>> > >> > >
>>>>>>> > >> > >
>>>>>>> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <
>>>>>>> jacques@apache.org>
>>>>>>> > >> wrote:
>>>>>>> > >> > >
>>>>>>> > >> > >> This is a pretty massive change to the apis. I wonder how
>>>>>>> nasty it
>>>>>>> > >> would
>>>>>>> > >> > >> be to just support both paths. Have you evaluated how
>>>>>>> complex that
>>>>>>> > >> > would be?
>>>>>>> > >> > >>
>>>>>>> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
>>>>>>> > >> emkornfield@gmail.com>
>>>>>>> > >> > >> wrote:
>>>>>>> > >> > >>
>>>>>>> > >> > >>> After more investigation, it looks like Float8Benchmarks
>>>>>>> at least
>>>>>>> > >> on my
>>>>>>> > >> > >>> machine are within the range of noise.
>>>>>>> > >> > >>>
>>>>>>> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems to
>>>>>>> bring the
>>>>>>> > >> > >>> BitVectorHelper benchmarks back inline (and even with some
>>>>>>> > >> improvement
>>>>>>> > >> > >>> for
>>>>>>> > >> > >>> getNullCountBenchmark).
>>>>>>> > >> > >>>
>>>>>>> > >> > >>> Benchmark                                        Mode
>>>>>>> Cnt   Score
>>>>>>> > >> > >>>  Error
>>>>>>> > >> > >>>  Units
>>>>>>> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt
>>>>>>> 5   3.821 ±
>>>>>>> > >> > >>> 0.031
>>>>>>> > >> > >>>  ns/op
>>>>>>> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt
>>>>>>> 5  14.884 ±
>>>>>>> > >> > >>> 0.141
>>>>>>> > >> > >>>  ns/op
>>>>>>> > >> > >>>
>>>>>>> > >> > >>> I applied the same pattern to other loops that I could
>>>>>>> find, and for
>>>>>>> > >> > any
>>>>>>> > >> > >>> "for (long" loop on the critical path, I broke it up into
>>>>>>> two loops.
>>>>>>> > >> > the
>>>>>>> > >> > >>> first loop does iteration by integer, the second finishes
>>>>>>> off for
>>>>>>> > >> any
>>>>>>> > >> > >>> long
>>>>>>> > >> > >>> values.  As a side note it seems like optimization for
>>>>>>> loops using
>>>>>>> > >> long
>>>>>>> > >> > >>> counters at least have a semi-recent open bug for the JVM
>>>>>>> [2]
>>>>>>> > >> > >>>
>>>>>>> > >> > >>> Thanks,
>>>>>>> > >> > >>> Micah
>>>>>>> > >> > >>>
>>>>>>> > >> > >>> [1]
>>>>>>> > >> > >>>
>>>>>>> > >> > >>>
>>>>>>> > >> >
>>>>>>> > >>
>>>>>>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>>>>>>> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>>>>>> > >> > >>>
>>>>>>> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
>>>>>>> > >> emkornfield@gmail.com>
>>>>>>> > >> > >>> wrote:
>>>>>>> > >> > >>>
>>>>>>> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables
>>>>>>> can make a
>>>>>>> > >> > big
>>>>>>> > >> > >>> > difference.  I didn't initially run the benchmarks with
>>>>>>> these
>>>>>>> > >> turned
>>>>>>> > >> > on
>>>>>>> > >> > >>> > (you can see the result from above with
>>>>>>> Float8Benchmarks).  Here
>>>>>>> > >> are
>>>>>>> > >> > >>> new
>>>>>>> > >> > >>> > numbers including with the flags enabled.  It looks
>>>>>>> like using
>>>>>>> > >> longs
>>>>>>> > >> > >>> might
>>>>>>> > >> > >>> > be a little bit slower, I'll see what I can do to
>>>>>>> mitigate this.
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > Ravindra also volunteered to try to benchmark the
>>>>>>> changes with
>>>>>>> > >> > Dremio's
>>>>>>> > >> > >>> > code on today's sync call.
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > New
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > Benchmark                                        Mode
>>>>>>> Cnt   Score
>>>>>>> > >> > >>>  Error
>>>>>>> > >> > >>> > Units
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt
>>>>>>>   5
>>>>>>> > >>  4.176 ±
>>>>>>> > >> > >>> 1.292
>>>>>>> > >> > >>> > ns/op
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt
>>>>>>>   5
>>>>>>> > >> 26.102 ±
>>>>>>> > >> > >>> 0.700
>>>>>>> > >> > >>> > ns/op
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ±
>>>>>>> 0.084
>>>>>>> > >> us/op
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ±
>>>>>>> 0.057
>>>>>>> > >> us/op
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > old
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt
>>>>>>>   5
>>>>>>> > >>  3.828 ±
>>>>>>> > >> > >>> 0.030
>>>>>>> > >> > >>> > ns/op
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt
>>>>>>>   5
>>>>>>> > >> 20.611 ±
>>>>>>> > >> > >>> 0.188
>>>>>>> > >> > >>> > ns/op
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ±
>>>>>>> 0.462
>>>>>>> > >> us/op
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ±
>>>>>>> 0.027
>>>>>>> > >> us/op
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <
>>>>>>> liya.fan03@gmail.com>
>>>>>>> > >> > wrote:
>>>>>>> > >> > >>> >
>>>>>>> > >> > >>> >> Hi Gonzalo,
>>>>>>> > >> > >>> >>
>>>>>>> > >> > >>> >> Thanks for sharing the performance results.
>>>>>>> > >> > >>> >> I am wondering if you have turned off the flag
>>>>>>> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>>>>>>> > >> > >>> >> If not, the lower throughput should be expected.
>>>>>>> > >> > >>> >>
>>>>>>> > >> > >>> >> Best,
>>>>>>> > >> > >>> >> Liya Fan
>>>>>>> > >> > >>> >>
>>>>>>> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>>>>>>> > >> > >>> emkornfield@gmail.com>
>>>>>>> > >> > >>> >> wrote:
>>>>>>> > >> > >>> >>
>>>>>>> > >> > >>> >>> Hi Gonzalo,
>>>>>>> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
>>>>>>> > >> > >>> implications.   At
>>>>>>> > >> > >>> >>> least on the benchmark run they don't seem to have an
>>>>>>> impact.
>>>>>>> > >> > >>> >>>
>>>>>>> > >> > >>> >>> If there are other benchmarks that people have that
>>>>>>> can
>>>>>>> > >> validate if
>>>>>>> > >> > >>> this
>>>>>>> > >> > >>> >>> change will be problematic I would appreciate trying
>>>>>>> to run them
>>>>>>> > >> > >>> with the
>>>>>>> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt
>>>>>>> tonight to
>>>>>>> > >> see
>>>>>>> > >> > if
>>>>>>> > >> > >>> >>> there
>>>>>>> > >> > >>> >>> is a change in those.
>>>>>>> > >> > >>> >>>
>>>>>>> > >> > >>> >>> -Micah
>>>>>>> > >> > >>> >>>
>>>>>>> > >> > >>> >>>
>>>>>>> > >> > >>> >>>
>>>>>>> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz
>>>>>>> Jaureguizar <
>>>>>>> > >> > >>> >>> golthiryus@gmail.com> wrote:
>>>>>>> > >> > >>> >>>
>>>>>>> > >> > >>> >>> > I would recommend to take care with this kind of
>>>>>>> changes.
>>>>>>> > >> > >>> >>> >
>>>>>>> > >> > >>> >>> > I didn't try Arrow in more than one year, but by
>>>>>>> then the
>>>>>>> > >> > >>> performance
>>>>>>> > >> > >>> >>> was
>>>>>>> > >> > >>> >>> > quite bad in comparison with plain byte buffer
>>>>>>> access
>>>>>>> > >> > >>> >>> > (see
>>>>>>> http://git.net/apache-arrow-development/msg02353.html *)
>>>>>>> > >> > and
>>>>>>> > >> > >>> >>> > there are several optimizations that the JVM
>>>>>>> (specifically,
>>>>>>> > >> C2)
>>>>>>> > >> > >>> does
>>>>>>> > >> > >>> >>> not
>>>>>>> > >> > >>> >>> > apply when dealing with int instead of longs. One
>>>>>>> of the
>>>>>>> > >> > >>> >>> > most commons is the loop unrolling and
>>>>>>> vectorization.
>>>>>>> > >> > >>> >>> >
>>>>>>> > >> > >>> >>> > * It doesn't seem the best way to reference an old
>>>>>>> email on
>>>>>>> > >> the
>>>>>>> > >> > >>> list,
>>>>>>> > >> > >>> >>> but
>>>>>>> > >> > >>> >>> > it is the only result shown by Google
>>>>>>> > >> > >>> >>> >
>>>>>>> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
>>>>>>> > >> > liya.fan03@gmail.com
>>>>>>> > >> > >>> >)
>>>>>>> > >> > >>> >>> > escribió:
>>>>>>> > >> > >>> >>> >
>>>>>>> > >> > >>> >>> >> Hi Micah,
>>>>>>> > >> > >>> >>> >>
>>>>>>> > >> > >>> >>> >> Thanks for your effort. The performance result
>>>>>>> looks good.
>>>>>>> > >> > >>> >>> >>
>>>>>>> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12
>>>>>>> bytes (4
>>>>>>> > >> > bytes
>>>>>>> > >> > >>> for
>>>>>>> > >> > >>> >>> each
>>>>>>> > >> > >>> >>> >> of length, write index, and read index).
>>>>>>> > >> > >>> >>> >> Similar overheads also exist for vectors like
>>>>>>> > >> > >>> BaseFixedWidthVector,
>>>>>>> > >> > >>> >>> >> BaseVariableWidthVector, etc.
>>>>>>> > >> > >>> >>> >>
>>>>>>> > >> > >>> >>> >> IMO, such overheads are small enough to justify
>>>>>>> the change.
>>>>>>> > >> > >>> >>> >> Let's check if there are other overheads.
>>>>>>> > >> > >>> >>> >>
>>>>>>> > >> > >>> >>> >> Best,
>>>>>>> > >> > >>> >>> >> Liya Fan
>>>>>>> > >> > >>> >>> >>
>>>>>>> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>>>>>>> > >> > >>> emkornfield@gmail.com
>>>>>>> > >> > >>> >>> >
>>>>>>> > >> > >>> >>> >> wrote:
>>>>>>> > >> > >>> >>> >>
>>>>>>> > >> > >>> >>> >> > Hi Liya Fan,
>>>>>>> > >> > >>> >>> >> > Based on the Float8Benchmark there does not seem
>>>>>>> to be any
>>>>>>> > >> > >>> >>> meaningful
>>>>>>> > >> > >>> >>> >> > performance difference on my machine.  At least
>>>>>>> for me, the
>>>>>>> > >> > >>> >>> benchmarks
>>>>>>> > >> > >>> >>> >> are
>>>>>>> > >> > >>> >>> >> > not stable enough to say one is faster than the
>>>>>>> other (I've
>>>>>>> > >> > >>> pasted
>>>>>>> > >> > >>> >>> >> results
>>>>>>> > >> > >>> >>> >> > below).  That being said my machine isn't
>>>>>>> necessarily the
>>>>>>> > >> most
>>>>>>> > >> > >>> >>> reliable
>>>>>>> > >> > >>> >>> >> for
>>>>>>> > >> > >>> >>> >> > benchmarking.
>>>>>>> > >> > >>> >>> >> >
>>>>>>> > >> > >>> >>> >> > On an intuitive level, this makes sense to me,
>>>>>>> for the
>>>>>>> > >> most
>>>>>>> > >> > >>> part it
>>>>>>> > >> > >>> >>> >> seems
>>>>>>> > >> > >>> >>> >> > like the change just moves casting from "int" to
>>>>>>> "long"
>>>>>>> > >> > further
>>>>>>> > >> > >>> up
>>>>>>> > >> > >>> >>> the
>>>>>>> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If
>>>>>>> there are
>>>>>>> > >> > other
>>>>>>> > >> > >>> >>> >> benchmarks
>>>>>>> > >> > >>> >>> >> > that you think are worth running let me know.
>>>>>>> > >> > >>> >>> >> >
>>>>>>> > >> > >>> >>> >> > One downside performance wise I think for his
>>>>>>> change is it
>>>>>>> > >> > >>> >>> increases the
>>>>>>> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could
>>>>>>> influence
>>>>>>> > >> > cache
>>>>>>> > >> > >>> >>> misses
>>>>>>> > >> > >>> >>> >> at
>>>>>>> > >> > >>> >>> >> > some level or increase the size of call-stacks,
>>>>>>> but this
>>>>>>> > >> > doesn't
>>>>>>> > >> > >>> >>> seem to
>>>>>>> > >> > >>> >>> >> > show up in the benchmark..
>>>>>>> > >> > >>> >>> >> >
>>>>>>> > >> > >>> >>> >> > Thanks,
>>>>>>> > >> > >>> >>> >> > Micah
>>>>>>> > >> > >>> >>> >> >
>>>>>>> > >> > >>> >>> >> > Sample benchmark numbers:
>>>>>>> > >> > >>> >>> >> >
>>>>>>> > >> > >>> >>> >> > [New Code]
>>>>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>>>>>  Score
>>>>>>> > >>  Error
>>>>>>> > >> > >>> >>> Units
>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>>>>>> 15.441 ±
>>>>>>> > >> 0.469
>>>>>>> > >> > >>> >>> us/op
>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>>>>>> 14.057 ±
>>>>>>> > >> 0.115
>>>>>>> > >> > >>> >>> us/op
>>>>>>> > >> > >>> >>> >> >
>>>>>>> > >> > >>> >>> >> >
>>>>>>> > >> > >>> >>> >> > [Old code]
>>>>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>>>>>  Score
>>>>>>> > >>  Error
>>>>>>> > >> > >>> >>> Units
>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>>>>>> 16.248 ±
>>>>>>> > >> 1.409
>>>>>>> > >> > >>> >>> us/op
>>>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>>>>>> 14.150 ±
>>>>>>> > >> 0.084
>>>>>>> > >> > >>> >>> us/op
>>>>>>> > >> > >>> >>> >> >
>>>>>>> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
>>>>>>> > >> liya.fan03@gmail.com
>>>>>>> > >> > >
>>>>>>> > >> > >>> >>> wrote:
>>>>>>> > >> > >>> >>> >> >
>>>>>>> > >> > >>> >>> >> >> Hi Micah,
>>>>>>> > >> > >>> >>> >> >>
>>>>>>> > >> > >>> >>> >> >> Thanks a lot for doing this.
>>>>>>> > >> > >>> >>> >> >>
>>>>>>> > >> > >>> >>> >> >> I am a little concerned about if there is any
>>>>>>> negative
>>>>>>> > >> > >>> performance
>>>>>>> > >> > >>> >>> >> impact
>>>>>>> > >> > >>> >>> >> >> on the current 32-bit-length based applications.
>>>>>>> > >> > >>> >>> >> >> Can we do some performance comparison on our
>>>>>>> existing
>>>>>>> > >> > >>> benchmarks?
>>>>>>> > >> > >>> >>> >> >>
>>>>>>> > >> > >>> >>> >> >> Best,
>>>>>>> > >> > >>> >>> >> >> Liya Fan
>>>>>>> > >> > >>> >>> >> >>
>>>>>>> > >> > >>> >>> >> >>
>>>>>>> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>>>>>>> > >> > >>> >>> emkornfield@gmail.com>
>>>>>>> > >> > >>> >>> >> >> wrote:
>>>>>>> > >> > >>> >>> >> >>
>>>>>>> > >> > >>> >>> >> >>> There have been some previous discussions on
>>>>>>> the mailing
>>>>>>> > >> > about
>>>>>>> > >> > >>> >>> >> supporting
>>>>>>> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is
>>>>>>> what the
>>>>>>> > >> IPC
>>>>>>> > >> > >>> >>> >> specification
>>>>>>> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that
>>>>>>> changes all
>>>>>>> > >> APIs
>>>>>>> > >> > >>> that I
>>>>>>> > >> > >>> >>> >> could
>>>>>>> > >> > >>> >>> >> >>> find that take an index to take an "long"
>>>>>>> instead of an
>>>>>>> > >> > "int"
>>>>>>> > >> > >>> (and
>>>>>>> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
>>>>>>> > >> > >>> >>> >> >>>
>>>>>>> > >> > >>> >>> >> >>> It is a big change, so I think it is worth
>>>>>>> discussing if
>>>>>>> > >> it
>>>>>>> > >> > is
>>>>>>> > >> > >>> >>> >> something
>>>>>>> > >> > >>> >>> >> >>> we
>>>>>>> > >> > >>> >>> >> >>> still want to move forward with.  It would be
>>>>>>> nice to
>>>>>>> > >> come
>>>>>>> > >> > to
>>>>>>> > >> > >>> a
>>>>>>> > >> > >>> >>> >> >>> conclusion
>>>>>>> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to
>>>>>>> avoid a lot of
>>>>>>> > >> > merge
>>>>>>> > >> > >>> >>> >> conflicts.
>>>>>>> > >> > >>> >>> >> >>>
>>>>>>> > >> > >>> >>> >> >>> The reason I did this work now is the C++
>>>>>>> implementation
>>>>>>> > >> has
>>>>>>> > >> > >>> added
>>>>>>> > >> > >>> >>> >> >>> support
>>>>>>> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString
>>>>>>> arrays and
>>>>>>> > >> based
>>>>>>> > >> > on
>>>>>>> > >> > >>> >>> prior
>>>>>>> > >> > >>> >>> >> >>> discussions we need to have similar support in
>>>>>>> Java
>>>>>>> > >> before
>>>>>>> > >> > our
>>>>>>> > >> > >>> >>> next
>>>>>>> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we can
>>>>>>> have full
>>>>>>> > >> > >>> >>> compatibility
>>>>>>> > >> > >>> >>> >> and
>>>>>>> > >> > >>> >>> >> >>> make the most use of the types in Java.
>>>>>>> > >> > >>> >>> >> >>>
>>>>>>> > >> > >>> >>> >> >>> Look forward to hearing feedback.
>>>>>>> > >> > >>> >>> >> >>>
>>>>>>> > >> > >>> >>> >> >>> Thanks,
>>>>>>> > >> > >>> >>> >> >>> Micah
>>>>>>> > >> > >>> >>> >> >>>
>>>>>>> > >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>>>>>>> > >> > >>> >>> >> >>>
>>>>>>> > >> > >>> >>> >> >>
>>>>>>> > >> > >>> >>> >>
>>>>>>> > >> > >>> >>> >
>>>>>>> > >> > >>> >>>
>>>>>>> > >> > >>> >>
>>>>>>> > >> > >>>
>>>>>>> > >> > >>
>>>>>>> > >> >
>>>>>>> > >>
>>>>>>> > >
>>>>>>>
>>>>>>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Jacques Nadeau <ja...@apache.org>.

On Fri, Aug 23, 2019, 8:55 PM Micah Kornfield <em...@gmail.com> wrote:

> The vector indexes being limited to 32 bits doesn't limit the addressing
>> to 32 bit chunks of memory. For example, you're prime example before was
>> image data. Having 2 billion images of 1mb images would still be supported
>> without changing the index addressing.
>
> This might be pre-coffee math, but I think we are limited to approximately
> 2000 images because an ArrowBuf only holds up-to 2 billion bytes [1].
> While we have plenty of room for the offsets, we quickly run out of
> contiguous data storage space. For LargeString and LargeBinary this could
> be fixed by changing ArrowBuf.
>
> LargeArray faces the same problem only it applies to its child vectors.
> Supporting LargeArray properly is really what drove the large-scale
> interface change.
>

My expressed concern about these changes was specifically about the use of
long for get/set in the vector interfaces. I'm not saying that we constrain
memory/ArrowBufs to 32bits.


> Rebase would help if possible.
>
> I'll try to get to this in the next few days.
>
> [1]
> https://github.com/apache/arrow/blob/95175fe7cb8439eebe6d2f6e0495f551d6864380/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java#L164
>
> On Fri, Aug 23, 2019 at 4:55 AM Jacques Nadeau <ja...@apache.org> wrote:
>
>>
>>
>> On Fri, Aug 23, 2019, 11:49 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> I don't think we should couple this discussion with the implementation
>>>> of large list, etc since I think those two concepts are independent.
>>>
>>> I'm still trying to balance in my mind which is a worse experience for
>>> consumers of the libraries for these types.  Claiming that Java supports
>>> these types but throwing an exception when the Vectors exceed 32-bits or
>>> just say they aren't supported until we have 64-bit support in Java.
>>>
>>
>> The vector indexes being limited to 32 bits doesn't limit the addressing
>> to 32 bit chunks of memory. For example, you're prime example before was
>> image data. Having 2 billion images of 1mb images would still be supported
>> without changing the index addressing.
>>
>>
>>
>>>
>>>> I've asked some others on my team their opinions on the risk here. I
>>>> think we should probably review some our more complex vector interactions
>>>> and see how the jvm's assembly changes with this kind of change. Using
>>>> microbenchmarking is good but I think we also need to see whether we're
>>>> constantly inserting additional instructions or if in most cases, this
>>>> actually doesn't impact instruction count.
>>>
>>>
>>> Is this something that your team will take on?
>>>
>>
>>
>> Yeah, we need to look at this I think.
>>
>> Do you need a rebased version of the PR or is the existing one sufficient?
>>>
>>
>> Rebase would help if possible.
>>
>>
>>
>>> Thanks,
>>> Micah
>>>
>>> On Thu, Aug 22, 2019 at 8:56 PM Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>>
>>>> I don't think we should couple this discussion with the implementation
>>>> of large list, etc since I think those two concepts are independent.
>>>>
>>>> I've asked some others on my team their opinions on the risk here. I
>>>> think we should probably review some our more complex vector interactions
>>>> and see how the jvm's assembly changes with this kind of change. Using
>>>> microbenchmarking is good but I think we also need to see whether we're
>>>> constantly inserting additional instructions or if in most cases, this
>>>> actually doesn't impact instruction count.
>>>>
>>>>
>>>>
>>>> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield <em...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>> With regards to the reference implementation point. It is a good
>>>>>> point. I'm on vacation this week. Unless you're pushing hard on this, can
>>>>>> we pick this up and discuss more next week?
>>>>>
>>>>>
>>>>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
>>>>> reference implementation aspect of this?
>>>>>
>>>>>
>>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>>>> would be best to decouple this discussion from the release timeline
>>>>>> given how many people we have relying on regular releases coming out.
>>>>>> We can keep continue making major 0.x releases until we're ready to
>>>>>> release 1.0.0.
>>>>>
>>>>>
>>>>> I'm OK with it as long as other stakeholders are. Timed releases are
>>>>> the way to go.  As stated on the release thread [1] we need a better
>>>>> mechanism to avoid this type of issue arising again.  The release thread
>>>>> also had some more discussion on compatibility.
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> [1]
>>>>> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>>>>>
>>>>>
>>>>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <
>>>>>> emkornfield@gmail.com> wrote:
>>>>>> >
>>>>>> > Hi Wes and Jacques,
>>>>>> > See responses below.
>>>>>> >
>>>>>> > With regards to the reference implementation point. It is a good
>>>>>> point. I'm
>>>>>> > > on vacation this week. Unless you're pushing hard on this, can we
>>>>>> pick this
>>>>>> > > up and discuss more next week?
>>>>>> >
>>>>>> >
>>>>>> > Sure thing, enjoy your vacation.  I think the only practical
>>>>>> implications
>>>>>> > are it delays choices around implementing LargeList, LargeBinary,
>>>>>> > LargeString in Java, which in turn might push out the 0.15.0
>>>>>> release.
>>>>>> >
>>>>>>
>>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>>>> would be best to decouple this discussion from the release timeline
>>>>>> given how many people we have relying on regular releases coming out.
>>>>>> We can keep continue making major 0.x releases until we're ready to
>>>>>> release 1.0.0.
>>>>>>
>>>>>> > My stance on this is that I don't know how important it is for Java
>>>>>> to
>>>>>> > > support vectors over INT32_MAX elements. The use cases enabled by
>>>>>> > > having very large arrays seem to be concentrated in the native
>>>>>> code
>>>>>> > > world (e.g. C/C++/Rust) -- that could just be
>>>>>> implementation-centrism
>>>>>> > > on my part, though.
>>>>>> >
>>>>>> >
>>>>>> > A data point against this view is Spark has done work to eliminate
>>>>>> 2GB
>>>>>> > memory limits on its block sizes [1].  I don't claim to understand
>>>>>> the
>>>>>> > implications of this. Bryan might you have any thoughts here?  I'm
>>>>>> OK with
>>>>>> > INT32_MAX, as well, I think we should think about what this means
>>>>>> for
>>>>>> > adding Large types to Java and implications for reference
>>>>>> implementations.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Micah
>>>>>> >
>>>>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>>>>>> >
>>>>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org>
>>>>>> wrote:
>>>>>> >
>>>>>> > > Hey Micah,
>>>>>> > >
>>>>>> > > Appreciate the offer on the compiling. The reality is I'm more
>>>>>> concerned
>>>>>> > > about the unknowns than the compiling issue itself. Any time
>>>>>> you've been
>>>>>> > > tuning for a while, changing something like this could be totally
>>>>>> fine or
>>>>>> > > cause a couple of major issues. For example, we've done a very
>>>>>> large amount
>>>>>> > > of work reducing heap memory footprint of the vectors. Are target
>>>>>> is to
>>>>>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap
>>>>>> per vector
>>>>>> > > (not including arrow bufs).
>>>>>> > >
>>>>>> > > With regards to the reference implementation point. It is a good
>>>>>> point.
>>>>>> > > I'm on vacation this week. Unless you're pushing hard on this,
>>>>>> can we pick
>>>>>> > > this up and discuss more next week?
>>>>>> > >
>>>>>> > > thanks,
>>>>>> > > Jacques
>>>>>> > >
>>>>>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
>>>>>> emkornfield@gmail.com>
>>>>>> > > wrote:
>>>>>> > >
>>>>>> > >> Hi Jacques,
>>>>>> > >> I definitely understand these concerns and this change is risky
>>>>>> because it
>>>>>> > >> is so large.  Perhaps, creating a new hierarchy, might be the
>>>>>> cleanest way
>>>>>> > >> of dealing with this.  This could have other benefits like
>>>>>> cleaning up
>>>>>> > >> some
>>>>>> > >> cruft around dictionary encode and "orphaned" method.   Per past
>>>>>> e-mail
>>>>>> > >> threads I agree it is beneficial to have 2 separate reference
>>>>>> > >> implementations that can communicate fully, and my intent here
>>>>>> was to
>>>>>> > >> close
>>>>>> > >> that gap.
>>>>>> > >>
>>>>>> > >> Trying to
>>>>>> > >> > determine the ramifications of these changes would be
>>>>>> challenging and
>>>>>> > >> time
>>>>>> > >> > consuming against all the different ways we interact with the
>>>>>> Arrow Java
>>>>>> > >> > library.
>>>>>> > >>
>>>>>> > >>
>>>>>> > >> Understood.  I took a quick look at Dremio-OSS it seems like it
>>>>>> has a
>>>>>> > >> simple java build system?  If it is helpful, I can try to get a
>>>>>> fork
>>>>>> > >> running that at least compiles against this PR.  My plan would
>>>>>> be to cast
>>>>>> > >> any place that was changed to return a long back to an int, so
>>>>>> in essence
>>>>>> > >> the Dremio algorithms would reman 32-bit implementations.
>>>>>> > >>
>>>>>> > >> I don't  have the infrastructure to test this change properly
>>>>>> from a
>>>>>> > >> distributed systems perspective, so it would still take some
>>>>>> time from
>>>>>> > >> Dremio to validate for regressions.
>>>>>> > >>
>>>>>> > >> I'm not saying I'm against this but want to make sure we've
>>>>>> > >> > explored all less disruptive options before considering
>>>>>> changing
>>>>>> > >> something
>>>>>> > >> > this fundamental (especially when I generally hold the view
>>>>>> that large
>>>>>> > >> cell
>>>>>> > >> > counts against massive contiguous memory is an anti pattern to
>>>>>> scalable
>>>>>> > >> > analytical processing--purely subjective of course).
>>>>>> > >>
>>>>>> > >>
>>>>>> > >> I'm open to other ideas here, as well. I don't think it is out
>>>>>> of the
>>>>>> > >> question to leave the Java implementation as 32-bit, but if we
>>>>>> do, then I
>>>>>> > >> think we should consider a different strategy for reference
>>>>>> > >> implementations.
>>>>>> > >>
>>>>>> > >> Thanks,
>>>>>> > >> Micah
>>>>>> > >>
>>>>>> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <
>>>>>> jacques@apache.org>
>>>>>> > >> wrote:
>>>>>> > >>
>>>>>> > >> > Hey Micah, I didn't have a particular path in mind. Was
>>>>>> thinking more
>>>>>> > >> along
>>>>>> > >> > the lines of extra methods as opposed to separate classes.
>>>>>> > >> >
>>>>>> > >> > Arrow hasn't historically been a place where we're writing
>>>>>> algorithms in
>>>>>> > >> > Java so the fact that they aren't there doesn't mean they
>>>>>> don't exist.
>>>>>> > >> We
>>>>>> > >> > have a large amount of code that depends on the current
>>>>>> behavior that is
>>>>>> > >> > deployed in hundreds of customer clusters (you can peruse our
>>>>>> dremio
>>>>>> > >> repo
>>>>>> > >> > to see how extensively we leverage Arrow if interested).
>>>>>> Trying to
>>>>>> > >> > determine the ramifications of these changes would be
>>>>>> challenging and
>>>>>> > >> time
>>>>>> > >> > consuming against all the different ways we interact with the
>>>>>> Arrow Java
>>>>>> > >> > library. I'm not saying I'm against this but want to make sure
>>>>>> we've
>>>>>> > >> > explored all less disruptive options before considering
>>>>>> changing
>>>>>> > >> something
>>>>>> > >> > this fundamental (especially when I generally hold the view
>>>>>> that large
>>>>>> > >> cell
>>>>>> > >> > counts against massive contiguous memory is an anti pattern to
>>>>>> scalable
>>>>>> > >> > analytical processing--purely subjective of course).
>>>>>> > >> >
>>>>>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <
>>>>>> emkornfield@gmail.com>
>>>>>> > >> > wrote:
>>>>>> > >> >
>>>>>> > >> > > Hi Jacques,
>>>>>> > >> > > What avenue were you thinking for supporting both paths?   I
>>>>>> didn't
>>>>>> > >> want
>>>>>> > >> > > to pursue a different class hierarchy, because I felt like
>>>>>> that would
>>>>>> > >> > > effectively fork the code base, but that is potentially an
>>>>>> option that
>>>>>> > >> > > would allow us to have a complete reference implementation
>>>>>> in Java
>>>>>> > >> that
>>>>>> > >> > can
>>>>>> > >> > > fully interact with C++, without major changes to this code.
>>>>>> > >> > >
>>>>>> > >> > > For supporting both APIs on the same classes/interfaces, I
>>>>>> think they
>>>>>> > >> > > roughly fall into three categories, changes to input
>>>>>> parameters,
>>>>>> > >> changes
>>>>>> > >> > to
>>>>>> > >> > > output parameters and algorithm changes.
>>>>>> > >> > >
>>>>>> > >> > > For inputs, changing from int to long is essentially a no-op
>>>>>> from the
>>>>>> > >> > > compiler perspective.  From the limited micro-benchmarking
>>>>>> this also
>>>>>> > >> > > doesn't seem to have a performance impact.  So we could keep
>>>>>> two
>>>>>> > >> versions
>>>>>> > >> > > of the methods that only differ on inputs, but it is not
>>>>>> clear what
>>>>>> > >> the
>>>>>> > >> > > value of that would be.
>>>>>> > >> > >
>>>>>> > >> > > For outputs, we can't support methods "long getLength()" and
>>>>>> "int
>>>>>> > >> > > getLength()" in the same class, so we would be forced into
>>>>>> something
>>>>>> > >> like
>>>>>> > >> > > "long getLength(boolean dummy)" which I think is a less
>>>>>> desirable.
>>>>>> > >> > >
>>>>>> > >> > > For algorithm changes, there did not appear to be too many
>>>>>> places
>>>>>> > >> where
>>>>>> > >> > we
>>>>>> > >> > > actually loop over all elements (it is quite possible I
>>>>>> missed
>>>>>> > >> something
>>>>>> > >> > > here), the ones that I did find I was able to mitigate
>>>>>> performance
>>>>>> > >> > > penalties as noted above.  Some of the current
>>>>>> implementation will
>>>>>> > >> get a
>>>>>> > >> > > lot slower for "large arrays", but we can likely fix those
>>>>>> later or in
>>>>>> > >> > this
>>>>>> > >> > > PR with a nested while loop instead of 2 for loops.
>>>>>> > >> > >
>>>>>> > >> > > Thanks,
>>>>>> > >> > > Micah
>>>>>> > >> > >
>>>>>> > >> > >
>>>>>> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <
>>>>>> jacques@apache.org>
>>>>>> > >> wrote:
>>>>>> > >> > >
>>>>>> > >> > >> This is a pretty massive change to the apis. I wonder how
>>>>>> nasty it
>>>>>> > >> would
>>>>>> > >> > >> be to just support both paths. Have you evaluated how
>>>>>> complex that
>>>>>> > >> > would be?
>>>>>> > >> > >>
>>>>>> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
>>>>>> > >> emkornfield@gmail.com>
>>>>>> > >> > >> wrote:
>>>>>> > >> > >>
>>>>>> > >> > >>> After more investigation, it looks like Float8Benchmarks
>>>>>> at least
>>>>>> > >> on my
>>>>>> > >> > >>> machine are within the range of noise.
>>>>>> > >> > >>>
>>>>>> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems to
>>>>>> bring the
>>>>>> > >> > >>> BitVectorHelper benchmarks back inline (and even with some
>>>>>> > >> improvement
>>>>>> > >> > >>> for
>>>>>> > >> > >>> getNullCountBenchmark).
>>>>>> > >> > >>>
>>>>>> > >> > >>> Benchmark                                        Mode
>>>>>> Cnt   Score
>>>>>> > >> > >>>  Error
>>>>>> > >> > >>>  Units
>>>>>> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt
>>>>>> 5   3.821 ±
>>>>>> > >> > >>> 0.031
>>>>>> > >> > >>>  ns/op
>>>>>> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt
>>>>>> 5  14.884 ±
>>>>>> > >> > >>> 0.141
>>>>>> > >> > >>>  ns/op
>>>>>> > >> > >>>
>>>>>> > >> > >>> I applied the same pattern to other loops that I could
>>>>>> find, and for
>>>>>> > >> > any
>>>>>> > >> > >>> "for (long" loop on the critical path, I broke it up into
>>>>>> two loops.
>>>>>> > >> > the
>>>>>> > >> > >>> first loop does iteration by integer, the second finishes
>>>>>> off for
>>>>>> > >> any
>>>>>> > >> > >>> long
>>>>>> > >> > >>> values.  As a side note it seems like optimization for
>>>>>> loops using
>>>>>> > >> long
>>>>>> > >> > >>> counters at least have a semi-recent open bug for the JVM
>>>>>> [2]
>>>>>> > >> > >>>
>>>>>> > >> > >>> Thanks,
>>>>>> > >> > >>> Micah
>>>>>> > >> > >>>
>>>>>> > >> > >>> [1]
>>>>>> > >> > >>>
>>>>>> > >> > >>>
>>>>>> > >> >
>>>>>> > >>
>>>>>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>>>>>> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>>>>> > >> > >>>
>>>>>> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
>>>>>> > >> emkornfield@gmail.com>
>>>>>> > >> > >>> wrote:
>>>>>> > >> > >>>
>>>>>> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables
>>>>>> can make a
>>>>>> > >> > big
>>>>>> > >> > >>> > difference.  I didn't initially run the benchmarks with
>>>>>> these
>>>>>> > >> turned
>>>>>> > >> > on
>>>>>> > >> > >>> > (you can see the result from above with
>>>>>> Float8Benchmarks).  Here
>>>>>> > >> are
>>>>>> > >> > >>> new
>>>>>> > >> > >>> > numbers including with the flags enabled.  It looks like
>>>>>> using
>>>>>> > >> longs
>>>>>> > >> > >>> might
>>>>>> > >> > >>> > be a little bit slower, I'll see what I can do to
>>>>>> mitigate this.
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > Ravindra also volunteered to try to benchmark the
>>>>>> changes with
>>>>>> > >> > Dremio's
>>>>>> > >> > >>> > code on today's sync call.
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > New
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > Benchmark                                        Mode
>>>>>> Cnt   Score
>>>>>> > >> > >>>  Error
>>>>>> > >> > >>> > Units
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt
>>>>>> 5
>>>>>> > >>  4.176 ±
>>>>>> > >> > >>> 1.292
>>>>>> > >> > >>> > ns/op
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt
>>>>>> 5
>>>>>> > >> 26.102 ±
>>>>>> > >> > >>> 0.700
>>>>>> > >> > >>> > ns/op
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ±
>>>>>> 0.084
>>>>>> > >> us/op
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ±
>>>>>> 0.057
>>>>>> > >> us/op
>>>>>> > >> > >>> >
>>>>>> > >> > >>> >
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > old
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt
>>>>>> 5
>>>>>> > >>  3.828 ±
>>>>>> > >> > >>> 0.030
>>>>>> > >> > >>> > ns/op
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt
>>>>>> 5
>>>>>> > >> 20.611 ±
>>>>>> > >> > >>> 0.188
>>>>>> > >> > >>> > ns/op
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ±
>>>>>> 0.462
>>>>>> > >> us/op
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ±
>>>>>> 0.027
>>>>>> > >> us/op
>>>>>> > >> > >>> >
>>>>>> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <
>>>>>> liya.fan03@gmail.com>
>>>>>> > >> > wrote:
>>>>>> > >> > >>> >
>>>>>> > >> > >>> >> Hi Gonzalo,
>>>>>> > >> > >>> >>
>>>>>> > >> > >>> >> Thanks for sharing the performance results.
>>>>>> > >> > >>> >> I am wondering if you have turned off the flag
>>>>>> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>>>>>> > >> > >>> >> If not, the lower throughput should be expected.
>>>>>> > >> > >>> >>
>>>>>> > >> > >>> >> Best,
>>>>>> > >> > >>> >> Liya Fan
>>>>>> > >> > >>> >>
>>>>>> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>>>>>> > >> > >>> emkornfield@gmail.com>
>>>>>> > >> > >>> >> wrote:
>>>>>> > >> > >>> >>
>>>>>> > >> > >>> >>> Hi Gonzalo,
>>>>>> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
>>>>>> > >> > >>> implications.   At
>>>>>> > >> > >>> >>> least on the benchmark run they don't seem to have an
>>>>>> impact.
>>>>>> > >> > >>> >>>
>>>>>> > >> > >>> >>> If there are other benchmarks that people have that can
>>>>>> > >> validate if
>>>>>> > >> > >>> this
>>>>>> > >> > >>> >>> change will be problematic I would appreciate trying
>>>>>> to run them
>>>>>> > >> > >>> with the
>>>>>> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt
>>>>>> tonight to
>>>>>> > >> see
>>>>>> > >> > if
>>>>>> > >> > >>> >>> there
>>>>>> > >> > >>> >>> is a change in those.
>>>>>> > >> > >>> >>>
>>>>>> > >> > >>> >>> -Micah
>>>>>> > >> > >>> >>>
>>>>>> > >> > >>> >>>
>>>>>> > >> > >>> >>>
>>>>>> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz
>>>>>> Jaureguizar <
>>>>>> > >> > >>> >>> golthiryus@gmail.com> wrote:
>>>>>> > >> > >>> >>>
>>>>>> > >> > >>> >>> > I would recommend to take care with this kind of
>>>>>> changes.
>>>>>> > >> > >>> >>> >
>>>>>> > >> > >>> >>> > I didn't try Arrow in more than one year, but by
>>>>>> then the
>>>>>> > >> > >>> performance
>>>>>> > >> > >>> >>> was
>>>>>> > >> > >>> >>> > quite bad in comparison with plain byte buffer access
>>>>>> > >> > >>> >>> > (see
>>>>>> http://git.net/apache-arrow-development/msg02353.html *)
>>>>>> > >> > and
>>>>>> > >> > >>> >>> > there are several optimizations that the JVM
>>>>>> (specifically,
>>>>>> > >> C2)
>>>>>> > >> > >>> does
>>>>>> > >> > >>> >>> not
>>>>>> > >> > >>> >>> > apply when dealing with int instead of longs. One of
>>>>>> the
>>>>>> > >> > >>> >>> > most commons is the loop unrolling and vectorization.
>>>>>> > >> > >>> >>> >
>>>>>> > >> > >>> >>> > * It doesn't seem the best way to reference an old
>>>>>> email on
>>>>>> > >> the
>>>>>> > >> > >>> list,
>>>>>> > >> > >>> >>> but
>>>>>> > >> > >>> >>> > it is the only result shown by Google
>>>>>> > >> > >>> >>> >
>>>>>> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
>>>>>> > >> > liya.fan03@gmail.com
>>>>>> > >> > >>> >)
>>>>>> > >> > >>> >>> > escribió:
>>>>>> > >> > >>> >>> >
>>>>>> > >> > >>> >>> >> Hi Micah,
>>>>>> > >> > >>> >>> >>
>>>>>> > >> > >>> >>> >> Thanks for your effort. The performance result
>>>>>> looks good.
>>>>>> > >> > >>> >>> >>
>>>>>> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12
>>>>>> bytes (4
>>>>>> > >> > bytes
>>>>>> > >> > >>> for
>>>>>> > >> > >>> >>> each
>>>>>> > >> > >>> >>> >> of length, write index, and read index).
>>>>>> > >> > >>> >>> >> Similar overheads also exist for vectors like
>>>>>> > >> > >>> BaseFixedWidthVector,
>>>>>> > >> > >>> >>> >> BaseVariableWidthVector, etc.
>>>>>> > >> > >>> >>> >>
>>>>>> > >> > >>> >>> >> IMO, such overheads are small enough to justify the
>>>>>> change.
>>>>>> > >> > >>> >>> >> Let's check if there are other overheads.
>>>>>> > >> > >>> >>> >>
>>>>>> > >> > >>> >>> >> Best,
>>>>>> > >> > >>> >>> >> Liya Fan
>>>>>> > >> > >>> >>> >>
>>>>>> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>>>>>> > >> > >>> emkornfield@gmail.com
>>>>>> > >> > >>> >>> >
>>>>>> > >> > >>> >>> >> wrote:
>>>>>> > >> > >>> >>> >>
>>>>>> > >> > >>> >>> >> > Hi Liya Fan,
>>>>>> > >> > >>> >>> >> > Based on the Float8Benchmark there does not seem
>>>>>> to be any
>>>>>> > >> > >>> >>> meaningful
>>>>>> > >> > >>> >>> >> > performance difference on my machine.  At least
>>>>>> for me, the
>>>>>> > >> > >>> >>> benchmarks
>>>>>> > >> > >>> >>> >> are
>>>>>> > >> > >>> >>> >> > not stable enough to say one is faster than the
>>>>>> other (I've
>>>>>> > >> > >>> pasted
>>>>>> > >> > >>> >>> >> results
>>>>>> > >> > >>> >>> >> > below).  That being said my machine isn't
>>>>>> necessarily the
>>>>>> > >> most
>>>>>> > >> > >>> >>> reliable
>>>>>> > >> > >>> >>> >> for
>>>>>> > >> > >>> >>> >> > benchmarking.
>>>>>> > >> > >>> >>> >> >
>>>>>> > >> > >>> >>> >> > On an intuitive level, this makes sense to me,
>>>>>> for the
>>>>>> > >> most
>>>>>> > >> > >>> part it
>>>>>> > >> > >>> >>> >> seems
>>>>>> > >> > >>> >>> >> > like the change just moves casting from "int" to
>>>>>> "long"
>>>>>> > >> > further
>>>>>> > >> > >>> up
>>>>>> > >> > >>> >>> the
>>>>>> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If
>>>>>> there are
>>>>>> > >> > other
>>>>>> > >> > >>> >>> >> benchmarks
>>>>>> > >> > >>> >>> >> > that you think are worth running let me know.
>>>>>> > >> > >>> >>> >> >
>>>>>> > >> > >>> >>> >> > One downside performance wise I think for his
>>>>>> change is it
>>>>>> > >> > >>> >>> increases the
>>>>>> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could
>>>>>> influence
>>>>>> > >> > cache
>>>>>> > >> > >>> >>> misses
>>>>>> > >> > >>> >>> >> at
>>>>>> > >> > >>> >>> >> > some level or increase the size of call-stacks,
>>>>>> but this
>>>>>> > >> > doesn't
>>>>>> > >> > >>> >>> seem to
>>>>>> > >> > >>> >>> >> > show up in the benchmark..
>>>>>> > >> > >>> >>> >> >
>>>>>> > >> > >>> >>> >> > Thanks,
>>>>>> > >> > >>> >>> >> > Micah
>>>>>> > >> > >>> >>> >> >
>>>>>> > >> > >>> >>> >> > Sample benchmark numbers:
>>>>>> > >> > >>> >>> >> >
>>>>>> > >> > >>> >>> >> > [New Code]
>>>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>>>>  Score
>>>>>> > >>  Error
>>>>>> > >> > >>> >>> Units
>>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>>>>> 15.441 ±
>>>>>> > >> 0.469
>>>>>> > >> > >>> >>> us/op
>>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>>>>> 14.057 ±
>>>>>> > >> 0.115
>>>>>> > >> > >>> >>> us/op
>>>>>> > >> > >>> >>> >> >
>>>>>> > >> > >>> >>> >> >
>>>>>> > >> > >>> >>> >> > [Old code]
>>>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>>>>  Score
>>>>>> > >>  Error
>>>>>> > >> > >>> >>> Units
>>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>>>>> 16.248 ±
>>>>>> > >> 1.409
>>>>>> > >> > >>> >>> us/op
>>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>>>>> 14.150 ±
>>>>>> > >> 0.084
>>>>>> > >> > >>> >>> us/op
>>>>>> > >> > >>> >>> >> >
>>>>>> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
>>>>>> > >> liya.fan03@gmail.com
>>>>>> > >> > >
>>>>>> > >> > >>> >>> wrote:
>>>>>> > >> > >>> >>> >> >
>>>>>> > >> > >>> >>> >> >> Hi Micah,
>>>>>> > >> > >>> >>> >> >>
>>>>>> > >> > >>> >>> >> >> Thanks a lot for doing this.
>>>>>> > >> > >>> >>> >> >>
>>>>>> > >> > >>> >>> >> >> I am a little concerned about if there is any
>>>>>> negative
>>>>>> > >> > >>> performance
>>>>>> > >> > >>> >>> >> impact
>>>>>> > >> > >>> >>> >> >> on the current 32-bit-length based applications.
>>>>>> > >> > >>> >>> >> >> Can we do some performance comparison on our
>>>>>> existing
>>>>>> > >> > >>> benchmarks?
>>>>>> > >> > >>> >>> >> >>
>>>>>> > >> > >>> >>> >> >> Best,
>>>>>> > >> > >>> >>> >> >> Liya Fan
>>>>>> > >> > >>> >>> >> >>
>>>>>> > >> > >>> >>> >> >>
>>>>>> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>>>>>> > >> > >>> >>> emkornfield@gmail.com>
>>>>>> > >> > >>> >>> >> >> wrote:
>>>>>> > >> > >>> >>> >> >>
>>>>>> > >> > >>> >>> >> >>> There have been some previous discussions on
>>>>>> the mailing
>>>>>> > >> > about
>>>>>> > >> > >>> >>> >> supporting
>>>>>> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is
>>>>>> what the
>>>>>> > >> IPC
>>>>>> > >> > >>> >>> >> specification
>>>>>> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that
>>>>>> changes all
>>>>>> > >> APIs
>>>>>> > >> > >>> that I
>>>>>> > >> > >>> >>> >> could
>>>>>> > >> > >>> >>> >> >>> find that take an index to take an "long"
>>>>>> instead of an
>>>>>> > >> > "int"
>>>>>> > >> > >>> (and
>>>>>> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
>>>>>> > >> > >>> >>> >> >>>
>>>>>> > >> > >>> >>> >> >>> It is a big change, so I think it is worth
>>>>>> discussing if
>>>>>> > >> it
>>>>>> > >> > is
>>>>>> > >> > >>> >>> >> something
>>>>>> > >> > >>> >>> >> >>> we
>>>>>> > >> > >>> >>> >> >>> still want to move forward with.  It would be
>>>>>> nice to
>>>>>> > >> come
>>>>>> > >> > to
>>>>>> > >> > >>> a
>>>>>> > >> > >>> >>> >> >>> conclusion
>>>>>> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid
>>>>>> a lot of
>>>>>> > >> > merge
>>>>>> > >> > >>> >>> >> conflicts.
>>>>>> > >> > >>> >>> >> >>>
>>>>>> > >> > >>> >>> >> >>> The reason I did this work now is the C++
>>>>>> implementation
>>>>>> > >> has
>>>>>> > >> > >>> added
>>>>>> > >> > >>> >>> >> >>> support
>>>>>> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString
>>>>>> arrays and
>>>>>> > >> based
>>>>>> > >> > on
>>>>>> > >> > >>> >>> prior
>>>>>> > >> > >>> >>> >> >>> discussions we need to have similar support in
>>>>>> Java
>>>>>> > >> before
>>>>>> > >> > our
>>>>>> > >> > >>> >>> next
>>>>>> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we can
>>>>>> have full
>>>>>> > >> > >>> >>> compatibility
>>>>>> > >> > >>> >>> >> and
>>>>>> > >> > >>> >>> >> >>> make the most use of the types in Java.
>>>>>> > >> > >>> >>> >> >>>
>>>>>> > >> > >>> >>> >> >>> Look forward to hearing feedback.
>>>>>> > >> > >>> >>> >> >>>
>>>>>> > >> > >>> >>> >> >>> Thanks,
>>>>>> > >> > >>> >>> >> >>> Micah
>>>>>> > >> > >>> >>> >> >>>
>>>>>> > >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>>>>>> > >> > >>> >>> >> >>>
>>>>>> > >> > >>> >>> >> >>
>>>>>> > >> > >>> >>> >>
>>>>>> > >> > >>> >>> >
>>>>>> > >> > >>> >>>
>>>>>> > >> > >>> >>
>>>>>> > >> > >>>
>>>>>> > >> > >>
>>>>>> > >> >
>>>>>> > >>
>>>>>> > >
>>>>>>
>>>>>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Micah Kornfield <em...@gmail.com>.

>
> The vector indexes being limited to 32 bits doesn't limit the addressing
> to 32 bit chunks of memory. For example, you're prime example before was
> image data. Having 2 billion images of 1mb images would still be supported
> without changing the index addressing.

This might be pre-coffee math, but I think we are limited to approximately
2000 images because an ArrowBuf only holds up-to 2 billion bytes [1].
While we have plenty of room for the offsets, we quickly run out of
contiguous data storage space. For LargeString and LargeBinary this could
be fixed by changing ArrowBuf.

LargeArray faces the same problem only it applies to its child vectors.
Supporting LargeArray properly is really what drove the large-scale
interface change.

Rebase would help if possible.

I'll try to get to this in the next few days.

[1]
https://github.com/apache/arrow/blob/95175fe7cb8439eebe6d2f6e0495f551d6864380/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java#L164

On Fri, Aug 23, 2019 at 4:55 AM Jacques Nadeau <ja...@apache.org> wrote:

>
>
> On Fri, Aug 23, 2019, 11:49 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> I don't think we should couple this discussion with the implementation of
>>> large list, etc since I think those two concepts are independent.
>>
>> I'm still trying to balance in my mind which is a worse experience for
>> consumers of the libraries for these types.  Claiming that Java supports
>> these types but throwing an exception when the Vectors exceed 32-bits or
>> just say they aren't supported until we have 64-bit support in Java.
>>
>
> The vector indexes being limited to 32 bits doesn't limit the addressing
> to 32 bit chunks of memory. For example, you're prime example before was
> image data. Having 2 billion images of 1mb images would still be supported
> without changing the index addressing.
>
>
>
>>
>>> I've asked some others on my team their opinions on the risk here. I
>>> think we should probably review some our more complex vector interactions
>>> and see how the jvm's assembly changes with this kind of change. Using
>>> microbenchmarking is good but I think we also need to see whether we're
>>> constantly inserting additional instructions or if in most cases, this
>>> actually doesn't impact instruction count.
>>
>>
>> Is this something that your team will take on?
>>
>
>
> Yeah, we need to look at this I think.
>
> Do you need a rebased version of the PR or is the existing one sufficient?
>>
>
> Rebase would help if possible.
>
>
>
>> Thanks,
>> Micah
>>
>> On Thu, Aug 22, 2019 at 8:56 PM Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>>> I don't think we should couple this discussion with the implementation
>>> of large list, etc since I think those two concepts are independent.
>>>
>>> I've asked some others on my team their opinions on the risk here. I
>>> think we should probably review some our more complex vector interactions
>>> and see how the jvm's assembly changes with this kind of change. Using
>>> microbenchmarking is good but I think we also need to see whether we're
>>> constantly inserting additional instructions or if in most cases, this
>>> actually doesn't impact instruction count.
>>>
>>>
>>>
>>> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>> With regards to the reference implementation point. It is a good
>>>>> point. I'm on vacation this week. Unless you're pushing hard on this, can
>>>>> we pick this up and discuss more next week?
>>>>
>>>>
>>>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
>>>> reference implementation aspect of this?
>>>>
>>>>
>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>>> would be best to decouple this discussion from the release timeline
>>>>> given how many people we have relying on regular releases coming out.
>>>>> We can keep continue making major 0.x releases until we're ready to
>>>>> release 1.0.0.
>>>>
>>>>
>>>> I'm OK with it as long as other stakeholders are. Timed releases are
>>>> the way to go.  As stated on the release thread [1] we need a better
>>>> mechanism to avoid this type of issue arising again.  The release thread
>>>> also had some more discussion on compatibility.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> [1]
>>>> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>>>>
>>>>
>>>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com>
>>>> wrote:
>>>>
>>>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <em...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi Wes and Jacques,
>>>>> > See responses below.
>>>>> >
>>>>> > With regards to the reference implementation point. It is a good
>>>>> point. I'm
>>>>> > > on vacation this week. Unless you're pushing hard on this, can we
>>>>> pick this
>>>>> > > up and discuss more next week?
>>>>> >
>>>>> >
>>>>> > Sure thing, enjoy your vacation.  I think the only practical
>>>>> implications
>>>>> > are it delays choices around implementing LargeList, LargeBinary,
>>>>> > LargeString in Java, which in turn might push out the 0.15.0 release.
>>>>> >
>>>>>
>>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>>> would be best to decouple this discussion from the release timeline
>>>>> given how many people we have relying on regular releases coming out.
>>>>> We can keep continue making major 0.x releases until we're ready to
>>>>> release 1.0.0.
>>>>>
>>>>> > My stance on this is that I don't know how important it is for Java
>>>>> to
>>>>> > > support vectors over INT32_MAX elements. The use cases enabled by
>>>>> > > having very large arrays seem to be concentrated in the native code
>>>>> > > world (e.g. C/C++/Rust) -- that could just be
>>>>> implementation-centrism
>>>>> > > on my part, though.
>>>>> >
>>>>> >
>>>>> > A data point against this view is Spark has done work to eliminate
>>>>> 2GB
>>>>> > memory limits on its block sizes [1].  I don't claim to understand
>>>>> the
>>>>> > implications of this. Bryan might you have any thoughts here?  I'm
>>>>> OK with
>>>>> > INT32_MAX, as well, I think we should think about what this means for
>>>>> > adding Large types to Java and implications for reference
>>>>> implementations.
>>>>> >
>>>>> > Thanks,
>>>>> > Micah
>>>>> >
>>>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>>>>> >
>>>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> > > Hey Micah,
>>>>> > >
>>>>> > > Appreciate the offer on the compiling. The reality is I'm more
>>>>> concerned
>>>>> > > about the unknowns than the compiling issue itself. Any time
>>>>> you've been
>>>>> > > tuning for a while, changing something like this could be totally
>>>>> fine or
>>>>> > > cause a couple of major issues. For example, we've done a very
>>>>> large amount
>>>>> > > of work reducing heap memory footprint of the vectors. Are target
>>>>> is to
>>>>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap
>>>>> per vector
>>>>> > > (not including arrow bufs).
>>>>> > >
>>>>> > > With regards to the reference implementation point. It is a good
>>>>> point.
>>>>> > > I'm on vacation this week. Unless you're pushing hard on this, can
>>>>> we pick
>>>>> > > this up and discuss more next week?
>>>>> > >
>>>>> > > thanks,
>>>>> > > Jacques
>>>>> > >
>>>>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
>>>>> emkornfield@gmail.com>
>>>>> > > wrote:
>>>>> > >
>>>>> > >> Hi Jacques,
>>>>> > >> I definitely understand these concerns and this change is risky
>>>>> because it
>>>>> > >> is so large.  Perhaps, creating a new hierarchy, might be the
>>>>> cleanest way
>>>>> > >> of dealing with this.  This could have other benefits like
>>>>> cleaning up
>>>>> > >> some
>>>>> > >> cruft around dictionary encode and "orphaned" method.   Per past
>>>>> e-mail
>>>>> > >> threads I agree it is beneficial to have 2 separate reference
>>>>> > >> implementations that can communicate fully, and my intent here
>>>>> was to
>>>>> > >> close
>>>>> > >> that gap.
>>>>> > >>
>>>>> > >> Trying to
>>>>> > >> > determine the ramifications of these changes would be
>>>>> challenging and
>>>>> > >> time
>>>>> > >> > consuming against all the different ways we interact with the
>>>>> Arrow Java
>>>>> > >> > library.
>>>>> > >>
>>>>> > >>
>>>>> > >> Understood.  I took a quick look at Dremio-OSS it seems like it
>>>>> has a
>>>>> > >> simple java build system?  If it is helpful, I can try to get a
>>>>> fork
>>>>> > >> running that at least compiles against this PR.  My plan would be
>>>>> to cast
>>>>> > >> any place that was changed to return a long back to an int, so in
>>>>> essence
>>>>> > >> the Dremio algorithms would reman 32-bit implementations.
>>>>> > >>
>>>>> > >> I don't  have the infrastructure to test this change properly
>>>>> from a
>>>>> > >> distributed systems perspective, so it would still take some time
>>>>> from
>>>>> > >> Dremio to validate for regressions.
>>>>> > >>
>>>>> > >> I'm not saying I'm against this but want to make sure we've
>>>>> > >> > explored all less disruptive options before considering changing
>>>>> > >> something
>>>>> > >> > this fundamental (especially when I generally hold the view
>>>>> that large
>>>>> > >> cell
>>>>> > >> > counts against massive contiguous memory is an anti pattern to
>>>>> scalable
>>>>> > >> > analytical processing--purely subjective of course).
>>>>> > >>
>>>>> > >>
>>>>> > >> I'm open to other ideas here, as well. I don't think it is out of
>>>>> the
>>>>> > >> question to leave the Java implementation as 32-bit, but if we
>>>>> do, then I
>>>>> > >> think we should consider a different strategy for reference
>>>>> > >> implementations.
>>>>> > >>
>>>>> > >> Thanks,
>>>>> > >> Micah
>>>>> > >>
>>>>> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <
>>>>> jacques@apache.org>
>>>>> > >> wrote:
>>>>> > >>
>>>>> > >> > Hey Micah, I didn't have a particular path in mind. Was
>>>>> thinking more
>>>>> > >> along
>>>>> > >> > the lines of extra methods as opposed to separate classes.
>>>>> > >> >
>>>>> > >> > Arrow hasn't historically been a place where we're writing
>>>>> algorithms in
>>>>> > >> > Java so the fact that they aren't there doesn't mean they don't
>>>>> exist.
>>>>> > >> We
>>>>> > >> > have a large amount of code that depends on the current
>>>>> behavior that is
>>>>> > >> > deployed in hundreds of customer clusters (you can peruse our
>>>>> dremio
>>>>> > >> repo
>>>>> > >> > to see how extensively we leverage Arrow if interested). Trying
>>>>> to
>>>>> > >> > determine the ramifications of these changes would be
>>>>> challenging and
>>>>> > >> time
>>>>> > >> > consuming against all the different ways we interact with the
>>>>> Arrow Java
>>>>> > >> > library. I'm not saying I'm against this but want to make sure
>>>>> we've
>>>>> > >> > explored all less disruptive options before considering changing
>>>>> > >> something
>>>>> > >> > this fundamental (especially when I generally hold the view
>>>>> that large
>>>>> > >> cell
>>>>> > >> > counts against massive contiguous memory is an anti pattern to
>>>>> scalable
>>>>> > >> > analytical processing--purely subjective of course).
>>>>> > >> >
>>>>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <
>>>>> emkornfield@gmail.com>
>>>>> > >> > wrote:
>>>>> > >> >
>>>>> > >> > > Hi Jacques,
>>>>> > >> > > What avenue were you thinking for supporting both paths?   I
>>>>> didn't
>>>>> > >> want
>>>>> > >> > > to pursue a different class hierarchy, because I felt like
>>>>> that would
>>>>> > >> > > effectively fork the code base, but that is potentially an
>>>>> option that
>>>>> > >> > > would allow us to have a complete reference implementation in
>>>>> Java
>>>>> > >> that
>>>>> > >> > can
>>>>> > >> > > fully interact with C++, without major changes to this code.
>>>>> > >> > >
>>>>> > >> > > For supporting both APIs on the same classes/interfaces, I
>>>>> think they
>>>>> > >> > > roughly fall into three categories, changes to input
>>>>> parameters,
>>>>> > >> changes
>>>>> > >> > to
>>>>> > >> > > output parameters and algorithm changes.
>>>>> > >> > >
>>>>> > >> > > For inputs, changing from int to long is essentially a no-op
>>>>> from the
>>>>> > >> > > compiler perspective.  From the limited micro-benchmarking
>>>>> this also
>>>>> > >> > > doesn't seem to have a performance impact.  So we could keep
>>>>> two
>>>>> > >> versions
>>>>> > >> > > of the methods that only differ on inputs, but it is not
>>>>> clear what
>>>>> > >> the
>>>>> > >> > > value of that would be.
>>>>> > >> > >
>>>>> > >> > > For outputs, we can't support methods "long getLength()" and
>>>>> "int
>>>>> > >> > > getLength()" in the same class, so we would be forced into
>>>>> something
>>>>> > >> like
>>>>> > >> > > "long getLength(boolean dummy)" which I think is a less
>>>>> desirable.
>>>>> > >> > >
>>>>> > >> > > For algorithm changes, there did not appear to be too many
>>>>> places
>>>>> > >> where
>>>>> > >> > we
>>>>> > >> > > actually loop over all elements (it is quite possible I missed
>>>>> > >> something
>>>>> > >> > > here), the ones that I did find I was able to mitigate
>>>>> performance
>>>>> > >> > > penalties as noted above.  Some of the current implementation
>>>>> will
>>>>> > >> get a
>>>>> > >> > > lot slower for "large arrays", but we can likely fix those
>>>>> later or in
>>>>> > >> > this
>>>>> > >> > > PR with a nested while loop instead of 2 for loops.
>>>>> > >> > >
>>>>> > >> > > Thanks,
>>>>> > >> > > Micah
>>>>> > >> > >
>>>>> > >> > >
>>>>> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <
>>>>> jacques@apache.org>
>>>>> > >> wrote:
>>>>> > >> > >
>>>>> > >> > >> This is a pretty massive change to the apis. I wonder how
>>>>> nasty it
>>>>> > >> would
>>>>> > >> > >> be to just support both paths. Have you evaluated how
>>>>> complex that
>>>>> > >> > would be?
>>>>> > >> > >>
>>>>> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
>>>>> > >> emkornfield@gmail.com>
>>>>> > >> > >> wrote:
>>>>> > >> > >>
>>>>> > >> > >>> After more investigation, it looks like Float8Benchmarks at
>>>>> least
>>>>> > >> on my
>>>>> > >> > >>> machine are within the range of noise.
>>>>> > >> > >>>
>>>>> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems to
>>>>> bring the
>>>>> > >> > >>> BitVectorHelper benchmarks back inline (and even with some
>>>>> > >> improvement
>>>>> > >> > >>> for
>>>>> > >> > >>> getNullCountBenchmark).
>>>>> > >> > >>>
>>>>> > >> > >>> Benchmark                                        Mode  Cnt
>>>>>  Score
>>>>> > >> > >>>  Error
>>>>> > >> > >>>  Units
>>>>> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>>>>  3.821 ±
>>>>> > >> > >>> 0.031
>>>>> > >> > >>>  ns/op
>>>>> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>>>> 14.884 ±
>>>>> > >> > >>> 0.141
>>>>> > >> > >>>  ns/op
>>>>> > >> > >>>
>>>>> > >> > >>> I applied the same pattern to other loops that I could
>>>>> find, and for
>>>>> > >> > any
>>>>> > >> > >>> "for (long" loop on the critical path, I broke it up into
>>>>> two loops.
>>>>> > >> > the
>>>>> > >> > >>> first loop does iteration by integer, the second finishes
>>>>> off for
>>>>> > >> any
>>>>> > >> > >>> long
>>>>> > >> > >>> values.  As a side note it seems like optimization for
>>>>> loops using
>>>>> > >> long
>>>>> > >> > >>> counters at least have a semi-recent open bug for the JVM
>>>>> [2]
>>>>> > >> > >>>
>>>>> > >> > >>> Thanks,
>>>>> > >> > >>> Micah
>>>>> > >> > >>>
>>>>> > >> > >>> [1]
>>>>> > >> > >>>
>>>>> > >> > >>>
>>>>> > >> >
>>>>> > >>
>>>>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>>>>> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>>>> > >> > >>>
>>>>> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
>>>>> > >> emkornfield@gmail.com>
>>>>> > >> > >>> wrote:
>>>>> > >> > >>>
>>>>> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables
>>>>> can make a
>>>>> > >> > big
>>>>> > >> > >>> > difference.  I didn't initially run the benchmarks with
>>>>> these
>>>>> > >> turned
>>>>> > >> > on
>>>>> > >> > >>> > (you can see the result from above with
>>>>> Float8Benchmarks).  Here
>>>>> > >> are
>>>>> > >> > >>> new
>>>>> > >> > >>> > numbers including with the flags enabled.  It looks like
>>>>> using
>>>>> > >> longs
>>>>> > >> > >>> might
>>>>> > >> > >>> > be a little bit slower, I'll see what I can do to
>>>>> mitigate this.
>>>>> > >> > >>> >
>>>>> > >> > >>> > Ravindra also volunteered to try to benchmark the changes
>>>>> with
>>>>> > >> > Dremio's
>>>>> > >> > >>> > code on today's sync call.
>>>>> > >> > >>> >
>>>>> > >> > >>> > New
>>>>> > >> > >>> >
>>>>> > >> > >>> > Benchmark                                        Mode
>>>>> Cnt   Score
>>>>> > >> > >>>  Error
>>>>> > >> > >>> > Units
>>>>> > >> > >>> >
>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>>>> > >>  4.176 ±
>>>>> > >> > >>> 1.292
>>>>> > >> > >>> > ns/op
>>>>> > >> > >>> >
>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>>>> > >> 26.102 ±
>>>>> > >> > >>> 0.700
>>>>> > >> > >>> > ns/op
>>>>> > >> > >>> >
>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ±
>>>>> 0.084
>>>>> > >> us/op
>>>>> > >> > >>> >
>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ±
>>>>> 0.057
>>>>> > >> us/op
>>>>> > >> > >>> >
>>>>> > >> > >>> >
>>>>> > >> > >>> >
>>>>> > >> > >>> > old
>>>>> > >> > >>> >
>>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>>>> > >>  3.828 ±
>>>>> > >> > >>> 0.030
>>>>> > >> > >>> > ns/op
>>>>> > >> > >>> >
>>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>>>> > >> 20.611 ±
>>>>> > >> > >>> 0.188
>>>>> > >> > >>> > ns/op
>>>>> > >> > >>> >
>>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ±
>>>>> 0.462
>>>>> > >> us/op
>>>>> > >> > >>> >
>>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ±
>>>>> 0.027
>>>>> > >> us/op
>>>>> > >> > >>> >
>>>>> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <
>>>>> liya.fan03@gmail.com>
>>>>> > >> > wrote:
>>>>> > >> > >>> >
>>>>> > >> > >>> >> Hi Gonzalo,
>>>>> > >> > >>> >>
>>>>> > >> > >>> >> Thanks for sharing the performance results.
>>>>> > >> > >>> >> I am wondering if you have turned off the flag
>>>>> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>>>>> > >> > >>> >> If not, the lower throughput should be expected.
>>>>> > >> > >>> >>
>>>>> > >> > >>> >> Best,
>>>>> > >> > >>> >> Liya Fan
>>>>> > >> > >>> >>
>>>>> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>>>>> > >> > >>> emkornfield@gmail.com>
>>>>> > >> > >>> >> wrote:
>>>>> > >> > >>> >>
>>>>> > >> > >>> >>> Hi Gonzalo,
>>>>> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
>>>>> > >> > >>> implications.   At
>>>>> > >> > >>> >>> least on the benchmark run they don't seem to have an
>>>>> impact.
>>>>> > >> > >>> >>>
>>>>> > >> > >>> >>> If there are other benchmarks that people have that can
>>>>> > >> validate if
>>>>> > >> > >>> this
>>>>> > >> > >>> >>> change will be problematic I would appreciate trying to
>>>>> run them
>>>>> > >> > >>> with the
>>>>> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt
>>>>> tonight to
>>>>> > >> see
>>>>> > >> > if
>>>>> > >> > >>> >>> there
>>>>> > >> > >>> >>> is a change in those.
>>>>> > >> > >>> >>>
>>>>> > >> > >>> >>> -Micah
>>>>> > >> > >>> >>>
>>>>> > >> > >>> >>>
>>>>> > >> > >>> >>>
>>>>> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar
>>>>> <
>>>>> > >> > >>> >>> golthiryus@gmail.com> wrote:
>>>>> > >> > >>> >>>
>>>>> > >> > >>> >>> > I would recommend to take care with this kind of
>>>>> changes.
>>>>> > >> > >>> >>> >
>>>>> > >> > >>> >>> > I didn't try Arrow in more than one year, but by then
>>>>> the
>>>>> > >> > >>> performance
>>>>> > >> > >>> >>> was
>>>>> > >> > >>> >>> > quite bad in comparison with plain byte buffer access
>>>>> > >> > >>> >>> > (see
>>>>> http://git.net/apache-arrow-development/msg02353.html *)
>>>>> > >> > and
>>>>> > >> > >>> >>> > there are several optimizations that the JVM
>>>>> (specifically,
>>>>> > >> C2)
>>>>> > >> > >>> does
>>>>> > >> > >>> >>> not
>>>>> > >> > >>> >>> > apply when dealing with int instead of longs. One of
>>>>> the
>>>>> > >> > >>> >>> > most commons is the loop unrolling and vectorization.
>>>>> > >> > >>> >>> >
>>>>> > >> > >>> >>> > * It doesn't seem the best way to reference an old
>>>>> email on
>>>>> > >> the
>>>>> > >> > >>> list,
>>>>> > >> > >>> >>> but
>>>>> > >> > >>> >>> > it is the only result shown by Google
>>>>> > >> > >>> >>> >
>>>>> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
>>>>> > >> > liya.fan03@gmail.com
>>>>> > >> > >>> >)
>>>>> > >> > >>> >>> > escribió:
>>>>> > >> > >>> >>> >
>>>>> > >> > >>> >>> >> Hi Micah,
>>>>> > >> > >>> >>> >>
>>>>> > >> > >>> >>> >> Thanks for your effort. The performance result looks
>>>>> good.
>>>>> > >> > >>> >>> >>
>>>>> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12
>>>>> bytes (4
>>>>> > >> > bytes
>>>>> > >> > >>> for
>>>>> > >> > >>> >>> each
>>>>> > >> > >>> >>> >> of length, write index, and read index).
>>>>> > >> > >>> >>> >> Similar overheads also exist for vectors like
>>>>> > >> > >>> BaseFixedWidthVector,
>>>>> > >> > >>> >>> >> BaseVariableWidthVector, etc.
>>>>> > >> > >>> >>> >>
>>>>> > >> > >>> >>> >> IMO, such overheads are small enough to justify the
>>>>> change.
>>>>> > >> > >>> >>> >> Let's check if there are other overheads.
>>>>> > >> > >>> >>> >>
>>>>> > >> > >>> >>> >> Best,
>>>>> > >> > >>> >>> >> Liya Fan
>>>>> > >> > >>> >>> >>
>>>>> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>>>>> > >> > >>> emkornfield@gmail.com
>>>>> > >> > >>> >>> >
>>>>> > >> > >>> >>> >> wrote:
>>>>> > >> > >>> >>> >>
>>>>> > >> > >>> >>> >> > Hi Liya Fan,
>>>>> > >> > >>> >>> >> > Based on the Float8Benchmark there does not seem
>>>>> to be any
>>>>> > >> > >>> >>> meaningful
>>>>> > >> > >>> >>> >> > performance difference on my machine.  At least
>>>>> for me, the
>>>>> > >> > >>> >>> benchmarks
>>>>> > >> > >>> >>> >> are
>>>>> > >> > >>> >>> >> > not stable enough to say one is faster than the
>>>>> other (I've
>>>>> > >> > >>> pasted
>>>>> > >> > >>> >>> >> results
>>>>> > >> > >>> >>> >> > below).  That being said my machine isn't
>>>>> necessarily the
>>>>> > >> most
>>>>> > >> > >>> >>> reliable
>>>>> > >> > >>> >>> >> for
>>>>> > >> > >>> >>> >> > benchmarking.
>>>>> > >> > >>> >>> >> >
>>>>> > >> > >>> >>> >> > On an intuitive level, this makes sense to me,
>>>>> for the
>>>>> > >> most
>>>>> > >> > >>> part it
>>>>> > >> > >>> >>> >> seems
>>>>> > >> > >>> >>> >> > like the change just moves casting from "int" to
>>>>> "long"
>>>>> > >> > further
>>>>> > >> > >>> up
>>>>> > >> > >>> >>> the
>>>>> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If
>>>>> there are
>>>>> > >> > other
>>>>> > >> > >>> >>> >> benchmarks
>>>>> > >> > >>> >>> >> > that you think are worth running let me know.
>>>>> > >> > >>> >>> >> >
>>>>> > >> > >>> >>> >> > One downside performance wise I think for his
>>>>> change is it
>>>>> > >> > >>> >>> increases the
>>>>> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could
>>>>> influence
>>>>> > >> > cache
>>>>> > >> > >>> >>> misses
>>>>> > >> > >>> >>> >> at
>>>>> > >> > >>> >>> >> > some level or increase the size of call-stacks,
>>>>> but this
>>>>> > >> > doesn't
>>>>> > >> > >>> >>> seem to
>>>>> > >> > >>> >>> >> > show up in the benchmark..
>>>>> > >> > >>> >>> >> >
>>>>> > >> > >>> >>> >> > Thanks,
>>>>> > >> > >>> >>> >> > Micah
>>>>> > >> > >>> >>> >> >
>>>>> > >> > >>> >>> >> > Sample benchmark numbers:
>>>>> > >> > >>> >>> >> >
>>>>> > >> > >>> >>> >> > [New Code]
>>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>>>  Score
>>>>> > >>  Error
>>>>> > >> > >>> >>> Units
>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>>>> 15.441 ±
>>>>> > >> 0.469
>>>>> > >> > >>> >>> us/op
>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>>>> 14.057 ±
>>>>> > >> 0.115
>>>>> > >> > >>> >>> us/op
>>>>> > >> > >>> >>> >> >
>>>>> > >> > >>> >>> >> >
>>>>> > >> > >>> >>> >> > [Old code]
>>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>>>  Score
>>>>> > >>  Error
>>>>> > >> > >>> >>> Units
>>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>>>> 16.248 ±
>>>>> > >> 1.409
>>>>> > >> > >>> >>> us/op
>>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>>>> 14.150 ±
>>>>> > >> 0.084
>>>>> > >> > >>> >>> us/op
>>>>> > >> > >>> >>> >> >
>>>>> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
>>>>> > >> liya.fan03@gmail.com
>>>>> > >> > >
>>>>> > >> > >>> >>> wrote:
>>>>> > >> > >>> >>> >> >
>>>>> > >> > >>> >>> >> >> Hi Micah,
>>>>> > >> > >>> >>> >> >>
>>>>> > >> > >>> >>> >> >> Thanks a lot for doing this.
>>>>> > >> > >>> >>> >> >>
>>>>> > >> > >>> >>> >> >> I am a little concerned about if there is any
>>>>> negative
>>>>> > >> > >>> performance
>>>>> > >> > >>> >>> >> impact
>>>>> > >> > >>> >>> >> >> on the current 32-bit-length based applications.
>>>>> > >> > >>> >>> >> >> Can we do some performance comparison on our
>>>>> existing
>>>>> > >> > >>> benchmarks?
>>>>> > >> > >>> >>> >> >>
>>>>> > >> > >>> >>> >> >> Best,
>>>>> > >> > >>> >>> >> >> Liya Fan
>>>>> > >> > >>> >>> >> >>
>>>>> > >> > >>> >>> >> >>
>>>>> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>>>>> > >> > >>> >>> emkornfield@gmail.com>
>>>>> > >> > >>> >>> >> >> wrote:
>>>>> > >> > >>> >>> >> >>
>>>>> > >> > >>> >>> >> >>> There have been some previous discussions on the
>>>>> mailing
>>>>> > >> > about
>>>>> > >> > >>> >>> >> supporting
>>>>> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is
>>>>> what the
>>>>> > >> IPC
>>>>> > >> > >>> >>> >> specification
>>>>> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that
>>>>> changes all
>>>>> > >> APIs
>>>>> > >> > >>> that I
>>>>> > >> > >>> >>> >> could
>>>>> > >> > >>> >>> >> >>> find that take an index to take an "long"
>>>>> instead of an
>>>>> > >> > "int"
>>>>> > >> > >>> (and
>>>>> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
>>>>> > >> > >>> >>> >> >>>
>>>>> > >> > >>> >>> >> >>> It is a big change, so I think it is worth
>>>>> discussing if
>>>>> > >> it
>>>>> > >> > is
>>>>> > >> > >>> >>> >> something
>>>>> > >> > >>> >>> >> >>> we
>>>>> > >> > >>> >>> >> >>> still want to move forward with.  It would be
>>>>> nice to
>>>>> > >> come
>>>>> > >> > to
>>>>> > >> > >>> a
>>>>> > >> > >>> >>> >> >>> conclusion
>>>>> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid
>>>>> a lot of
>>>>> > >> > merge
>>>>> > >> > >>> >>> >> conflicts.
>>>>> > >> > >>> >>> >> >>>
>>>>> > >> > >>> >>> >> >>> The reason I did this work now is the C++
>>>>> implementation
>>>>> > >> has
>>>>> > >> > >>> added
>>>>> > >> > >>> >>> >> >>> support
>>>>> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString
>>>>> arrays and
>>>>> > >> based
>>>>> > >> > on
>>>>> > >> > >>> >>> prior
>>>>> > >> > >>> >>> >> >>> discussions we need to have similar support in
>>>>> Java
>>>>> > >> before
>>>>> > >> > our
>>>>> > >> > >>> >>> next
>>>>> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we can
>>>>> have full
>>>>> > >> > >>> >>> compatibility
>>>>> > >> > >>> >>> >> and
>>>>> > >> > >>> >>> >> >>> make the most use of the types in Java.
>>>>> > >> > >>> >>> >> >>>
>>>>> > >> > >>> >>> >> >>> Look forward to hearing feedback.
>>>>> > >> > >>> >>> >> >>>
>>>>> > >> > >>> >>> >> >>> Thanks,
>>>>> > >> > >>> >>> >> >>> Micah
>>>>> > >> > >>> >>> >> >>>
>>>>> > >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>>>>> > >> > >>> >>> >> >>>
>>>>> > >> > >>> >>> >> >>
>>>>> > >> > >>> >>> >>
>>>>> > >> > >>> >>> >
>>>>> > >> > >>> >>>
>>>>> > >> > >>> >>
>>>>> > >> > >>>
>>>>> > >> > >>
>>>>> > >> >
>>>>> > >>
>>>>> > >
>>>>>
>>>>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Jacques Nadeau <ja...@apache.org>.

On Fri, Aug 23, 2019, 11:49 AM Micah Kornfield <em...@gmail.com>
wrote:

> I don't think we should couple this discussion with the implementation of
>> large list, etc since I think those two concepts are independent.
>
> I'm still trying to balance in my mind which is a worse experience for
> consumers of the libraries for these types.  Claiming that Java supports
> these types but throwing an exception when the Vectors exceed 32-bits or
> just say they aren't supported until we have 64-bit support in Java.
>

The vector indexes being limited to 32 bits doesn't limit the addressing to
32 bit chunks of memory. For example, you're prime example before was image
data. Having 2 billion images of 1mb images would still be supported
without changing the index addressing.



>
>> I've asked some others on my team their opinions on the risk here. I
>> think we should probably review some our more complex vector interactions
>> and see how the jvm's assembly changes with this kind of change. Using
>> microbenchmarking is good but I think we also need to see whether we're
>> constantly inserting additional instructions or if in most cases, this
>> actually doesn't impact instruction count.
>
>
> Is this something that your team will take on?
>


Yeah, we need to look at this I think.

Do you need a rebased version of the PR or is the existing one sufficient?
>

Rebase would help if possible.



> Thanks,
> Micah
>
> On Thu, Aug 22, 2019 at 8:56 PM Jacques Nadeau <ja...@apache.org> wrote:
>
>> I don't think we should couple this discussion with the implementation of
>> large list, etc since I think those two concepts are independent.
>>
>> I've asked some others on my team their opinions on the risk here. I
>> think we should probably review some our more complex vector interactions
>> and see how the jvm's assembly changes with this kind of change. Using
>> microbenchmarking is good but I think we also need to see whether we're
>> constantly inserting additional instructions or if in most cases, this
>> actually doesn't impact instruction count.
>>
>>
>>
>> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>>
>>>> With regards to the reference implementation point. It is a good point.
>>>> I'm on vacation this week. Unless you're pushing hard on this, can we pick
>>>> this up and discuss more next week?
>>>
>>>
>>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
>>> reference implementation aspect of this?
>>>
>>>
>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>> would be best to decouple this discussion from the release timeline
>>>> given how many people we have relying on regular releases coming out.
>>>> We can keep continue making major 0.x releases until we're ready to
>>>> release 1.0.0.
>>>
>>>
>>> I'm OK with it as long as other stakeholders are. Timed releases are the
>>> way to go.  As stated on the release thread [1] we need a better mechanism
>>> to avoid this type of issue arising again.  The release thread also had
>>> some more discussion on compatibility.
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>>>
>>>
>>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>>
>>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <em...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi Wes and Jacques,
>>>> > See responses below.
>>>> >
>>>> > With regards to the reference implementation point. It is a good
>>>> point. I'm
>>>> > > on vacation this week. Unless you're pushing hard on this, can we
>>>> pick this
>>>> > > up and discuss more next week?
>>>> >
>>>> >
>>>> > Sure thing, enjoy your vacation.  I think the only practical
>>>> implications
>>>> > are it delays choices around implementing LargeList, LargeBinary,
>>>> > LargeString in Java, which in turn might push out the 0.15.0 release.
>>>> >
>>>>
>>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>>> would be best to decouple this discussion from the release timeline
>>>> given how many people we have relying on regular releases coming out.
>>>> We can keep continue making major 0.x releases until we're ready to
>>>> release 1.0.0.
>>>>
>>>> > My stance on this is that I don't know how important it is for Java to
>>>> > > support vectors over INT32_MAX elements. The use cases enabled by
>>>> > > having very large arrays seem to be concentrated in the native code
>>>> > > world (e.g. C/C++/Rust) -- that could just be
>>>> implementation-centrism
>>>> > > on my part, though.
>>>> >
>>>> >
>>>> > A data point against this view is Spark has done work to eliminate 2GB
>>>> > memory limits on its block sizes [1].  I don't claim to understand the
>>>> > implications of this. Bryan might you have any thoughts here?  I'm OK
>>>> with
>>>> > INT32_MAX, as well, I think we should think about what this means for
>>>> > adding Large types to Java and implications for reference
>>>> implementations.
>>>> >
>>>> > Thanks,
>>>> > Micah
>>>> >
>>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>>>> >
>>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org>
>>>> wrote:
>>>> >
>>>> > > Hey Micah,
>>>> > >
>>>> > > Appreciate the offer on the compiling. The reality is I'm more
>>>> concerned
>>>> > > about the unknowns than the compiling issue itself. Any time you've
>>>> been
>>>> > > tuning for a while, changing something like this could be totally
>>>> fine or
>>>> > > cause a couple of major issues. For example, we've done a very
>>>> large amount
>>>> > > of work reducing heap memory footprint of the vectors. Are target
>>>> is to
>>>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per
>>>> vector
>>>> > > (not including arrow bufs).
>>>> > >
>>>> > > With regards to the reference implementation point. It is a good
>>>> point.
>>>> > > I'm on vacation this week. Unless you're pushing hard on this, can
>>>> we pick
>>>> > > this up and discuss more next week?
>>>> > >
>>>> > > thanks,
>>>> > > Jacques
>>>> > >
>>>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
>>>> emkornfield@gmail.com>
>>>> > > wrote:
>>>> > >
>>>> > >> Hi Jacques,
>>>> > >> I definitely understand these concerns and this change is risky
>>>> because it
>>>> > >> is so large.  Perhaps, creating a new hierarchy, might be the
>>>> cleanest way
>>>> > >> of dealing with this.  This could have other benefits like
>>>> cleaning up
>>>> > >> some
>>>> > >> cruft around dictionary encode and "orphaned" method.   Per past
>>>> e-mail
>>>> > >> threads I agree it is beneficial to have 2 separate reference
>>>> > >> implementations that can communicate fully, and my intent here was
>>>> to
>>>> > >> close
>>>> > >> that gap.
>>>> > >>
>>>> > >> Trying to
>>>> > >> > determine the ramifications of these changes would be
>>>> challenging and
>>>> > >> time
>>>> > >> > consuming against all the different ways we interact with the
>>>> Arrow Java
>>>> > >> > library.
>>>> > >>
>>>> > >>
>>>> > >> Understood.  I took a quick look at Dremio-OSS it seems like it
>>>> has a
>>>> > >> simple java build system?  If it is helpful, I can try to get a
>>>> fork
>>>> > >> running that at least compiles against this PR.  My plan would be
>>>> to cast
>>>> > >> any place that was changed to return a long back to an int, so in
>>>> essence
>>>> > >> the Dremio algorithms would reman 32-bit implementations.
>>>> > >>
>>>> > >> I don't  have the infrastructure to test this change properly from
>>>> a
>>>> > >> distributed systems perspective, so it would still take some time
>>>> from
>>>> > >> Dremio to validate for regressions.
>>>> > >>
>>>> > >> I'm not saying I'm against this but want to make sure we've
>>>> > >> > explored all less disruptive options before considering changing
>>>> > >> something
>>>> > >> > this fundamental (especially when I generally hold the view that
>>>> large
>>>> > >> cell
>>>> > >> > counts against massive contiguous memory is an anti pattern to
>>>> scalable
>>>> > >> > analytical processing--purely subjective of course).
>>>> > >>
>>>> > >>
>>>> > >> I'm open to other ideas here, as well. I don't think it is out of
>>>> the
>>>> > >> question to leave the Java implementation as 32-bit, but if we do,
>>>> then I
>>>> > >> think we should consider a different strategy for reference
>>>> > >> implementations.
>>>> > >>
>>>> > >> Thanks,
>>>> > >> Micah
>>>> > >>
>>>> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <jacques@apache.org
>>>> >
>>>> > >> wrote:
>>>> > >>
>>>> > >> > Hey Micah, I didn't have a particular path in mind. Was thinking
>>>> more
>>>> > >> along
>>>> > >> > the lines of extra methods as opposed to separate classes.
>>>> > >> >
>>>> > >> > Arrow hasn't historically been a place where we're writing
>>>> algorithms in
>>>> > >> > Java so the fact that they aren't there doesn't mean they don't
>>>> exist.
>>>> > >> We
>>>> > >> > have a large amount of code that depends on the current behavior
>>>> that is
>>>> > >> > deployed in hundreds of customer clusters (you can peruse our
>>>> dremio
>>>> > >> repo
>>>> > >> > to see how extensively we leverage Arrow if interested). Trying
>>>> to
>>>> > >> > determine the ramifications of these changes would be
>>>> challenging and
>>>> > >> time
>>>> > >> > consuming against all the different ways we interact with the
>>>> Arrow Java
>>>> > >> > library. I'm not saying I'm against this but want to make sure
>>>> we've
>>>> > >> > explored all less disruptive options before considering changing
>>>> > >> something
>>>> > >> > this fundamental (especially when I generally hold the view that
>>>> large
>>>> > >> cell
>>>> > >> > counts against massive contiguous memory is an anti pattern to
>>>> scalable
>>>> > >> > analytical processing--purely subjective of course).
>>>> > >> >
>>>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <
>>>> emkornfield@gmail.com>
>>>> > >> > wrote:
>>>> > >> >
>>>> > >> > > Hi Jacques,
>>>> > >> > > What avenue were you thinking for supporting both paths?   I
>>>> didn't
>>>> > >> want
>>>> > >> > > to pursue a different class hierarchy, because I felt like
>>>> that would
>>>> > >> > > effectively fork the code base, but that is potentially an
>>>> option that
>>>> > >> > > would allow us to have a complete reference implementation in
>>>> Java
>>>> > >> that
>>>> > >> > can
>>>> > >> > > fully interact with C++, without major changes to this code.
>>>> > >> > >
>>>> > >> > > For supporting both APIs on the same classes/interfaces, I
>>>> think they
>>>> > >> > > roughly fall into three categories, changes to input
>>>> parameters,
>>>> > >> changes
>>>> > >> > to
>>>> > >> > > output parameters and algorithm changes.
>>>> > >> > >
>>>> > >> > > For inputs, changing from int to long is essentially a no-op
>>>> from the
>>>> > >> > > compiler perspective.  From the limited micro-benchmarking
>>>> this also
>>>> > >> > > doesn't seem to have a performance impact.  So we could keep
>>>> two
>>>> > >> versions
>>>> > >> > > of the methods that only differ on inputs, but it is not clear
>>>> what
>>>> > >> the
>>>> > >> > > value of that would be.
>>>> > >> > >
>>>> > >> > > For outputs, we can't support methods "long getLength()" and
>>>> "int
>>>> > >> > > getLength()" in the same class, so we would be forced into
>>>> something
>>>> > >> like
>>>> > >> > > "long getLength(boolean dummy)" which I think is a less
>>>> desirable.
>>>> > >> > >
>>>> > >> > > For algorithm changes, there did not appear to be too many
>>>> places
>>>> > >> where
>>>> > >> > we
>>>> > >> > > actually loop over all elements (it is quite possible I missed
>>>> > >> something
>>>> > >> > > here), the ones that I did find I was able to mitigate
>>>> performance
>>>> > >> > > penalties as noted above.  Some of the current implementation
>>>> will
>>>> > >> get a
>>>> > >> > > lot slower for "large arrays", but we can likely fix those
>>>> later or in
>>>> > >> > this
>>>> > >> > > PR with a nested while loop instead of 2 for loops.
>>>> > >> > >
>>>> > >> > > Thanks,
>>>> > >> > > Micah
>>>> > >> > >
>>>> > >> > >
>>>> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <
>>>> jacques@apache.org>
>>>> > >> wrote:
>>>> > >> > >
>>>> > >> > >> This is a pretty massive change to the apis. I wonder how
>>>> nasty it
>>>> > >> would
>>>> > >> > >> be to just support both paths. Have you evaluated how complex
>>>> that
>>>> > >> > would be?
>>>> > >> > >>
>>>> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
>>>> > >> emkornfield@gmail.com>
>>>> > >> > >> wrote:
>>>> > >> > >>
>>>> > >> > >>> After more investigation, it looks like Float8Benchmarks at
>>>> least
>>>> > >> on my
>>>> > >> > >>> machine are within the range of noise.
>>>> > >> > >>>
>>>> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems to
>>>> bring the
>>>> > >> > >>> BitVectorHelper benchmarks back inline (and even with some
>>>> > >> improvement
>>>> > >> > >>> for
>>>> > >> > >>> getNullCountBenchmark).
>>>> > >> > >>>
>>>> > >> > >>> Benchmark                                        Mode  Cnt
>>>>  Score
>>>> > >> > >>>  Error
>>>> > >> > >>>  Units
>>>> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>>>  3.821 ±
>>>> > >> > >>> 0.031
>>>> > >> > >>>  ns/op
>>>> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>>> 14.884 ±
>>>> > >> > >>> 0.141
>>>> > >> > >>>  ns/op
>>>> > >> > >>>
>>>> > >> > >>> I applied the same pattern to other loops that I could find,
>>>> and for
>>>> > >> > any
>>>> > >> > >>> "for (long" loop on the critical path, I broke it up into
>>>> two loops.
>>>> > >> > the
>>>> > >> > >>> first loop does iteration by integer, the second finishes
>>>> off for
>>>> > >> any
>>>> > >> > >>> long
>>>> > >> > >>> values.  As a side note it seems like optimization for loops
>>>> using
>>>> > >> long
>>>> > >> > >>> counters at least have a semi-recent open bug for the JVM [2]
>>>> > >> > >>>
>>>> > >> > >>> Thanks,
>>>> > >> > >>> Micah
>>>> > >> > >>>
>>>> > >> > >>> [1]
>>>> > >> > >>>
>>>> > >> > >>>
>>>> > >> >
>>>> > >>
>>>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>>>> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>>> > >> > >>>
>>>> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
>>>> > >> emkornfield@gmail.com>
>>>> > >> > >>> wrote:
>>>> > >> > >>>
>>>> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables
>>>> can make a
>>>> > >> > big
>>>> > >> > >>> > difference.  I didn't initially run the benchmarks with
>>>> these
>>>> > >> turned
>>>> > >> > on
>>>> > >> > >>> > (you can see the result from above with
>>>> Float8Benchmarks).  Here
>>>> > >> are
>>>> > >> > >>> new
>>>> > >> > >>> > numbers including with the flags enabled.  It looks like
>>>> using
>>>> > >> longs
>>>> > >> > >>> might
>>>> > >> > >>> > be a little bit slower, I'll see what I can do to mitigate
>>>> this.
>>>> > >> > >>> >
>>>> > >> > >>> > Ravindra also volunteered to try to benchmark the changes
>>>> with
>>>> > >> > Dremio's
>>>> > >> > >>> > code on today's sync call.
>>>> > >> > >>> >
>>>> > >> > >>> > New
>>>> > >> > >>> >
>>>> > >> > >>> > Benchmark                                        Mode
>>>> Cnt   Score
>>>> > >> > >>>  Error
>>>> > >> > >>> > Units
>>>> > >> > >>> >
>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>>> > >>  4.176 ±
>>>> > >> > >>> 1.292
>>>> > >> > >>> > ns/op
>>>> > >> > >>> >
>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>>> > >> 26.102 ±
>>>> > >> > >>> 0.700
>>>> > >> > >>> > ns/op
>>>> > >> > >>> >
>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ±
>>>> 0.084
>>>> > >> us/op
>>>> > >> > >>> >
>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ±
>>>> 0.057
>>>> > >> us/op
>>>> > >> > >>> >
>>>> > >> > >>> >
>>>> > >> > >>> >
>>>> > >> > >>> > old
>>>> > >> > >>> >
>>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>>> > >>  3.828 ±
>>>> > >> > >>> 0.030
>>>> > >> > >>> > ns/op
>>>> > >> > >>> >
>>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>>> > >> 20.611 ±
>>>> > >> > >>> 0.188
>>>> > >> > >>> > ns/op
>>>> > >> > >>> >
>>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ±
>>>> 0.462
>>>> > >> us/op
>>>> > >> > >>> >
>>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ±
>>>> 0.027
>>>> > >> us/op
>>>> > >> > >>> >
>>>> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <
>>>> liya.fan03@gmail.com>
>>>> > >> > wrote:
>>>> > >> > >>> >
>>>> > >> > >>> >> Hi Gonzalo,
>>>> > >> > >>> >>
>>>> > >> > >>> >> Thanks for sharing the performance results.
>>>> > >> > >>> >> I am wondering if you have turned off the flag
>>>> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>>>> > >> > >>> >> If not, the lower throughput should be expected.
>>>> > >> > >>> >>
>>>> > >> > >>> >> Best,
>>>> > >> > >>> >> Liya Fan
>>>> > >> > >>> >>
>>>> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>>>> > >> > >>> emkornfield@gmail.com>
>>>> > >> > >>> >> wrote:
>>>> > >> > >>> >>
>>>> > >> > >>> >>> Hi Gonzalo,
>>>> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
>>>> > >> > >>> implications.   At
>>>> > >> > >>> >>> least on the benchmark run they don't seem to have an
>>>> impact.
>>>> > >> > >>> >>>
>>>> > >> > >>> >>> If there are other benchmarks that people have that can
>>>> > >> validate if
>>>> > >> > >>> this
>>>> > >> > >>> >>> change will be problematic I would appreciate trying to
>>>> run them
>>>> > >> > >>> with the
>>>> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt
>>>> tonight to
>>>> > >> see
>>>> > >> > if
>>>> > >> > >>> >>> there
>>>> > >> > >>> >>> is a change in those.
>>>> > >> > >>> >>>
>>>> > >> > >>> >>> -Micah
>>>> > >> > >>> >>>
>>>> > >> > >>> >>>
>>>> > >> > >>> >>>
>>>> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>>>> > >> > >>> >>> golthiryus@gmail.com> wrote:
>>>> > >> > >>> >>>
>>>> > >> > >>> >>> > I would recommend to take care with this kind of
>>>> changes.
>>>> > >> > >>> >>> >
>>>> > >> > >>> >>> > I didn't try Arrow in more than one year, but by then
>>>> the
>>>> > >> > >>> performance
>>>> > >> > >>> >>> was
>>>> > >> > >>> >>> > quite bad in comparison with plain byte buffer access
>>>> > >> > >>> >>> > (see
>>>> http://git.net/apache-arrow-development/msg02353.html *)
>>>> > >> > and
>>>> > >> > >>> >>> > there are several optimizations that the JVM
>>>> (specifically,
>>>> > >> C2)
>>>> > >> > >>> does
>>>> > >> > >>> >>> not
>>>> > >> > >>> >>> > apply when dealing with int instead of longs. One of
>>>> the
>>>> > >> > >>> >>> > most commons is the loop unrolling and vectorization.
>>>> > >> > >>> >>> >
>>>> > >> > >>> >>> > * It doesn't seem the best way to reference an old
>>>> email on
>>>> > >> the
>>>> > >> > >>> list,
>>>> > >> > >>> >>> but
>>>> > >> > >>> >>> > it is the only result shown by Google
>>>> > >> > >>> >>> >
>>>> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
>>>> > >> > liya.fan03@gmail.com
>>>> > >> > >>> >)
>>>> > >> > >>> >>> > escribió:
>>>> > >> > >>> >>> >
>>>> > >> > >>> >>> >> Hi Micah,
>>>> > >> > >>> >>> >>
>>>> > >> > >>> >>> >> Thanks for your effort. The performance result looks
>>>> good.
>>>> > >> > >>> >>> >>
>>>> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12
>>>> bytes (4
>>>> > >> > bytes
>>>> > >> > >>> for
>>>> > >> > >>> >>> each
>>>> > >> > >>> >>> >> of length, write index, and read index).
>>>> > >> > >>> >>> >> Similar overheads also exist for vectors like
>>>> > >> > >>> BaseFixedWidthVector,
>>>> > >> > >>> >>> >> BaseVariableWidthVector, etc.
>>>> > >> > >>> >>> >>
>>>> > >> > >>> >>> >> IMO, such overheads are small enough to justify the
>>>> change.
>>>> > >> > >>> >>> >> Let's check if there are other overheads.
>>>> > >> > >>> >>> >>
>>>> > >> > >>> >>> >> Best,
>>>> > >> > >>> >>> >> Liya Fan
>>>> > >> > >>> >>> >>
>>>> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>>>> > >> > >>> emkornfield@gmail.com
>>>> > >> > >>> >>> >
>>>> > >> > >>> >>> >> wrote:
>>>> > >> > >>> >>> >>
>>>> > >> > >>> >>> >> > Hi Liya Fan,
>>>> > >> > >>> >>> >> > Based on the Float8Benchmark there does not seem to
>>>> be any
>>>> > >> > >>> >>> meaningful
>>>> > >> > >>> >>> >> > performance difference on my machine.  At least for
>>>> me, the
>>>> > >> > >>> >>> benchmarks
>>>> > >> > >>> >>> >> are
>>>> > >> > >>> >>> >> > not stable enough to say one is faster than the
>>>> other (I've
>>>> > >> > >>> pasted
>>>> > >> > >>> >>> >> results
>>>> > >> > >>> >>> >> > below).  That being said my machine isn't
>>>> necessarily the
>>>> > >> most
>>>> > >> > >>> >>> reliable
>>>> > >> > >>> >>> >> for
>>>> > >> > >>> >>> >> > benchmarking.
>>>> > >> > >>> >>> >> >
>>>> > >> > >>> >>> >> > On an intuitive level, this makes sense to me,  for
>>>> the
>>>> > >> most
>>>> > >> > >>> part it
>>>> > >> > >>> >>> >> seems
>>>> > >> > >>> >>> >> > like the change just moves casting from "int" to
>>>> "long"
>>>> > >> > further
>>>> > >> > >>> up
>>>> > >> > >>> >>> the
>>>> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If
>>>> there are
>>>> > >> > other
>>>> > >> > >>> >>> >> benchmarks
>>>> > >> > >>> >>> >> > that you think are worth running let me know.
>>>> > >> > >>> >>> >> >
>>>> > >> > >>> >>> >> > One downside performance wise I think for his
>>>> change is it
>>>> > >> > >>> >>> increases the
>>>> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could
>>>> influence
>>>> > >> > cache
>>>> > >> > >>> >>> misses
>>>> > >> > >>> >>> >> at
>>>> > >> > >>> >>> >> > some level or increase the size of call-stacks, but
>>>> this
>>>> > >> > doesn't
>>>> > >> > >>> >>> seem to
>>>> > >> > >>> >>> >> > show up in the benchmark..
>>>> > >> > >>> >>> >> >
>>>> > >> > >>> >>> >> > Thanks,
>>>> > >> > >>> >>> >> > Micah
>>>> > >> > >>> >>> >> >
>>>> > >> > >>> >>> >> > Sample benchmark numbers:
>>>> > >> > >>> >>> >> >
>>>> > >> > >>> >>> >> > [New Code]
>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>>  Score
>>>> > >>  Error
>>>> > >> > >>> >>> Units
>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>>> 15.441 ±
>>>> > >> 0.469
>>>> > >> > >>> >>> us/op
>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>>> 14.057 ±
>>>> > >> 0.115
>>>> > >> > >>> >>> us/op
>>>> > >> > >>> >>> >> >
>>>> > >> > >>> >>> >> >
>>>> > >> > >>> >>> >> > [Old code]
>>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>>  Score
>>>> > >>  Error
>>>> > >> > >>> >>> Units
>>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>>> 16.248 ±
>>>> > >> 1.409
>>>> > >> > >>> >>> us/op
>>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>>> 14.150 ±
>>>> > >> 0.084
>>>> > >> > >>> >>> us/op
>>>> > >> > >>> >>> >> >
>>>> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
>>>> > >> liya.fan03@gmail.com
>>>> > >> > >
>>>> > >> > >>> >>> wrote:
>>>> > >> > >>> >>> >> >
>>>> > >> > >>> >>> >> >> Hi Micah,
>>>> > >> > >>> >>> >> >>
>>>> > >> > >>> >>> >> >> Thanks a lot for doing this.
>>>> > >> > >>> >>> >> >>
>>>> > >> > >>> >>> >> >> I am a little concerned about if there is any
>>>> negative
>>>> > >> > >>> performance
>>>> > >> > >>> >>> >> impact
>>>> > >> > >>> >>> >> >> on the current 32-bit-length based applications.
>>>> > >> > >>> >>> >> >> Can we do some performance comparison on our
>>>> existing
>>>> > >> > >>> benchmarks?
>>>> > >> > >>> >>> >> >>
>>>> > >> > >>> >>> >> >> Best,
>>>> > >> > >>> >>> >> >> Liya Fan
>>>> > >> > >>> >>> >> >>
>>>> > >> > >>> >>> >> >>
>>>> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>>>> > >> > >>> >>> emkornfield@gmail.com>
>>>> > >> > >>> >>> >> >> wrote:
>>>> > >> > >>> >>> >> >>
>>>> > >> > >>> >>> >> >>> There have been some previous discussions on the
>>>> mailing
>>>> > >> > about
>>>> > >> > >>> >>> >> supporting
>>>> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is
>>>> what the
>>>> > >> IPC
>>>> > >> > >>> >>> >> specification
>>>> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that
>>>> changes all
>>>> > >> APIs
>>>> > >> > >>> that I
>>>> > >> > >>> >>> >> could
>>>> > >> > >>> >>> >> >>> find that take an index to take an "long" instead
>>>> of an
>>>> > >> > "int"
>>>> > >> > >>> (and
>>>> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
>>>> > >> > >>> >>> >> >>>
>>>> > >> > >>> >>> >> >>> It is a big change, so I think it is worth
>>>> discussing if
>>>> > >> it
>>>> > >> > is
>>>> > >> > >>> >>> >> something
>>>> > >> > >>> >>> >> >>> we
>>>> > >> > >>> >>> >> >>> still want to move forward with.  It would be
>>>> nice to
>>>> > >> come
>>>> > >> > to
>>>> > >> > >>> a
>>>> > >> > >>> >>> >> >>> conclusion
>>>> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a
>>>> lot of
>>>> > >> > merge
>>>> > >> > >>> >>> >> conflicts.
>>>> > >> > >>> >>> >> >>>
>>>> > >> > >>> >>> >> >>> The reason I did this work now is the C++
>>>> implementation
>>>> > >> has
>>>> > >> > >>> added
>>>> > >> > >>> >>> >> >>> support
>>>> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays
>>>> and
>>>> > >> based
>>>> > >> > on
>>>> > >> > >>> >>> prior
>>>> > >> > >>> >>> >> >>> discussions we need to have similar support in
>>>> Java
>>>> > >> before
>>>> > >> > our
>>>> > >> > >>> >>> next
>>>> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we can have
>>>> full
>>>> > >> > >>> >>> compatibility
>>>> > >> > >>> >>> >> and
>>>> > >> > >>> >>> >> >>> make the most use of the types in Java.
>>>> > >> > >>> >>> >> >>>
>>>> > >> > >>> >>> >> >>> Look forward to hearing feedback.
>>>> > >> > >>> >>> >> >>>
>>>> > >> > >>> >>> >> >>> Thanks,
>>>> > >> > >>> >>> >> >>> Micah
>>>> > >> > >>> >>> >> >>>
>>>> > >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>>>> > >> > >>> >>> >> >>>
>>>> > >> > >>> >>> >> >>
>>>> > >> > >>> >>> >>
>>>> > >> > >>> >>> >
>>>> > >> > >>> >>>
>>>> > >> > >>> >>
>>>> > >> > >>>
>>>> > >> > >>
>>>> > >> >
>>>> > >>
>>>> > >
>>>>
>>>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Micah Kornfield <em...@gmail.com>.

>
> I don't think we should couple this discussion with the implementation of
> large list, etc since I think those two concepts are independent.

I'm still trying to balance in my mind which is a worse experience for
consumers of the libraries for these types.  Claiming that Java supports
these types but throwing an exception when the Vectors exceed 32-bits or
just say they aren't supported until we have 64-bit support in Java.


> I've asked some others on my team their opinions on the risk here. I think
> we should probably review some our more complex vector interactions and see
> how the jvm's assembly changes with this kind of change. Using
> microbenchmarking is good but I think we also need to see whether we're
> constantly inserting additional instructions or if in most cases, this
> actually doesn't impact instruction count.


Is this something that your team will take on?  Do you need a rebased
version of the PR or is the existing one sufficient?

Thanks,
Micah

On Thu, Aug 22, 2019 at 8:56 PM Jacques Nadeau <ja...@apache.org> wrote:

> I don't think we should couple this discussion with the implementation of
> large list, etc since I think those two concepts are independent.
>
> I've asked some others on my team their opinions on the risk here. I think
> we should probably review some our more complex vector interactions and see
> how the jvm's assembly changes with this kind of change. Using
> microbenchmarking is good but I think we also need to see whether we're
> constantly inserting additional instructions or if in most cases, this
> actually doesn't impact instruction count.
>
>
>
> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>>
>>> With regards to the reference implementation point. It is a good point.
>>> I'm on vacation this week. Unless you're pushing hard on this, can we pick
>>> this up and discuss more next week?
>>
>>
>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
>> reference implementation aspect of this?
>>
>>
>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>> would be best to decouple this discussion from the release timeline
>>> given how many people we have relying on regular releases coming out.
>>> We can keep continue making major 0.x releases until we're ready to
>>> release 1.0.0.
>>
>>
>> I'm OK with it as long as other stakeholders are. Timed releases are the
>> way to go.  As stated on the release thread [1] we need a better mechanism
>> to avoid this type of issue arising again.  The release thread also had
>> some more discussion on compatibility.
>>
>> Thanks,
>> Micah
>>
>> [1]
>> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>>
>>
>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com> wrote:
>>
>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Wes and Jacques,
>>> > See responses below.
>>> >
>>> > With regards to the reference implementation point. It is a good
>>> point. I'm
>>> > > on vacation this week. Unless you're pushing hard on this, can we
>>> pick this
>>> > > up and discuss more next week?
>>> >
>>> >
>>> > Sure thing, enjoy your vacation.  I think the only practical
>>> implications
>>> > are it delays choices around implementing LargeList, LargeBinary,
>>> > LargeString in Java, which in turn might push out the 0.15.0 release.
>>> >
>>>
>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>> would be best to decouple this discussion from the release timeline
>>> given how many people we have relying on regular releases coming out.
>>> We can keep continue making major 0.x releases until we're ready to
>>> release 1.0.0.
>>>
>>> > My stance on this is that I don't know how important it is for Java to
>>> > > support vectors over INT32_MAX elements. The use cases enabled by
>>> > > having very large arrays seem to be concentrated in the native code
>>> > > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
>>> > > on my part, though.
>>> >
>>> >
>>> > A data point against this view is Spark has done work to eliminate 2GB
>>> > memory limits on its block sizes [1].  I don't claim to understand the
>>> > implications of this. Bryan might you have any thoughts here?  I'm OK
>>> with
>>> > INT32_MAX, as well, I think we should think about what this means for
>>> > adding Large types to Java and implications for reference
>>> implementations.
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>>> >
>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>> >
>>> > > Hey Micah,
>>> > >
>>> > > Appreciate the offer on the compiling. The reality is I'm more
>>> concerned
>>> > > about the unknowns than the compiling issue itself. Any time you've
>>> been
>>> > > tuning for a while, changing something like this could be totally
>>> fine or
>>> > > cause a couple of major issues. For example, we've done a very large
>>> amount
>>> > > of work reducing heap memory footprint of the vectors. Are target is
>>> to
>>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per
>>> vector
>>> > > (not including arrow bufs).
>>> > >
>>> > > With regards to the reference implementation point. It is a good
>>> point.
>>> > > I'm on vacation this week. Unless you're pushing hard on this, can
>>> we pick
>>> > > this up and discuss more next week?
>>> > >
>>> > > thanks,
>>> > > Jacques
>>> > >
>>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>> > > wrote:
>>> > >
>>> > >> Hi Jacques,
>>> > >> I definitely understand these concerns and this change is risky
>>> because it
>>> > >> is so large.  Perhaps, creating a new hierarchy, might be the
>>> cleanest way
>>> > >> of dealing with this.  This could have other benefits like cleaning
>>> up
>>> > >> some
>>> > >> cruft around dictionary encode and "orphaned" method.   Per past
>>> e-mail
>>> > >> threads I agree it is beneficial to have 2 separate reference
>>> > >> implementations that can communicate fully, and my intent here was
>>> to
>>> > >> close
>>> > >> that gap.
>>> > >>
>>> > >> Trying to
>>> > >> > determine the ramifications of these changes would be challenging
>>> and
>>> > >> time
>>> > >> > consuming against all the different ways we interact with the
>>> Arrow Java
>>> > >> > library.
>>> > >>
>>> > >>
>>> > >> Understood.  I took a quick look at Dremio-OSS it seems like it has
>>> a
>>> > >> simple java build system?  If it is helpful, I can try to get a fork
>>> > >> running that at least compiles against this PR.  My plan would be
>>> to cast
>>> > >> any place that was changed to return a long back to an int, so in
>>> essence
>>> > >> the Dremio algorithms would reman 32-bit implementations.
>>> > >>
>>> > >> I don't  have the infrastructure to test this change properly from a
>>> > >> distributed systems perspective, so it would still take some time
>>> from
>>> > >> Dremio to validate for regressions.
>>> > >>
>>> > >> I'm not saying I'm against this but want to make sure we've
>>> > >> > explored all less disruptive options before considering changing
>>> > >> something
>>> > >> > this fundamental (especially when I generally hold the view that
>>> large
>>> > >> cell
>>> > >> > counts against massive contiguous memory is an anti pattern to
>>> scalable
>>> > >> > analytical processing--purely subjective of course).
>>> > >>
>>> > >>
>>> > >> I'm open to other ideas here, as well. I don't think it is out of
>>> the
>>> > >> question to leave the Java implementation as 32-bit, but if we do,
>>> then I
>>> > >> think we should consider a different strategy for reference
>>> > >> implementations.
>>> > >>
>>> > >> Thanks,
>>> > >> Micah
>>> > >>
>>> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <ja...@apache.org>
>>> > >> wrote:
>>> > >>
>>> > >> > Hey Micah, I didn't have a particular path in mind. Was thinking
>>> more
>>> > >> along
>>> > >> > the lines of extra methods as opposed to separate classes.
>>> > >> >
>>> > >> > Arrow hasn't historically been a place where we're writing
>>> algorithms in
>>> > >> > Java so the fact that they aren't there doesn't mean they don't
>>> exist.
>>> > >> We
>>> > >> > have a large amount of code that depends on the current behavior
>>> that is
>>> > >> > deployed in hundreds of customer clusters (you can peruse our
>>> dremio
>>> > >> repo
>>> > >> > to see how extensively we leverage Arrow if interested). Trying to
>>> > >> > determine the ramifications of these changes would be challenging
>>> and
>>> > >> time
>>> > >> > consuming against all the different ways we interact with the
>>> Arrow Java
>>> > >> > library. I'm not saying I'm against this but want to make sure
>>> we've
>>> > >> > explored all less disruptive options before considering changing
>>> > >> something
>>> > >> > this fundamental (especially when I generally hold the view that
>>> large
>>> > >> cell
>>> > >> > counts against massive contiguous memory is an anti pattern to
>>> scalable
>>> > >> > analytical processing--purely subjective of course).
>>> > >> >
>>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>> > >> > wrote:
>>> > >> >
>>> > >> > > Hi Jacques,
>>> > >> > > What avenue were you thinking for supporting both paths?   I
>>> didn't
>>> > >> want
>>> > >> > > to pursue a different class hierarchy, because I felt like that
>>> would
>>> > >> > > effectively fork the code base, but that is potentially an
>>> option that
>>> > >> > > would allow us to have a complete reference implementation in
>>> Java
>>> > >> that
>>> > >> > can
>>> > >> > > fully interact with C++, without major changes to this code.
>>> > >> > >
>>> > >> > > For supporting both APIs on the same classes/interfaces, I
>>> think they
>>> > >> > > roughly fall into three categories, changes to input parameters,
>>> > >> changes
>>> > >> > to
>>> > >> > > output parameters and algorithm changes.
>>> > >> > >
>>> > >> > > For inputs, changing from int to long is essentially a no-op
>>> from the
>>> > >> > > compiler perspective.  From the limited micro-benchmarking this
>>> also
>>> > >> > > doesn't seem to have a performance impact.  So we could keep two
>>> > >> versions
>>> > >> > > of the methods that only differ on inputs, but it is not clear
>>> what
>>> > >> the
>>> > >> > > value of that would be.
>>> > >> > >
>>> > >> > > For outputs, we can't support methods "long getLength()" and
>>> "int
>>> > >> > > getLength()" in the same class, so we would be forced into
>>> something
>>> > >> like
>>> > >> > > "long getLength(boolean dummy)" which I think is a less
>>> desirable.
>>> > >> > >
>>> > >> > > For algorithm changes, there did not appear to be too many
>>> places
>>> > >> where
>>> > >> > we
>>> > >> > > actually loop over all elements (it is quite possible I missed
>>> > >> something
>>> > >> > > here), the ones that I did find I was able to mitigate
>>> performance
>>> > >> > > penalties as noted above.  Some of the current implementation
>>> will
>>> > >> get a
>>> > >> > > lot slower for "large arrays", but we can likely fix those
>>> later or in
>>> > >> > this
>>> > >> > > PR with a nested while loop instead of 2 for loops.
>>> > >> > >
>>> > >> > > Thanks,
>>> > >> > > Micah
>>> > >> > >
>>> > >> > >
>>> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <
>>> jacques@apache.org>
>>> > >> wrote:
>>> > >> > >
>>> > >> > >> This is a pretty massive change to the apis. I wonder how
>>> nasty it
>>> > >> would
>>> > >> > >> be to just support both paths. Have you evaluated how complex
>>> that
>>> > >> > would be?
>>> > >> > >>
>>> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
>>> > >> emkornfield@gmail.com>
>>> > >> > >> wrote:
>>> > >> > >>
>>> > >> > >>> After more investigation, it looks like Float8Benchmarks at
>>> least
>>> > >> on my
>>> > >> > >>> machine are within the range of noise.
>>> > >> > >>>
>>> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems to bring
>>> the
>>> > >> > >>> BitVectorHelper benchmarks back inline (and even with some
>>> > >> improvement
>>> > >> > >>> for
>>> > >> > >>> getNullCountBenchmark).
>>> > >> > >>>
>>> > >> > >>> Benchmark                                        Mode  Cnt
>>>  Score
>>> > >> > >>>  Error
>>> > >> > >>>  Units
>>> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>>  3.821 ±
>>> > >> > >>> 0.031
>>> > >> > >>>  ns/op
>>> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>> 14.884 ±
>>> > >> > >>> 0.141
>>> > >> > >>>  ns/op
>>> > >> > >>>
>>> > >> > >>> I applied the same pattern to other loops that I could find,
>>> and for
>>> > >> > any
>>> > >> > >>> "for (long" loop on the critical path, I broke it up into two
>>> loops.
>>> > >> > the
>>> > >> > >>> first loop does iteration by integer, the second finishes off
>>> for
>>> > >> any
>>> > >> > >>> long
>>> > >> > >>> values.  As a side note it seems like optimization for loops
>>> using
>>> > >> long
>>> > >> > >>> counters at least have a semi-recent open bug for the JVM [2]
>>> > >> > >>>
>>> > >> > >>> Thanks,
>>> > >> > >>> Micah
>>> > >> > >>>
>>> > >> > >>> [1]
>>> > >> > >>>
>>> > >> > >>>
>>> > >> >
>>> > >>
>>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>>> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>> > >> > >>>
>>> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
>>> > >> emkornfield@gmail.com>
>>> > >> > >>> wrote:
>>> > >> > >>>
>>> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables can
>>> make a
>>> > >> > big
>>> > >> > >>> > difference.  I didn't initially run the benchmarks with
>>> these
>>> > >> turned
>>> > >> > on
>>> > >> > >>> > (you can see the result from above with Float8Benchmarks).
>>> Here
>>> > >> are
>>> > >> > >>> new
>>> > >> > >>> > numbers including with the flags enabled.  It looks like
>>> using
>>> > >> longs
>>> > >> > >>> might
>>> > >> > >>> > be a little bit slower, I'll see what I can do to mitigate
>>> this.
>>> > >> > >>> >
>>> > >> > >>> > Ravindra also volunteered to try to benchmark the changes
>>> with
>>> > >> > Dremio's
>>> > >> > >>> > code on today's sync call.
>>> > >> > >>> >
>>> > >> > >>> > New
>>> > >> > >>> >
>>> > >> > >>> > Benchmark                                        Mode  Cnt
>>>  Score
>>> > >> > >>>  Error
>>> > >> > >>> > Units
>>> > >> > >>> >
>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>> > >>  4.176 ±
>>> > >> > >>> 1.292
>>> > >> > >>> > ns/op
>>> > >> > >>> >
>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>> > >> 26.102 ±
>>> > >> > >>> 0.700
>>> > >> > >>> > ns/op
>>> > >> > >>> >
>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ±
>>> 0.084
>>> > >> us/op
>>> > >> > >>> >
>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ±
>>> 0.057
>>> > >> us/op
>>> > >> > >>> >
>>> > >> > >>> >
>>> > >> > >>> >
>>> > >> > >>> > old
>>> > >> > >>> >
>>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>> > >>  3.828 ±
>>> > >> > >>> 0.030
>>> > >> > >>> > ns/op
>>> > >> > >>> >
>>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>>> > >> 20.611 ±
>>> > >> > >>> 0.188
>>> > >> > >>> > ns/op
>>> > >> > >>> >
>>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ±
>>> 0.462
>>> > >> us/op
>>> > >> > >>> >
>>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ±
>>> 0.027
>>> > >> us/op
>>> > >> > >>> >
>>> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <
>>> liya.fan03@gmail.com>
>>> > >> > wrote:
>>> > >> > >>> >
>>> > >> > >>> >> Hi Gonzalo,
>>> > >> > >>> >>
>>> > >> > >>> >> Thanks for sharing the performance results.
>>> > >> > >>> >> I am wondering if you have turned off the flag
>>> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>>> > >> > >>> >> If not, the lower throughput should be expected.
>>> > >> > >>> >>
>>> > >> > >>> >> Best,
>>> > >> > >>> >> Liya Fan
>>> > >> > >>> >>
>>> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>>> > >> > >>> emkornfield@gmail.com>
>>> > >> > >>> >> wrote:
>>> > >> > >>> >>
>>> > >> > >>> >>> Hi Gonzalo,
>>> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
>>> > >> > >>> implications.   At
>>> > >> > >>> >>> least on the benchmark run they don't seem to have an
>>> impact.
>>> > >> > >>> >>>
>>> > >> > >>> >>> If there are other benchmarks that people have that can
>>> > >> validate if
>>> > >> > >>> this
>>> > >> > >>> >>> change will be problematic I would appreciate trying to
>>> run them
>>> > >> > >>> with the
>>> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt
>>> tonight to
>>> > >> see
>>> > >> > if
>>> > >> > >>> >>> there
>>> > >> > >>> >>> is a change in those.
>>> > >> > >>> >>>
>>> > >> > >>> >>> -Micah
>>> > >> > >>> >>>
>>> > >> > >>> >>>
>>> > >> > >>> >>>
>>> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>>> > >> > >>> >>> golthiryus@gmail.com> wrote:
>>> > >> > >>> >>>
>>> > >> > >>> >>> > I would recommend to take care with this kind of
>>> changes.
>>> > >> > >>> >>> >
>>> > >> > >>> >>> > I didn't try Arrow in more than one year, but by then
>>> the
>>> > >> > >>> performance
>>> > >> > >>> >>> was
>>> > >> > >>> >>> > quite bad in comparison with plain byte buffer access
>>> > >> > >>> >>> > (see
>>> http://git.net/apache-arrow-development/msg02353.html *)
>>> > >> > and
>>> > >> > >>> >>> > there are several optimizations that the JVM
>>> (specifically,
>>> > >> C2)
>>> > >> > >>> does
>>> > >> > >>> >>> not
>>> > >> > >>> >>> > apply when dealing with int instead of longs. One of the
>>> > >> > >>> >>> > most commons is the loop unrolling and vectorization.
>>> > >> > >>> >>> >
>>> > >> > >>> >>> > * It doesn't seem the best way to reference an old
>>> email on
>>> > >> the
>>> > >> > >>> list,
>>> > >> > >>> >>> but
>>> > >> > >>> >>> > it is the only result shown by Google
>>> > >> > >>> >>> >
>>> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
>>> > >> > liya.fan03@gmail.com
>>> > >> > >>> >)
>>> > >> > >>> >>> > escribió:
>>> > >> > >>> >>> >
>>> > >> > >>> >>> >> Hi Micah,
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> Thanks for your effort. The performance result looks
>>> good.
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12
>>> bytes (4
>>> > >> > bytes
>>> > >> > >>> for
>>> > >> > >>> >>> each
>>> > >> > >>> >>> >> of length, write index, and read index).
>>> > >> > >>> >>> >> Similar overheads also exist for vectors like
>>> > >> > >>> BaseFixedWidthVector,
>>> > >> > >>> >>> >> BaseVariableWidthVector, etc.
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> IMO, such overheads are small enough to justify the
>>> change.
>>> > >> > >>> >>> >> Let's check if there are other overheads.
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> Best,
>>> > >> > >>> >>> >> Liya Fan
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>>> > >> > >>> emkornfield@gmail.com
>>> > >> > >>> >>> >
>>> > >> > >>> >>> >> wrote:
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >> > Hi Liya Fan,
>>> > >> > >>> >>> >> > Based on the Float8Benchmark there does not seem to
>>> be any
>>> > >> > >>> >>> meaningful
>>> > >> > >>> >>> >> > performance difference on my machine.  At least for
>>> me, the
>>> > >> > >>> >>> benchmarks
>>> > >> > >>> >>> >> are
>>> > >> > >>> >>> >> > not stable enough to say one is faster than the
>>> other (I've
>>> > >> > >>> pasted
>>> > >> > >>> >>> >> results
>>> > >> > >>> >>> >> > below).  That being said my machine isn't
>>> necessarily the
>>> > >> most
>>> > >> > >>> >>> reliable
>>> > >> > >>> >>> >> for
>>> > >> > >>> >>> >> > benchmarking.
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > On an intuitive level, this makes sense to me,  for
>>> the
>>> > >> most
>>> > >> > >>> part it
>>> > >> > >>> >>> >> seems
>>> > >> > >>> >>> >> > like the change just moves casting from "int" to
>>> "long"
>>> > >> > further
>>> > >> > >>> up
>>> > >> > >>> >>> the
>>> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If
>>> there are
>>> > >> > other
>>> > >> > >>> >>> >> benchmarks
>>> > >> > >>> >>> >> > that you think are worth running let me know.
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > One downside performance wise I think for his change
>>> is it
>>> > >> > >>> >>> increases the
>>> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could
>>> influence
>>> > >> > cache
>>> > >> > >>> >>> misses
>>> > >> > >>> >>> >> at
>>> > >> > >>> >>> >> > some level or increase the size of call-stacks, but
>>> this
>>> > >> > doesn't
>>> > >> > >>> >>> seem to
>>> > >> > >>> >>> >> > show up in the benchmark..
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > Thanks,
>>> > >> > >>> >>> >> > Micah
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > Sample benchmark numbers:
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > [New Code]
>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>  Score
>>> > >>  Error
>>> > >> > >>> >>> Units
>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>> 15.441 ±
>>> > >> 0.469
>>> > >> > >>> >>> us/op
>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>> 14.057 ±
>>> > >> 0.115
>>> > >> > >>> >>> us/op
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > [Old code]
>>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt
>>>  Score
>>> > >>  Error
>>> > >> > >>> >>> Units
>>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>>> 16.248 ±
>>> > >> 1.409
>>> > >> > >>> >>> us/op
>>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>>> 14.150 ±
>>> > >> 0.084
>>> > >> > >>> >>> us/op
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
>>> > >> liya.fan03@gmail.com
>>> > >> > >
>>> > >> > >>> >>> wrote:
>>> > >> > >>> >>> >> >
>>> > >> > >>> >>> >> >> Hi Micah,
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >> Thanks a lot for doing this.
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >> I am a little concerned about if there is any
>>> negative
>>> > >> > >>> performance
>>> > >> > >>> >>> >> impact
>>> > >> > >>> >>> >> >> on the current 32-bit-length based applications.
>>> > >> > >>> >>> >> >> Can we do some performance comparison on our
>>> existing
>>> > >> > >>> benchmarks?
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >> Best,
>>> > >> > >>> >>> >> >> Liya Fan
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>>> > >> > >>> >>> emkornfield@gmail.com>
>>> > >> > >>> >>> >> >> wrote:
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >> >>> There have been some previous discussions on the
>>> mailing
>>> > >> > about
>>> > >> > >>> >>> >> supporting
>>> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is
>>> what the
>>> > >> IPC
>>> > >> > >>> >>> >> specification
>>> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that changes
>>> all
>>> > >> APIs
>>> > >> > >>> that I
>>> > >> > >>> >>> >> could
>>> > >> > >>> >>> >> >>> find that take an index to take an "long" instead
>>> of an
>>> > >> > "int"
>>> > >> > >>> (and
>>> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>> It is a big change, so I think it is worth
>>> discussing if
>>> > >> it
>>> > >> > is
>>> > >> > >>> >>> >> something
>>> > >> > >>> >>> >> >>> we
>>> > >> > >>> >>> >> >>> still want to move forward with.  It would be nice
>>> to
>>> > >> come
>>> > >> > to
>>> > >> > >>> a
>>> > >> > >>> >>> >> >>> conclusion
>>> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a
>>> lot of
>>> > >> > merge
>>> > >> > >>> >>> >> conflicts.
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>> The reason I did this work now is the C++
>>> implementation
>>> > >> has
>>> > >> > >>> added
>>> > >> > >>> >>> >> >>> support
>>> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays
>>> and
>>> > >> based
>>> > >> > on
>>> > >> > >>> >>> prior
>>> > >> > >>> >>> >> >>> discussions we need to have similar support in Java
>>> > >> before
>>> > >> > our
>>> > >> > >>> >>> next
>>> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we can have
>>> full
>>> > >> > >>> >>> compatibility
>>> > >> > >>> >>> >> and
>>> > >> > >>> >>> >> >>> make the most use of the types in Java.
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>> Look forward to hearing feedback.
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>> Thanks,
>>> > >> > >>> >>> >> >>> Micah
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>>> > >> > >>> >>> >> >>>
>>> > >> > >>> >>> >> >>
>>> > >> > >>> >>> >>
>>> > >> > >>> >>> >
>>> > >> > >>> >>>
>>> > >> > >>> >>
>>> > >> > >>>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> > >
>>>
>>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Jacques Nadeau <ja...@apache.org>.

I don't think we should couple this discussion with the implementation of
large list, etc since I think those two concepts are independent.

I've asked some others on my team their opinions on the risk here. I think
we should probably review some our more complex vector interactions and see
how the jvm's assembly changes with this kind of change. Using
microbenchmarking is good but I think we also need to see whether we're
constantly inserting additional instructions or if in most cases, this
actually doesn't impact instruction count.



On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield <em...@gmail.com>
wrote:

>
>> With regards to the reference implementation point. It is a good point.
>> I'm on vacation this week. Unless you're pushing hard on this, can we pick
>> this up and discuss more next week?
>
>
> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
> reference implementation aspect of this?
>
>
>> To copy the sentiments from the 0.15.0 release thread, I think it
>> would be best to decouple this discussion from the release timeline
>> given how many people we have relying on regular releases coming out.
>> We can keep continue making major 0.x releases until we're ready to
>> release 1.0.0.
>
>
> I'm OK with it as long as other stakeholders are. Timed releases are the
> way to go.  As stated on the release thread [1] we need a better mechanism
> to avoid this type of issue arising again.  The release thread also had
> some more discussion on compatibility.
>
> Thanks,
> Micah
>
> [1]
> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>
>
> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com> wrote:
>
>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>> >
>> > Hi Wes and Jacques,
>> > See responses below.
>> >
>> > With regards to the reference implementation point. It is a good point.
>> I'm
>> > > on vacation this week. Unless you're pushing hard on this, can we
>> pick this
>> > > up and discuss more next week?
>> >
>> >
>> > Sure thing, enjoy your vacation.  I think the only practical
>> implications
>> > are it delays choices around implementing LargeList, LargeBinary,
>> > LargeString in Java, which in turn might push out the 0.15.0 release.
>> >
>>
>> To copy the sentiments from the 0.15.0 release thread, I think it
>> would be best to decouple this discussion from the release timeline
>> given how many people we have relying on regular releases coming out.
>> We can keep continue making major 0.x releases until we're ready to
>> release 1.0.0.
>>
>> > My stance on this is that I don't know how important it is for Java to
>> > > support vectors over INT32_MAX elements. The use cases enabled by
>> > > having very large arrays seem to be concentrated in the native code
>> > > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
>> > > on my part, though.
>> >
>> >
>> > A data point against this view is Spark has done work to eliminate 2GB
>> > memory limits on its block sizes [1].  I don't claim to understand the
>> > implications of this. Bryan might you have any thoughts here?  I'm OK
>> with
>> > INT32_MAX, as well, I think we should think about what this means for
>> > adding Large types to Java and implications for reference
>> implementations.
>> >
>> > Thanks,
>> > Micah
>> >
>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>> >
>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org>
>> wrote:
>> >
>> > > Hey Micah,
>> > >
>> > > Appreciate the offer on the compiling. The reality is I'm more
>> concerned
>> > > about the unknowns than the compiling issue itself. Any time you've
>> been
>> > > tuning for a while, changing something like this could be totally
>> fine or
>> > > cause a couple of major issues. For example, we've done a very large
>> amount
>> > > of work reducing heap memory footprint of the vectors. Are target is
>> to
>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per
>> vector
>> > > (not including arrow bufs).
>> > >
>> > > With regards to the reference implementation point. It is a good
>> point.
>> > > I'm on vacation this week. Unless you're pushing hard on this, can we
>> pick
>> > > this up and discuss more next week?
>> > >
>> > > thanks,
>> > > Jacques
>> > >
>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> > > wrote:
>> > >
>> > >> Hi Jacques,
>> > >> I definitely understand these concerns and this change is risky
>> because it
>> > >> is so large.  Perhaps, creating a new hierarchy, might be the
>> cleanest way
>> > >> of dealing with this.  This could have other benefits like cleaning
>> up
>> > >> some
>> > >> cruft around dictionary encode and "orphaned" method.   Per past
>> e-mail
>> > >> threads I agree it is beneficial to have 2 separate reference
>> > >> implementations that can communicate fully, and my intent here was to
>> > >> close
>> > >> that gap.
>> > >>
>> > >> Trying to
>> > >> > determine the ramifications of these changes would be challenging
>> and
>> > >> time
>> > >> > consuming against all the different ways we interact with the
>> Arrow Java
>> > >> > library.
>> > >>
>> > >>
>> > >> Understood.  I took a quick look at Dremio-OSS it seems like it has a
>> > >> simple java build system?  If it is helpful, I can try to get a fork
>> > >> running that at least compiles against this PR.  My plan would be to
>> cast
>> > >> any place that was changed to return a long back to an int, so in
>> essence
>> > >> the Dremio algorithms would reman 32-bit implementations.
>> > >>
>> > >> I don't  have the infrastructure to test this change properly from a
>> > >> distributed systems perspective, so it would still take some time
>> from
>> > >> Dremio to validate for regressions.
>> > >>
>> > >> I'm not saying I'm against this but want to make sure we've
>> > >> > explored all less disruptive options before considering changing
>> > >> something
>> > >> > this fundamental (especially when I generally hold the view that
>> large
>> > >> cell
>> > >> > counts against massive contiguous memory is an anti pattern to
>> scalable
>> > >> > analytical processing--purely subjective of course).
>> > >>
>> > >>
>> > >> I'm open to other ideas here, as well. I don't think it is out of the
>> > >> question to leave the Java implementation as 32-bit, but if we do,
>> then I
>> > >> think we should consider a different strategy for reference
>> > >> implementations.
>> > >>
>> > >> Thanks,
>> > >> Micah
>> > >>
>> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <ja...@apache.org>
>> > >> wrote:
>> > >>
>> > >> > Hey Micah, I didn't have a particular path in mind. Was thinking
>> more
>> > >> along
>> > >> > the lines of extra methods as opposed to separate classes.
>> > >> >
>> > >> > Arrow hasn't historically been a place where we're writing
>> algorithms in
>> > >> > Java so the fact that they aren't there doesn't mean they don't
>> exist.
>> > >> We
>> > >> > have a large amount of code that depends on the current behavior
>> that is
>> > >> > deployed in hundreds of customer clusters (you can peruse our
>> dremio
>> > >> repo
>> > >> > to see how extensively we leverage Arrow if interested). Trying to
>> > >> > determine the ramifications of these changes would be challenging
>> and
>> > >> time
>> > >> > consuming against all the different ways we interact with the
>> Arrow Java
>> > >> > library. I'm not saying I'm against this but want to make sure
>> we've
>> > >> > explored all less disruptive options before considering changing
>> > >> something
>> > >> > this fundamental (especially when I generally hold the view that
>> large
>> > >> cell
>> > >> > counts against massive contiguous memory is an anti pattern to
>> scalable
>> > >> > analytical processing--purely subjective of course).
>> > >> >
>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Hi Jacques,
>> > >> > > What avenue were you thinking for supporting both paths?   I
>> didn't
>> > >> want
>> > >> > > to pursue a different class hierarchy, because I felt like that
>> would
>> > >> > > effectively fork the code base, but that is potentially an
>> option that
>> > >> > > would allow us to have a complete reference implementation in
>> Java
>> > >> that
>> > >> > can
>> > >> > > fully interact with C++, without major changes to this code.
>> > >> > >
>> > >> > > For supporting both APIs on the same classes/interfaces, I think
>> they
>> > >> > > roughly fall into three categories, changes to input parameters,
>> > >> changes
>> > >> > to
>> > >> > > output parameters and algorithm changes.
>> > >> > >
>> > >> > > For inputs, changing from int to long is essentially a no-op
>> from the
>> > >> > > compiler perspective.  From the limited micro-benchmarking this
>> also
>> > >> > > doesn't seem to have a performance impact.  So we could keep two
>> > >> versions
>> > >> > > of the methods that only differ on inputs, but it is not clear
>> what
>> > >> the
>> > >> > > value of that would be.
>> > >> > >
>> > >> > > For outputs, we can't support methods "long getLength()" and "int
>> > >> > > getLength()" in the same class, so we would be forced into
>> something
>> > >> like
>> > >> > > "long getLength(boolean dummy)" which I think is a less
>> desirable.
>> > >> > >
>> > >> > > For algorithm changes, there did not appear to be too many places
>> > >> where
>> > >> > we
>> > >> > > actually loop over all elements (it is quite possible I missed
>> > >> something
>> > >> > > here), the ones that I did find I was able to mitigate
>> performance
>> > >> > > penalties as noted above.  Some of the current implementation
>> will
>> > >> get a
>> > >> > > lot slower for "large arrays", but we can likely fix those later
>> or in
>> > >> > this
>> > >> > > PR with a nested while loop instead of 2 for loops.
>> > >> > >
>> > >> > > Thanks,
>> > >> > > Micah
>> > >> > >
>> > >> > >
>> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <jacques@apache.org
>> >
>> > >> wrote:
>> > >> > >
>> > >> > >> This is a pretty massive change to the apis. I wonder how nasty
>> it
>> > >> would
>> > >> > >> be to just support both paths. Have you evaluated how complex
>> that
>> > >> > would be?
>> > >> > >>
>> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
>> > >> emkornfield@gmail.com>
>> > >> > >> wrote:
>> > >> > >>
>> > >> > >>> After more investigation, it looks like Float8Benchmarks at
>> least
>> > >> on my
>> > >> > >>> machine are within the range of noise.
>> > >> > >>>
>> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems to bring
>> the
>> > >> > >>> BitVectorHelper benchmarks back inline (and even with some
>> > >> improvement
>> > >> > >>> for
>> > >> > >>> getNullCountBenchmark).
>> > >> > >>>
>> > >> > >>> Benchmark                                        Mode  Cnt
>>  Score
>> > >> > >>>  Error
>> > >> > >>>  Units
>> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>  3.821 ±
>> > >> > >>> 0.031
>> > >> > >>>  ns/op
>> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>> 14.884 ±
>> > >> > >>> 0.141
>> > >> > >>>  ns/op
>> > >> > >>>
>> > >> > >>> I applied the same pattern to other loops that I could find,
>> and for
>> > >> > any
>> > >> > >>> "for (long" loop on the critical path, I broke it up into two
>> loops.
>> > >> > the
>> > >> > >>> first loop does iteration by integer, the second finishes off
>> for
>> > >> any
>> > >> > >>> long
>> > >> > >>> values.  As a side note it seems like optimization for loops
>> using
>> > >> long
>> > >> > >>> counters at least have a semi-recent open bug for the JVM [2]
>> > >> > >>>
>> > >> > >>> Thanks,
>> > >> > >>> Micah
>> > >> > >>>
>> > >> > >>> [1]
>> > >> > >>>
>> > >> > >>>
>> > >> >
>> > >>
>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>> > >> > >>>
>> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
>> > >> emkornfield@gmail.com>
>> > >> > >>> wrote:
>> > >> > >>>
>> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables can
>> make a
>> > >> > big
>> > >> > >>> > difference.  I didn't initially run the benchmarks with these
>> > >> turned
>> > >> > on
>> > >> > >>> > (you can see the result from above with Float8Benchmarks).
>> Here
>> > >> are
>> > >> > >>> new
>> > >> > >>> > numbers including with the flags enabled.  It looks like
>> using
>> > >> longs
>> > >> > >>> might
>> > >> > >>> > be a little bit slower, I'll see what I can do to mitigate
>> this.
>> > >> > >>> >
>> > >> > >>> > Ravindra also volunteered to try to benchmark the changes
>> with
>> > >> > Dremio's
>> > >> > >>> > code on today's sync call.
>> > >> > >>> >
>> > >> > >>> > New
>> > >> > >>> >
>> > >> > >>> > Benchmark                                        Mode  Cnt
>>  Score
>> > >> > >>>  Error
>> > >> > >>> > Units
>> > >> > >>> >
>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>> > >>  4.176 ±
>> > >> > >>> 1.292
>> > >> > >>> > ns/op
>> > >> > >>> >
>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>> > >> 26.102 ±
>> > >> > >>> 0.700
>> > >> > >>> > ns/op
>> > >> > >>> >
>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084
>> > >> us/op
>> > >> > >>> >
>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057
>> > >> us/op
>> > >> > >>> >
>> > >> > >>> >
>> > >> > >>> >
>> > >> > >>> > old
>> > >> > >>> >
>> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>> > >>  3.828 ±
>> > >> > >>> 0.030
>> > >> > >>> > ns/op
>> > >> > >>> >
>> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>> > >> 20.611 ±
>> > >> > >>> 0.188
>> > >> > >>> > ns/op
>> > >> > >>> >
>> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462
>> > >> us/op
>> > >> > >>> >
>> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027
>> > >> us/op
>> > >> > >>> >
>> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <
>> liya.fan03@gmail.com>
>> > >> > wrote:
>> > >> > >>> >
>> > >> > >>> >> Hi Gonzalo,
>> > >> > >>> >>
>> > >> > >>> >> Thanks for sharing the performance results.
>> > >> > >>> >> I am wondering if you have turned off the flag
>> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>> > >> > >>> >> If not, the lower throughput should be expected.
>> > >> > >>> >>
>> > >> > >>> >> Best,
>> > >> > >>> >> Liya Fan
>> > >> > >>> >>
>> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>> > >> > >>> emkornfield@gmail.com>
>> > >> > >>> >> wrote:
>> > >> > >>> >>
>> > >> > >>> >>> Hi Gonzalo,
>> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
>> > >> > >>> implications.   At
>> > >> > >>> >>> least on the benchmark run they don't seem to have an
>> impact.
>> > >> > >>> >>>
>> > >> > >>> >>> If there are other benchmarks that people have that can
>> > >> validate if
>> > >> > >>> this
>> > >> > >>> >>> change will be problematic I would appreciate trying to
>> run them
>> > >> > >>> with the
>> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight
>> to
>> > >> see
>> > >> > if
>> > >> > >>> >>> there
>> > >> > >>> >>> is a change in those.
>> > >> > >>> >>>
>> > >> > >>> >>> -Micah
>> > >> > >>> >>>
>> > >> > >>> >>>
>> > >> > >>> >>>
>> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>> > >> > >>> >>> golthiryus@gmail.com> wrote:
>> > >> > >>> >>>
>> > >> > >>> >>> > I would recommend to take care with this kind of changes.
>> > >> > >>> >>> >
>> > >> > >>> >>> > I didn't try Arrow in more than one year, but by then the
>> > >> > >>> performance
>> > >> > >>> >>> was
>> > >> > >>> >>> > quite bad in comparison with plain byte buffer access
>> > >> > >>> >>> > (see
>> http://git.net/apache-arrow-development/msg02353.html *)
>> > >> > and
>> > >> > >>> >>> > there are several optimizations that the JVM
>> (specifically,
>> > >> C2)
>> > >> > >>> does
>> > >> > >>> >>> not
>> > >> > >>> >>> > apply when dealing with int instead of longs. One of the
>> > >> > >>> >>> > most commons is the loop unrolling and vectorization.
>> > >> > >>> >>> >
>> > >> > >>> >>> > * It doesn't seem the best way to reference an old email
>> on
>> > >> the
>> > >> > >>> list,
>> > >> > >>> >>> but
>> > >> > >>> >>> > it is the only result shown by Google
>> > >> > >>> >>> >
>> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
>> > >> > liya.fan03@gmail.com
>> > >> > >>> >)
>> > >> > >>> >>> > escribió:
>> > >> > >>> >>> >
>> > >> > >>> >>> >> Hi Micah,
>> > >> > >>> >>> >>
>> > >> > >>> >>> >> Thanks for your effort. The performance result looks
>> good.
>> > >> > >>> >>> >>
>> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12
>> bytes (4
>> > >> > bytes
>> > >> > >>> for
>> > >> > >>> >>> each
>> > >> > >>> >>> >> of length, write index, and read index).
>> > >> > >>> >>> >> Similar overheads also exist for vectors like
>> > >> > >>> BaseFixedWidthVector,
>> > >> > >>> >>> >> BaseVariableWidthVector, etc.
>> > >> > >>> >>> >>
>> > >> > >>> >>> >> IMO, such overheads are small enough to justify the
>> change.
>> > >> > >>> >>> >> Let's check if there are other overheads.
>> > >> > >>> >>> >>
>> > >> > >>> >>> >> Best,
>> > >> > >>> >>> >> Liya Fan
>> > >> > >>> >>> >>
>> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>> > >> > >>> emkornfield@gmail.com
>> > >> > >>> >>> >
>> > >> > >>> >>> >> wrote:
>> > >> > >>> >>> >>
>> > >> > >>> >>> >> > Hi Liya Fan,
>> > >> > >>> >>> >> > Based on the Float8Benchmark there does not seem to
>> be any
>> > >> > >>> >>> meaningful
>> > >> > >>> >>> >> > performance difference on my machine.  At least for
>> me, the
>> > >> > >>> >>> benchmarks
>> > >> > >>> >>> >> are
>> > >> > >>> >>> >> > not stable enough to say one is faster than the other
>> (I've
>> > >> > >>> pasted
>> > >> > >>> >>> >> results
>> > >> > >>> >>> >> > below).  That being said my machine isn't necessarily
>> the
>> > >> most
>> > >> > >>> >>> reliable
>> > >> > >>> >>> >> for
>> > >> > >>> >>> >> > benchmarking.
>> > >> > >>> >>> >> >
>> > >> > >>> >>> >> > On an intuitive level, this makes sense to me,  for
>> the
>> > >> most
>> > >> > >>> part it
>> > >> > >>> >>> >> seems
>> > >> > >>> >>> >> > like the change just moves casting from "int" to
>> "long"
>> > >> > further
>> > >> > >>> up
>> > >> > >>> >>> the
>> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If
>> there are
>> > >> > other
>> > >> > >>> >>> >> benchmarks
>> > >> > >>> >>> >> > that you think are worth running let me know.
>> > >> > >>> >>> >> >
>> > >> > >>> >>> >> > One downside performance wise I think for his change
>> is it
>> > >> > >>> >>> increases the
>> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could
>> influence
>> > >> > cache
>> > >> > >>> >>> misses
>> > >> > >>> >>> >> at
>> > >> > >>> >>> >> > some level or increase the size of call-stacks, but
>> this
>> > >> > doesn't
>> > >> > >>> >>> seem to
>> > >> > >>> >>> >> > show up in the benchmark..
>> > >> > >>> >>> >> >
>> > >> > >>> >>> >> > Thanks,
>> > >> > >>> >>> >> > Micah
>> > >> > >>> >>> >> >
>> > >> > >>> >>> >> > Sample benchmark numbers:
>> > >> > >>> >>> >> >
>> > >> > >>> >>> >> > [New Code]
>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
>> > >>  Error
>> > >> > >>> >>> Units
>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>> 15.441 ±
>> > >> 0.469
>> > >> > >>> >>> us/op
>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>> 14.057 ±
>> > >> 0.115
>> > >> > >>> >>> us/op
>> > >> > >>> >>> >> >
>> > >> > >>> >>> >> >
>> > >> > >>> >>> >> > [Old code]
>> > >> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
>> > >>  Error
>> > >> > >>> >>> Units
>> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5
>> 16.248 ±
>> > >> 1.409
>> > >> > >>> >>> us/op
>> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5
>> 14.150 ±
>> > >> 0.084
>> > >> > >>> >>> us/op
>> > >> > >>> >>> >> >
>> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
>> > >> liya.fan03@gmail.com
>> > >> > >
>> > >> > >>> >>> wrote:
>> > >> > >>> >>> >> >
>> > >> > >>> >>> >> >> Hi Micah,
>> > >> > >>> >>> >> >>
>> > >> > >>> >>> >> >> Thanks a lot for doing this.
>> > >> > >>> >>> >> >>
>> > >> > >>> >>> >> >> I am a little concerned about if there is any
>> negative
>> > >> > >>> performance
>> > >> > >>> >>> >> impact
>> > >> > >>> >>> >> >> on the current 32-bit-length based applications.
>> > >> > >>> >>> >> >> Can we do some performance comparison on our existing
>> > >> > >>> benchmarks?
>> > >> > >>> >>> >> >>
>> > >> > >>> >>> >> >> Best,
>> > >> > >>> >>> >> >> Liya Fan
>> > >> > >>> >>> >> >>
>> > >> > >>> >>> >> >>
>> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>> > >> > >>> >>> emkornfield@gmail.com>
>> > >> > >>> >>> >> >> wrote:
>> > >> > >>> >>> >> >>
>> > >> > >>> >>> >> >>> There have been some previous discussions on the
>> mailing
>> > >> > about
>> > >> > >>> >>> >> supporting
>> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what
>> the
>> > >> IPC
>> > >> > >>> >>> >> specification
>> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that changes
>> all
>> > >> APIs
>> > >> > >>> that I
>> > >> > >>> >>> >> could
>> > >> > >>> >>> >> >>> find that take an index to take an "long" instead
>> of an
>> > >> > "int"
>> > >> > >>> (and
>> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
>> > >> > >>> >>> >> >>>
>> > >> > >>> >>> >> >>> It is a big change, so I think it is worth
>> discussing if
>> > >> it
>> > >> > is
>> > >> > >>> >>> >> something
>> > >> > >>> >>> >> >>> we
>> > >> > >>> >>> >> >>> still want to move forward with.  It would be nice
>> to
>> > >> come
>> > >> > to
>> > >> > >>> a
>> > >> > >>> >>> >> >>> conclusion
>> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a
>> lot of
>> > >> > merge
>> > >> > >>> >>> >> conflicts.
>> > >> > >>> >>> >> >>>
>> > >> > >>> >>> >> >>> The reason I did this work now is the C++
>> implementation
>> > >> has
>> > >> > >>> added
>> > >> > >>> >>> >> >>> support
>> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays
>> and
>> > >> based
>> > >> > on
>> > >> > >>> >>> prior
>> > >> > >>> >>> >> >>> discussions we need to have similar support in Java
>> > >> before
>> > >> > our
>> > >> > >>> >>> next
>> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we can have
>> full
>> > >> > >>> >>> compatibility
>> > >> > >>> >>> >> and
>> > >> > >>> >>> >> >>> make the most use of the types in Java.
>> > >> > >>> >>> >> >>>
>> > >> > >>> >>> >> >>> Look forward to hearing feedback.
>> > >> > >>> >>> >> >>>
>> > >> > >>> >>> >> >>> Thanks,
>> > >> > >>> >>> >> >>> Micah
>> > >> > >>> >>> >> >>>
>> > >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>> > >> > >>> >>> >> >>>
>> > >> > >>> >>> >> >>
>> > >> > >>> >>> >>
>> > >> > >>> >>> >
>> > >> > >>> >>>
>> > >> > >>> >>
>> > >> > >>>
>> > >> > >>
>> > >> >
>> > >>
>> > >
>>
>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Micah Kornfield <em...@gmail.com>.

>
>
> With regards to the reference implementation point. It is a good point.
> I'm on vacation this week. Unless you're pushing hard on this, can we pick
> this up and discuss more next week?


Hi Jacques, I hope you had a good rest.  Any more thoughts on the reference
implementation aspect of this?


> To copy the sentiments from the 0.15.0 release thread, I think it
> would be best to decouple this discussion from the release timeline
> given how many people we have relying on regular releases coming out.
> We can keep continue making major 0.x releases until we're ready to
> release 1.0.0.


I'm OK with it as long as other stakeholders are. Timed releases are the
way to go.  As stated on the release thread [1] we need a better mechanism
to avoid this type of issue arising again.  The release thread also had
some more discussion on compatibility.

Thanks,
Micah

[1]
https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E


On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney <we...@gmail.com> wrote:

> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Hi Wes and Jacques,
> > See responses below.
> >
> > With regards to the reference implementation point. It is a good point.
> I'm
> > > on vacation this week. Unless you're pushing hard on this, can we pick
> this
> > > up and discuss more next week?
> >
> >
> > Sure thing, enjoy your vacation.  I think the only practical implications
> > are it delays choices around implementing LargeList, LargeBinary,
> > LargeString in Java, which in turn might push out the 0.15.0 release.
> >
>
> To copy the sentiments from the 0.15.0 release thread, I think it
> would be best to decouple this discussion from the release timeline
> given how many people we have relying on regular releases coming out.
> We can keep continue making major 0.x releases until we're ready to
> release 1.0.0.
>
> > My stance on this is that I don't know how important it is for Java to
> > > support vectors over INT32_MAX elements. The use cases enabled by
> > > having very large arrays seem to be concentrated in the native code
> > > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
> > > on my part, though.
> >
> >
> > A data point against this view is Spark has done work to eliminate 2GB
> > memory limits on its block sizes [1].  I don't claim to understand the
> > implications of this. Bryan might you have any thoughts here?  I'm OK
> with
> > INT32_MAX, as well, I think we should think about what this means for
> > adding Large types to Java and implications for reference
> implementations.
> >
> > Thanks,
> > Micah
> >
> > [1] https://issues.apache.org/jira/browse/SPARK-6235
> >
> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> > > Hey Micah,
> > >
> > > Appreciate the offer on the compiling. The reality is I'm more
> concerned
> > > about the unknowns than the compiling issue itself. Any time you've
> been
> > > tuning for a while, changing something like this could be totally fine
> or
> > > cause a couple of major issues. For example, we've done a very large
> amount
> > > of work reducing heap memory footprint of the vectors. Are target is to
> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per
> vector
> > > (not including arrow bufs).
> > >
> > > With regards to the reference implementation point. It is a good point.
> > > I'm on vacation this week. Unless you're pushing hard on this, can we
> pick
> > > this up and discuss more next week?
> > >
> > > thanks,
> > > Jacques
> > >
> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Jacques,
> > >> I definitely understand these concerns and this change is risky
> because it
> > >> is so large.  Perhaps, creating a new hierarchy, might be the
> cleanest way
> > >> of dealing with this.  This could have other benefits like cleaning up
> > >> some
> > >> cruft around dictionary encode and "orphaned" method.   Per past
> e-mail
> > >> threads I agree it is beneficial to have 2 separate reference
> > >> implementations that can communicate fully, and my intent here was to
> > >> close
> > >> that gap.
> > >>
> > >> Trying to
> > >> > determine the ramifications of these changes would be challenging
> and
> > >> time
> > >> > consuming against all the different ways we interact with the Arrow
> Java
> > >> > library.
> > >>
> > >>
> > >> Understood.  I took a quick look at Dremio-OSS it seems like it has a
> > >> simple java build system?  If it is helpful, I can try to get a fork
> > >> running that at least compiles against this PR.  My plan would be to
> cast
> > >> any place that was changed to return a long back to an int, so in
> essence
> > >> the Dremio algorithms would reman 32-bit implementations.
> > >>
> > >> I don't  have the infrastructure to test this change properly from a
> > >> distributed systems perspective, so it would still take some time from
> > >> Dremio to validate for regressions.
> > >>
> > >> I'm not saying I'm against this but want to make sure we've
> > >> > explored all less disruptive options before considering changing
> > >> something
> > >> > this fundamental (especially when I generally hold the view that
> large
> > >> cell
> > >> > counts against massive contiguous memory is an anti pattern to
> scalable
> > >> > analytical processing--purely subjective of course).
> > >>
> > >>
> > >> I'm open to other ideas here, as well. I don't think it is out of the
> > >> question to leave the Java implementation as 32-bit, but if we do,
> then I
> > >> think we should consider a different strategy for reference
> > >> implementations.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <ja...@apache.org>
> > >> wrote:
> > >>
> > >> > Hey Micah, I didn't have a particular path in mind. Was thinking
> more
> > >> along
> > >> > the lines of extra methods as opposed to separate classes.
> > >> >
> > >> > Arrow hasn't historically been a place where we're writing
> algorithms in
> > >> > Java so the fact that they aren't there doesn't mean they don't
> exist.
> > >> We
> > >> > have a large amount of code that depends on the current behavior
> that is
> > >> > deployed in hundreds of customer clusters (you can peruse our dremio
> > >> repo
> > >> > to see how extensively we leverage Arrow if interested). Trying to
> > >> > determine the ramifications of these changes would be challenging
> and
> > >> time
> > >> > consuming against all the different ways we interact with the Arrow
> Java
> > >> > library. I'm not saying I'm against this but want to make sure we've
> > >> > explored all less disruptive options before considering changing
> > >> something
> > >> > this fundamental (especially when I generally hold the view that
> large
> > >> cell
> > >> > counts against massive contiguous memory is an anti pattern to
> scalable
> > >> > analytical processing--purely subjective of course).
> > >> >
> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <
> emkornfield@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Hi Jacques,
> > >> > > What avenue were you thinking for supporting both paths?   I
> didn't
> > >> want
> > >> > > to pursue a different class hierarchy, because I felt like that
> would
> > >> > > effectively fork the code base, but that is potentially an option
> that
> > >> > > would allow us to have a complete reference implementation in Java
> > >> that
> > >> > can
> > >> > > fully interact with C++, without major changes to this code.
> > >> > >
> > >> > > For supporting both APIs on the same classes/interfaces, I think
> they
> > >> > > roughly fall into three categories, changes to input parameters,
> > >> changes
> > >> > to
> > >> > > output parameters and algorithm changes.
> > >> > >
> > >> > > For inputs, changing from int to long is essentially a no-op from
> the
> > >> > > compiler perspective.  From the limited micro-benchmarking this
> also
> > >> > > doesn't seem to have a performance impact.  So we could keep two
> > >> versions
> > >> > > of the methods that only differ on inputs, but it is not clear
> what
> > >> the
> > >> > > value of that would be.
> > >> > >
> > >> > > For outputs, we can't support methods "long getLength()" and "int
> > >> > > getLength()" in the same class, so we would be forced into
> something
> > >> like
> > >> > > "long getLength(boolean dummy)" which I think is a less desirable.
> > >> > >
> > >> > > For algorithm changes, there did not appear to be too many places
> > >> where
> > >> > we
> > >> > > actually loop over all elements (it is quite possible I missed
> > >> something
> > >> > > here), the ones that I did find I was able to mitigate performance
> > >> > > penalties as noted above.  Some of the current implementation will
> > >> get a
> > >> > > lot slower for "large arrays", but we can likely fix those later
> or in
> > >> > this
> > >> > > PR with a nested while loop instead of 2 for loops.
> > >> > >
> > >> > > Thanks,
> > >> > > Micah
> > >> > >
> > >> > >
> > >> > > On Saturday, August 10, 2019, Jacques Nadeau <ja...@apache.org>
> > >> wrote:
> > >> > >
> > >> > >> This is a pretty massive change to the apis. I wonder how nasty
> it
> > >> would
> > >> > >> be to just support both paths. Have you evaluated how complex
> that
> > >> > would be?
> > >> > >>
> > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
> > >> emkornfield@gmail.com>
> > >> > >> wrote:
> > >> > >>
> > >> > >>> After more investigation, it looks like Float8Benchmarks at
> least
> > >> on my
> > >> > >>> machine are within the range of noise.
> > >> > >>>
> > >> > >>> For BitVectorHelper I pushed a new commit [1], seems to bring
> the
> > >> > >>> BitVectorHelper benchmarks back inline (and even with some
> > >> improvement
> > >> > >>> for
> > >> > >>> getNullCountBenchmark).
> > >> > >>>
> > >> > >>> Benchmark                                        Mode  Cnt
>  Score
> > >> > >>>  Error
> > >> > >>>  Units
> > >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>  3.821 ±
> > >> > >>> 0.031
> > >> > >>>  ns/op
> > >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
> 14.884 ±
> > >> > >>> 0.141
> > >> > >>>  ns/op
> > >> > >>>
> > >> > >>> I applied the same pattern to other loops that I could find,
> and for
> > >> > any
> > >> > >>> "for (long" loop on the critical path, I broke it up into two
> loops.
> > >> > the
> > >> > >>> first loop does iteration by integer, the second finishes off
> for
> > >> any
> > >> > >>> long
> > >> > >>> values.  As a side note it seems like optimization for loops
> using
> > >> long
> > >> > >>> counters at least have a semi-recent open bug for the JVM [2]
> > >> > >>>
> > >> > >>> Thanks,
> > >> > >>> Micah
> > >> > >>>
> > >> > >>> [1]
> > >> > >>>
> > >> > >>>
> > >> >
> > >>
> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
> > >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
> > >> > >>>
> > >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
> > >> emkornfield@gmail.com>
> > >> > >>> wrote:
> > >> > >>>
> > >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables can
> make a
> > >> > big
> > >> > >>> > difference.  I didn't initially run the benchmarks with these
> > >> turned
> > >> > on
> > >> > >>> > (you can see the result from above with Float8Benchmarks).
> Here
> > >> are
> > >> > >>> new
> > >> > >>> > numbers including with the flags enabled.  It looks like using
> > >> longs
> > >> > >>> might
> > >> > >>> > be a little bit slower, I'll see what I can do to mitigate
> this.
> > >> > >>> >
> > >> > >>> > Ravindra also volunteered to try to benchmark the changes with
> > >> > Dremio's
> > >> > >>> > code on today's sync call.
> > >> > >>> >
> > >> > >>> > New
> > >> > >>> >
> > >> > >>> > Benchmark                                        Mode  Cnt
>  Score
> > >> > >>>  Error
> > >> > >>> > Units
> > >> > >>> >
> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
> > >>  4.176 ±
> > >> > >>> 1.292
> > >> > >>> > ns/op
> > >> > >>> >
> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
> > >> 26.102 ±
> > >> > >>> 0.700
> > >> > >>> > ns/op
> > >> > >>> >
> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084
> > >> us/op
> > >> > >>> >
> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057
> > >> us/op
> > >> > >>> >
> > >> > >>> >
> > >> > >>> >
> > >> > >>> > old
> > >> > >>> >
> > >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
> > >>  3.828 ±
> > >> > >>> 0.030
> > >> > >>> > ns/op
> > >> > >>> >
> > >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
> > >> 20.611 ±
> > >> > >>> 0.188
> > >> > >>> > ns/op
> > >> > >>> >
> > >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462
> > >> us/op
> > >> > >>> >
> > >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027
> > >> us/op
> > >> > >>> >
> > >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <liya.fan03@gmail.com
> >
> > >> > wrote:
> > >> > >>> >
> > >> > >>> >> Hi Gonzalo,
> > >> > >>> >>
> > >> > >>> >> Thanks for sharing the performance results.
> > >> > >>> >> I am wondering if you have turned off the flag
> > >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> > >> > >>> >> If not, the lower throughput should be expected.
> > >> > >>> >>
> > >> > >>> >> Best,
> > >> > >>> >> Liya Fan
> > >> > >>> >>
> > >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
> > >> > >>> emkornfield@gmail.com>
> > >> > >>> >> wrote:
> > >> > >>> >>
> > >> > >>> >>> Hi Gonzalo,
> > >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
> > >> > >>> implications.   At
> > >> > >>> >>> least on the benchmark run they don't seem to have an
> impact.
> > >> > >>> >>>
> > >> > >>> >>> If there are other benchmarks that people have that can
> > >> validate if
> > >> > >>> this
> > >> > >>> >>> change will be problematic I would appreciate trying to run
> them
> > >> > >>> with the
> > >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight
> to
> > >> see
> > >> > if
> > >> > >>> >>> there
> > >> > >>> >>> is a change in those.
> > >> > >>> >>>
> > >> > >>> >>> -Micah
> > >> > >>> >>>
> > >> > >>> >>>
> > >> > >>> >>>
> > >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
> > >> > >>> >>> golthiryus@gmail.com> wrote:
> > >> > >>> >>>
> > >> > >>> >>> > I would recommend to take care with this kind of changes.
> > >> > >>> >>> >
> > >> > >>> >>> > I didn't try Arrow in more than one year, but by then the
> > >> > >>> performance
> > >> > >>> >>> was
> > >> > >>> >>> > quite bad in comparison with plain byte buffer access
> > >> > >>> >>> > (see
> http://git.net/apache-arrow-development/msg02353.html *)
> > >> > and
> > >> > >>> >>> > there are several optimizations that the JVM
> (specifically,
> > >> C2)
> > >> > >>> does
> > >> > >>> >>> not
> > >> > >>> >>> > apply when dealing with int instead of longs. One of the
> > >> > >>> >>> > most commons is the loop unrolling and vectorization.
> > >> > >>> >>> >
> > >> > >>> >>> > * It doesn't seem the best way to reference an old email
> on
> > >> the
> > >> > >>> list,
> > >> > >>> >>> but
> > >> > >>> >>> > it is the only result shown by Google
> > >> > >>> >>> >
> > >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
> > >> > liya.fan03@gmail.com
> > >> > >>> >)
> > >> > >>> >>> > escribió:
> > >> > >>> >>> >
> > >> > >>> >>> >> Hi Micah,
> > >> > >>> >>> >>
> > >> > >>> >>> >> Thanks for your effort. The performance result looks
> good.
> > >> > >>> >>> >>
> > >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes
> (4
> > >> > bytes
> > >> > >>> for
> > >> > >>> >>> each
> > >> > >>> >>> >> of length, write index, and read index).
> > >> > >>> >>> >> Similar overheads also exist for vectors like
> > >> > >>> BaseFixedWidthVector,
> > >> > >>> >>> >> BaseVariableWidthVector, etc.
> > >> > >>> >>> >>
> > >> > >>> >>> >> IMO, such overheads are small enough to justify the
> change.
> > >> > >>> >>> >> Let's check if there are other overheads.
> > >> > >>> >>> >>
> > >> > >>> >>> >> Best,
> > >> > >>> >>> >> Liya Fan
> > >> > >>> >>> >>
> > >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
> > >> > >>> emkornfield@gmail.com
> > >> > >>> >>> >
> > >> > >>> >>> >> wrote:
> > >> > >>> >>> >>
> > >> > >>> >>> >> > Hi Liya Fan,
> > >> > >>> >>> >> > Based on the Float8Benchmark there does not seem to be
> any
> > >> > >>> >>> meaningful
> > >> > >>> >>> >> > performance difference on my machine.  At least for
> me, the
> > >> > >>> >>> benchmarks
> > >> > >>> >>> >> are
> > >> > >>> >>> >> > not stable enough to say one is faster than the other
> (I've
> > >> > >>> pasted
> > >> > >>> >>> >> results
> > >> > >>> >>> >> > below).  That being said my machine isn't necessarily
> the
> > >> most
> > >> > >>> >>> reliable
> > >> > >>> >>> >> for
> > >> > >>> >>> >> > benchmarking.
> > >> > >>> >>> >> >
> > >> > >>> >>> >> > On an intuitive level, this makes sense to me,  for the
> > >> most
> > >> > >>> part it
> > >> > >>> >>> >> seems
> > >> > >>> >>> >> > like the change just moves casting from "int" to "long"
> > >> > further
> > >> > >>> up
> > >> > >>> >>> the
> > >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there
> are
> > >> > other
> > >> > >>> >>> >> benchmarks
> > >> > >>> >>> >> > that you think are worth running let me know.
> > >> > >>> >>> >> >
> > >> > >>> >>> >> > One downside performance wise I think for his change
> is it
> > >> > >>> >>> increases the
> > >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could
> influence
> > >> > cache
> > >> > >>> >>> misses
> > >> > >>> >>> >> at
> > >> > >>> >>> >> > some level or increase the size of call-stacks, but
> this
> > >> > doesn't
> > >> > >>> >>> seem to
> > >> > >>> >>> >> > show up in the benchmark..
> > >> > >>> >>> >> >
> > >> > >>> >>> >> > Thanks,
> > >> > >>> >>> >> > Micah
> > >> > >>> >>> >> >
> > >> > >>> >>> >> > Sample benchmark numbers:
> > >> > >>> >>> >> >
> > >> > >>> >>> >> > [New Code]
> > >> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
> > >>  Error
> > >> > >>> >>> Units
> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441
> ±
> > >> 0.469
> > >> > >>> >>> us/op
> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057
> ±
> > >> 0.115
> > >> > >>> >>> us/op
> > >> > >>> >>> >> >
> > >> > >>> >>> >> >
> > >> > >>> >>> >> > [Old code]
> > >> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
> > >>  Error
> > >> > >>> >>> Units
> > >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248
> ±
> > >> 1.409
> > >> > >>> >>> us/op
> > >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150
> ±
> > >> 0.084
> > >> > >>> >>> us/op
> > >> > >>> >>> >> >
> > >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
> > >> liya.fan03@gmail.com
> > >> > >
> > >> > >>> >>> wrote:
> > >> > >>> >>> >> >
> > >> > >>> >>> >> >> Hi Micah,
> > >> > >>> >>> >> >>
> > >> > >>> >>> >> >> Thanks a lot for doing this.
> > >> > >>> >>> >> >>
> > >> > >>> >>> >> >> I am a little concerned about if there is any negative
> > >> > >>> performance
> > >> > >>> >>> >> impact
> > >> > >>> >>> >> >> on the current 32-bit-length based applications.
> > >> > >>> >>> >> >> Can we do some performance comparison on our existing
> > >> > >>> benchmarks?
> > >> > >>> >>> >> >>
> > >> > >>> >>> >> >> Best,
> > >> > >>> >>> >> >> Liya Fan
> > >> > >>> >>> >> >>
> > >> > >>> >>> >> >>
> > >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
> > >> > >>> >>> emkornfield@gmail.com>
> > >> > >>> >>> >> >> wrote:
> > >> > >>> >>> >> >>
> > >> > >>> >>> >> >>> There have been some previous discussions on the
> mailing
> > >> > about
> > >> > >>> >>> >> supporting
> > >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what
> the
> > >> IPC
> > >> > >>> >>> >> specification
> > >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that changes
> all
> > >> APIs
> > >> > >>> that I
> > >> > >>> >>> >> could
> > >> > >>> >>> >> >>> find that take an index to take an "long" instead of
> an
> > >> > "int"
> > >> > >>> (and
> > >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
> > >> > >>> >>> >> >>>
> > >> > >>> >>> >> >>> It is a big change, so I think it is worth
> discussing if
> > >> it
> > >> > is
> > >> > >>> >>> >> something
> > >> > >>> >>> >> >>> we
> > >> > >>> >>> >> >>> still want to move forward with.  It would be nice to
> > >> come
> > >> > to
> > >> > >>> a
> > >> > >>> >>> >> >>> conclusion
> > >> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a
> lot of
> > >> > merge
> > >> > >>> >>> >> conflicts.
> > >> > >>> >>> >> >>>
> > >> > >>> >>> >> >>> The reason I did this work now is the C++
> implementation
> > >> has
> > >> > >>> added
> > >> > >>> >>> >> >>> support
> > >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and
> > >> based
> > >> > on
> > >> > >>> >>> prior
> > >> > >>> >>> >> >>> discussions we need to have similar support in Java
> > >> before
> > >> > our
> > >> > >>> >>> next
> > >> > >>> >>> >> >>> release. Support 64-bit indexes means we can have
> full
> > >> > >>> >>> compatibility
> > >> > >>> >>> >> and
> > >> > >>> >>> >> >>> make the most use of the types in Java.
> > >> > >>> >>> >> >>>
> > >> > >>> >>> >> >>> Look forward to hearing feedback.
> > >> > >>> >>> >> >>>
> > >> > >>> >>> >> >>> Thanks,
> > >> > >>> >>> >> >>> Micah
> > >> > >>> >>> >> >>>
> > >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
> > >> > >>> >>> >> >>>
> > >> > >>> >>> >> >>
> > >> > >>> >>> >>
> > >> > >>> >>> >
> > >> > >>> >>>
> > >> > >>> >>
> > >> > >>>
> > >> > >>
> > >> >
> > >>
> > >
>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Wes McKinney <we...@gmail.com>.

On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Wes and Jacques,
> See responses below.
>
> With regards to the reference implementation point. It is a good point. I'm
> > on vacation this week. Unless you're pushing hard on this, can we pick this
> > up and discuss more next week?
>
>
> Sure thing, enjoy your vacation.  I think the only practical implications
> are it delays choices around implementing LargeList, LargeBinary,
> LargeString in Java, which in turn might push out the 0.15.0 release.
>

To copy the sentiments from the 0.15.0 release thread, I think it
would be best to decouple this discussion from the release timeline
given how many people we have relying on regular releases coming out.
We can keep continue making major 0.x releases until we're ready to
release 1.0.0.

> My stance on this is that I don't know how important it is for Java to
> > support vectors over INT32_MAX elements. The use cases enabled by
> > having very large arrays seem to be concentrated in the native code
> > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
> > on my part, though.
>
>
> A data point against this view is Spark has done work to eliminate 2GB
> memory limits on its block sizes [1].  I don't claim to understand the
> implications of this. Bryan might you have any thoughts here?  I'm OK with
> INT32_MAX, as well, I think we should think about what this means for
> adding Large types to Java and implications for reference implementations.
>
> Thanks,
> Micah
>
> [1] https://issues.apache.org/jira/browse/SPARK-6235
>
> On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> > Hey Micah,
> >
> > Appreciate the offer on the compiling. The reality is I'm more concerned
> > about the unknowns than the compiling issue itself. Any time you've been
> > tuning for a while, changing something like this could be totally fine or
> > cause a couple of major issues. For example, we've done a very large amount
> > of work reducing heap memory footprint of the vectors. Are target is to
> > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per vector
> > (not including arrow bufs).
> >
> > With regards to the reference implementation point. It is a good point.
> > I'm on vacation this week. Unless you're pushing hard on this, can we pick
> > this up and discuss more next week?
> >
> > thanks,
> > Jacques
> >
> > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> Hi Jacques,
> >> I definitely understand these concerns and this change is risky because it
> >> is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
> >> of dealing with this.  This could have other benefits like cleaning up
> >> some
> >> cruft around dictionary encode and "orphaned" method.   Per past e-mail
> >> threads I agree it is beneficial to have 2 separate reference
> >> implementations that can communicate fully, and my intent here was to
> >> close
> >> that gap.
> >>
> >> Trying to
> >> > determine the ramifications of these changes would be challenging and
> >> time
> >> > consuming against all the different ways we interact with the Arrow Java
> >> > library.
> >>
> >>
> >> Understood.  I took a quick look at Dremio-OSS it seems like it has a
> >> simple java build system?  If it is helpful, I can try to get a fork
> >> running that at least compiles against this PR.  My plan would be to cast
> >> any place that was changed to return a long back to an int, so in essence
> >> the Dremio algorithms would reman 32-bit implementations.
> >>
> >> I don't  have the infrastructure to test this change properly from a
> >> distributed systems perspective, so it would still take some time from
> >> Dremio to validate for regressions.
> >>
> >> I'm not saying I'm against this but want to make sure we've
> >> > explored all less disruptive options before considering changing
> >> something
> >> > this fundamental (especially when I generally hold the view that large
> >> cell
> >> > counts against massive contiguous memory is an anti pattern to scalable
> >> > analytical processing--purely subjective of course).
> >>
> >>
> >> I'm open to other ideas here, as well. I don't think it is out of the
> >> question to leave the Java implementation as 32-bit, but if we do, then I
> >> think we should consider a different strategy for reference
> >> implementations.
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >>
> >> > Hey Micah, I didn't have a particular path in mind. Was thinking more
> >> along
> >> > the lines of extra methods as opposed to separate classes.
> >> >
> >> > Arrow hasn't historically been a place where we're writing algorithms in
> >> > Java so the fact that they aren't there doesn't mean they don't exist.
> >> We
> >> > have a large amount of code that depends on the current behavior that is
> >> > deployed in hundreds of customer clusters (you can peruse our dremio
> >> repo
> >> > to see how extensively we leverage Arrow if interested). Trying to
> >> > determine the ramifications of these changes would be challenging and
> >> time
> >> > consuming against all the different ways we interact with the Arrow Java
> >> > library. I'm not saying I'm against this but want to make sure we've
> >> > explored all less disruptive options before considering changing
> >> something
> >> > this fundamental (especially when I generally hold the view that large
> >> cell
> >> > counts against massive contiguous memory is an anti pattern to scalable
> >> > analytical processing--purely subjective of course).
> >> >
> >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <em...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Jacques,
> >> > > What avenue were you thinking for supporting both paths?   I didn't
> >> want
> >> > > to pursue a different class hierarchy, because I felt like that would
> >> > > effectively fork the code base, but that is potentially an option that
> >> > > would allow us to have a complete reference implementation in Java
> >> that
> >> > can
> >> > > fully interact with C++, without major changes to this code.
> >> > >
> >> > > For supporting both APIs on the same classes/interfaces, I think they
> >> > > roughly fall into three categories, changes to input parameters,
> >> changes
> >> > to
> >> > > output parameters and algorithm changes.
> >> > >
> >> > > For inputs, changing from int to long is essentially a no-op from the
> >> > > compiler perspective.  From the limited micro-benchmarking this also
> >> > > doesn't seem to have a performance impact.  So we could keep two
> >> versions
> >> > > of the methods that only differ on inputs, but it is not clear what
> >> the
> >> > > value of that would be.
> >> > >
> >> > > For outputs, we can't support methods "long getLength()" and "int
> >> > > getLength()" in the same class, so we would be forced into something
> >> like
> >> > > "long getLength(boolean dummy)" which I think is a less desirable.
> >> > >
> >> > > For algorithm changes, there did not appear to be too many places
> >> where
> >> > we
> >> > > actually loop over all elements (it is quite possible I missed
> >> something
> >> > > here), the ones that I did find I was able to mitigate performance
> >> > > penalties as noted above.  Some of the current implementation will
> >> get a
> >> > > lot slower for "large arrays", but we can likely fix those later or in
> >> > this
> >> > > PR with a nested while loop instead of 2 for loops.
> >> > >
> >> > > Thanks,
> >> > > Micah
> >> > >
> >> > >
> >> > > On Saturday, August 10, 2019, Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >> > >
> >> > >> This is a pretty massive change to the apis. I wonder how nasty it
> >> would
> >> > >> be to just support both paths. Have you evaluated how complex that
> >> > would be?
> >> > >>
> >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
> >> emkornfield@gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >>> After more investigation, it looks like Float8Benchmarks at least
> >> on my
> >> > >>> machine are within the range of noise.
> >> > >>>
> >> > >>> For BitVectorHelper I pushed a new commit [1], seems to bring the
> >> > >>> BitVectorHelper benchmarks back inline (and even with some
> >> improvement
> >> > >>> for
> >> > >>> getNullCountBenchmark).
> >> > >>>
> >> > >>> Benchmark                                        Mode  Cnt   Score
> >> > >>>  Error
> >> > >>>  Units
> >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ±
> >> > >>> 0.031
> >> > >>>  ns/op
> >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ±
> >> > >>> 0.141
> >> > >>>  ns/op
> >> > >>>
> >> > >>> I applied the same pattern to other loops that I could find, and for
> >> > any
> >> > >>> "for (long" loop on the critical path, I broke it up into two loops.
> >> > the
> >> > >>> first loop does iteration by integer, the second finishes off for
> >> any
> >> > >>> long
> >> > >>> values.  As a side note it seems like optimization for loops using
> >> long
> >> > >>> counters at least have a semi-recent open bug for the JVM [2]
> >> > >>>
> >> > >>> Thanks,
> >> > >>> Micah
> >> > >>>
> >> > >>> [1]
> >> > >>>
> >> > >>>
> >> >
> >> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
> >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
> >> > >>>
> >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
> >> emkornfield@gmail.com>
> >> > >>> wrote:
> >> > >>>
> >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables can make a
> >> > big
> >> > >>> > difference.  I didn't initially run the benchmarks with these
> >> turned
> >> > on
> >> > >>> > (you can see the result from above with Float8Benchmarks).  Here
> >> are
> >> > >>> new
> >> > >>> > numbers including with the flags enabled.  It looks like using
> >> longs
> >> > >>> might
> >> > >>> > be a little bit slower, I'll see what I can do to mitigate this.
> >> > >>> >
> >> > >>> > Ravindra also volunteered to try to benchmark the changes with
> >> > Dremio's
> >> > >>> > code on today's sync call.
> >> > >>> >
> >> > >>> > New
> >> > >>> >
> >> > >>> > Benchmark                                        Mode  Cnt   Score
> >> > >>>  Error
> >> > >>> > Units
> >> > >>> >
> >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
> >>  4.176 ±
> >> > >>> 1.292
> >> > >>> > ns/op
> >> > >>> >
> >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
> >> 26.102 ±
> >> > >>> 0.700
> >> > >>> > ns/op
> >> > >>> >
> >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084
> >> us/op
> >> > >>> >
> >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057
> >> us/op
> >> > >>> >
> >> > >>> >
> >> > >>> >
> >> > >>> > old
> >> > >>> >
> >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
> >>  3.828 ±
> >> > >>> 0.030
> >> > >>> > ns/op
> >> > >>> >
> >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
> >> 20.611 ±
> >> > >>> 0.188
> >> > >>> > ns/op
> >> > >>> >
> >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462
> >> us/op
> >> > >>> >
> >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027
> >> us/op
> >> > >>> >
> >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com>
> >> > wrote:
> >> > >>> >
> >> > >>> >> Hi Gonzalo,
> >> > >>> >>
> >> > >>> >> Thanks for sharing the performance results.
> >> > >>> >> I am wondering if you have turned off the flag
> >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> >> > >>> >> If not, the lower throughput should be expected.
> >> > >>> >>
> >> > >>> >> Best,
> >> > >>> >> Liya Fan
> >> > >>> >>
> >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
> >> > >>> emkornfield@gmail.com>
> >> > >>> >> wrote:
> >> > >>> >>
> >> > >>> >>> Hi Gonzalo,
> >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
> >> > >>> implications.   At
> >> > >>> >>> least on the benchmark run they don't seem to have an impact.
> >> > >>> >>>
> >> > >>> >>> If there are other benchmarks that people have that can
> >> validate if
> >> > >>> this
> >> > >>> >>> change will be problematic I would appreciate trying to run them
> >> > >>> with the
> >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to
> >> see
> >> > if
> >> > >>> >>> there
> >> > >>> >>> is a change in those.
> >> > >>> >>>
> >> > >>> >>> -Micah
> >> > >>> >>>
> >> > >>> >>>
> >> > >>> >>>
> >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
> >> > >>> >>> golthiryus@gmail.com> wrote:
> >> > >>> >>>
> >> > >>> >>> > I would recommend to take care with this kind of changes.
> >> > >>> >>> >
> >> > >>> >>> > I didn't try Arrow in more than one year, but by then the
> >> > >>> performance
> >> > >>> >>> was
> >> > >>> >>> > quite bad in comparison with plain byte buffer access
> >> > >>> >>> > (see http://git.net/apache-arrow-development/msg02353.html *)
> >> > and
> >> > >>> >>> > there are several optimizations that the JVM (specifically,
> >> C2)
> >> > >>> does
> >> > >>> >>> not
> >> > >>> >>> > apply when dealing with int instead of longs. One of the
> >> > >>> >>> > most commons is the loop unrolling and vectorization.
> >> > >>> >>> >
> >> > >>> >>> > * It doesn't seem the best way to reference an old email on
> >> the
> >> > >>> list,
> >> > >>> >>> but
> >> > >>> >>> > it is the only result shown by Google
> >> > >>> >>> >
> >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
> >> > liya.fan03@gmail.com
> >> > >>> >)
> >> > >>> >>> > escribió:
> >> > >>> >>> >
> >> > >>> >>> >> Hi Micah,
> >> > >>> >>> >>
> >> > >>> >>> >> Thanks for your effort. The performance result looks good.
> >> > >>> >>> >>
> >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4
> >> > bytes
> >> > >>> for
> >> > >>> >>> each
> >> > >>> >>> >> of length, write index, and read index).
> >> > >>> >>> >> Similar overheads also exist for vectors like
> >> > >>> BaseFixedWidthVector,
> >> > >>> >>> >> BaseVariableWidthVector, etc.
> >> > >>> >>> >>
> >> > >>> >>> >> IMO, such overheads are small enough to justify the change.
> >> > >>> >>> >> Let's check if there are other overheads.
> >> > >>> >>> >>
> >> > >>> >>> >> Best,
> >> > >>> >>> >> Liya Fan
> >> > >>> >>> >>
> >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
> >> > >>> emkornfield@gmail.com
> >> > >>> >>> >
> >> > >>> >>> >> wrote:
> >> > >>> >>> >>
> >> > >>> >>> >> > Hi Liya Fan,
> >> > >>> >>> >> > Based on the Float8Benchmark there does not seem to be any
> >> > >>> >>> meaningful
> >> > >>> >>> >> > performance difference on my machine.  At least for me, the
> >> > >>> >>> benchmarks
> >> > >>> >>> >> are
> >> > >>> >>> >> > not stable enough to say one is faster than the other (I've
> >> > >>> pasted
> >> > >>> >>> >> results
> >> > >>> >>> >> > below).  That being said my machine isn't necessarily the
> >> most
> >> > >>> >>> reliable
> >> > >>> >>> >> for
> >> > >>> >>> >> > benchmarking.
> >> > >>> >>> >> >
> >> > >>> >>> >> > On an intuitive level, this makes sense to me,  for the
> >> most
> >> > >>> part it
> >> > >>> >>> >> seems
> >> > >>> >>> >> > like the change just moves casting from "int" to "long"
> >> > further
> >> > >>> up
> >> > >>> >>> the
> >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are
> >> > other
> >> > >>> >>> >> benchmarks
> >> > >>> >>> >> > that you think are worth running let me know.
> >> > >>> >>> >> >
> >> > >>> >>> >> > One downside performance wise I think for his change is it
> >> > >>> >>> increases the
> >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could influence
> >> > cache
> >> > >>> >>> misses
> >> > >>> >>> >> at
> >> > >>> >>> >> > some level or increase the size of call-stacks, but this
> >> > doesn't
> >> > >>> >>> seem to
> >> > >>> >>> >> > show up in the benchmark..
> >> > >>> >>> >> >
> >> > >>> >>> >> > Thanks,
> >> > >>> >>> >> > Micah
> >> > >>> >>> >> >
> >> > >>> >>> >> > Sample benchmark numbers:
> >> > >>> >>> >> >
> >> > >>> >>> >> > [New Code]
> >> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
> >>  Error
> >> > >>> >>> Units
> >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ±
> >> 0.469
> >> > >>> >>> us/op
> >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ±
> >> 0.115
> >> > >>> >>> us/op
> >> > >>> >>> >> >
> >> > >>> >>> >> >
> >> > >>> >>> >> > [Old code]
> >> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
> >>  Error
> >> > >>> >>> Units
> >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ±
> >> 1.409
> >> > >>> >>> us/op
> >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ±
> >> 0.084
> >> > >>> >>> us/op
> >> > >>> >>> >> >
> >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
> >> liya.fan03@gmail.com
> >> > >
> >> > >>> >>> wrote:
> >> > >>> >>> >> >
> >> > >>> >>> >> >> Hi Micah,
> >> > >>> >>> >> >>
> >> > >>> >>> >> >> Thanks a lot for doing this.
> >> > >>> >>> >> >>
> >> > >>> >>> >> >> I am a little concerned about if there is any negative
> >> > >>> performance
> >> > >>> >>> >> impact
> >> > >>> >>> >> >> on the current 32-bit-length based applications.
> >> > >>> >>> >> >> Can we do some performance comparison on our existing
> >> > >>> benchmarks?
> >> > >>> >>> >> >>
> >> > >>> >>> >> >> Best,
> >> > >>> >>> >> >> Liya Fan
> >> > >>> >>> >> >>
> >> > >>> >>> >> >>
> >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
> >> > >>> >>> emkornfield@gmail.com>
> >> > >>> >>> >> >> wrote:
> >> > >>> >>> >> >>
> >> > >>> >>> >> >>> There have been some previous discussions on the mailing
> >> > about
> >> > >>> >>> >> supporting
> >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the
> >> IPC
> >> > >>> >>> >> specification
> >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that changes all
> >> APIs
> >> > >>> that I
> >> > >>> >>> >> could
> >> > >>> >>> >> >>> find that take an index to take an "long" instead of an
> >> > "int"
> >> > >>> (and
> >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>> It is a big change, so I think it is worth discussing if
> >> it
> >> > is
> >> > >>> >>> >> something
> >> > >>> >>> >> >>> we
> >> > >>> >>> >> >>> still want to move forward with.  It would be nice to
> >> come
> >> > to
> >> > >>> a
> >> > >>> >>> >> >>> conclusion
> >> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of
> >> > merge
> >> > >>> >>> >> conflicts.
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>> The reason I did this work now is the C++ implementation
> >> has
> >> > >>> added
> >> > >>> >>> >> >>> support
> >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and
> >> based
> >> > on
> >> > >>> >>> prior
> >> > >>> >>> >> >>> discussions we need to have similar support in Java
> >> before
> >> > our
> >> > >>> >>> next
> >> > >>> >>> >> >>> release. Support 64-bit indexes means we can have full
> >> > >>> >>> compatibility
> >> > >>> >>> >> and
> >> > >>> >>> >> >>> make the most use of the types in Java.
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>> Look forward to hearing feedback.
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>> Thanks,
> >> > >>> >>> >> >>> Micah
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>
> >> > >>> >>> >>
> >> > >>> >>> >
> >> > >>> >>>
> >> > >>> >>
> >> > >>>
> >> > >>
> >> >
> >>
> >

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Micah Kornfield <em...@gmail.com>.

Hi Wes and Jacques,
See responses below.

With regards to the reference implementation point. It is a good point. I'm
> on vacation this week. Unless you're pushing hard on this, can we pick this
> up and discuss more next week?


Sure thing, enjoy your vacation.  I think the only practical implications
are it delays choices around implementing LargeList, LargeBinary,
LargeString in Java, which in turn might push out the 0.15.0 release.

My stance on this is that I don't know how important it is for Java to
> support vectors over INT32_MAX elements. The use cases enabled by
> having very large arrays seem to be concentrated in the native code
> world (e.g. C/C++/Rust) -- that could just be implementation-centrism
> on my part, though.


A data point against this view is Spark has done work to eliminate 2GB
memory limits on its block sizes [1].  I don't claim to understand the
implications of this. Bryan might you have any thoughts here?  I'm OK with
INT32_MAX, as well, I think we should think about what this means for
adding Large types to Java and implications for reference implementations.

Thanks,
Micah

[1] https://issues.apache.org/jira/browse/SPARK-6235

On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <ja...@apache.org> wrote:

> Hey Micah,
>
> Appreciate the offer on the compiling. The reality is I'm more concerned
> about the unknowns than the compiling issue itself. Any time you've been
> tuning for a while, changing something like this could be totally fine or
> cause a couple of major issues. For example, we've done a very large amount
> of work reducing heap memory footprint of the vectors. Are target is to
> actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per vector
> (not including arrow bufs).
>
> With regards to the reference implementation point. It is a good point.
> I'm on vacation this week. Unless you're pushing hard on this, can we pick
> this up and discuss more next week?
>
> thanks,
> Jacques
>
> On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Jacques,
>> I definitely understand these concerns and this change is risky because it
>> is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
>> of dealing with this.  This could have other benefits like cleaning up
>> some
>> cruft around dictionary encode and "orphaned" method.   Per past e-mail
>> threads I agree it is beneficial to have 2 separate reference
>> implementations that can communicate fully, and my intent here was to
>> close
>> that gap.
>>
>> Trying to
>> > determine the ramifications of these changes would be challenging and
>> time
>> > consuming against all the different ways we interact with the Arrow Java
>> > library.
>>
>>
>> Understood.  I took a quick look at Dremio-OSS it seems like it has a
>> simple java build system?  If it is helpful, I can try to get a fork
>> running that at least compiles against this PR.  My plan would be to cast
>> any place that was changed to return a long back to an int, so in essence
>> the Dremio algorithms would reman 32-bit implementations.
>>
>> I don't  have the infrastructure to test this change properly from a
>> distributed systems perspective, so it would still take some time from
>> Dremio to validate for regressions.
>>
>> I'm not saying I'm against this but want to make sure we've
>> > explored all less disruptive options before considering changing
>> something
>> > this fundamental (especially when I generally hold the view that large
>> cell
>> > counts against massive contiguous memory is an anti pattern to scalable
>> > analytical processing--purely subjective of course).
>>
>>
>> I'm open to other ideas here, as well. I don't think it is out of the
>> question to leave the Java implementation as 32-bit, but if we do, then I
>> think we should consider a different strategy for reference
>> implementations.
>>
>> Thanks,
>> Micah
>>
>> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>> > Hey Micah, I didn't have a particular path in mind. Was thinking more
>> along
>> > the lines of extra methods as opposed to separate classes.
>> >
>> > Arrow hasn't historically been a place where we're writing algorithms in
>> > Java so the fact that they aren't there doesn't mean they don't exist.
>> We
>> > have a large amount of code that depends on the current behavior that is
>> > deployed in hundreds of customer clusters (you can peruse our dremio
>> repo
>> > to see how extensively we leverage Arrow if interested). Trying to
>> > determine the ramifications of these changes would be challenging and
>> time
>> > consuming against all the different ways we interact with the Arrow Java
>> > library. I'm not saying I'm against this but want to make sure we've
>> > explored all less disruptive options before considering changing
>> something
>> > this fundamental (especially when I generally hold the view that large
>> cell
>> > counts against massive contiguous memory is an anti pattern to scalable
>> > analytical processing--purely subjective of course).
>> >
>> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <em...@gmail.com>
>> > wrote:
>> >
>> > > Hi Jacques,
>> > > What avenue were you thinking for supporting both paths?   I didn't
>> want
>> > > to pursue a different class hierarchy, because I felt like that would
>> > > effectively fork the code base, but that is potentially an option that
>> > > would allow us to have a complete reference implementation in Java
>> that
>> > can
>> > > fully interact with C++, without major changes to this code.
>> > >
>> > > For supporting both APIs on the same classes/interfaces, I think they
>> > > roughly fall into three categories, changes to input parameters,
>> changes
>> > to
>> > > output parameters and algorithm changes.
>> > >
>> > > For inputs, changing from int to long is essentially a no-op from the
>> > > compiler perspective.  From the limited micro-benchmarking this also
>> > > doesn't seem to have a performance impact.  So we could keep two
>> versions
>> > > of the methods that only differ on inputs, but it is not clear what
>> the
>> > > value of that would be.
>> > >
>> > > For outputs, we can't support methods "long getLength()" and "int
>> > > getLength()" in the same class, so we would be forced into something
>> like
>> > > "long getLength(boolean dummy)" which I think is a less desirable.
>> > >
>> > > For algorithm changes, there did not appear to be too many places
>> where
>> > we
>> > > actually loop over all elements (it is quite possible I missed
>> something
>> > > here), the ones that I did find I was able to mitigate performance
>> > > penalties as noted above.  Some of the current implementation will
>> get a
>> > > lot slower for "large arrays", but we can likely fix those later or in
>> > this
>> > > PR with a nested while loop instead of 2 for loops.
>> > >
>> > > Thanks,
>> > > Micah
>> > >
>> > >
>> > > On Saturday, August 10, 2019, Jacques Nadeau <ja...@apache.org>
>> wrote:
>> > >
>> > >> This is a pretty massive change to the apis. I wonder how nasty it
>> would
>> > >> be to just support both paths. Have you evaluated how complex that
>> > would be?
>> > >>
>> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> After more investigation, it looks like Float8Benchmarks at least
>> on my
>> > >>> machine are within the range of noise.
>> > >>>
>> > >>> For BitVectorHelper I pushed a new commit [1], seems to bring the
>> > >>> BitVectorHelper benchmarks back inline (and even with some
>> improvement
>> > >>> for
>> > >>> getNullCountBenchmark).
>> > >>>
>> > >>> Benchmark                                        Mode  Cnt   Score
>> > >>>  Error
>> > >>>  Units
>> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ±
>> > >>> 0.031
>> > >>>  ns/op
>> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ±
>> > >>> 0.141
>> > >>>  ns/op
>> > >>>
>> > >>> I applied the same pattern to other loops that I could find, and for
>> > any
>> > >>> "for (long" loop on the critical path, I broke it up into two loops.
>> > the
>> > >>> first loop does iteration by integer, the second finishes off for
>> any
>> > >>> long
>> > >>> values.  As a side note it seems like optimization for loops using
>> long
>> > >>> counters at least have a semi-recent open bug for the JVM [2]
>> > >>>
>> > >>> Thanks,
>> > >>> Micah
>> > >>>
>> > >>> [1]
>> > >>>
>> > >>>
>> >
>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>> > >>>
>> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> > >>> wrote:
>> > >>>
>> > >>> > Indeed, the BoundChecking and CheckNullForGet variables can make a
>> > big
>> > >>> > difference.  I didn't initially run the benchmarks with these
>> turned
>> > on
>> > >>> > (you can see the result from above with Float8Benchmarks).  Here
>> are
>> > >>> new
>> > >>> > numbers including with the flags enabled.  It looks like using
>> longs
>> > >>> might
>> > >>> > be a little bit slower, I'll see what I can do to mitigate this.
>> > >>> >
>> > >>> > Ravindra also volunteered to try to benchmark the changes with
>> > Dremio's
>> > >>> > code on today's sync call.
>> > >>> >
>> > >>> > New
>> > >>> >
>> > >>> > Benchmark                                        Mode  Cnt   Score
>> > >>>  Error
>> > >>> > Units
>> > >>> >
>> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>  4.176 ±
>> > >>> 1.292
>> > >>> > ns/op
>> > >>> >
>> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>> 26.102 ±
>> > >>> 0.700
>> > >>> > ns/op
>> > >>> >
>> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084
>> us/op
>> > >>> >
>> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057
>> us/op
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> > old
>> > >>> >
>> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
>>  3.828 ±
>> > >>> 0.030
>> > >>> > ns/op
>> > >>> >
>> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
>> 20.611 ±
>> > >>> 0.188
>> > >>> > ns/op
>> > >>> >
>> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462
>> us/op
>> > >>> >
>> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027
>> us/op
>> > >>> >
>> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com>
>> > wrote:
>> > >>> >
>> > >>> >> Hi Gonzalo,
>> > >>> >>
>> > >>> >> Thanks for sharing the performance results.
>> > >>> >> I am wondering if you have turned off the flag
>> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>> > >>> >> If not, the lower throughput should be expected.
>> > >>> >>
>> > >>> >> Best,
>> > >>> >> Liya Fan
>> > >>> >>
>> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>> > >>> emkornfield@gmail.com>
>> > >>> >> wrote:
>> > >>> >>
>> > >>> >>> Hi Gonzalo,
>> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
>> > >>> implications.   At
>> > >>> >>> least on the benchmark run they don't seem to have an impact.
>> > >>> >>>
>> > >>> >>> If there are other benchmarks that people have that can
>> validate if
>> > >>> this
>> > >>> >>> change will be problematic I would appreciate trying to run them
>> > >>> with the
>> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to
>> see
>> > if
>> > >>> >>> there
>> > >>> >>> is a change in those.
>> > >>> >>>
>> > >>> >>> -Micah
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>> > >>> >>> golthiryus@gmail.com> wrote:
>> > >>> >>>
>> > >>> >>> > I would recommend to take care with this kind of changes.
>> > >>> >>> >
>> > >>> >>> > I didn't try Arrow in more than one year, but by then the
>> > >>> performance
>> > >>> >>> was
>> > >>> >>> > quite bad in comparison with plain byte buffer access
>> > >>> >>> > (see http://git.net/apache-arrow-development/msg02353.html *)
>> > and
>> > >>> >>> > there are several optimizations that the JVM (specifically,
>> C2)
>> > >>> does
>> > >>> >>> not
>> > >>> >>> > apply when dealing with int instead of longs. One of the
>> > >>> >>> > most commons is the loop unrolling and vectorization.
>> > >>> >>> >
>> > >>> >>> > * It doesn't seem the best way to reference an old email on
>> the
>> > >>> list,
>> > >>> >>> but
>> > >>> >>> > it is the only result shown by Google
>> > >>> >>> >
>> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
>> > liya.fan03@gmail.com
>> > >>> >)
>> > >>> >>> > escribió:
>> > >>> >>> >
>> > >>> >>> >> Hi Micah,
>> > >>> >>> >>
>> > >>> >>> >> Thanks for your effort. The performance result looks good.
>> > >>> >>> >>
>> > >>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4
>> > bytes
>> > >>> for
>> > >>> >>> each
>> > >>> >>> >> of length, write index, and read index).
>> > >>> >>> >> Similar overheads also exist for vectors like
>> > >>> BaseFixedWidthVector,
>> > >>> >>> >> BaseVariableWidthVector, etc.
>> > >>> >>> >>
>> > >>> >>> >> IMO, such overheads are small enough to justify the change.
>> > >>> >>> >> Let's check if there are other overheads.
>> > >>> >>> >>
>> > >>> >>> >> Best,
>> > >>> >>> >> Liya Fan
>> > >>> >>> >>
>> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>> > >>> emkornfield@gmail.com
>> > >>> >>> >
>> > >>> >>> >> wrote:
>> > >>> >>> >>
>> > >>> >>> >> > Hi Liya Fan,
>> > >>> >>> >> > Based on the Float8Benchmark there does not seem to be any
>> > >>> >>> meaningful
>> > >>> >>> >> > performance difference on my machine.  At least for me, the
>> > >>> >>> benchmarks
>> > >>> >>> >> are
>> > >>> >>> >> > not stable enough to say one is faster than the other (I've
>> > >>> pasted
>> > >>> >>> >> results
>> > >>> >>> >> > below).  That being said my machine isn't necessarily the
>> most
>> > >>> >>> reliable
>> > >>> >>> >> for
>> > >>> >>> >> > benchmarking.
>> > >>> >>> >> >
>> > >>> >>> >> > On an intuitive level, this makes sense to me,  for the
>> most
>> > >>> part it
>> > >>> >>> >> seems
>> > >>> >>> >> > like the change just moves casting from "int" to "long"
>> > further
>> > >>> up
>> > >>> >>> the
>> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are
>> > other
>> > >>> >>> >> benchmarks
>> > >>> >>> >> > that you think are worth running let me know.
>> > >>> >>> >> >
>> > >>> >>> >> > One downside performance wise I think for his change is it
>> > >>> >>> increases the
>> > >>> >>> >> > size of ArrowBuf objects, which I suppose could influence
>> > cache
>> > >>> >>> misses
>> > >>> >>> >> at
>> > >>> >>> >> > some level or increase the size of call-stacks, but this
>> > doesn't
>> > >>> >>> seem to
>> > >>> >>> >> > show up in the benchmark..
>> > >>> >>> >> >
>> > >>> >>> >> > Thanks,
>> > >>> >>> >> > Micah
>> > >>> >>> >> >
>> > >>> >>> >> > Sample benchmark numbers:
>> > >>> >>> >> >
>> > >>> >>> >> > [New Code]
>> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
>>  Error
>> > >>> >>> Units
>> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ±
>> 0.469
>> > >>> >>> us/op
>> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ±
>> 0.115
>> > >>> >>> us/op
>> > >>> >>> >> >
>> > >>> >>> >> >
>> > >>> >>> >> > [Old code]
>> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
>>  Error
>> > >>> >>> Units
>> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ±
>> 1.409
>> > >>> >>> us/op
>> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ±
>> 0.084
>> > >>> >>> us/op
>> > >>> >>> >> >
>> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
>> liya.fan03@gmail.com
>> > >
>> > >>> >>> wrote:
>> > >>> >>> >> >
>> > >>> >>> >> >> Hi Micah,
>> > >>> >>> >> >>
>> > >>> >>> >> >> Thanks a lot for doing this.
>> > >>> >>> >> >>
>> > >>> >>> >> >> I am a little concerned about if there is any negative
>> > >>> performance
>> > >>> >>> >> impact
>> > >>> >>> >> >> on the current 32-bit-length based applications.
>> > >>> >>> >> >> Can we do some performance comparison on our existing
>> > >>> benchmarks?
>> > >>> >>> >> >>
>> > >>> >>> >> >> Best,
>> > >>> >>> >> >> Liya Fan
>> > >>> >>> >> >>
>> > >>> >>> >> >>
>> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>> > >>> >>> emkornfield@gmail.com>
>> > >>> >>> >> >> wrote:
>> > >>> >>> >> >>
>> > >>> >>> >> >>> There have been some previous discussions on the mailing
>> > about
>> > >>> >>> >> supporting
>> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the
>> IPC
>> > >>> >>> >> specification
>> > >>> >>> >> >>> and C++ support).  I created a PR [1] that changes all
>> APIs
>> > >>> that I
>> > >>> >>> >> could
>> > >>> >>> >> >>> find that take an index to take an "long" instead of an
>> > "int"
>> > >>> (and
>> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
>> > >>> >>> >> >>>
>> > >>> >>> >> >>> It is a big change, so I think it is worth discussing if
>> it
>> > is
>> > >>> >>> >> something
>> > >>> >>> >> >>> we
>> > >>> >>> >> >>> still want to move forward with.  It would be nice to
>> come
>> > to
>> > >>> a
>> > >>> >>> >> >>> conclusion
>> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of
>> > merge
>> > >>> >>> >> conflicts.
>> > >>> >>> >> >>>
>> > >>> >>> >> >>> The reason I did this work now is the C++ implementation
>> has
>> > >>> added
>> > >>> >>> >> >>> support
>> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and
>> based
>> > on
>> > >>> >>> prior
>> > >>> >>> >> >>> discussions we need to have similar support in Java
>> before
>> > our
>> > >>> >>> next
>> > >>> >>> >> >>> release. Support 64-bit indexes means we can have full
>> > >>> >>> compatibility
>> > >>> >>> >> and
>> > >>> >>> >> >>> make the most use of the types in Java.
>> > >>> >>> >> >>>
>> > >>> >>> >> >>> Look forward to hearing feedback.
>> > >>> >>> >> >>>
>> > >>> >>> >> >>> Thanks,
>> > >>> >>> >> >>> Micah
>> > >>> >>> >> >>>
>> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>> > >>> >>> >> >>>
>> > >>> >>> >> >>
>> > >>> >>> >>
>> > >>> >>> >
>> > >>> >>>
>> > >>> >>
>> > >>>
>> > >>
>> >
>>
>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Jacques Nadeau <ja...@apache.org>.

Hey Micah,

Appreciate the offer on the compiling. The reality is I'm more concerned
about the unknowns than the compiling issue itself. Any time you've been
tuning for a while, changing something like this could be totally fine or
cause a couple of major issues. For example, we've done a very large amount
of work reducing heap memory footprint of the vectors. Are target is to
actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per vector
(not including arrow bufs).

With regards to the reference implementation point. It is a good point. I'm
on vacation this week. Unless you're pushing hard on this, can we pick this
up and discuss more next week?

thanks,
Jacques

On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Jacques,
> I definitely understand these concerns and this change is risky because it
> is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
> of dealing with this.  This could have other benefits like cleaning up some
> cruft around dictionary encode and "orphaned" method.   Per past e-mail
> threads I agree it is beneficial to have 2 separate reference
> implementations that can communicate fully, and my intent here was to close
> that gap.
>
> Trying to
> > determine the ramifications of these changes would be challenging and
> time
> > consuming against all the different ways we interact with the Arrow Java
> > library.
>
>
> Understood.  I took a quick look at Dremio-OSS it seems like it has a
> simple java build system?  If it is helpful, I can try to get a fork
> running that at least compiles against this PR.  My plan would be to cast
> any place that was changed to return a long back to an int, so in essence
> the Dremio algorithms would reman 32-bit implementations.
>
> I don't  have the infrastructure to test this change properly from a
> distributed systems perspective, so it would still take some time from
> Dremio to validate for regressions.
>
> I'm not saying I'm against this but want to make sure we've
> > explored all less disruptive options before considering changing
> something
> > this fundamental (especially when I generally hold the view that large
> cell
> > counts against massive contiguous memory is an anti pattern to scalable
> > analytical processing--purely subjective of course).
>
>
> I'm open to other ideas here, as well. I don't think it is out of the
> question to leave the Java implementation as 32-bit, but if we do, then I
> think we should consider a different strategy for reference
> implementations.
>
> Thanks,
> Micah
>
> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> > Hey Micah, I didn't have a particular path in mind. Was thinking more
> along
> > the lines of extra methods as opposed to separate classes.
> >
> > Arrow hasn't historically been a place where we're writing algorithms in
> > Java so the fact that they aren't there doesn't mean they don't exist. We
> > have a large amount of code that depends on the current behavior that is
> > deployed in hundreds of customer clusters (you can peruse our dremio repo
> > to see how extensively we leverage Arrow if interested). Trying to
> > determine the ramifications of these changes would be challenging and
> time
> > consuming against all the different ways we interact with the Arrow Java
> > library. I'm not saying I'm against this but want to make sure we've
> > explored all less disruptive options before considering changing
> something
> > this fundamental (especially when I generally hold the view that large
> cell
> > counts against massive contiguous memory is an anti pattern to scalable
> > analytical processing--purely subjective of course).
> >
> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > Hi Jacques,
> > > What avenue were you thinking for supporting both paths?   I didn't
> want
> > > to pursue a different class hierarchy, because I felt like that would
> > > effectively fork the code base, but that is potentially an option that
> > > would allow us to have a complete reference implementation in Java that
> > can
> > > fully interact with C++, without major changes to this code.
> > >
> > > For supporting both APIs on the same classes/interfaces, I think they
> > > roughly fall into three categories, changes to input parameters,
> changes
> > to
> > > output parameters and algorithm changes.
> > >
> > > For inputs, changing from int to long is essentially a no-op from the
> > > compiler perspective.  From the limited micro-benchmarking this also
> > > doesn't seem to have a performance impact.  So we could keep two
> versions
> > > of the methods that only differ on inputs, but it is not clear what the
> > > value of that would be.
> > >
> > > For outputs, we can't support methods "long getLength()" and "int
> > > getLength()" in the same class, so we would be forced into something
> like
> > > "long getLength(boolean dummy)" which I think is a less desirable.
> > >
> > > For algorithm changes, there did not appear to be too many places where
> > we
> > > actually loop over all elements (it is quite possible I missed
> something
> > > here), the ones that I did find I was able to mitigate performance
> > > penalties as noted above.  Some of the current implementation will get
> a
> > > lot slower for "large arrays", but we can likely fix those later or in
> > this
> > > PR with a nested while loop instead of 2 for loops.
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > > On Saturday, August 10, 2019, Jacques Nadeau <ja...@apache.org>
> wrote:
> > >
> > >> This is a pretty massive change to the apis. I wonder how nasty it
> would
> > >> be to just support both paths. Have you evaluated how complex that
> > would be?
> > >>
> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
> emkornfield@gmail.com>
> > >> wrote:
> > >>
> > >>> After more investigation, it looks like Float8Benchmarks at least on
> my
> > >>> machine are within the range of noise.
> > >>>
> > >>> For BitVectorHelper I pushed a new commit [1], seems to bring the
> > >>> BitVectorHelper benchmarks back inline (and even with some
> improvement
> > >>> for
> > >>> getNullCountBenchmark).
> > >>>
> > >>> Benchmark                                        Mode  Cnt   Score
> > >>>  Error
> > >>>  Units
> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ±
> > >>> 0.031
> > >>>  ns/op
> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ±
> > >>> 0.141
> > >>>  ns/op
> > >>>
> > >>> I applied the same pattern to other loops that I could find, and for
> > any
> > >>> "for (long" loop on the critical path, I broke it up into two loops.
> > the
> > >>> first loop does iteration by integer, the second finishes off for any
> > >>> long
> > >>> values.  As a side note it seems like optimization for loops using
> long
> > >>> counters at least have a semi-recent open bug for the JVM [2]
> > >>>
> > >>> Thanks,
> > >>> Micah
> > >>>
> > >>> [1]
> > >>>
> > >>>
> >
> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
> > >>>
> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
> emkornfield@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Indeed, the BoundChecking and CheckNullForGet variables can make a
> > big
> > >>> > difference.  I didn't initially run the benchmarks with these
> turned
> > on
> > >>> > (you can see the result from above with Float8Benchmarks).  Here
> are
> > >>> new
> > >>> > numbers including with the flags enabled.  It looks like using
> longs
> > >>> might
> > >>> > be a little bit slower, I'll see what I can do to mitigate this.
> > >>> >
> > >>> > Ravindra also volunteered to try to benchmark the changes with
> > Dremio's
> > >>> > code on today's sync call.
> > >>> >
> > >>> > New
> > >>> >
> > >>> > Benchmark                                        Mode  Cnt   Score
> > >>>  Error
> > >>> > Units
> > >>> >
> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   4.176
> ±
> > >>> 1.292
> > >>> > ns/op
> > >>> >
> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  26.102
> ±
> > >>> 0.700
> > >>> > ns/op
> > >>> >
> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084
> us/op
> > >>> >
> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057
> us/op
> > >>> >
> > >>> >
> > >>> >
> > >>> > old
> > >>> >
> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.828
> ±
> > >>> 0.030
> > >>> > ns/op
> > >>> >
> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  20.611
> ±
> > >>> 0.188
> > >>> > ns/op
> > >>> >
> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462
> us/op
> > >>> >
> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027
> us/op
> > >>> >
> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com>
> > wrote:
> > >>> >
> > >>> >> Hi Gonzalo,
> > >>> >>
> > >>> >> Thanks for sharing the performance results.
> > >>> >> I am wondering if you have turned off the flag
> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> > >>> >> If not, the lower throughput should be expected.
> > >>> >>
> > >>> >> Best,
> > >>> >> Liya Fan
> > >>> >>
> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
> > >>> emkornfield@gmail.com>
> > >>> >> wrote:
> > >>> >>
> > >>> >>> Hi Gonzalo,
> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
> > >>> implications.   At
> > >>> >>> least on the benchmark run they don't seem to have an impact.
> > >>> >>>
> > >>> >>> If there are other benchmarks that people have that can validate
> if
> > >>> this
> > >>> >>> change will be problematic I would appreciate trying to run them
> > >>> with the
> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to see
> > if
> > >>> >>> there
> > >>> >>> is a change in those.
> > >>> >>>
> > >>> >>> -Micah
> > >>> >>>
> > >>> >>>
> > >>> >>>
> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
> > >>> >>> golthiryus@gmail.com> wrote:
> > >>> >>>
> > >>> >>> > I would recommend to take care with this kind of changes.
> > >>> >>> >
> > >>> >>> > I didn't try Arrow in more than one year, but by then the
> > >>> performance
> > >>> >>> was
> > >>> >>> > quite bad in comparison with plain byte buffer access
> > >>> >>> > (see http://git.net/apache-arrow-development/msg02353.html *)
> > and
> > >>> >>> > there are several optimizations that the JVM (specifically, C2)
> > >>> does
> > >>> >>> not
> > >>> >>> > apply when dealing with int instead of longs. One of the
> > >>> >>> > most commons is the loop unrolling and vectorization.
> > >>> >>> >
> > >>> >>> > * It doesn't seem the best way to reference an old email on the
> > >>> list,
> > >>> >>> but
> > >>> >>> > it is the only result shown by Google
> > >>> >>> >
> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
> > liya.fan03@gmail.com
> > >>> >)
> > >>> >>> > escribió:
> > >>> >>> >
> > >>> >>> >> Hi Micah,
> > >>> >>> >>
> > >>> >>> >> Thanks for your effort. The performance result looks good.
> > >>> >>> >>
> > >>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4
> > bytes
> > >>> for
> > >>> >>> each
> > >>> >>> >> of length, write index, and read index).
> > >>> >>> >> Similar overheads also exist for vectors like
> > >>> BaseFixedWidthVector,
> > >>> >>> >> BaseVariableWidthVector, etc.
> > >>> >>> >>
> > >>> >>> >> IMO, such overheads are small enough to justify the change.
> > >>> >>> >> Let's check if there are other overheads.
> > >>> >>> >>
> > >>> >>> >> Best,
> > >>> >>> >> Liya Fan
> > >>> >>> >>
> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
> > >>> emkornfield@gmail.com
> > >>> >>> >
> > >>> >>> >> wrote:
> > >>> >>> >>
> > >>> >>> >> > Hi Liya Fan,
> > >>> >>> >> > Based on the Float8Benchmark there does not seem to be any
> > >>> >>> meaningful
> > >>> >>> >> > performance difference on my machine.  At least for me, the
> > >>> >>> benchmarks
> > >>> >>> >> are
> > >>> >>> >> > not stable enough to say one is faster than the other (I've
> > >>> pasted
> > >>> >>> >> results
> > >>> >>> >> > below).  That being said my machine isn't necessarily the
> most
> > >>> >>> reliable
> > >>> >>> >> for
> > >>> >>> >> > benchmarking.
> > >>> >>> >> >
> > >>> >>> >> > On an intuitive level, this makes sense to me,  for the most
> > >>> part it
> > >>> >>> >> seems
> > >>> >>> >> > like the change just moves casting from "int" to "long"
> > further
> > >>> up
> > >>> >>> the
> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are
> > other
> > >>> >>> >> benchmarks
> > >>> >>> >> > that you think are worth running let me know.
> > >>> >>> >> >
> > >>> >>> >> > One downside performance wise I think for his change is it
> > >>> >>> increases the
> > >>> >>> >> > size of ArrowBuf objects, which I suppose could influence
> > cache
> > >>> >>> misses
> > >>> >>> >> at
> > >>> >>> >> > some level or increase the size of call-stacks, but this
> > doesn't
> > >>> >>> seem to
> > >>> >>> >> > show up in the benchmark..
> > >>> >>> >> >
> > >>> >>> >> > Thanks,
> > >>> >>> >> > Micah
> > >>> >>> >> >
> > >>> >>> >> > Sample benchmark numbers:
> > >>> >>> >> >
> > >>> >>> >> > [New Code]
> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
>  Error
> > >>> >>> Units
> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ±
> 0.469
> > >>> >>> us/op
> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ±
> 0.115
> > >>> >>> us/op
> > >>> >>> >> >
> > >>> >>> >> >
> > >>> >>> >> > [Old code]
> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
>  Error
> > >>> >>> Units
> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ±
> 1.409
> > >>> >>> us/op
> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ±
> 0.084
> > >>> >>> us/op
> > >>> >>> >> >
> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
> liya.fan03@gmail.com
> > >
> > >>> >>> wrote:
> > >>> >>> >> >
> > >>> >>> >> >> Hi Micah,
> > >>> >>> >> >>
> > >>> >>> >> >> Thanks a lot for doing this.
> > >>> >>> >> >>
> > >>> >>> >> >> I am a little concerned about if there is any negative
> > >>> performance
> > >>> >>> >> impact
> > >>> >>> >> >> on the current 32-bit-length based applications.
> > >>> >>> >> >> Can we do some performance comparison on our existing
> > >>> benchmarks?
> > >>> >>> >> >>
> > >>> >>> >> >> Best,
> > >>> >>> >> >> Liya Fan
> > >>> >>> >> >>
> > >>> >>> >> >>
> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
> > >>> >>> emkornfield@gmail.com>
> > >>> >>> >> >> wrote:
> > >>> >>> >> >>
> > >>> >>> >> >>> There have been some previous discussions on the mailing
> > about
> > >>> >>> >> supporting
> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the
> IPC
> > >>> >>> >> specification
> > >>> >>> >> >>> and C++ support).  I created a PR [1] that changes all
> APIs
> > >>> that I
> > >>> >>> >> could
> > >>> >>> >> >>> find that take an index to take an "long" instead of an
> > "int"
> > >>> (and
> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
> > >>> >>> >> >>>
> > >>> >>> >> >>> It is a big change, so I think it is worth discussing if
> it
> > is
> > >>> >>> >> something
> > >>> >>> >> >>> we
> > >>> >>> >> >>> still want to move forward with.  It would be nice to come
> > to
> > >>> a
> > >>> >>> >> >>> conclusion
> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of
> > merge
> > >>> >>> >> conflicts.
> > >>> >>> >> >>>
> > >>> >>> >> >>> The reason I did this work now is the C++ implementation
> has
> > >>> added
> > >>> >>> >> >>> support
> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and
> based
> > on
> > >>> >>> prior
> > >>> >>> >> >>> discussions we need to have similar support in Java before
> > our
> > >>> >>> next
> > >>> >>> >> >>> release. Support 64-bit indexes means we can have full
> > >>> >>> compatibility
> > >>> >>> >> and
> > >>> >>> >> >>> make the most use of the types in Java.
> > >>> >>> >> >>>
> > >>> >>> >> >>> Look forward to hearing feedback.
> > >>> >>> >> >>>
> > >>> >>> >> >>> Thanks,
> > >>> >>> >> >>> Micah
> > >>> >>> >> >>>
> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
> > >>> >>> >> >>>
> > >>> >>> >> >>
> > >>> >>> >>
> > >>> >>> >
> > >>> >>>
> > >>> >>
> > >>>
> > >>
> >
>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Micah Kornfield <em...@gmail.com>.

Hi Jacques,
I definitely understand these concerns and this change is risky because it
is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
of dealing with this.  This could have other benefits like cleaning up some
cruft around dictionary encode and "orphaned" method.   Per past e-mail
threads I agree it is beneficial to have 2 separate reference
implementations that can communicate fully, and my intent here was to close
that gap.

Trying to
> determine the ramifications of these changes would be challenging and time
> consuming against all the different ways we interact with the Arrow Java
> library.


Understood.  I took a quick look at Dremio-OSS it seems like it has a
simple java build system?  If it is helpful, I can try to get a fork
running that at least compiles against this PR.  My plan would be to cast
any place that was changed to return a long back to an int, so in essence
the Dremio algorithms would reman 32-bit implementations.

I don't  have the infrastructure to test this change properly from a
distributed systems perspective, so it would still take some time from
Dremio to validate for regressions.

I'm not saying I'm against this but want to make sure we've
> explored all less disruptive options before considering changing something
> this fundamental (especially when I generally hold the view that large cell
> counts against massive contiguous memory is an anti pattern to scalable
> analytical processing--purely subjective of course).


I'm open to other ideas here, as well. I don't think it is out of the
question to leave the Java implementation as 32-bit, but if we do, then I
think we should consider a different strategy for reference implementations.

Thanks,
Micah

On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <ja...@apache.org> wrote:

> Hey Micah, I didn't have a particular path in mind. Was thinking more along
> the lines of extra methods as opposed to separate classes.
>
> Arrow hasn't historically been a place where we're writing algorithms in
> Java so the fact that they aren't there doesn't mean they don't exist. We
> have a large amount of code that depends on the current behavior that is
> deployed in hundreds of customer clusters (you can peruse our dremio repo
> to see how extensively we leverage Arrow if interested). Trying to
> determine the ramifications of these changes would be challenging and time
> consuming against all the different ways we interact with the Arrow Java
> library. I'm not saying I'm against this but want to make sure we've
> explored all less disruptive options before considering changing something
> this fundamental (especially when I generally hold the view that large cell
> counts against massive contiguous memory is an anti pattern to scalable
> analytical processing--purely subjective of course).
>
> On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Hi Jacques,
> > What avenue were you thinking for supporting both paths?   I didn't want
> > to pursue a different class hierarchy, because I felt like that would
> > effectively fork the code base, but that is potentially an option that
> > would allow us to have a complete reference implementation in Java that
> can
> > fully interact with C++, without major changes to this code.
> >
> > For supporting both APIs on the same classes/interfaces, I think they
> > roughly fall into three categories, changes to input parameters, changes
> to
> > output parameters and algorithm changes.
> >
> > For inputs, changing from int to long is essentially a no-op from the
> > compiler perspective.  From the limited micro-benchmarking this also
> > doesn't seem to have a performance impact.  So we could keep two versions
> > of the methods that only differ on inputs, but it is not clear what the
> > value of that would be.
> >
> > For outputs, we can't support methods "long getLength()" and "int
> > getLength()" in the same class, so we would be forced into something like
> > "long getLength(boolean dummy)" which I think is a less desirable.
> >
> > For algorithm changes, there did not appear to be too many places where
> we
> > actually loop over all elements (it is quite possible I missed something
> > here), the ones that I did find I was able to mitigate performance
> > penalties as noted above.  Some of the current implementation will get a
> > lot slower for "large arrays", but we can likely fix those later or in
> this
> > PR with a nested while loop instead of 2 for loops.
> >
> > Thanks,
> > Micah
> >
> >
> > On Saturday, August 10, 2019, Jacques Nadeau <ja...@apache.org> wrote:
> >
> >> This is a pretty massive change to the apis. I wonder how nasty it would
> >> be to just support both paths. Have you evaluated how complex that
> would be?
> >>
> >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <em...@gmail.com>
> >> wrote:
> >>
> >>> After more investigation, it looks like Float8Benchmarks at least on my
> >>> machine are within the range of noise.
> >>>
> >>> For BitVectorHelper I pushed a new commit [1], seems to bring the
> >>> BitVectorHelper benchmarks back inline (and even with some improvement
> >>> for
> >>> getNullCountBenchmark).
> >>>
> >>> Benchmark                                        Mode  Cnt   Score
> >>>  Error
> >>>  Units
> >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ±
> >>> 0.031
> >>>  ns/op
> >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ±
> >>> 0.141
> >>>  ns/op
> >>>
> >>> I applied the same pattern to other loops that I could find, and for
> any
> >>> "for (long" loop on the critical path, I broke it up into two loops.
> the
> >>> first loop does iteration by integer, the second finishes off for any
> >>> long
> >>> values.  As a side note it seems like optimization for loops using long
> >>> counters at least have a semi-recent open bug for the JVM [2]
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>> [1]
> >>>
> >>>
> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
> >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
> >>>
> >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <em...@gmail.com>
> >>> wrote:
> >>>
> >>> > Indeed, the BoundChecking and CheckNullForGet variables can make a
> big
> >>> > difference.  I didn't initially run the benchmarks with these turned
> on
> >>> > (you can see the result from above with Float8Benchmarks).  Here are
> >>> new
> >>> > numbers including with the flags enabled.  It looks like using longs
> >>> might
> >>> > be a little bit slower, I'll see what I can do to mitigate this.
> >>> >
> >>> > Ravindra also volunteered to try to benchmark the changes with
> Dremio's
> >>> > code on today's sync call.
> >>> >
> >>> > New
> >>> >
> >>> > Benchmark                                        Mode  Cnt   Score
> >>>  Error
> >>> > Units
> >>> >
> >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   4.176 ±
> >>> 1.292
> >>> > ns/op
> >>> >
> >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  26.102 ±
> >>> 0.700
> >>> > ns/op
> >>> >
> >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084  us/op
> >>> >
> >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057  us/op
> >>> >
> >>> >
> >>> >
> >>> > old
> >>> >
> >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.828 ±
> >>> 0.030
> >>> > ns/op
> >>> >
> >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  20.611 ±
> >>> 0.188
> >>> > ns/op
> >>> >
> >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462  us/op
> >>> >
> >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027  us/op
> >>> >
> >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com>
> wrote:
> >>> >
> >>> >> Hi Gonzalo,
> >>> >>
> >>> >> Thanks for sharing the performance results.
> >>> >> I am wondering if you have turned off the flag
> >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> >>> >> If not, the lower throughput should be expected.
> >>> >>
> >>> >> Best,
> >>> >> Liya Fan
> >>> >>
> >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
> >>> emkornfield@gmail.com>
> >>> >> wrote:
> >>> >>
> >>> >>> Hi Gonzalo,
> >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
> >>> implications.   At
> >>> >>> least on the benchmark run they don't seem to have an impact.
> >>> >>>
> >>> >>> If there are other benchmarks that people have that can validate if
> >>> this
> >>> >>> change will be problematic I would appreciate trying to run them
> >>> with the
> >>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to see
> if
> >>> >>> there
> >>> >>> is a change in those.
> >>> >>>
> >>> >>> -Micah
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
> >>> >>> golthiryus@gmail.com> wrote:
> >>> >>>
> >>> >>> > I would recommend to take care with this kind of changes.
> >>> >>> >
> >>> >>> > I didn't try Arrow in more than one year, but by then the
> >>> performance
> >>> >>> was
> >>> >>> > quite bad in comparison with plain byte buffer access
> >>> >>> > (see http://git.net/apache-arrow-development/msg02353.html *)
> and
> >>> >>> > there are several optimizations that the JVM (specifically, C2)
> >>> does
> >>> >>> not
> >>> >>> > apply when dealing with int instead of longs. One of the
> >>> >>> > most commons is the loop unrolling and vectorization.
> >>> >>> >
> >>> >>> > * It doesn't seem the best way to reference an old email on the
> >>> list,
> >>> >>> but
> >>> >>> > it is the only result shown by Google
> >>> >>> >
> >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
> liya.fan03@gmail.com
> >>> >)
> >>> >>> > escribió:
> >>> >>> >
> >>> >>> >> Hi Micah,
> >>> >>> >>
> >>> >>> >> Thanks for your effort. The performance result looks good.
> >>> >>> >>
> >>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4
> bytes
> >>> for
> >>> >>> each
> >>> >>> >> of length, write index, and read index).
> >>> >>> >> Similar overheads also exist for vectors like
> >>> BaseFixedWidthVector,
> >>> >>> >> BaseVariableWidthVector, etc.
> >>> >>> >>
> >>> >>> >> IMO, such overheads are small enough to justify the change.
> >>> >>> >> Let's check if there are other overheads.
> >>> >>> >>
> >>> >>> >> Best,
> >>> >>> >> Liya Fan
> >>> >>> >>
> >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
> >>> emkornfield@gmail.com
> >>> >>> >
> >>> >>> >> wrote:
> >>> >>> >>
> >>> >>> >> > Hi Liya Fan,
> >>> >>> >> > Based on the Float8Benchmark there does not seem to be any
> >>> >>> meaningful
> >>> >>> >> > performance difference on my machine.  At least for me, the
> >>> >>> benchmarks
> >>> >>> >> are
> >>> >>> >> > not stable enough to say one is faster than the other (I've
> >>> pasted
> >>> >>> >> results
> >>> >>> >> > below).  That being said my machine isn't necessarily the most
> >>> >>> reliable
> >>> >>> >> for
> >>> >>> >> > benchmarking.
> >>> >>> >> >
> >>> >>> >> > On an intuitive level, this makes sense to me,  for the most
> >>> part it
> >>> >>> >> seems
> >>> >>> >> > like the change just moves casting from "int" to "long"
> further
> >>> up
> >>> >>> the
> >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are
> other
> >>> >>> >> benchmarks
> >>> >>> >> > that you think are worth running let me know.
> >>> >>> >> >
> >>> >>> >> > One downside performance wise I think for his change is it
> >>> >>> increases the
> >>> >>> >> > size of ArrowBuf objects, which I suppose could influence
> cache
> >>> >>> misses
> >>> >>> >> at
> >>> >>> >> > some level or increase the size of call-stacks, but this
> doesn't
> >>> >>> seem to
> >>> >>> >> > show up in the benchmark..
> >>> >>> >> >
> >>> >>> >> > Thanks,
> >>> >>> >> > Micah
> >>> >>> >> >
> >>> >>> >> > Sample benchmark numbers:
> >>> >>> >> >
> >>> >>> >> > [New Code]
> >>> >>> >> > Benchmark                            Mode  Cnt   Score   Error
> >>> >>> Units
> >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ± 0.469
> >>> >>> us/op
> >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ± 0.115
> >>> >>> us/op
> >>> >>> >> >
> >>> >>> >> >
> >>> >>> >> > [Old code]
> >>> >>> >> > Benchmark                            Mode  Cnt   Score   Error
> >>> >>> Units
> >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ± 1.409
> >>> >>> us/op
> >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ± 0.084
> >>> >>> us/op
> >>> >>> >> >
> >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <liya.fan03@gmail.com
> >
> >>> >>> wrote:
> >>> >>> >> >
> >>> >>> >> >> Hi Micah,
> >>> >>> >> >>
> >>> >>> >> >> Thanks a lot for doing this.
> >>> >>> >> >>
> >>> >>> >> >> I am a little concerned about if there is any negative
> >>> performance
> >>> >>> >> impact
> >>> >>> >> >> on the current 32-bit-length based applications.
> >>> >>> >> >> Can we do some performance comparison on our existing
> >>> benchmarks?
> >>> >>> >> >>
> >>> >>> >> >> Best,
> >>> >>> >> >> Liya Fan
> >>> >>> >> >>
> >>> >>> >> >>
> >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
> >>> >>> emkornfield@gmail.com>
> >>> >>> >> >> wrote:
> >>> >>> >> >>
> >>> >>> >> >>> There have been some previous discussions on the mailing
> about
> >>> >>> >> supporting
> >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
> >>> >>> >> specification
> >>> >>> >> >>> and C++ support).  I created a PR [1] that changes all APIs
> >>> that I
> >>> >>> >> could
> >>> >>> >> >>> find that take an index to take an "long" instead of an
> "int"
> >>> (and
> >>> >>> >> >>> similarly change "size/rowcount" APIs).
> >>> >>> >> >>>
> >>> >>> >> >>> It is a big change, so I think it is worth discussing if it
> is
> >>> >>> >> something
> >>> >>> >> >>> we
> >>> >>> >> >>> still want to move forward with.  It would be nice to come
> to
> >>> a
> >>> >>> >> >>> conclusion
> >>> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of
> merge
> >>> >>> >> conflicts.
> >>> >>> >> >>>
> >>> >>> >> >>> The reason I did this work now is the C++ implementation has
> >>> added
> >>> >>> >> >>> support
> >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and based
> on
> >>> >>> prior
> >>> >>> >> >>> discussions we need to have similar support in Java before
> our
> >>> >>> next
> >>> >>> >> >>> release. Support 64-bit indexes means we can have full
> >>> >>> compatibility
> >>> >>> >> and
> >>> >>> >> >>> make the most use of the types in Java.
> >>> >>> >> >>>
> >>> >>> >> >>> Look forward to hearing feedback.
> >>> >>> >> >>>
> >>> >>> >> >>> Thanks,
> >>> >>> >> >>> Micah
> >>> >>> >> >>>
> >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
> >>> >>> >> >>>
> >>> >>> >> >>
> >>> >>> >>
> >>> >>> >
> >>> >>>
> >>> >>
> >>>
> >>
>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Jacques Nadeau <ja...@apache.org>.

Hey Micah, I didn't have a particular path in mind. Was thinking more along
the lines of extra methods as opposed to separate classes.

Arrow hasn't historically been a place where we're writing algorithms in
Java so the fact that they aren't there doesn't mean they don't exist. We
have a large amount of code that depends on the current behavior that is
deployed in hundreds of customer clusters (you can peruse our dremio repo
to see how extensively we leverage Arrow if interested). Trying to
determine the ramifications of these changes would be challenging and time
consuming against all the different ways we interact with the Arrow Java
library. I'm not saying I'm against this but want to make sure we've
explored all less disruptive options before considering changing something
this fundamental (especially when I generally hold the view that large cell
counts against massive contiguous memory is an anti pattern to scalable
analytical processing--purely subjective of course).

On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <em...@gmail.com> wrote:

> Hi Jacques,
> What avenue were you thinking for supporting both paths?   I didn't want
> to pursue a different class hierarchy, because I felt like that would
> effectively fork the code base, but that is potentially an option that
> would allow us to have a complete reference implementation in Java that can
> fully interact with C++, without major changes to this code.
>
> For supporting both APIs on the same classes/interfaces, I think they
> roughly fall into three categories, changes to input parameters, changes to
> output parameters and algorithm changes.
>
> For inputs, changing from int to long is essentially a no-op from the
> compiler perspective.  From the limited micro-benchmarking this also
> doesn't seem to have a performance impact.  So we could keep two versions
> of the methods that only differ on inputs, but it is not clear what the
> value of that would be.
>
> For outputs, we can't support methods "long getLength()" and "int
> getLength()" in the same class, so we would be forced into something like
> "long getLength(boolean dummy)" which I think is a less desirable.
>
> For algorithm changes, there did not appear to be too many places where we
> actually loop over all elements (it is quite possible I missed something
> here), the ones that I did find I was able to mitigate performance
> penalties as noted above.  Some of the current implementation will get a
> lot slower for "large arrays", but we can likely fix those later or in this
> PR with a nested while loop instead of 2 for loops.
>
> Thanks,
> Micah
>
>
> On Saturday, August 10, 2019, Jacques Nadeau <ja...@apache.org> wrote:
>
>> This is a pretty massive change to the apis. I wonder how nasty it would
>> be to just support both paths. Have you evaluated how complex that would be?
>>
>> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> After more investigation, it looks like Float8Benchmarks at least on my
>>> machine are within the range of noise.
>>>
>>> For BitVectorHelper I pushed a new commit [1], seems to bring the
>>> BitVectorHelper benchmarks back inline (and even with some improvement
>>> for
>>> getNullCountBenchmark).
>>>
>>> Benchmark                                        Mode  Cnt   Score
>>>  Error
>>>  Units
>>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ±
>>> 0.031
>>>  ns/op
>>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ±
>>> 0.141
>>>  ns/op
>>>
>>> I applied the same pattern to other loops that I could find, and for any
>>> "for (long" loop on the critical path, I broke it up into two loops.  the
>>> first loop does iteration by integer, the second finishes off for any
>>> long
>>> values.  As a side note it seems like optimization for loops using long
>>> counters at least have a semi-recent open bug for the JVM [2]
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1]
>>>
>>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>>
>>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>> > Indeed, the BoundChecking and CheckNullForGet variables can make a big
>>> > difference.  I didn't initially run the benchmarks with these turned on
>>> > (you can see the result from above with Float8Benchmarks).  Here are
>>> new
>>> > numbers including with the flags enabled.  It looks like using longs
>>> might
>>> > be a little bit slower, I'll see what I can do to mitigate this.
>>> >
>>> > Ravindra also volunteered to try to benchmark the changes with Dremio's
>>> > code on today's sync call.
>>> >
>>> > New
>>> >
>>> > Benchmark                                        Mode  Cnt   Score
>>>  Error
>>> > Units
>>> >
>>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   4.176 ±
>>> 1.292
>>> > ns/op
>>> >
>>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  26.102 ±
>>> 0.700
>>> > ns/op
>>> >
>>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084  us/op
>>> >
>>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057  us/op
>>> >
>>> >
>>> >
>>> > old
>>> >
>>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.828 ±
>>> 0.030
>>> > ns/op
>>> >
>>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  20.611 ±
>>> 0.188
>>> > ns/op
>>> >
>>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462  us/op
>>> >
>>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027  us/op
>>> >
>>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com> wrote:
>>> >
>>> >> Hi Gonzalo,
>>> >>
>>> >> Thanks for sharing the performance results.
>>> >> I am wondering if you have turned off the flag
>>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>>> >> If not, the lower throughput should be expected.
>>> >>
>>> >> Best,
>>> >> Liya Fan
>>> >>
>>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>> >> wrote:
>>> >>
>>> >>> Hi Gonzalo,
>>> >>> Thank you for the feedback.  I wasn't aware of the JIT
>>> implications.   At
>>> >>> least on the benchmark run they don't seem to have an impact.
>>> >>>
>>> >>> If there are other benchmarks that people have that can validate if
>>> this
>>> >>> change will be problematic I would appreciate trying to run them
>>> with the
>>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to see if
>>> >>> there
>>> >>> is a change in those.
>>> >>>
>>> >>> -Micah
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>>> >>> golthiryus@gmail.com> wrote:
>>> >>>
>>> >>> > I would recommend to take care with this kind of changes.
>>> >>> >
>>> >>> > I didn't try Arrow in more than one year, but by then the
>>> performance
>>> >>> was
>>> >>> > quite bad in comparison with plain byte buffer access
>>> >>> > (see http://git.net/apache-arrow-development/msg02353.html *) and
>>> >>> > there are several optimizations that the JVM (specifically, C2)
>>> does
>>> >>> not
>>> >>> > apply when dealing with int instead of longs. One of the
>>> >>> > most commons is the loop unrolling and vectorization.
>>> >>> >
>>> >>> > * It doesn't seem the best way to reference an old email on the
>>> list,
>>> >>> but
>>> >>> > it is the only result shown by Google
>>> >>> >
>>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<liya.fan03@gmail.com
>>> >)
>>> >>> > escribió:
>>> >>> >
>>> >>> >> Hi Micah,
>>> >>> >>
>>> >>> >> Thanks for your effort. The performance result looks good.
>>> >>> >>
>>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes
>>> for
>>> >>> each
>>> >>> >> of length, write index, and read index).
>>> >>> >> Similar overheads also exist for vectors like
>>> BaseFixedWidthVector,
>>> >>> >> BaseVariableWidthVector, etc.
>>> >>> >>
>>> >>> >> IMO, such overheads are small enough to justify the change.
>>> >>> >> Let's check if there are other overheads.
>>> >>> >>
>>> >>> >> Best,
>>> >>> >> Liya Fan
>>> >>> >>
>>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>>> emkornfield@gmail.com
>>> >>> >
>>> >>> >> wrote:
>>> >>> >>
>>> >>> >> > Hi Liya Fan,
>>> >>> >> > Based on the Float8Benchmark there does not seem to be any
>>> >>> meaningful
>>> >>> >> > performance difference on my machine.  At least for me, the
>>> >>> benchmarks
>>> >>> >> are
>>> >>> >> > not stable enough to say one is faster than the other (I've
>>> pasted
>>> >>> >> results
>>> >>> >> > below).  That being said my machine isn't necessarily the most
>>> >>> reliable
>>> >>> >> for
>>> >>> >> > benchmarking.
>>> >>> >> >
>>> >>> >> > On an intuitive level, this makes sense to me,  for the most
>>> part it
>>> >>> >> seems
>>> >>> >> > like the change just moves casting from "int" to "long" further
>>> up
>>> >>> the
>>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are other
>>> >>> >> benchmarks
>>> >>> >> > that you think are worth running let me know.
>>> >>> >> >
>>> >>> >> > One downside performance wise I think for his change is it
>>> >>> increases the
>>> >>> >> > size of ArrowBuf objects, which I suppose could influence cache
>>> >>> misses
>>> >>> >> at
>>> >>> >> > some level or increase the size of call-stacks, but this doesn't
>>> >>> seem to
>>> >>> >> > show up in the benchmark..
>>> >>> >> >
>>> >>> >> > Thanks,
>>> >>> >> > Micah
>>> >>> >> >
>>> >>> >> > Sample benchmark numbers:
>>> >>> >> >
>>> >>> >> > [New Code]
>>> >>> >> > Benchmark                            Mode  Cnt   Score   Error
>>> >>> Units
>>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ± 0.469
>>> >>> us/op
>>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ± 0.115
>>> >>> us/op
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > [Old code]
>>> >>> >> > Benchmark                            Mode  Cnt   Score   Error
>>> >>> Units
>>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ± 1.409
>>> >>> us/op
>>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ± 0.084
>>> >>> us/op
>>> >>> >> >
>>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <li...@gmail.com>
>>> >>> wrote:
>>> >>> >> >
>>> >>> >> >> Hi Micah,
>>> >>> >> >>
>>> >>> >> >> Thanks a lot for doing this.
>>> >>> >> >>
>>> >>> >> >> I am a little concerned about if there is any negative
>>> performance
>>> >>> >> impact
>>> >>> >> >> on the current 32-bit-length based applications.
>>> >>> >> >> Can we do some performance comparison on our existing
>>> benchmarks?
>>> >>> >> >>
>>> >>> >> >> Best,
>>> >>> >> >> Liya Fan
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>>> >>> emkornfield@gmail.com>
>>> >>> >> >> wrote:
>>> >>> >> >>
>>> >>> >> >>> There have been some previous discussions on the mailing about
>>> >>> >> supporting
>>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
>>> >>> >> specification
>>> >>> >> >>> and C++ support).  I created a PR [1] that changes all APIs
>>> that I
>>> >>> >> could
>>> >>> >> >>> find that take an index to take an "long" instead of an "int"
>>> (and
>>> >>> >> >>> similarly change "size/rowcount" APIs).
>>> >>> >> >>>
>>> >>> >> >>> It is a big change, so I think it is worth discussing if it is
>>> >>> >> something
>>> >>> >> >>> we
>>> >>> >> >>> still want to move forward with.  It would be nice to come to
>>> a
>>> >>> >> >>> conclusion
>>> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of merge
>>> >>> >> conflicts.
>>> >>> >> >>>
>>> >>> >> >>> The reason I did this work now is the C++ implementation has
>>> added
>>> >>> >> >>> support
>>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and based on
>>> >>> prior
>>> >>> >> >>> discussions we need to have similar support in Java before our
>>> >>> next
>>> >>> >> >>> release. Support 64-bit indexes means we can have full
>>> >>> compatibility
>>> >>> >> and
>>> >>> >> >>> make the most use of the types in Java.
>>> >>> >> >>>
>>> >>> >> >>> Look forward to hearing feedback.
>>> >>> >> >>>
>>> >>> >> >>> Thanks,
>>> >>> >> >>> Micah
>>> >>> >> >>>
>>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>>> >>> >> >>>
>>> >>> >> >>
>>> >>> >>
>>> >>> >
>>> >>>
>>> >>
>>>
>>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Micah Kornfield <em...@gmail.com>.

Hi Jacques,
What avenue were you thinking for supporting both paths?   I didn't want to
pursue a different class hierarchy, because I felt like that would
effectively fork the code base, but that is potentially an option that
would allow us to have a complete reference implementation in Java that can
fully interact with C++, without major changes to this code.

For supporting both APIs on the same classes/interfaces, I think they
roughly fall into three categories, changes to input parameters, changes to
output parameters and algorithm changes.

For inputs, changing from int to long is essentially a no-op from the
compiler perspective.  From the limited micro-benchmarking this also
doesn't seem to have a performance impact.  So we could keep two versions
of the methods that only differ on inputs, but it is not clear what the
value of that would be.

For outputs, we can't support methods "long getLength()" and "int
getLength()" in the same class, so we would be forced into something like
"long getLength(boolean dummy)" which I think is a less desirable.

For algorithm changes, there did not appear to be too many places where we
actually loop over all elements (it is quite possible I missed something
here), the ones that I did find I was able to mitigate performance
penalties as noted above.  Some of the current implementation will get a
lot slower for "large arrays", but we can likely fix those later or in this
PR with a nested while loop instead of 2 for loops.

Thanks,
Micah


On Saturday, August 10, 2019, Jacques Nadeau <ja...@apache.org> wrote:

> This is a pretty massive change to the apis. I wonder how nasty it would
> be to just support both paths. Have you evaluated how complex that would be?
>
> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> After more investigation, it looks like Float8Benchmarks at least on my
>> machine are within the range of noise.
>>
>> For BitVectorHelper I pushed a new commit [1], seems to bring the
>> BitVectorHelper benchmarks back inline (and even with some improvement for
>> getNullCountBenchmark).
>>
>> Benchmark                                        Mode  Cnt   Score   Error
>>  Units
>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ± 0.031
>>  ns/op
>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ± 0.141
>>  ns/op
>>
>> I applied the same pattern to other loops that I could find, and for any
>> "for (long" loop on the critical path, I broke it up into two loops.  the
>> first loop does iteration by integer, the second finishes off for any long
>> values.  As a side note it seems like optimization for loops using long
>> counters at least have a semi-recent open bug for the JVM [2]
>>
>> Thanks,
>> Micah
>>
>> [1]
>>
>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>
>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>> > Indeed, the BoundChecking and CheckNullForGet variables can make a big
>> > difference.  I didn't initially run the benchmarks with these turned on
>> > (you can see the result from above with Float8Benchmarks).  Here are new
>> > numbers including with the flags enabled.  It looks like using longs
>> might
>> > be a little bit slower, I'll see what I can do to mitigate this.
>> >
>> > Ravindra also volunteered to try to benchmark the changes with Dremio's
>> > code on today's sync call.
>> >
>> > New
>> >
>> > Benchmark                                        Mode  Cnt   Score
>>  Error
>> > Units
>> >
>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   4.176 ±
>> 1.292
>> > ns/op
>> >
>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  26.102 ±
>> 0.700
>> > ns/op
>> >
>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084  us/op
>> >
>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057  us/op
>> >
>> >
>> >
>> > old
>> >
>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.828 ±
>> 0.030
>> > ns/op
>> >
>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  20.611 ±
>> 0.188
>> > ns/op
>> >
>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462  us/op
>> >
>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027  us/op
>> >
>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com> wrote:
>> >
>> >> Hi Gonzalo,
>> >>
>> >> Thanks for sharing the performance results.
>> >> I am wondering if you have turned off the flag
>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>> >> If not, the lower throughput should be expected.
>> >>
>> >> Best,
>> >> Liya Fan
>> >>
>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <emkornfield@gmail.com
>> >
>> >> wrote:
>> >>
>> >>> Hi Gonzalo,
>> >>> Thank you for the feedback.  I wasn't aware of the JIT implications.
>>  At
>> >>> least on the benchmark run they don't seem to have an impact.
>> >>>
>> >>> If there are other benchmarks that people have that can validate if
>> this
>> >>> change will be problematic I would appreciate trying to run them with
>> the
>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to see if
>> >>> there
>> >>> is a change in those.
>> >>>
>> >>> -Micah
>> >>>
>> >>>
>> >>>
>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>> >>> golthiryus@gmail.com> wrote:
>> >>>
>> >>> > I would recommend to take care with this kind of changes.
>> >>> >
>> >>> > I didn't try Arrow in more than one year, but by then the
>> performance
>> >>> was
>> >>> > quite bad in comparison with plain byte buffer access
>> >>> > (see http://git.net/apache-arrow-development/msg02353.html *) and
>> >>> > there are several optimizations that the JVM (specifically, C2) does
>> >>> not
>> >>> > apply when dealing with int instead of longs. One of the
>> >>> > most commons is the loop unrolling and vectorization.
>> >>> >
>> >>> > * It doesn't seem the best way to reference an old email on the
>> list,
>> >>> but
>> >>> > it is the only result shown by Google
>> >>> >
>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<li...@gmail.com>)
>> >>> > escribió:
>> >>> >
>> >>> >> Hi Micah,
>> >>> >>
>> >>> >> Thanks for your effort. The performance result looks good.
>> >>> >>
>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes
>> for
>> >>> each
>> >>> >> of length, write index, and read index).
>> >>> >> Similar overheads also exist for vectors like BaseFixedWidthVector,
>> >>> >> BaseVariableWidthVector, etc.
>> >>> >>
>> >>> >> IMO, such overheads are small enough to justify the change.
>> >>> >> Let's check if there are other overheads.
>> >>> >>
>> >>> >> Best,
>> >>> >> Liya Fan
>> >>> >>
>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>> emkornfield@gmail.com
>> >>> >
>> >>> >> wrote:
>> >>> >>
>> >>> >> > Hi Liya Fan,
>> >>> >> > Based on the Float8Benchmark there does not seem to be any
>> >>> meaningful
>> >>> >> > performance difference on my machine.  At least for me, the
>> >>> benchmarks
>> >>> >> are
>> >>> >> > not stable enough to say one is faster than the other (I've
>> pasted
>> >>> >> results
>> >>> >> > below).  That being said my machine isn't necessarily the most
>> >>> reliable
>> >>> >> for
>> >>> >> > benchmarking.
>> >>> >> >
>> >>> >> > On an intuitive level, this makes sense to me,  for the most
>> part it
>> >>> >> seems
>> >>> >> > like the change just moves casting from "int" to "long" further
>> up
>> >>> the
>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are other
>> >>> >> benchmarks
>> >>> >> > that you think are worth running let me know.
>> >>> >> >
>> >>> >> > One downside performance wise I think for his change is it
>> >>> increases the
>> >>> >> > size of ArrowBuf objects, which I suppose could influence cache
>> >>> misses
>> >>> >> at
>> >>> >> > some level or increase the size of call-stacks, but this doesn't
>> >>> seem to
>> >>> >> > show up in the benchmark..
>> >>> >> >
>> >>> >> > Thanks,
>> >>> >> > Micah
>> >>> >> >
>> >>> >> > Sample benchmark numbers:
>> >>> >> >
>> >>> >> > [New Code]
>> >>> >> > Benchmark                            Mode  Cnt   Score   Error
>> >>> Units
>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ± 0.469
>> >>> us/op
>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ± 0.115
>> >>> us/op
>> >>> >> >
>> >>> >> >
>> >>> >> > [Old code]
>> >>> >> > Benchmark                            Mode  Cnt   Score   Error
>> >>> Units
>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ± 1.409
>> >>> us/op
>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ± 0.084
>> >>> us/op
>> >>> >> >
>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <li...@gmail.com>
>> >>> wrote:
>> >>> >> >
>> >>> >> >> Hi Micah,
>> >>> >> >>
>> >>> >> >> Thanks a lot for doing this.
>> >>> >> >>
>> >>> >> >> I am a little concerned about if there is any negative
>> performance
>> >>> >> impact
>> >>> >> >> on the current 32-bit-length based applications.
>> >>> >> >> Can we do some performance comparison on our existing
>> benchmarks?
>> >>> >> >>
>> >>> >> >> Best,
>> >>> >> >> Liya Fan
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>> >>> emkornfield@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>
>> >>> >> >>> There have been some previous discussions on the mailing about
>> >>> >> supporting
>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
>> >>> >> specification
>> >>> >> >>> and C++ support).  I created a PR [1] that changes all APIs
>> that I
>> >>> >> could
>> >>> >> >>> find that take an index to take an "long" instead of an "int"
>> (and
>> >>> >> >>> similarly change "size/rowcount" APIs).
>> >>> >> >>>
>> >>> >> >>> It is a big change, so I think it is worth discussing if it is
>> >>> >> something
>> >>> >> >>> we
>> >>> >> >>> still want to move forward with.  It would be nice to come to a
>> >>> >> >>> conclusion
>> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of merge
>> >>> >> conflicts.
>> >>> >> >>>
>> >>> >> >>> The reason I did this work now is the C++ implementation has
>> added
>> >>> >> >>> support
>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and based on
>> >>> prior
>> >>> >> >>> discussions we need to have similar support in Java before our
>> >>> next
>> >>> >> >>> release. Support 64-bit indexes means we can have full
>> >>> compatibility
>> >>> >> and
>> >>> >> >>> make the most use of the types in Java.
>> >>> >> >>>
>> >>> >> >>> Look forward to hearing feedback.
>> >>> >> >>>
>> >>> >> >>> Thanks,
>> >>> >> >>> Micah
>> >>> >> >>>
>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>> >>> >> >>>
>> >>> >> >>
>> >>> >>
>> >>> >
>> >>>
>> >>
>>
>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Jacques Nadeau <ja...@apache.org>.

This is a pretty massive change to the apis. I wonder how nasty it would be
to just support both paths. Have you evaluated how complex that would be?

On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <em...@gmail.com>
wrote:

> After more investigation, it looks like Float8Benchmarks at least on my
> machine are within the range of noise.
>
> For BitVectorHelper I pushed a new commit [1], seems to bring the
> BitVectorHelper benchmarks back inline (and even with some improvement for
> getNullCountBenchmark).
>
> Benchmark                                        Mode  Cnt   Score   Error
>  Units
> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ± 0.031
>  ns/op
> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ± 0.141
>  ns/op
>
> I applied the same pattern to other loops that I could find, and for any
> "for (long" loop on the critical path, I broke it up into two loops.  the
> first loop does iteration by integer, the second finishes off for any long
> values.  As a side note it seems like optimization for loops using long
> counters at least have a semi-recent open bug for the JVM [2]
>
> Thanks,
> Micah
>
> [1]
>
> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>
> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Indeed, the BoundChecking and CheckNullForGet variables can make a big
> > difference.  I didn't initially run the benchmarks with these turned on
> > (you can see the result from above with Float8Benchmarks).  Here are new
> > numbers including with the flags enabled.  It looks like using longs
> might
> > be a little bit slower, I'll see what I can do to mitigate this.
> >
> > Ravindra also volunteered to try to benchmark the changes with Dremio's
> > code on today's sync call.
> >
> > New
> >
> > Benchmark                                        Mode  Cnt   Score
>  Error
> > Units
> >
> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   4.176 ±
> 1.292
> > ns/op
> >
> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  26.102 ±
> 0.700
> > ns/op
> >
> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084  us/op
> >
> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057  us/op
> >
> >
> >
> > old
> >
> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.828 ±
> 0.030
> > ns/op
> >
> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  20.611 ±
> 0.188
> > ns/op
> >
> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462  us/op
> >
> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027  us/op
> >
> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com> wrote:
> >
> >> Hi Gonzalo,
> >>
> >> Thanks for sharing the performance results.
> >> I am wondering if you have turned off the flag
> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> >> If not, the lower throughput should be expected.
> >>
> >> Best,
> >> Liya Fan
> >>
> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <em...@gmail.com>
> >> wrote:
> >>
> >>> Hi Gonzalo,
> >>> Thank you for the feedback.  I wasn't aware of the JIT implications.
>  At
> >>> least on the benchmark run they don't seem to have an impact.
> >>>
> >>> If there are other benchmarks that people have that can validate if
> this
> >>> change will be problematic I would appreciate trying to run them with
> the
> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to see if
> >>> there
> >>> is a change in those.
> >>>
> >>> -Micah
> >>>
> >>>
> >>>
> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
> >>> golthiryus@gmail.com> wrote:
> >>>
> >>> > I would recommend to take care with this kind of changes.
> >>> >
> >>> > I didn't try Arrow in more than one year, but by then the performance
> >>> was
> >>> > quite bad in comparison with plain byte buffer access
> >>> > (see http://git.net/apache-arrow-development/msg02353.html *) and
> >>> > there are several optimizations that the JVM (specifically, C2) does
> >>> not
> >>> > apply when dealing with int instead of longs. One of the
> >>> > most commons is the loop unrolling and vectorization.
> >>> >
> >>> > * It doesn't seem the best way to reference an old email on the list,
> >>> but
> >>> > it is the only result shown by Google
> >>> >
> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<li...@gmail.com>)
> >>> > escribió:
> >>> >
> >>> >> Hi Micah,
> >>> >>
> >>> >> Thanks for your effort. The performance result looks good.
> >>> >>
> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes
> for
> >>> each
> >>> >> of length, write index, and read index).
> >>> >> Similar overheads also exist for vectors like BaseFixedWidthVector,
> >>> >> BaseVariableWidthVector, etc.
> >>> >>
> >>> >> IMO, such overheads are small enough to justify the change.
> >>> >> Let's check if there are other overheads.
> >>> >>
> >>> >> Best,
> >>> >> Liya Fan
> >>> >>
> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
> emkornfield@gmail.com
> >>> >
> >>> >> wrote:
> >>> >>
> >>> >> > Hi Liya Fan,
> >>> >> > Based on the Float8Benchmark there does not seem to be any
> >>> meaningful
> >>> >> > performance difference on my machine.  At least for me, the
> >>> benchmarks
> >>> >> are
> >>> >> > not stable enough to say one is faster than the other (I've pasted
> >>> >> results
> >>> >> > below).  That being said my machine isn't necessarily the most
> >>> reliable
> >>> >> for
> >>> >> > benchmarking.
> >>> >> >
> >>> >> > On an intuitive level, this makes sense to me,  for the most part
> it
> >>> >> seems
> >>> >> > like the change just moves casting from "int" to "long" further up
> >>> the
> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are other
> >>> >> benchmarks
> >>> >> > that you think are worth running let me know.
> >>> >> >
> >>> >> > One downside performance wise I think for his change is it
> >>> increases the
> >>> >> > size of ArrowBuf objects, which I suppose could influence cache
> >>> misses
> >>> >> at
> >>> >> > some level or increase the size of call-stacks, but this doesn't
> >>> seem to
> >>> >> > show up in the benchmark..
> >>> >> >
> >>> >> > Thanks,
> >>> >> > Micah
> >>> >> >
> >>> >> > Sample benchmark numbers:
> >>> >> >
> >>> >> > [New Code]
> >>> >> > Benchmark                            Mode  Cnt   Score   Error
> >>> Units
> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ± 0.469
> >>> us/op
> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ± 0.115
> >>> us/op
> >>> >> >
> >>> >> >
> >>> >> > [Old code]
> >>> >> > Benchmark                            Mode  Cnt   Score   Error
> >>> Units
> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ± 1.409
> >>> us/op
> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ± 0.084
> >>> us/op
> >>> >> >
> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <li...@gmail.com>
> >>> wrote:
> >>> >> >
> >>> >> >> Hi Micah,
> >>> >> >>
> >>> >> >> Thanks a lot for doing this.
> >>> >> >>
> >>> >> >> I am a little concerned about if there is any negative
> performance
> >>> >> impact
> >>> >> >> on the current 32-bit-length based applications.
> >>> >> >> Can we do some performance comparison on our existing benchmarks?
> >>> >> >>
> >>> >> >> Best,
> >>> >> >> Liya Fan
> >>> >> >>
> >>> >> >>
> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
> >>> emkornfield@gmail.com>
> >>> >> >> wrote:
> >>> >> >>
> >>> >> >>> There have been some previous discussions on the mailing about
> >>> >> supporting
> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
> >>> >> specification
> >>> >> >>> and C++ support).  I created a PR [1] that changes all APIs
> that I
> >>> >> could
> >>> >> >>> find that take an index to take an "long" instead of an "int"
> (and
> >>> >> >>> similarly change "size/rowcount" APIs).
> >>> >> >>>
> >>> >> >>> It is a big change, so I think it is worth discussing if it is
> >>> >> something
> >>> >> >>> we
> >>> >> >>> still want to move forward with.  It would be nice to come to a
> >>> >> >>> conclusion
> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of merge
> >>> >> conflicts.
> >>> >> >>>
> >>> >> >>> The reason I did this work now is the C++ implementation has
> added
> >>> >> >>> support
> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and based on
> >>> prior
> >>> >> >>> discussions we need to have similar support in Java before our
> >>> next
> >>> >> >>> release. Support 64-bit indexes means we can have full
> >>> compatibility
> >>> >> and
> >>> >> >>> make the most use of the types in Java.
> >>> >> >>>
> >>> >> >>> Look forward to hearing feedback.
> >>> >> >>>
> >>> >> >>> Thanks,
> >>> >> >>> Micah
> >>> >> >>>
> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
> >>> >> >>>
> >>> >> >>
> >>> >>
> >>> >
> >>>
> >>
>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Micah Kornfield <em...@gmail.com>.

After more investigation, it looks like Float8Benchmarks at least on my
machine are within the range of noise.

For BitVectorHelper I pushed a new commit [1], seems to bring the
BitVectorHelper benchmarks back inline (and even with some improvement for
getNullCountBenchmark).

Benchmark                                        Mode  Cnt   Score   Error
 Units
BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ± 0.031
 ns/op
BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ± 0.141
 ns/op

I applied the same pattern to other loops that I could find, and for any
"for (long" loop on the critical path, I broke it up into two loops.  the
first loop does iteration by integer, the second finishes off for any long
values.  As a side note it seems like optimization for loops using long
counters at least have a semi-recent open bug for the JVM [2]

Thanks,
Micah

[1]
https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
[2] https://bugs.openjdk.java.net/browse/JDK-8223051

On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <em...@gmail.com>
wrote:

> Indeed, the BoundChecking and CheckNullForGet variables can make a big
> difference.  I didn't initially run the benchmarks with these turned on
> (you can see the result from above with Float8Benchmarks).  Here are new
> numbers including with the flags enabled.  It looks like using longs might
> be a little bit slower, I'll see what I can do to mitigate this.
>
> Ravindra also volunteered to try to benchmark the changes with Dremio's
> code on today's sync call.
>
> New
>
> Benchmark                                        Mode  Cnt   Score   Error
> Units
>
> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   4.176 ± 1.292
> ns/op
>
> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  26.102 ± 0.700
> ns/op
>
> Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084  us/op
>
> Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057  us/op
>
>
>
> old
>
> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.828 ± 0.030
> ns/op
>
> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  20.611 ± 0.188
> ns/op
>
> Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462  us/op
>
> Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027  us/op
>
> On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com> wrote:
>
>> Hi Gonzalo,
>>
>> Thanks for sharing the performance results.
>> I am wondering if you have turned off the flag
>> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>> If not, the lower throughput should be expected.
>>
>> Best,
>> Liya Fan
>>
>> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> Hi Gonzalo,
>>> Thank you for the feedback.  I wasn't aware of the JIT implications.   At
>>> least on the benchmark run they don't seem to have an impact.
>>>
>>> If there are other benchmarks that people have that can validate if this
>>> change will be problematic I would appreciate trying to run them with the
>>> PR.  I will try to run the ones for zeroing/popcnt tonight to see if
>>> there
>>> is a change in those.
>>>
>>> -Micah
>>>
>>>
>>>
>>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>>> golthiryus@gmail.com> wrote:
>>>
>>> > I would recommend to take care with this kind of changes.
>>> >
>>> > I didn't try Arrow in more than one year, but by then the performance
>>> was
>>> > quite bad in comparison with plain byte buffer access
>>> > (see http://git.net/apache-arrow-development/msg02353.html *) and
>>> > there are several optimizations that the JVM (specifically, C2) does
>>> not
>>> > apply when dealing with int instead of longs. One of the
>>> > most commons is the loop unrolling and vectorization.
>>> >
>>> > * It doesn't seem the best way to reference an old email on the list,
>>> but
>>> > it is the only result shown by Google
>>> >
>>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<li...@gmail.com>)
>>> > escribió:
>>> >
>>> >> Hi Micah,
>>> >>
>>> >> Thanks for your effort. The performance result looks good.
>>> >>
>>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes for
>>> each
>>> >> of length, write index, and read index).
>>> >> Similar overheads also exist for vectors like BaseFixedWidthVector,
>>> >> BaseVariableWidthVector, etc.
>>> >>
>>> >> IMO, such overheads are small enough to justify the change.
>>> >> Let's check if there are other overheads.
>>> >>
>>> >> Best,
>>> >> Liya Fan
>>> >>
>>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <emkornfield@gmail.com
>>> >
>>> >> wrote:
>>> >>
>>> >> > Hi Liya Fan,
>>> >> > Based on the Float8Benchmark there does not seem to be any
>>> meaningful
>>> >> > performance difference on my machine.  At least for me, the
>>> benchmarks
>>> >> are
>>> >> > not stable enough to say one is faster than the other (I've pasted
>>> >> results
>>> >> > below).  That being said my machine isn't necessarily the most
>>> reliable
>>> >> for
>>> >> > benchmarking.
>>> >> >
>>> >> > On an intuitive level, this makes sense to me,  for the most part it
>>> >> seems
>>> >> > like the change just moves casting from "int" to "long" further up
>>> the
>>> >> > stack  for  "PlatformDepdendent" operations.  If there are other
>>> >> benchmarks
>>> >> > that you think are worth running let me know.
>>> >> >
>>> >> > One downside performance wise I think for his change is it
>>> increases the
>>> >> > size of ArrowBuf objects, which I suppose could influence cache
>>> misses
>>> >> at
>>> >> > some level or increase the size of call-stacks, but this doesn't
>>> seem to
>>> >> > show up in the benchmark..
>>> >> >
>>> >> > Thanks,
>>> >> > Micah
>>> >> >
>>> >> > Sample benchmark numbers:
>>> >> >
>>> >> > [New Code]
>>> >> > Benchmark                            Mode  Cnt   Score   Error
>>> Units
>>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ± 0.469
>>> us/op
>>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ± 0.115
>>> us/op
>>> >> >
>>> >> >
>>> >> > [Old code]
>>> >> > Benchmark                            Mode  Cnt   Score   Error
>>> Units
>>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ± 1.409
>>> us/op
>>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ± 0.084
>>> us/op
>>> >> >
>>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <li...@gmail.com>
>>> wrote:
>>> >> >
>>> >> >> Hi Micah,
>>> >> >>
>>> >> >> Thanks a lot for doing this.
>>> >> >>
>>> >> >> I am a little concerned about if there is any negative performance
>>> >> impact
>>> >> >> on the current 32-bit-length based applications.
>>> >> >> Can we do some performance comparison on our existing benchmarks?
>>> >> >>
>>> >> >> Best,
>>> >> >> Liya Fan
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>> >> >> wrote:
>>> >> >>
>>> >> >>> There have been some previous discussions on the mailing about
>>> >> supporting
>>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
>>> >> specification
>>> >> >>> and C++ support).  I created a PR [1] that changes all APIs that I
>>> >> could
>>> >> >>> find that take an index to take an "long" instead of an "int" (and
>>> >> >>> similarly change "size/rowcount" APIs).
>>> >> >>>
>>> >> >>> It is a big change, so I think it is worth discussing if it is
>>> >> something
>>> >> >>> we
>>> >> >>> still want to move forward with.  It would be nice to come to a
>>> >> >>> conclusion
>>> >> >>> quickly, ideally in the next few days, to avoid a lot of merge
>>> >> conflicts.
>>> >> >>>
>>> >> >>> The reason I did this work now is the C++ implementation has added
>>> >> >>> support
>>> >> >>> for LargeList, LargeBinary and LargeString arrays and based on
>>> prior
>>> >> >>> discussions we need to have similar support in Java before our
>>> next
>>> >> >>> release. Support 64-bit indexes means we can have full
>>> compatibility
>>> >> and
>>> >> >>> make the most use of the types in Java.
>>> >> >>>
>>> >> >>> Look forward to hearing feedback.
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>> Micah
>>> >> >>>
>>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>>> >> >>>
>>> >> >>
>>> >>
>>> >
>>>
>>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Micah Kornfield <em...@gmail.com>.

Indeed, the BoundChecking and CheckNullForGet variables can make a big
difference.  I didn't initially run the benchmarks with these turned on
(you can see the result from above with Float8Benchmarks).  Here are new
numbers including with the flags enabled.  It looks like using longs might
be a little bit slower, I'll see what I can do to mitigate this.

Ravindra also volunteered to try to benchmark the changes with Dremio's
code on today's sync call.

New

Benchmark                                        Mode  Cnt   Score   Error
Units

BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   4.176 ± 1.292
ns/op

BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  26.102 ± 0.700
ns/op

Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084  us/op

Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057  us/op



old

BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.828 ± 0.030
ns/op

BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  20.611 ± 0.188
ns/op

Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462  us/op

Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027  us/op

On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <li...@gmail.com> wrote:

> Hi Gonzalo,
>
> Thanks for sharing the performance results.
> I am wondering if you have turned off the flag
> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> If not, the lower throughput should be expected.
>
> Best,
> Liya Fan
>
> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Gonzalo,
>> Thank you for the feedback.  I wasn't aware of the JIT implications.   At
>> least on the benchmark run they don't seem to have an impact.
>>
>> If there are other benchmarks that people have that can validate if this
>> change will be problematic I would appreciate trying to run them with the
>> PR.  I will try to run the ones for zeroing/popcnt tonight to see if there
>> is a change in those.
>>
>> -Micah
>>
>>
>>
>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>> golthiryus@gmail.com> wrote:
>>
>> > I would recommend to take care with this kind of changes.
>> >
>> > I didn't try Arrow in more than one year, but by then the performance
>> was
>> > quite bad in comparison with plain byte buffer access
>> > (see http://git.net/apache-arrow-development/msg02353.html *) and
>> > there are several optimizations that the JVM (specifically, C2) does not
>> > apply when dealing with int instead of longs. One of the
>> > most commons is the loop unrolling and vectorization.
>> >
>> > * It doesn't seem the best way to reference an old email on the list,
>> but
>> > it is the only result shown by Google
>> >
>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<li...@gmail.com>)
>> > escribió:
>> >
>> >> Hi Micah,
>> >>
>> >> Thanks for your effort. The performance result looks good.
>> >>
>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes for
>> each
>> >> of length, write index, and read index).
>> >> Similar overheads also exist for vectors like BaseFixedWidthVector,
>> >> BaseVariableWidthVector, etc.
>> >>
>> >> IMO, such overheads are small enough to justify the change.
>> >> Let's check if there are other overheads.
>> >>
>> >> Best,
>> >> Liya Fan
>> >>
>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <em...@gmail.com>
>> >> wrote:
>> >>
>> >> > Hi Liya Fan,
>> >> > Based on the Float8Benchmark there does not seem to be any meaningful
>> >> > performance difference on my machine.  At least for me, the
>> benchmarks
>> >> are
>> >> > not stable enough to say one is faster than the other (I've pasted
>> >> results
>> >> > below).  That being said my machine isn't necessarily the most
>> reliable
>> >> for
>> >> > benchmarking.
>> >> >
>> >> > On an intuitive level, this makes sense to me,  for the most part it
>> >> seems
>> >> > like the change just moves casting from "int" to "long" further up
>> the
>> >> > stack  for  "PlatformDepdendent" operations.  If there are other
>> >> benchmarks
>> >> > that you think are worth running let me know.
>> >> >
>> >> > One downside performance wise I think for his change is it increases
>> the
>> >> > size of ArrowBuf objects, which I suppose could influence cache
>> misses
>> >> at
>> >> > some level or increase the size of call-stacks, but this doesn't
>> seem to
>> >> > show up in the benchmark..
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> > Sample benchmark numbers:
>> >> >
>> >> > [New Code]
>> >> > Benchmark                            Mode  Cnt   Score   Error  Units
>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ± 0.469  us/op
>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ± 0.115  us/op
>> >> >
>> >> >
>> >> > [Old code]
>> >> > Benchmark                            Mode  Cnt   Score   Error  Units
>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ± 1.409  us/op
>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ± 0.084  us/op
>> >> >
>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <li...@gmail.com>
>> wrote:
>> >> >
>> >> >> Hi Micah,
>> >> >>
>> >> >> Thanks a lot for doing this.
>> >> >>
>> >> >> I am a little concerned about if there is any negative performance
>> >> impact
>> >> >> on the current 32-bit-length based applications.
>> >> >> Can we do some performance comparison on our existing benchmarks?
>> >> >>
>> >> >> Best,
>> >> >> Liya Fan
>> >> >>
>> >> >>
>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >>> There have been some previous discussions on the mailing about
>> >> supporting
>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
>> >> specification
>> >> >>> and C++ support).  I created a PR [1] that changes all APIs that I
>> >> could
>> >> >>> find that take an index to take an "long" instead of an "int" (and
>> >> >>> similarly change "size/rowcount" APIs).
>> >> >>>
>> >> >>> It is a big change, so I think it is worth discussing if it is
>> >> something
>> >> >>> we
>> >> >>> still want to move forward with.  It would be nice to come to a
>> >> >>> conclusion
>> >> >>> quickly, ideally in the next few days, to avoid a lot of merge
>> >> conflicts.
>> >> >>>
>> >> >>> The reason I did this work now is the C++ implementation has added
>> >> >>> support
>> >> >>> for LargeList, LargeBinary and LargeString arrays and based on
>> prior
>> >> >>> discussions we need to have similar support in Java before our next
>> >> >>> release. Support 64-bit indexes means we can have full
>> compatibility
>> >> and
>> >> >>> make the most use of the types in Java.
>> >> >>>
>> >> >>> Look forward to hearing feedback.
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Micah
>> >> >>>
>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>> >> >>>
>> >> >>
>> >>
>> >
>>
>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Posted by Fan Liya <li...@gmail.com>.

Hi Gonzalo,

Thanks for sharing the performance results.
I am wondering if you have turned off the flag
BoundsChecking#BOUNDS_CHECKING_ENABLED.
If not, the lower throughput should be expected.

Best,
Liya Fan

On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Gonzalo,
> Thank you for the feedback.  I wasn't aware of the JIT implications.   At
> least on the benchmark run they don't seem to have an impact.
>
> If there are other benchmarks that people have that can validate if this
> change will be problematic I would appreciate trying to run them with the
> PR.  I will try to run the ones for zeroing/popcnt tonight to see if there
> is a change in those.
>
> -Micah
>
>
>
> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
> golthiryus@gmail.com> wrote:
>
> > I would recommend to take care with this kind of changes.
> >
> > I didn't try Arrow in more than one year, but by then the performance was
> > quite bad in comparison with plain byte buffer access
> > (see http://git.net/apache-arrow-development/msg02353.html *) and
> > there are several optimizations that the JVM (specifically, C2) does not
> > apply when dealing with int instead of longs. One of the
> > most commons is the loop unrolling and vectorization.
> >
> > * It doesn't seem the best way to reference an old email on the list, but
> > it is the only result shown by Google
> >
> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<li...@gmail.com>)
> > escribió:
> >
> >> Hi Micah,
> >>
> >> Thanks for your effort. The performance result looks good.
> >>
> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes for
> each
> >> of length, write index, and read index).
> >> Similar overheads also exist for vectors like BaseFixedWidthVector,
> >> BaseVariableWidthVector, etc.
> >>
> >> IMO, such overheads are small enough to justify the change.
> >> Let's check if there are other overheads.
> >>
> >> Best,
> >> Liya Fan
> >>
> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <em...@gmail.com>
> >> wrote:
> >>
> >> > Hi Liya Fan,
> >> > Based on the Float8Benchmark there does not seem to be any meaningful
> >> > performance difference on my machine.  At least for me, the benchmarks
> >> are
> >> > not stable enough to say one is faster than the other (I've pasted
> >> results
> >> > below).  That being said my machine isn't necessarily the most
> reliable
> >> for
> >> > benchmarking.
> >> >
> >> > On an intuitive level, this makes sense to me,  for the most part it
> >> seems
> >> > like the change just moves casting from "int" to "long" further up the
> >> > stack  for  "PlatformDepdendent" operations.  If there are other
> >> benchmarks
> >> > that you think are worth running let me know.
> >> >
> >> > One downside performance wise I think for his change is it increases
> the
> >> > size of ArrowBuf objects, which I suppose could influence cache misses
> >> at
> >> > some level or increase the size of call-stacks, but this doesn't seem
> to
> >> > show up in the benchmark..
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> > Sample benchmark numbers:
> >> >
> >> > [New Code]
> >> > Benchmark                            Mode  Cnt   Score   Error  Units
> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ± 0.469  us/op
> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ± 0.115  us/op
> >> >
> >> >
> >> > [Old code]
> >> > Benchmark                            Mode  Cnt   Score   Error  Units
> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ± 1.409  us/op
> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ± 0.084  us/op
> >> >
> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <li...@gmail.com> wrote:
> >> >
> >> >> Hi Micah,
> >> >>
> >> >> Thanks a lot for doing this.
> >> >>
> >> >> I am a little concerned about if there is any negative performance
> >> impact
> >> >> on the current 32-bit-length based applications.
> >> >> Can we do some performance comparison on our existing benchmarks?
> >> >>
> >> >> Best,
> >> >> Liya Fan
> >> >>
> >> >>
> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
> emkornfield@gmail.com>
> >> >> wrote:
> >> >>
> >> >>> There have been some previous discussions on the mailing about
> >> supporting
> >> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
> >> specification
> >> >>> and C++ support).  I created a PR [1] that changes all APIs that I
> >> could
> >> >>> find that take an index to take an "long" instead of an "int" (and
> >> >>> similarly change "size/rowcount" APIs).
> >> >>>
> >> >>> It is a big change, so I think it is worth discussing if it is
> >> something
> >> >>> we
> >> >>> still want to move forward with.  It would be nice to come to a
> >> >>> conclusion
> >> >>> quickly, ideally in the next few days, to avoid a lot of merge
> >> conflicts.
> >> >>>
> >> >>> The reason I did this work now is the C++ implementation has added
> >> >>> support
> >> >>> for LargeList, LargeBinary and LargeString arrays and based on prior
> >> >>> discussions we need to have similar support in Java before our next
> >> >>> release. Support 64-bit indexes means we can have full compatibility
> >> and
> >> >>> make the most use of the types in Java.
> >> >>>
> >> >>> Look forward to hearing feedback.
> >> >>>
> >> >>> Thanks,
> >> >>> Micah
> >> >>>
> >> >>> [1] https://github.com/apache/arrow/pull/5020
> >> >>>
> >> >>
> >>
> >
>