You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crail.apache.org by Animesh Trivedi <an...@gmail.com> on 2018/10/03 13:44:55 UTC

Re: [JAVA] Arrow performance measurement

Hi all - quick update on the performance investigation:

- I spent some time looking at performance profile for a binary blob column
(1024 bytes of byte[]) and found a few favorable settings for delivering up
to 168 Gbps from in-memory reading benchmark on 16 cores. These settings
(NUMA, JVM settings, Arrow holder API, and batch size, etc.) are documented
here:
https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
- these setting also help to improved the last number that reported (but
not by much) for the in-memory TPC-DS store_sales table from ~39 Gbps up to
~45-47 Gbps (note: this number is just in-memory benchmark, i.e., w/o any
networking or storage links)

A few follow up questions that I have:
1. Arrow reads a batch size worth of data in one go. Are there any
recommended batch sizes? In my investigation, small batch size help with a
better cache profile but increase number of instructions required (more
looping). Larger one do otherwise. Somehow ~10MB/thread seem to be the best
performing configuration, which is also a bit counter intuitive as for 16
threads this will lead to 160 MB of memory footprint. May be this is also
tired to the memory management logic which is my next question.
2. Arrow use's netty's memory manager. (i) what are decent netty memory
management settings for "io.netty.allocator.*" parameters? I don't find any
decent write-up on them; (ii) Is there a provision for ArrowBuf being
re-used once a batch is consumed? As it looks for now, read read allocates
a new buffer to read the whole batch size.
3. Are there examples of Arrow in C++ read/write code that I can have a
look?

Cheers,
--
Animesh


On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com> wrote:

> On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> <an...@gmail.com> wrote:
> >
> > Hi Johan, Wes, and Jacques - many thanks for your comments:
> >
> > @Johan -
> > 1. I also do not suspect that there is any inherent drawback in Java or
> C++
> > due to the Arrow format. I mentioned C++ because Wes pointed out that
> Java
> > routines are not the most optimized ones (yet!). And naturally one would
> > expect better performance in a native language with all
> pointer/memory/SIMD
> > instruction optimizations that you mentioned. As far as I know, the
> > off-heap buffers are managed in ArrowBuf which implements an abstract
> netty
> > class. But there is nothing unusual, i.e., netty specific, about these
> > unsafe routines, they are used by many projects. Though there is cost
> > associated with materializing on-heap Java values from off-heap memory
> > regions. I need to benchmark that more carefully.
> >
> > 2. When you say "I've so far always been able to get similar performance
> > numbers" - do you mean the same performance of my case 3 where 16 cores
> > drive close to 40 Gbps, or the same performance between your C++ and Java
> > benchmarks. Do you have some write-up? I would be interested to read up
> :)
> >
> > 3. "Can you get to 100 Gbps starting from primitive arrays in Java" ->
> that
> > is a good idea. Let me try and report back.
> >
> > @Wes -
> >
> > Is there some benchmark template for C++ routines I can have a look? I
> > would be happy to get some input from Java-Arrow experts on how to write
> > these benchmarks more efficiently. I will have a closer look at the JIRA
> > tickets that you mentioned.
> >
> > So, for now I am focusing on the case 3, which is about establishing
> > performance when reading from a local in-memory I/O stream that I
> > implemented (
> >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> ).
> > In this case I first read data from parquet files, convert them into
> Arrow,
> > and write-out to a MemoryIOChannel, and then read back from it. So, the
> > performance has nothing to do with Crail or HDFS in the case 3. Once, I
> > establish the base performance in this setup (which is around ~40 Gbps
> with
> > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams can take
> > ArrowBuf as src/dst buffers. That should be doable.
>
> Right, in any case what you are testing is the performance of using a
> particular Java accessor layer to JVM off-heap Arrow memory to sum the
> non-null values of each column. I'm not sure that a single bandwidth
> number produced by this benchmark is very informative for people
> contemplating what memory format to use in their system due to the
> current state of the implementation (Java) and workload measured
> (summing the non-null values with a naive algorithm). I would guess
> that a C++ version with raw pointers and a loop-unrolled, branch-free
> vectorized sum is going to be a lot faster.
>
> >
> > @Jacques -
> >
> > That is a good point that "Arrow's implementation is more focused on
> > interacting with the structure than transporting it". However, in any
> > distributed system one needs to move data/structure around - as far as I
> > understand that is another goal of the project. My investigation started
> > within the context of Spark/SQL data processing. Spark converts incoming
> > data into its own in-memory UnsafeRow representation for processing. So
> > naturally the performance of this data ingestion pipeline cannot
> outperform
> > the read performance of the used file format. I benchmarked Parquet, ORC,
> > Avro, JSON (for the specific TPC-DS store_sales table). And then
> curiously
> > benchmarked Arrow as well because its design choices are a better fit for
> > modern high-performance RDMA/NVMe/100+Gbps devices I am investigating.
> From
> > this point of view, I am trying to find out can Arrow be the file format
> > for the next generation of storage/networking devices (see Apache Crail
> > project) delivering close to the hardware speed reading/writing rates. As
> > Wes pointed out that a C++ library implementation should be memory-IO
> > bound, so what would it take to deliver the same performance in Java ;)
> > (and then, from across the network).
> >
> > I hope this makes sense.
> >
> > Cheers,
> > --
> > Animesh
> >
> > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> > > My big question is what is the use case and how/what are you trying to
> > > compare? Arrow's implementation is more focused on interacting with the
> > > structure than transporting it. Generally speaking, when we're working
> with
> > > Arrow data we frequently are just interacting with memory locations and
> > > doing direct operations. If you have a layer that supports that type of
> > > semantic, create a movement technique that depends on that. Arrow
> doesn't
> > > force a particular API since the data itself is defined by its
> in-memory
> > > layout so if you have a custom use or pattern, just work with the
> in-memory
> > > structures.
> > >
> > >
> > >
> > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > hi Animesh,
> > > >
> > > > Per Johan's comments, the C++ library is essentially going to be
> > > > IO/memory bandwidth bound since you're interacting with raw pointers.
> > > >
> > > > I'm looking at your code
> > > >
> > > > private void consumeFloat4(FieldVector fv) {
> > > >     Float4Vector accessor = (Float4Vector) fv;
> > > >     int valCount = accessor.getValueCount();
> > > >     for(int i = 0; i < valCount; i++){
> > > >         if(!accessor.isNull(i)){
> > > >             float4Count+=1;
> > > >             checksum+=accessor.get(i);
> > > >         }
> > > >     }
> > > > }
> > > >
> > > > You'll want to get a Java-Arrow expert from Dremio to advise you the
> > > > fastest way to iterate over this data -- my understanding is that
> much
> > > > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > > > rather than going through the higher level APIs. You're also dropping
> > > > performance because memory mapping is not yet implemented in Java,
> see
> > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > >
> > > > Furthermore, the IPC reader class you are using could be made more
> > > > efficient. I described the problem in
> > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > > required as soon as we have the ability to do memory mapping in Java
> > > >
> > > > Could Crail use the Arrow data structures in its runtime rather than
> > > > copying? If not, how are Crail's runtime data structures different?
> > > >
> > > > - Wes
> > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > <J....@tudelft.nl> wrote:
> > > > >
> > > > > Hello Animesh,
> > > > >
> > > > >
> > > > > I browsed a bit in your sources, thanks for sharing. We have
> performed
> > > > some similar measurements to your third case in the past for C/C++ on
> > > > collections of various basic types such as primitives and strings.
> > > > >
> > > > >
> > > > > I can say that in terms of consuming data from the Arrow format
> versus
> > > > language native collections in C++, I've so far always been able to
> get
> > > > similar performance numbers (e.g. no drawbacks due to the Arrow
> format
> > > > itself). Especially when accessing the data through Arrow's raw data
> > > > pointers (and using for example std::string_view-like constructs).
> > > > >
> > > > > In C/C++ the fast data structures are engineered in such a way
> that as
> > > > little pointer traversals are required and they take up an as small
> as
> > > > possible memory footprint. Thus each memory access is relatively
> > > efficient
> > > > (in terms of obtaining the data of interest). The same can
> absolutely be
> > > > said for Arrow, if not even more efficient in some cases where object
> > > > fields are of variable length.
> > > > >
> > > > >
> > > > > In the JVM case, the Arrow data is stored off-heap. This requires
> the
> > > > JVM to interface to it through some calls to Unsafe hidden under the
> > > Netty
> > > > layer (but please correct me if I'm wrong, I'm not an expert on
> this).
> > > > Those calls are the only reason I can think of that would degrade the
> > > > performance a bit compared to a pure JAva case. I don't know if the
> > > Unsafe
> > > > calls are inlined during JIT compilation. If they aren't they will
> > > increase
> > > > access latency to any data a little bit.
> > > > >
> > > > >
> > > > > I don't have a similar machine so it's not easy to relate my
> numbers to
> > > > yours, but if you can get that data consumed with 100 Gbps in a pure
> Java
> > > > case, I don't see any reason (resulting from Arrow format / off-heap
> > > > storage) why you wouldn't be able to get at least really close. Can
> you
> > > get
> > > > to 100 Gbps starting from primitive arrays in Java with your
> consumption
> > > > functions in the first place?
> > > > >
> > > > >
> > > > > I'm interested to see your progress on this.
> > > > >
> > > > >
> > > > > Kind regards,
> > > > >
> > > > >
> > > > > Johan Peltenburg
> > > > >
> > > > > ________________________________
> > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > Subject: [JAVA] Arrow performance measurement
> > > > >
> > > > > Hi all,
> > > > >
> > > > > A week ago, Wes and I had a discussion about the performance of the
> > > > > Arrow/Java implementation on the Apache Crail (Incubating) mailing
> > > list (
> > > > >
> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > ).
> > > > In
> > > > > a nutshell: I am investigating the performance of various file
> formats
> > > > > (including Arrow) on high-performance NVMe and RDMA/100Gbps/RoCE
> > > setups.
> > > > I
> > > > > benchmarked how long does it take to materialize values (ints,
> longs,
> > > > > doubles) of the store_sales table, the largest table in the TPC-DS
> > > > dataset
> > > > > stored on different file formats. Here is a write-up on this -
> > > > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> found
> > > > that
> > > > > between a pair of machine connected over a 100 Gbps link, Arrow
> (using
> > > > as a
> > > > > file format on HDFS) delivered close to ~30 Gbps of bandwidth with
> all
> > > 16
> > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory IPC
> format,
> > > > and
> > > > > has not been optimized for storage interfaces/APIs like HDFS; (ii)
> the
> > > > > performance I am measuring is for the java implementation.
> > > > >
> > > > > Wes, I hope I summarized our discussion correctly.
> > > > >
> > > > > That brings us to this email where I promised to follow up on the
> Arrow
> > > > > mailing list to understand and optimize the performance of
> Arrow/Java
> > > > > implementation on high-performance devices. I wrote a small
> stand-alone
> > > > > benchmark (https://github.com/animeshtrivedi/benchmarking-arrow)
> with
> > > > three
> > > > > implementations of WritableByteChannel, SeekableByteChannel
> interfaces:
> > > > >
> > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps
> > > > performance
> > > > > 2. Arrow data is stored in Crail/DRAM - this gives me ~35-36 Gbps
> > > > > performance
> > > > > 3. Arrow data is stored in on-heap byte[] - this gives me ~39 Gbps
> > > > > performance
> > > > >
> > > > > I think the order makes sense. To better understand the
> performance of
> > > > > Arrow/Java we can focus on the option 3.
> > > > >
> > > > > The key question I am trying to answer is "what would it take for
> > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it possible? If
> > > yes,
> > > > > then what is missing/or mis-interpreted by me? If not, then where
> is
> > > the
> > > > > performance lost? Does anyone have any performance measurements
> for C++
> > > > > implementation? if they have seen/expect better numbers.
> > > > >
> > > > > As a next step, I will profile the read path of Arrow/Java for the
> > > option
> > > > > 3. I will report my findings.
> > > > >
> > > > > Any thoughts and feedback on this investigation are very welcome.
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > Animesh
> > > > >
> > > > > PS~ Cross-posting on the dev@crail.apache.org list as well.
> > > >
> > >
>

Re: [JAVA] Arrow performance measurement

Posted by Animesh Trivedi <an...@gmail.com>.
Hi Wes, sure. I opened a ticket and will do a pull request.

Cheers,
--
Animesh

On Fri, Nov 30, 2018 at 5:28 PM Wes McKinney <we...@gmail.com> wrote:

> hi Animesh -- can you link to JIRA issues about the C++ improvements
> you're describing? Want to make sure this doesn't fall through the
> cracks
>
> Thanks
> Wes
> On Mon, Nov 26, 2018 at 7:54 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Hi Animesh,
> >
> > Le 26/11/2018 à 14:23, Animesh Trivedi a écrit :
> > >
> > > * C++ bitmap code can be optimized further by using the unsigned
> integers
> > > than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
> > > https://godbolt.org/z/deq0_q - compare the size of the assembly code.
> And
> > > the performance measurements in the blog show up to 50% performance
> gains.
> > > Alternatively if signed to unsigned upgrade is not possible (perhaps in
> > > every language), then in the C++ code, we should use the bitmap
> operations
> > > directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
> > > operation), instead of `/` and `%`.
> >
> > Thank you for noticing this.  Switching to unsigned (and do a
> > static_cast to unsigned) sounds good to me.
> >
> > Regards
> >
> > Antoine.
>

Re: [JAVA] Arrow performance measurement

Posted by Wes McKinney <we...@gmail.com>.
hi Animesh -- can you link to JIRA issues about the C++ improvements
you're describing? Want to make sure this doesn't fall through the
cracks

Thanks
Wes
On Mon, Nov 26, 2018 at 7:54 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hi Animesh,
>
> Le 26/11/2018 à 14:23, Animesh Trivedi a écrit :
> >
> > * C++ bitmap code can be optimized further by using the unsigned integers
> > than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
> > https://godbolt.org/z/deq0_q - compare the size of the assembly code. And
> > the performance measurements in the blog show up to 50% performance gains.
> > Alternatively if signed to unsigned upgrade is not possible (perhaps in
> > every language), then in the C++ code, we should use the bitmap operations
> > directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
> > operation), instead of `/` and `%`.
>
> Thank you for noticing this.  Switching to unsigned (and do a
> static_cast to unsigned) sounds good to me.
>
> Regards
>
> Antoine.

Re: [JAVA] Arrow performance measurement

Posted by Antoine Pitrou <an...@python.org>.
Hi Animesh,

Le 26/11/2018 à 14:23, Animesh Trivedi a écrit :
> 
> * C++ bitmap code can be optimized further by using the unsigned integers
> than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
> https://godbolt.org/z/deq0_q - compare the size of the assembly code. And
> the performance measurements in the blog show up to 50% performance gains.
> Alternatively if signed to unsigned upgrade is not possible (perhaps in
> every language), then in the C++ code, we should use the bitmap operations
> directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
> operation), instead of `/` and `%`.

Thank you for noticing this.  Switching to unsigned (and do a
static_cast to unsigned) sounds good to me.

Regards

Antoine.

Re: [JAVA] Arrow performance measurement

Posted by Animesh Trivedi <an...@gmail.com>.
Hi all,

Apologies for the silence, I have been occupied. Here is the part 3 of the
investigation. This time, I compared the performance of Java and C++
readers. The benchmark is read and checksum 10 billion integers (~40GB of
data, with some null values) on a single core. You can read the full
investigation here:
https://github.com/animeshtrivedi/blog/blob/master/post/2018-11-22-arrow-cpp.md.
The best numbers that I have are (see the table at the end of the blog for
more details):

C++ (file)    : 21.5 Gbps
C++ (mmap)    : 30.4 Gbps
Java (unsafe) : 12.48 Gbps

The key insights are:

* C++ code (with file I/O) is generally 2x faster than the Java code. I
have narrowed down the profiling to JITing issues, memory management, etc.
Most of these issues are independent of Arrow (as far as I can tell). The
standalone benchmark to debug these issues is here :
https://github.com/animeshtrivedi/java-cpp-fun. The pattern of the
workloads is - (1) you have a large integer and bitmap array; (2) go over
the array, check for NULL, and sum it. In general, C++ code is 2x faster
than the Java code. Generally I understand why Java code is slow, but how
to make it faster of comparable to C++ code is what I want to know. Any
input on it is appreciated.

* C++ code has a no-null values optimization branch on the IsValid() check.
Java code can benefit from it too, but does not have this implemented. I
will open a pull request for this.

* C++ bitmap code can be optimized further by using the unsigned integers
than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
https://godbolt.org/z/deq0_q - compare the size of the assembly code. And
the performance measurements in the blog show up to 50% performance gains.
Alternatively if signed to unsigned upgrade is not possible (perhaps in
every language), then in the C++ code, we should use the bitmap operations
directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
operation), instead of `/` and `%`.

Thanks,
--
Animesh


On Thu, Oct 11, 2018 at 6:55 PM Animesh Trivedi <an...@gmail.com>
wrote:

> Hi Li - thanks for setting up the jiras. I'll prepare the pull requests
> soon.
>
> Hi Wes, I think that is a good idea to write a blog about performance,
> with pointing to the improvements in the code base (fixes and
> microbenchmarks). I can prepare a first draft as we go along. This will
> definitely serve well to the large java community.
>
> Cheers,
> --
> Animesh
>
>
> On Thu, 11 Oct 2018, 17:55 Li Jin, <ic...@gmail.com> wrote:
>
>> I have created these as the first step. Animesh, feel free to submit PR
>> for
>> these. I will look into your micro benchmarks soon.
>>
>>
>>
>>    1. [image: Improvement] ARROW-3497[Java] Add user documentation for
>>    achieving better performance
>>    <https://jira.apache.org/jira/browse/ARROW-3497>
>>    2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to
>> Java
>>    <https://jira.apache.org/jira/browse/ARROW-3496>
>>    3. [image: Improvement] ARROW-3495[Java] Optimize bit operations
>>    performance <https://jira.apache.org/jira/browse/ARROW-3495>
>>    4. [image: Improvement] ARROW-3493[Java] Document
>> BOUNDS_CHECKING_ENABLED
>>    <https://jira.apache.org/jira/browse/ARROW-3493>
>>
>>
>> On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ic...@gmail.com> wrote:
>>
>> > Hi Wes and Animesh,
>> >
>> > Thanks for the analysis and discussion. I am happy to looking into
>> this. I
>> > will create some Jiras soon.
>> >
>> > Li
>> >
>> > On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> >> hey Animesh,
>> >>
>> >> Thank you for doing this analysis. If you'd like to share some of the
>> >> analysis more broadly e.g. on the Apache Arrow blog or social media,
>> >> let us know.
>> >>
>> >> Seems like there might be a few follow ups here for the Arrow Java
>> >> community:
>> >>
>> >> * Documentation about achieving better performance
>> >> * Writing some microperformance benchmarks
>> >> * Making some improvements to the code to facilitate better performance
>> >>
>> >> Feel free to create some JIRA issues. Are any Java developers
>> >> interested in digging a little more into these issues?
>> >>
>> >> Thanks,
>> >> Wes
>> >> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
>> >> <an...@gmail.com> wrote:
>> >> >
>> >> > Hi Wes and all,
>> >> >
>> >> > Here is another round of updates:
>> >> >
>> >> > Quick recap - previously we established that for 1kB binary blobs,
>> Arrow
>> >> > can deliver > 160 Gbps performance from in-memory buffers.
>> >> >
>> >> > In this round I looked at the performance of materializing
>> "integers".
>> >> In
>> >> > my benchmarks, I found that with careful
>> optimizations/code-rewriting we
>> >> > can push the performance of integer reading from 5.42 Gbps/core to
>> 13.61
>> >> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to
>> 110+
>> >> > Gbps. Key things to do is:
>> >> >
>> >> > 1) Disable memory access checks in Arrow and Netty buffers. This gave
>> >> > significant performance boost. However, for such an important
>> >> performance
>> >> > flag, it is very poorly documented
>> >> > ("drill.enable_unsafe_memory_access=true").
>> >> >
>> >> > 2) Materialize values from Validity and Value direct buffers instead
>> of
>> >> > calling getInt() function on the IntVector. This is implemented as a
>> new
>> >> > Unsafe reader type (
>> >> >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
>> >> > )
>> >> >
>> >> > 3) Optimize bitmap operation to check if a bit is set or not (
>> >> >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
>> >> > )
>> >> >
>> >> > A detailed write up of these steps is available here:
>> >> >
>> >>
>> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
>> >> >
>> >> > I have 2 follow-up questions:
>> >> >
>> >> > 1) Regarding the `isSet` function, why does it has to calculate
>> number
>> >> of
>> >> > bits set? (
>> >> >
>> >>
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797
>> >> ).
>> >> > Wouldn't just checking if the result of the AND operation is zero or
>> >> not be
>> >> > sufficient? Like what I did :
>> >> >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
>> >> >
>> >> >
>> >> > 2) What is the reason behind this bitmap generation optimization here
>> >> >
>> >>
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
>> >> > ? At this point when this function is called, the bitmap vector is
>> >> already
>> >> > read from the storage, and contains the right values (either all
>> null,
>> >> all
>> >> > set, or whatever). Generating this mask here for the special cases
>> when
>> >> the
>> >> > values are all NULL or all set (this was the case in my benchmark),
>> can
>> >> be
>> >> > slower than just returning what one has read from the storage.
>> >> >
>> >> > Collectively optimizing these two bitmap operations give more than 1
>> >> Gbps
>> >> > gains in my bench-marking code.
>> >> >
>> >> > Cheers,
>> >> > --
>> >> > Animesh
>> >> >
>> >> >
>> >> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com>
>> >> wrote:
>> >> >
>> >> > > See e.g.
>> >> > >
>> >> > >
>> >> > >
>> >>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
>> >> > >
>> >> > >
>> >> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
>> >> > > <an...@gmail.com> wrote:
>> >> > > >
>> >> > > > Primarily write the same microbenchmark as I have in Java in C++
>> for
>> >> > > table
>> >> > > > reading and value materialization. So just an example of
>> equivalent
>> >> > > > ArrowFileReader example code in C++. Unit tests are a good
>> starting
>> >> > > point,
>> >> > > > thanks for the tip :)
>> >> > > >
>> >> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <
>> wesmckinn@gmail.com>
>> >> > > wrote:
>> >> > > >
>> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
>> can
>> >> > > have a
>> >> > > > > look?
>> >> > > > >
>> >> > > > > What kind of code are you looking for? I would direct you to
>> >> relevant
>> >> > > > > unit tests that exhibit certain functionality, but it depends
>> on
>> >> what
>> >> > > > > you are trying to do
>> >> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
>> >> > > > > <an...@gmail.com> wrote:
>> >> > > > > >
>> >> > > > > > Hi all - quick update on the performance investigation:
>> >> > > > > >
>> >> > > > > > - I spent some time looking at performance profile for a
>> binary
>> >> blob
>> >> > > > > column
>> >> > > > > > (1024 bytes of byte[]) and found a few favorable settings for
>> >> > > delivering
>> >> > > > > up
>> >> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores.
>> These
>> >> > > settings
>> >> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.)
>> are
>> >> > > > > documented
>> >> > > > > > here:
>> >> > > > > >
>> >> > > > >
>> >> > >
>> >>
>> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
>> >> > > > > > - these setting also help to improved the last number that
>> >> reported
>> >> > > (but
>> >> > > > > > not by much) for the in-memory TPC-DS store_sales table from
>> ~39
>> >> > > Gbps up
>> >> > > > > to
>> >> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark,
>> >> i.e.,
>> >> > > w/o any
>> >> > > > > > networking or storage links)
>> >> > > > > >
>> >> > > > > > A few follow up questions that I have:
>> >> > > > > > 1. Arrow reads a batch size worth of data in one go. Are
>> there
>> >> any
>> >> > > > > > recommended batch sizes? In my investigation, small batch
>> size
>> >> help
>> >> > > with
>> >> > > > > a
>> >> > > > > > better cache profile but increase number of instructions
>> >> required
>> >> > > (more
>> >> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem
>> to
>> >> be
>> >> > > the
>> >> > > > > best
>> >> > > > > > performing configuration, which is also a bit counter
>> intuitive
>> >> as
>> >> > > for 16
>> >> > > > > > threads this will lead to 160 MB of memory footprint. May be
>> >> this is
>> >> > > also
>> >> > > > > > tired to the memory management logic which is my next
>> question.
>> >> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent
>> netty
>> >> > > memory
>> >> > > > > > management settings for "io.netty.allocator.*" parameters? I
>> >> don't
>> >> > > find
>> >> > > > > any
>> >> > > > > > decent write-up on them; (ii) Is there a provision for
>> ArrowBuf
>> >> being
>> >> > > > > > re-used once a batch is consumed? As it looks for now, read
>> read
>> >> > > > > allocates
>> >> > > > > > a new buffer to read the whole batch size.
>> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
>> can
>> >> > > have a
>> >> > > > > > look?
>> >> > > > > >
>> >> > > > > > Cheers,
>> >> > > > > > --
>> >> > > > > > Animesh
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <
>> >> wesmckinn@gmail.com>
>> >> > > > > wrote:
>> >> > > > > >
>> >> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
>> >> > > > > > > <an...@gmail.com> wrote:
>> >> > > > > > > >
>> >> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your
>> comments:
>> >> > > > > > > >
>> >> > > > > > > > @Johan -
>> >> > > > > > > > 1. I also do not suspect that there is any inherent
>> >> drawback in
>> >> > > Java
>> >> > > > > or
>> >> > > > > > > C++
>> >> > > > > > > > due to the Arrow format. I mentioned C++ because Wes
>> >> pointed out
>> >> > > that
>> >> > > > > > > Java
>> >> > > > > > > > routines are not the most optimized ones (yet!). And
>> >> naturally
>> >> > > one
>> >> > > > > would
>> >> > > > > > > > expect better performance in a native language with all
>> >> > > > > > > pointer/memory/SIMD
>> >> > > > > > > > instruction optimizations that you mentioned. As far as I
>> >> know,
>> >> > > the
>> >> > > > > > > > off-heap buffers are managed in ArrowBuf which
>> implements an
>> >> > > abstract
>> >> > > > > > > netty
>> >> > > > > > > > class. But there is nothing unusual, i.e., netty
>> specific,
>> >> about
>> >> > > > > these
>> >> > > > > > > > unsafe routines, they are used by many projects. Though
>> >> there is
>> >> > > cost
>> >> > > > > > > > associated with materializing on-heap Java values from
>> >> off-heap
>> >> > > > > memory
>> >> > > > > > > > regions. I need to benchmark that more carefully.
>> >> > > > > > > >
>> >> > > > > > > > 2. When you say "I've so far always been able to get
>> similar
>> >> > > > > performance
>> >> > > > > > > > numbers" - do you mean the same performance of my case 3
>> >> where 16
>> >> > > > > cores
>> >> > > > > > > > drive close to 40 Gbps, or the same performance between
>> >> your C++
>> >> > > and
>> >> > > > > Java
>> >> > > > > > > > benchmarks. Do you have some write-up? I would be
>> >> interested to
>> >> > > read
>> >> > > > > up
>> >> > > > > > > :)
>> >> > > > > > > >
>> >> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive
>> arrays
>> >> in
>> >> > > Java"
>> >> > > > > ->
>> >> > > > > > > that
>> >> > > > > > > > is a good idea. Let me try and report back.
>> >> > > > > > > >
>> >> > > > > > > > @Wes -
>> >> > > > > > > >
>> >> > > > > > > > Is there some benchmark template for C++ routines I can
>> >> have a
>> >> > > look?
>> >> > > > > I
>> >> > > > > > > > would be happy to get some input from Java-Arrow experts
>> on
>> >> how
>> >> > > to
>> >> > > > > write
>> >> > > > > > > > these benchmarks more efficiently. I will have a closer
>> >> look at
>> >> > > the
>> >> > > > > JIRA
>> >> > > > > > > > tickets that you mentioned.
>> >> > > > > > > >
>> >> > > > > > > > So, for now I am focusing on the case 3, which is about
>> >> > > establishing
>> >> > > > > > > > performance when reading from a local in-memory I/O
>> stream
>> >> that I
>> >> > > > > > > > implemented (
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > >
>> >> > >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
>> >> > > > > > > ).
>> >> > > > > > > > In this case I first read data from parquet files,
>> convert
>> >> them
>> >> > > into
>> >> > > > > > > Arrow,
>> >> > > > > > > > and write-out to a MemoryIOChannel, and then read back
>> from
>> >> it.
>> >> > > So,
>> >> > > > > the
>> >> > > > > > > > performance has nothing to do with Crail or HDFS in the
>> >> case 3.
>> >> > > > > Once, I
>> >> > > > > > > > establish the base performance in this setup (which is
>> >> around ~40
>> >> > > > > Gbps
>> >> > > > > > > with
>> >> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O
>> >> streams
>> >> > > can
>> >> > > > > take
>> >> > > > > > > > ArrowBuf as src/dst buffers. That should be doable.
>> >> > > > > > >
>> >> > > > > > > Right, in any case what you are testing is the performance
>> of
>> >> > > using a
>> >> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory
>> >> to sum
>> >> > > the
>> >> > > > > > > non-null values of each column. I'm not sure that a single
>> >> > > bandwidth
>> >> > > > > > > number produced by this benchmark is very informative for
>> >> people
>> >> > > > > > > contemplating what memory format to use in their system due
>> >> to the
>> >> > > > > > > current state of the implementation (Java) and workload
>> >> measured
>> >> > > > > > > (summing the non-null values with a naive algorithm). I
>> would
>> >> guess
>> >> > > > > > > that a C++ version with raw pointers and a loop-unrolled,
>> >> > > branch-free
>> >> > > > > > > vectorized sum is going to be a lot faster.
>> >> > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > @Jacques -
>> >> > > > > > > >
>> >> > > > > > > > That is a good point that "Arrow's implementation is more
>> >> > > focused on
>> >> > > > > > > > interacting with the structure than transporting it".
>> >> However,
>> >> > > in any
>> >> > > > > > > > distributed system one needs to move data/structure
>> around
>> >> - as
>> >> > > far
>> >> > > > > as I
>> >> > > > > > > > understand that is another goal of the project. My
>> >> investigation
>> >> > > > > started
>> >> > > > > > > > within the context of Spark/SQL data processing. Spark
>> >> converts
>> >> > > > > incoming
>> >> > > > > > > > data into its own in-memory UnsafeRow representation for
>> >> > > processing.
>> >> > > > > So
>> >> > > > > > > > naturally the performance of this data ingestion pipeline
>> >> cannot
>> >> > > > > > > outperform
>> >> > > > > > > > the read performance of the used file format. I
>> benchmarked
>> >> > > Parquet,
>> >> > > > > ORC,
>> >> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table).
>> And
>> >> then
>> >> > > > > > > curiously
>> >> > > > > > > > benchmarked Arrow as well because its design choices are
>> a
>> >> better
>> >> > > > > fit for
>> >> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
>> >> > > > > investigating.
>> >> > > > > > > From
>> >> > > > > > > > this point of view, I am trying to find out can Arrow be
>> >> the file
>> >> > > > > format
>> >> > > > > > > > for the next generation of storage/networking devices
>> (see
>> >> Apache
>> >> > > > > Crail
>> >> > > > > > > > project) delivering close to the hardware speed
>> >> reading/writing
>> >> > > > > rates. As
>> >> > > > > > > > Wes pointed out that a C++ library implementation should
>> be
>> >> > > memory-IO
>> >> > > > > > > > bound, so what would it take to deliver the same
>> >> performance in
>> >> > > Java
>> >> > > > > ;)
>> >> > > > > > > > (and then, from across the network).
>> >> > > > > > > >
>> >> > > > > > > > I hope this makes sense.
>> >> > > > > > > >
>> >> > > > > > > > Cheers,
>> >> > > > > > > > --
>> >> > > > > > > > Animesh
>> >> > > > > > > >
>> >> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
>> >> > > jacques@apache.org>
>> >> > > > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > My big question is what is the use case and how/what
>> are
>> >> you
>> >> > > > > trying to
>> >> > > > > > > > > compare? Arrow's implementation is more focused on
>> >> interacting
>> >> > > > > with the
>> >> > > > > > > > > structure than transporting it. Generally speaking,
>> when
>> >> we're
>> >> > > > > working
>> >> > > > > > > with
>> >> > > > > > > > > Arrow data we frequently are just interacting with
>> memory
>> >> > > > > locations and
>> >> > > > > > > > > doing direct operations. If you have a layer that
>> >> supports that
>> >> > > > > type of
>> >> > > > > > > > > semantic, create a movement technique that depends on
>> >> that.
>> >> > > Arrow
>> >> > > > > > > doesn't
>> >> > > > > > > > > force a particular API since the data itself is defined
>> >> by its
>> >> > > > > > > in-memory
>> >> > > > > > > > > layout so if you have a custom use or pattern, just
>> work
>> >> with
>> >> > > the
>> >> > > > > > > in-memory
>> >> > > > > > > > > structures.
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
>> >> > > wesmckinn@gmail.com>
>> >> > > > > > > wrote:
>> >> > > > > > > > >
>> >> > > > > > > > > > hi Animesh,
>> >> > > > > > > > > >
>> >> > > > > > > > > > Per Johan's comments, the C++ library is essentially
>> >> going
>> >> > > to be
>> >> > > > > > > > > > IO/memory bandwidth bound since you're interacting
>> with
>> >> raw
>> >> > > > > pointers.
>> >> > > > > > > > > >
>> >> > > > > > > > > > I'm looking at your code
>> >> > > > > > > > > >
>> >> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
>> >> > > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
>> >> > > > > > > > > >     int valCount = accessor.getValueCount();
>> >> > > > > > > > > >     for(int i = 0; i < valCount; i++){
>> >> > > > > > > > > >         if(!accessor.isNull(i)){
>> >> > > > > > > > > >             float4Count+=1;
>> >> > > > > > > > > >             checksum+=accessor.get(i);
>> >> > > > > > > > > >         }
>> >> > > > > > > > > >     }
>> >> > > > > > > > > > }
>> >> > > > > > > > > >
>> >> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
>> >> advise
>> >> > > you
>> >> > > > > the
>> >> > > > > > > > > > fastest way to iterate over this data -- my
>> >> understanding is
>> >> > > that
>> >> > > > > > > much
>> >> > > > > > > > > > code in Dremio interacts with the wrapped Netty
>> ArrowBuf
>> >> > > objects
>> >> > > > > > > > > > rather than going through the higher level APIs.
>> You're
>> >> also
>> >> > > > > dropping
>> >> > > > > > > > > > performance because memory mapping is not yet
>> >> implemented in
>> >> > > > > Java,
>> >> > > > > > > see
>> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
>> >> > > > > > > > > >
>> >> > > > > > > > > > Furthermore, the IPC reader class you are using could
>> >> be made
>> >> > > > > more
>> >> > > > > > > > > > efficient. I described the problem in
>> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 --
>> >> this
>> >> > > will be
>> >> > > > > > > > > > required as soon as we have the ability to do memory
>> >> mapping
>> >> > > in
>> >> > > > > Java
>> >> > > > > > > > > >
>> >> > > > > > > > > > Could Crail use the Arrow data structures in its
>> runtime
>> >> > > rather
>> >> > > > > than
>> >> > > > > > > > > > copying? If not, how are Crail's runtime data
>> structures
>> >> > > > > different?
>> >> > > > > > > > > >
>> >> > > > > > > > > > - Wes
>> >> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg -
>> EWI
>> >> > > > > > > > > > <J....@tudelft.nl> wrote:
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Hello Animesh,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I browsed a bit in your sources, thanks for
>> sharing.
>> >> We
>> >> > > have
>> >> > > > > > > performed
>> >> > > > > > > > > > some similar measurements to your third case in the
>> >> past for
>> >> > > > > C/C++ on
>> >> > > > > > > > > > collections of various basic types such as primitives
>> >> and
>> >> > > > > strings.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I can say that in terms of consuming data from the
>> >> Arrow
>> >> > > format
>> >> > > > > > > versus
>> >> > > > > > > > > > language native collections in C++, I've so far
>> always
>> >> been
>> >> > > able
>> >> > > > > to
>> >> > > > > > > get
>> >> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to
>> >> the
>> >> > > Arrow
>> >> > > > > > > format
>> >> > > > > > > > > > itself). Especially when accessing the data through
>> >> Arrow's
>> >> > > raw
>> >> > > > > data
>> >> > > > > > > > > > pointers (and using for example std::string_view-like
>> >> > > > > constructs).
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > In C/C++ the fast data structures are engineered in
>> >> such a
>> >> > > way
>> >> > > > > > > that as
>> >> > > > > > > > > > little pointer traversals are required and they take
>> up
>> >> an as
>> >> > > > > small
>> >> > > > > > > as
>> >> > > > > > > > > > possible memory footprint. Thus each memory access is
>> >> > > relatively
>> >> > > > > > > > > efficient
>> >> > > > > > > > > > (in terms of obtaining the data of interest). The
>> same
>> >> can
>> >> > > > > > > absolutely be
>> >> > > > > > > > > > said for Arrow, if not even more efficient in some
>> cases
>> >> > > where
>> >> > > > > object
>> >> > > > > > > > > > fields are of variable length.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
>> >> This
>> >> > > > > requires
>> >> > > > > > > the
>> >> > > > > > > > > > JVM to interface to it through some calls to Unsafe
>> >> hidden
>> >> > > under
>> >> > > > > the
>> >> > > > > > > > > Netty
>> >> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
>> >> expert
>> >> > > on
>> >> > > > > > > this).
>> >> > > > > > > > > > Those calls are the only reason I can think of that
>> >> would
>> >> > > > > degrade the
>> >> > > > > > > > > > performance a bit compared to a pure JAva case. I
>> don't
>> >> know
>> >> > > if
>> >> > > > > the
>> >> > > > > > > > > Unsafe
>> >> > > > > > > > > > calls are inlined during JIT compilation. If they
>> >> aren't they
>> >> > > > > will
>> >> > > > > > > > > increase
>> >> > > > > > > > > > access latency to any data a little bit.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I don't have a similar machine so it's not easy to
>> >> relate
>> >> > > my
>> >> > > > > > > numbers to
>> >> > > > > > > > > > yours, but if you can get that data consumed with 100
>> >> Gbps
>> >> > > in a
>> >> > > > > pure
>> >> > > > > > > Java
>> >> > > > > > > > > > case, I don't see any reason (resulting from Arrow
>> >> format /
>> >> > > > > off-heap
>> >> > > > > > > > > > storage) why you wouldn't be able to get at least
>> really
>> >> > > close.
>> >> > > > > Can
>> >> > > > > > > you
>> >> > > > > > > > > get
>> >> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java
>> with
>> >> your
>> >> > > > > > > consumption
>> >> > > > > > > > > > functions in the first place?
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I'm interested to see your progress on this.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Kind regards,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Johan Peltenburg
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > ________________________________
>> >> > > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
>> >> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
>> >> > > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
>> >> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Hi all,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > A week ago, Wes and I had a discussion about the
>> >> > > performance
>> >> > > > > of the
>> >> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
>> >> (Incubating)
>> >> > > > > mailing
>> >> > > > > > > > > list (
>> >> > > > > > > > > > >
>> >> > > > > > >
>> >> > >
>> >> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
>> >> > > > > > > > > ).
>> >> > > > > > > > > > In
>> >> > > > > > > > > > > a nutshell: I am investigating the performance of
>> >> various
>> >> > > file
>> >> > > > > > > formats
>> >> > > > > > > > > > > (including Arrow) on high-performance NVMe and
>> >> > > > > RDMA/100Gbps/RoCE
>> >> > > > > > > > > setups.
>> >> > > > > > > > > > I
>> >> > > > > > > > > > > benchmarked how long does it take to materialize
>> >> values
>> >> > > (ints,
>> >> > > > > > > longs,
>> >> > > > > > > > > > > doubles) of the store_sales table, the largest
>> table
>> >> in the
>> >> > > > > TPC-DS
>> >> > > > > > > > > > dataset
>> >> > > > > > > > > > > stored on different file formats. Here is a
>> write-up
>> >> on
>> >> > > this -
>> >> > > > > > > > > > >
>> >> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
>> >> > > > > > > found
>> >> > > > > > > > > > that
>> >> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
>> >> link,
>> >> > > Arrow
>> >> > > > > > > (using
>> >> > > > > > > > > > as a
>> >> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
>> >> > > bandwidth
>> >> > > > > with
>> >> > > > > > > all
>> >> > > > > > > > > 16
>> >> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
>> >> in-memory
>> >> > > IPC
>> >> > > > > > > format,
>> >> > > > > > > > > > and
>> >> > > > > > > > > > > has not been optimized for storage interfaces/APIs
>> >> like
>> >> > > HDFS;
>> >> > > > > (ii)
>> >> > > > > > > the
>> >> > > > > > > > > > > performance I am measuring is for the java
>> >> implementation.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > That brings us to this email where I promised to
>> >> follow up
>> >> > > on
>> >> > > > > the
>> >> > > > > > > Arrow
>> >> > > > > > > > > > > mailing list to understand and optimize the
>> >> performance of
>> >> > > > > > > Arrow/Java
>> >> > > > > > > > > > > implementation on high-performance devices. I
>> wrote a
>> >> small
>> >> > > > > > > stand-alone
>> >> > > > > > > > > > > benchmark (
>> >> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
>> >> > > > > > > with
>> >> > > > > > > > > > three
>> >> > > > > > > > > > > implementations of WritableByteChannel,
>> >> SeekableByteChannel
>> >> > > > > > > interfaces:
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives
>> me
>> >> ~30
>> >> > > Gbps
>> >> > > > > > > > > > performance
>> >> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives
>> me
>> >> > > ~35-36
>> >> > > > > Gbps
>> >> > > > > > > > > > > performance
>> >> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this
>> >> gives me
>> >> > > ~39
>> >> > > > > Gbps
>> >> > > > > > > > > > > performance
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I think the order makes sense. To better understand
>> >> the
>> >> > > > > > > performance of
>> >> > > > > > > > > > > Arrow/Java we can focus on the option 3.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > The key question I am trying to answer is "what
>> would
>> >> it
>> >> > > take
>> >> > > > > for
>> >> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"?
>> Is it
>> >> > > > > possible? If
>> >> > > > > > > > > yes,
>> >> > > > > > > > > > > then what is missing/or mis-interpreted by me? If
>> >> not, then
>> >> > > > > where
>> >> > > > > > > is
>> >> > > > > > > > > the
>> >> > > > > > > > > > > performance lost? Does anyone have any performance
>> >> > > measurements
>> >> > > > > > > for C++
>> >> > > > > > > > > > > implementation? if they have seen/expect better
>> >> numbers.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > As a next step, I will profile the read path of
>> >> Arrow/Java
>> >> > > for
>> >> > > > > the
>> >> > > > > > > > > option
>> >> > > > > > > > > > > 3. I will report my findings.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Any thoughts and feedback on this investigation are
>> >> very
>> >> > > > > welcome.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Cheers,
>> >> > > > > > > > > > > --
>> >> > > > > > > > > > > Animesh
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org
>> list as
>> >> > > well.
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > >
>> >> > > > >
>> >> > >
>> >>
>> >
>>
>>

Re: [JAVA] Arrow performance measurement

Posted by Animesh Trivedi <an...@gmail.com>.
Hi all,

Apologies for the silence, I have been occupied. Here is the part 3 of the
investigation. This time, I compared the performance of Java and C++
readers. The benchmark is read and checksum 10 billion integers (~40GB of
data, with some null values) on a single core. You can read the full
investigation here:
https://github.com/animeshtrivedi/blog/blob/master/post/2018-11-22-arrow-cpp.md.
The best numbers that I have are (see the table at the end of the blog for
more details):

C++ (file)    : 21.5 Gbps
C++ (mmap)    : 30.4 Gbps
Java (unsafe) : 12.48 Gbps

The key insights are:

* C++ code (with file I/O) is generally 2x faster than the Java code. I
have narrowed down the profiling to JITing issues, memory management, etc.
Most of these issues are independent of Arrow (as far as I can tell). The
standalone benchmark to debug these issues is here :
https://github.com/animeshtrivedi/java-cpp-fun. The pattern of the
workloads is - (1) you have a large integer and bitmap array; (2) go over
the array, check for NULL, and sum it. In general, C++ code is 2x faster
than the Java code. Generally I understand why Java code is slow, but how
to make it faster of comparable to C++ code is what I want to know. Any
input on it is appreciated.

* C++ code has a no-null values optimization branch on the IsValid() check.
Java code can benefit from it too, but does not have this implemented. I
will open a pull request for this.

* C++ bitmap code can be optimized further by using the unsigned integers
than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
https://godbolt.org/z/deq0_q - compare the size of the assembly code. And
the performance measurements in the blog show up to 50% performance gains.
Alternatively if signed to unsigned upgrade is not possible (perhaps in
every language), then in the C++ code, we should use the bitmap operations
directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
operation), instead of `/` and `%`.

Thanks,
--
Animesh


On Thu, Oct 11, 2018 at 6:55 PM Animesh Trivedi <an...@gmail.com>
wrote:

> Hi Li - thanks for setting up the jiras. I'll prepare the pull requests
> soon.
>
> Hi Wes, I think that is a good idea to write a blog about performance,
> with pointing to the improvements in the code base (fixes and
> microbenchmarks). I can prepare a first draft as we go along. This will
> definitely serve well to the large java community.
>
> Cheers,
> --
> Animesh
>
>
> On Thu, 11 Oct 2018, 17:55 Li Jin, <ic...@gmail.com> wrote:
>
>> I have created these as the first step. Animesh, feel free to submit PR
>> for
>> these. I will look into your micro benchmarks soon.
>>
>>
>>
>>    1. [image: Improvement] ARROW-3497[Java] Add user documentation for
>>    achieving better performance
>>    <https://jira.apache.org/jira/browse/ARROW-3497>
>>    2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to
>> Java
>>    <https://jira.apache.org/jira/browse/ARROW-3496>
>>    3. [image: Improvement] ARROW-3495[Java] Optimize bit operations
>>    performance <https://jira.apache.org/jira/browse/ARROW-3495>
>>    4. [image: Improvement] ARROW-3493[Java] Document
>> BOUNDS_CHECKING_ENABLED
>>    <https://jira.apache.org/jira/browse/ARROW-3493>
>>
>>
>> On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ic...@gmail.com> wrote:
>>
>> > Hi Wes and Animesh,
>> >
>> > Thanks for the analysis and discussion. I am happy to looking into
>> this. I
>> > will create some Jiras soon.
>> >
>> > Li
>> >
>> > On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> >> hey Animesh,
>> >>
>> >> Thank you for doing this analysis. If you'd like to share some of the
>> >> analysis more broadly e.g. on the Apache Arrow blog or social media,
>> >> let us know.
>> >>
>> >> Seems like there might be a few follow ups here for the Arrow Java
>> >> community:
>> >>
>> >> * Documentation about achieving better performance
>> >> * Writing some microperformance benchmarks
>> >> * Making some improvements to the code to facilitate better performance
>> >>
>> >> Feel free to create some JIRA issues. Are any Java developers
>> >> interested in digging a little more into these issues?
>> >>
>> >> Thanks,
>> >> Wes
>> >> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
>> >> <an...@gmail.com> wrote:
>> >> >
>> >> > Hi Wes and all,
>> >> >
>> >> > Here is another round of updates:
>> >> >
>> >> > Quick recap - previously we established that for 1kB binary blobs,
>> Arrow
>> >> > can deliver > 160 Gbps performance from in-memory buffers.
>> >> >
>> >> > In this round I looked at the performance of materializing
>> "integers".
>> >> In
>> >> > my benchmarks, I found that with careful
>> optimizations/code-rewriting we
>> >> > can push the performance of integer reading from 5.42 Gbps/core to
>> 13.61
>> >> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to
>> 110+
>> >> > Gbps. Key things to do is:
>> >> >
>> >> > 1) Disable memory access checks in Arrow and Netty buffers. This gave
>> >> > significant performance boost. However, for such an important
>> >> performance
>> >> > flag, it is very poorly documented
>> >> > ("drill.enable_unsafe_memory_access=true").
>> >> >
>> >> > 2) Materialize values from Validity and Value direct buffers instead
>> of
>> >> > calling getInt() function on the IntVector. This is implemented as a
>> new
>> >> > Unsafe reader type (
>> >> >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
>> >> > )
>> >> >
>> >> > 3) Optimize bitmap operation to check if a bit is set or not (
>> >> >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
>> >> > )
>> >> >
>> >> > A detailed write up of these steps is available here:
>> >> >
>> >>
>> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
>> >> >
>> >> > I have 2 follow-up questions:
>> >> >
>> >> > 1) Regarding the `isSet` function, why does it has to calculate
>> number
>> >> of
>> >> > bits set? (
>> >> >
>> >>
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797
>> >> ).
>> >> > Wouldn't just checking if the result of the AND operation is zero or
>> >> not be
>> >> > sufficient? Like what I did :
>> >> >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
>> >> >
>> >> >
>> >> > 2) What is the reason behind this bitmap generation optimization here
>> >> >
>> >>
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
>> >> > ? At this point when this function is called, the bitmap vector is
>> >> already
>> >> > read from the storage, and contains the right values (either all
>> null,
>> >> all
>> >> > set, or whatever). Generating this mask here for the special cases
>> when
>> >> the
>> >> > values are all NULL or all set (this was the case in my benchmark),
>> can
>> >> be
>> >> > slower than just returning what one has read from the storage.
>> >> >
>> >> > Collectively optimizing these two bitmap operations give more than 1
>> >> Gbps
>> >> > gains in my bench-marking code.
>> >> >
>> >> > Cheers,
>> >> > --
>> >> > Animesh
>> >> >
>> >> >
>> >> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com>
>> >> wrote:
>> >> >
>> >> > > See e.g.
>> >> > >
>> >> > >
>> >> > >
>> >>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
>> >> > >
>> >> > >
>> >> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
>> >> > > <an...@gmail.com> wrote:
>> >> > > >
>> >> > > > Primarily write the same microbenchmark as I have in Java in C++
>> for
>> >> > > table
>> >> > > > reading and value materialization. So just an example of
>> equivalent
>> >> > > > ArrowFileReader example code in C++. Unit tests are a good
>> starting
>> >> > > point,
>> >> > > > thanks for the tip :)
>> >> > > >
>> >> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <
>> wesmckinn@gmail.com>
>> >> > > wrote:
>> >> > > >
>> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
>> can
>> >> > > have a
>> >> > > > > look?
>> >> > > > >
>> >> > > > > What kind of code are you looking for? I would direct you to
>> >> relevant
>> >> > > > > unit tests that exhibit certain functionality, but it depends
>> on
>> >> what
>> >> > > > > you are trying to do
>> >> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
>> >> > > > > <an...@gmail.com> wrote:
>> >> > > > > >
>> >> > > > > > Hi all - quick update on the performance investigation:
>> >> > > > > >
>> >> > > > > > - I spent some time looking at performance profile for a
>> binary
>> >> blob
>> >> > > > > column
>> >> > > > > > (1024 bytes of byte[]) and found a few favorable settings for
>> >> > > delivering
>> >> > > > > up
>> >> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores.
>> These
>> >> > > settings
>> >> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.)
>> are
>> >> > > > > documented
>> >> > > > > > here:
>> >> > > > > >
>> >> > > > >
>> >> > >
>> >>
>> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
>> >> > > > > > - these setting also help to improved the last number that
>> >> reported
>> >> > > (but
>> >> > > > > > not by much) for the in-memory TPC-DS store_sales table from
>> ~39
>> >> > > Gbps up
>> >> > > > > to
>> >> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark,
>> >> i.e.,
>> >> > > w/o any
>> >> > > > > > networking or storage links)
>> >> > > > > >
>> >> > > > > > A few follow up questions that I have:
>> >> > > > > > 1. Arrow reads a batch size worth of data in one go. Are
>> there
>> >> any
>> >> > > > > > recommended batch sizes? In my investigation, small batch
>> size
>> >> help
>> >> > > with
>> >> > > > > a
>> >> > > > > > better cache profile but increase number of instructions
>> >> required
>> >> > > (more
>> >> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem
>> to
>> >> be
>> >> > > the
>> >> > > > > best
>> >> > > > > > performing configuration, which is also a bit counter
>> intuitive
>> >> as
>> >> > > for 16
>> >> > > > > > threads this will lead to 160 MB of memory footprint. May be
>> >> this is
>> >> > > also
>> >> > > > > > tired to the memory management logic which is my next
>> question.
>> >> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent
>> netty
>> >> > > memory
>> >> > > > > > management settings for "io.netty.allocator.*" parameters? I
>> >> don't
>> >> > > find
>> >> > > > > any
>> >> > > > > > decent write-up on them; (ii) Is there a provision for
>> ArrowBuf
>> >> being
>> >> > > > > > re-used once a batch is consumed? As it looks for now, read
>> read
>> >> > > > > allocates
>> >> > > > > > a new buffer to read the whole batch size.
>> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
>> can
>> >> > > have a
>> >> > > > > > look?
>> >> > > > > >
>> >> > > > > > Cheers,
>> >> > > > > > --
>> >> > > > > > Animesh
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <
>> >> wesmckinn@gmail.com>
>> >> > > > > wrote:
>> >> > > > > >
>> >> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
>> >> > > > > > > <an...@gmail.com> wrote:
>> >> > > > > > > >
>> >> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your
>> comments:
>> >> > > > > > > >
>> >> > > > > > > > @Johan -
>> >> > > > > > > > 1. I also do not suspect that there is any inherent
>> >> drawback in
>> >> > > Java
>> >> > > > > or
>> >> > > > > > > C++
>> >> > > > > > > > due to the Arrow format. I mentioned C++ because Wes
>> >> pointed out
>> >> > > that
>> >> > > > > > > Java
>> >> > > > > > > > routines are not the most optimized ones (yet!). And
>> >> naturally
>> >> > > one
>> >> > > > > would
>> >> > > > > > > > expect better performance in a native language with all
>> >> > > > > > > pointer/memory/SIMD
>> >> > > > > > > > instruction optimizations that you mentioned. As far as I
>> >> know,
>> >> > > the
>> >> > > > > > > > off-heap buffers are managed in ArrowBuf which
>> implements an
>> >> > > abstract
>> >> > > > > > > netty
>> >> > > > > > > > class. But there is nothing unusual, i.e., netty
>> specific,
>> >> about
>> >> > > > > these
>> >> > > > > > > > unsafe routines, they are used by many projects. Though
>> >> there is
>> >> > > cost
>> >> > > > > > > > associated with materializing on-heap Java values from
>> >> off-heap
>> >> > > > > memory
>> >> > > > > > > > regions. I need to benchmark that more carefully.
>> >> > > > > > > >
>> >> > > > > > > > 2. When you say "I've so far always been able to get
>> similar
>> >> > > > > performance
>> >> > > > > > > > numbers" - do you mean the same performance of my case 3
>> >> where 16
>> >> > > > > cores
>> >> > > > > > > > drive close to 40 Gbps, or the same performance between
>> >> your C++
>> >> > > and
>> >> > > > > Java
>> >> > > > > > > > benchmarks. Do you have some write-up? I would be
>> >> interested to
>> >> > > read
>> >> > > > > up
>> >> > > > > > > :)
>> >> > > > > > > >
>> >> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive
>> arrays
>> >> in
>> >> > > Java"
>> >> > > > > ->
>> >> > > > > > > that
>> >> > > > > > > > is a good idea. Let me try and report back.
>> >> > > > > > > >
>> >> > > > > > > > @Wes -
>> >> > > > > > > >
>> >> > > > > > > > Is there some benchmark template for C++ routines I can
>> >> have a
>> >> > > look?
>> >> > > > > I
>> >> > > > > > > > would be happy to get some input from Java-Arrow experts
>> on
>> >> how
>> >> > > to
>> >> > > > > write
>> >> > > > > > > > these benchmarks more efficiently. I will have a closer
>> >> look at
>> >> > > the
>> >> > > > > JIRA
>> >> > > > > > > > tickets that you mentioned.
>> >> > > > > > > >
>> >> > > > > > > > So, for now I am focusing on the case 3, which is about
>> >> > > establishing
>> >> > > > > > > > performance when reading from a local in-memory I/O
>> stream
>> >> that I
>> >> > > > > > > > implemented (
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > >
>> >> > >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
>> >> > > > > > > ).
>> >> > > > > > > > In this case I first read data from parquet files,
>> convert
>> >> them
>> >> > > into
>> >> > > > > > > Arrow,
>> >> > > > > > > > and write-out to a MemoryIOChannel, and then read back
>> from
>> >> it.
>> >> > > So,
>> >> > > > > the
>> >> > > > > > > > performance has nothing to do with Crail or HDFS in the
>> >> case 3.
>> >> > > > > Once, I
>> >> > > > > > > > establish the base performance in this setup (which is
>> >> around ~40
>> >> > > > > Gbps
>> >> > > > > > > with
>> >> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O
>> >> streams
>> >> > > can
>> >> > > > > take
>> >> > > > > > > > ArrowBuf as src/dst buffers. That should be doable.
>> >> > > > > > >
>> >> > > > > > > Right, in any case what you are testing is the performance
>> of
>> >> > > using a
>> >> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory
>> >> to sum
>> >> > > the
>> >> > > > > > > non-null values of each column. I'm not sure that a single
>> >> > > bandwidth
>> >> > > > > > > number produced by this benchmark is very informative for
>> >> people
>> >> > > > > > > contemplating what memory format to use in their system due
>> >> to the
>> >> > > > > > > current state of the implementation (Java) and workload
>> >> measured
>> >> > > > > > > (summing the non-null values with a naive algorithm). I
>> would
>> >> guess
>> >> > > > > > > that a C++ version with raw pointers and a loop-unrolled,
>> >> > > branch-free
>> >> > > > > > > vectorized sum is going to be a lot faster.
>> >> > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > @Jacques -
>> >> > > > > > > >
>> >> > > > > > > > That is a good point that "Arrow's implementation is more
>> >> > > focused on
>> >> > > > > > > > interacting with the structure than transporting it".
>> >> However,
>> >> > > in any
>> >> > > > > > > > distributed system one needs to move data/structure
>> around
>> >> - as
>> >> > > far
>> >> > > > > as I
>> >> > > > > > > > understand that is another goal of the project. My
>> >> investigation
>> >> > > > > started
>> >> > > > > > > > within the context of Spark/SQL data processing. Spark
>> >> converts
>> >> > > > > incoming
>> >> > > > > > > > data into its own in-memory UnsafeRow representation for
>> >> > > processing.
>> >> > > > > So
>> >> > > > > > > > naturally the performance of this data ingestion pipeline
>> >> cannot
>> >> > > > > > > outperform
>> >> > > > > > > > the read performance of the used file format. I
>> benchmarked
>> >> > > Parquet,
>> >> > > > > ORC,
>> >> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table).
>> And
>> >> then
>> >> > > > > > > curiously
>> >> > > > > > > > benchmarked Arrow as well because its design choices are
>> a
>> >> better
>> >> > > > > fit for
>> >> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
>> >> > > > > investigating.
>> >> > > > > > > From
>> >> > > > > > > > this point of view, I am trying to find out can Arrow be
>> >> the file
>> >> > > > > format
>> >> > > > > > > > for the next generation of storage/networking devices
>> (see
>> >> Apache
>> >> > > > > Crail
>> >> > > > > > > > project) delivering close to the hardware speed
>> >> reading/writing
>> >> > > > > rates. As
>> >> > > > > > > > Wes pointed out that a C++ library implementation should
>> be
>> >> > > memory-IO
>> >> > > > > > > > bound, so what would it take to deliver the same
>> >> performance in
>> >> > > Java
>> >> > > > > ;)
>> >> > > > > > > > (and then, from across the network).
>> >> > > > > > > >
>> >> > > > > > > > I hope this makes sense.
>> >> > > > > > > >
>> >> > > > > > > > Cheers,
>> >> > > > > > > > --
>> >> > > > > > > > Animesh
>> >> > > > > > > >
>> >> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
>> >> > > jacques@apache.org>
>> >> > > > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > My big question is what is the use case and how/what
>> are
>> >> you
>> >> > > > > trying to
>> >> > > > > > > > > compare? Arrow's implementation is more focused on
>> >> interacting
>> >> > > > > with the
>> >> > > > > > > > > structure than transporting it. Generally speaking,
>> when
>> >> we're
>> >> > > > > working
>> >> > > > > > > with
>> >> > > > > > > > > Arrow data we frequently are just interacting with
>> memory
>> >> > > > > locations and
>> >> > > > > > > > > doing direct operations. If you have a layer that
>> >> supports that
>> >> > > > > type of
>> >> > > > > > > > > semantic, create a movement technique that depends on
>> >> that.
>> >> > > Arrow
>> >> > > > > > > doesn't
>> >> > > > > > > > > force a particular API since the data itself is defined
>> >> by its
>> >> > > > > > > in-memory
>> >> > > > > > > > > layout so if you have a custom use or pattern, just
>> work
>> >> with
>> >> > > the
>> >> > > > > > > in-memory
>> >> > > > > > > > > structures.
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
>> >> > > wesmckinn@gmail.com>
>> >> > > > > > > wrote:
>> >> > > > > > > > >
>> >> > > > > > > > > > hi Animesh,
>> >> > > > > > > > > >
>> >> > > > > > > > > > Per Johan's comments, the C++ library is essentially
>> >> going
>> >> > > to be
>> >> > > > > > > > > > IO/memory bandwidth bound since you're interacting
>> with
>> >> raw
>> >> > > > > pointers.
>> >> > > > > > > > > >
>> >> > > > > > > > > > I'm looking at your code
>> >> > > > > > > > > >
>> >> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
>> >> > > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
>> >> > > > > > > > > >     int valCount = accessor.getValueCount();
>> >> > > > > > > > > >     for(int i = 0; i < valCount; i++){
>> >> > > > > > > > > >         if(!accessor.isNull(i)){
>> >> > > > > > > > > >             float4Count+=1;
>> >> > > > > > > > > >             checksum+=accessor.get(i);
>> >> > > > > > > > > >         }
>> >> > > > > > > > > >     }
>> >> > > > > > > > > > }
>> >> > > > > > > > > >
>> >> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
>> >> advise
>> >> > > you
>> >> > > > > the
>> >> > > > > > > > > > fastest way to iterate over this data -- my
>> >> understanding is
>> >> > > that
>> >> > > > > > > much
>> >> > > > > > > > > > code in Dremio interacts with the wrapped Netty
>> ArrowBuf
>> >> > > objects
>> >> > > > > > > > > > rather than going through the higher level APIs.
>> You're
>> >> also
>> >> > > > > dropping
>> >> > > > > > > > > > performance because memory mapping is not yet
>> >> implemented in
>> >> > > > > Java,
>> >> > > > > > > see
>> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
>> >> > > > > > > > > >
>> >> > > > > > > > > > Furthermore, the IPC reader class you are using could
>> >> be made
>> >> > > > > more
>> >> > > > > > > > > > efficient. I described the problem in
>> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 --
>> >> this
>> >> > > will be
>> >> > > > > > > > > > required as soon as we have the ability to do memory
>> >> mapping
>> >> > > in
>> >> > > > > Java
>> >> > > > > > > > > >
>> >> > > > > > > > > > Could Crail use the Arrow data structures in its
>> runtime
>> >> > > rather
>> >> > > > > than
>> >> > > > > > > > > > copying? If not, how are Crail's runtime data
>> structures
>> >> > > > > different?
>> >> > > > > > > > > >
>> >> > > > > > > > > > - Wes
>> >> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg -
>> EWI
>> >> > > > > > > > > > <J....@tudelft.nl> wrote:
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Hello Animesh,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I browsed a bit in your sources, thanks for
>> sharing.
>> >> We
>> >> > > have
>> >> > > > > > > performed
>> >> > > > > > > > > > some similar measurements to your third case in the
>> >> past for
>> >> > > > > C/C++ on
>> >> > > > > > > > > > collections of various basic types such as primitives
>> >> and
>> >> > > > > strings.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I can say that in terms of consuming data from the
>> >> Arrow
>> >> > > format
>> >> > > > > > > versus
>> >> > > > > > > > > > language native collections in C++, I've so far
>> always
>> >> been
>> >> > > able
>> >> > > > > to
>> >> > > > > > > get
>> >> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to
>> >> the
>> >> > > Arrow
>> >> > > > > > > format
>> >> > > > > > > > > > itself). Especially when accessing the data through
>> >> Arrow's
>> >> > > raw
>> >> > > > > data
>> >> > > > > > > > > > pointers (and using for example std::string_view-like
>> >> > > > > constructs).
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > In C/C++ the fast data structures are engineered in
>> >> such a
>> >> > > way
>> >> > > > > > > that as
>> >> > > > > > > > > > little pointer traversals are required and they take
>> up
>> >> an as
>> >> > > > > small
>> >> > > > > > > as
>> >> > > > > > > > > > possible memory footprint. Thus each memory access is
>> >> > > relatively
>> >> > > > > > > > > efficient
>> >> > > > > > > > > > (in terms of obtaining the data of interest). The
>> same
>> >> can
>> >> > > > > > > absolutely be
>> >> > > > > > > > > > said for Arrow, if not even more efficient in some
>> cases
>> >> > > where
>> >> > > > > object
>> >> > > > > > > > > > fields are of variable length.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
>> >> This
>> >> > > > > requires
>> >> > > > > > > the
>> >> > > > > > > > > > JVM to interface to it through some calls to Unsafe
>> >> hidden
>> >> > > under
>> >> > > > > the
>> >> > > > > > > > > Netty
>> >> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
>> >> expert
>> >> > > on
>> >> > > > > > > this).
>> >> > > > > > > > > > Those calls are the only reason I can think of that
>> >> would
>> >> > > > > degrade the
>> >> > > > > > > > > > performance a bit compared to a pure JAva case. I
>> don't
>> >> know
>> >> > > if
>> >> > > > > the
>> >> > > > > > > > > Unsafe
>> >> > > > > > > > > > calls are inlined during JIT compilation. If they
>> >> aren't they
>> >> > > > > will
>> >> > > > > > > > > increase
>> >> > > > > > > > > > access latency to any data a little bit.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I don't have a similar machine so it's not easy to
>> >> relate
>> >> > > my
>> >> > > > > > > numbers to
>> >> > > > > > > > > > yours, but if you can get that data consumed with 100
>> >> Gbps
>> >> > > in a
>> >> > > > > pure
>> >> > > > > > > Java
>> >> > > > > > > > > > case, I don't see any reason (resulting from Arrow
>> >> format /
>> >> > > > > off-heap
>> >> > > > > > > > > > storage) why you wouldn't be able to get at least
>> really
>> >> > > close.
>> >> > > > > Can
>> >> > > > > > > you
>> >> > > > > > > > > get
>> >> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java
>> with
>> >> your
>> >> > > > > > > consumption
>> >> > > > > > > > > > functions in the first place?
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I'm interested to see your progress on this.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Kind regards,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Johan Peltenburg
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > ________________________________
>> >> > > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
>> >> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
>> >> > > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
>> >> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Hi all,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > A week ago, Wes and I had a discussion about the
>> >> > > performance
>> >> > > > > of the
>> >> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
>> >> (Incubating)
>> >> > > > > mailing
>> >> > > > > > > > > list (
>> >> > > > > > > > > > >
>> >> > > > > > >
>> >> > >
>> >> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
>> >> > > > > > > > > ).
>> >> > > > > > > > > > In
>> >> > > > > > > > > > > a nutshell: I am investigating the performance of
>> >> various
>> >> > > file
>> >> > > > > > > formats
>> >> > > > > > > > > > > (including Arrow) on high-performance NVMe and
>> >> > > > > RDMA/100Gbps/RoCE
>> >> > > > > > > > > setups.
>> >> > > > > > > > > > I
>> >> > > > > > > > > > > benchmarked how long does it take to materialize
>> >> values
>> >> > > (ints,
>> >> > > > > > > longs,
>> >> > > > > > > > > > > doubles) of the store_sales table, the largest
>> table
>> >> in the
>> >> > > > > TPC-DS
>> >> > > > > > > > > > dataset
>> >> > > > > > > > > > > stored on different file formats. Here is a
>> write-up
>> >> on
>> >> > > this -
>> >> > > > > > > > > > >
>> >> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
>> >> > > > > > > found
>> >> > > > > > > > > > that
>> >> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
>> >> link,
>> >> > > Arrow
>> >> > > > > > > (using
>> >> > > > > > > > > > as a
>> >> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
>> >> > > bandwidth
>> >> > > > > with
>> >> > > > > > > all
>> >> > > > > > > > > 16
>> >> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
>> >> in-memory
>> >> > > IPC
>> >> > > > > > > format,
>> >> > > > > > > > > > and
>> >> > > > > > > > > > > has not been optimized for storage interfaces/APIs
>> >> like
>> >> > > HDFS;
>> >> > > > > (ii)
>> >> > > > > > > the
>> >> > > > > > > > > > > performance I am measuring is for the java
>> >> implementation.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > That brings us to this email where I promised to
>> >> follow up
>> >> > > on
>> >> > > > > the
>> >> > > > > > > Arrow
>> >> > > > > > > > > > > mailing list to understand and optimize the
>> >> performance of
>> >> > > > > > > Arrow/Java
>> >> > > > > > > > > > > implementation on high-performance devices. I
>> wrote a
>> >> small
>> >> > > > > > > stand-alone
>> >> > > > > > > > > > > benchmark (
>> >> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
>> >> > > > > > > with
>> >> > > > > > > > > > three
>> >> > > > > > > > > > > implementations of WritableByteChannel,
>> >> SeekableByteChannel
>> >> > > > > > > interfaces:
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives
>> me
>> >> ~30
>> >> > > Gbps
>> >> > > > > > > > > > performance
>> >> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives
>> me
>> >> > > ~35-36
>> >> > > > > Gbps
>> >> > > > > > > > > > > performance
>> >> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this
>> >> gives me
>> >> > > ~39
>> >> > > > > Gbps
>> >> > > > > > > > > > > performance
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I think the order makes sense. To better understand
>> >> the
>> >> > > > > > > performance of
>> >> > > > > > > > > > > Arrow/Java we can focus on the option 3.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > The key question I am trying to answer is "what
>> would
>> >> it
>> >> > > take
>> >> > > > > for
>> >> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"?
>> Is it
>> >> > > > > possible? If
>> >> > > > > > > > > yes,
>> >> > > > > > > > > > > then what is missing/or mis-interpreted by me? If
>> >> not, then
>> >> > > > > where
>> >> > > > > > > is
>> >> > > > > > > > > the
>> >> > > > > > > > > > > performance lost? Does anyone have any performance
>> >> > > measurements
>> >> > > > > > > for C++
>> >> > > > > > > > > > > implementation? if they have seen/expect better
>> >> numbers.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > As a next step, I will profile the read path of
>> >> Arrow/Java
>> >> > > for
>> >> > > > > the
>> >> > > > > > > > > option
>> >> > > > > > > > > > > 3. I will report my findings.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Any thoughts and feedback on this investigation are
>> >> very
>> >> > > > > welcome.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Cheers,
>> >> > > > > > > > > > > --
>> >> > > > > > > > > > > Animesh
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org
>> list as
>> >> > > well.
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > >
>> >> > > > >
>> >> > >
>> >>
>> >
>>
>>

Re: [JAVA] Arrow performance measurement

Posted by Animesh Trivedi <an...@gmail.com>.
Hi Li - thanks for setting up the jiras. I'll prepare the pull requests
soon.

Hi Wes, I think that is a good idea to write a blog about performance, with
pointing to the improvements in the code base (fixes and microbenchmarks).
I can prepare a first draft as we go along. This will definitely serve well
to the large java community.

Cheers,
--
Animesh


On Thu, 11 Oct 2018, 17:55 Li Jin, <ic...@gmail.com> wrote:

> I have created these as the first step. Animesh, feel free to submit PR for
> these. I will look into your micro benchmarks soon.
>
>
>
>    1. [image: Improvement] ARROW-3497[Java] Add user documentation for
>    achieving better performance
>    <https://jira.apache.org/jira/browse/ARROW-3497>
>    2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to Java
>    <https://jira.apache.org/jira/browse/ARROW-3496>
>    3. [image: Improvement] ARROW-3495[Java] Optimize bit operations
>    performance <https://jira.apache.org/jira/browse/ARROW-3495>
>    4. [image: Improvement] ARROW-3493[Java] Document
> BOUNDS_CHECKING_ENABLED
>    <https://jira.apache.org/jira/browse/ARROW-3493>
>
>
> On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ic...@gmail.com> wrote:
>
> > Hi Wes and Animesh,
> >
> > Thanks for the analysis and discussion. I am happy to looking into this.
> I
> > will create some Jiras soon.
> >
> > Li
> >
> > On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> hey Animesh,
> >>
> >> Thank you for doing this analysis. If you'd like to share some of the
> >> analysis more broadly e.g. on the Apache Arrow blog or social media,
> >> let us know.
> >>
> >> Seems like there might be a few follow ups here for the Arrow Java
> >> community:
> >>
> >> * Documentation about achieving better performance
> >> * Writing some microperformance benchmarks
> >> * Making some improvements to the code to facilitate better performance
> >>
> >> Feel free to create some JIRA issues. Are any Java developers
> >> interested in digging a little more into these issues?
> >>
> >> Thanks,
> >> Wes
> >> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
> >> <an...@gmail.com> wrote:
> >> >
> >> > Hi Wes and all,
> >> >
> >> > Here is another round of updates:
> >> >
> >> > Quick recap - previously we established that for 1kB binary blobs,
> Arrow
> >> > can deliver > 160 Gbps performance from in-memory buffers.
> >> >
> >> > In this round I looked at the performance of materializing "integers".
> >> In
> >> > my benchmarks, I found that with careful optimizations/code-rewriting
> we
> >> > can push the performance of integer reading from 5.42 Gbps/core to
> 13.61
> >> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to
> 110+
> >> > Gbps. Key things to do is:
> >> >
> >> > 1) Disable memory access checks in Arrow and Netty buffers. This gave
> >> > significant performance boost. However, for such an important
> >> performance
> >> > flag, it is very poorly documented
> >> > ("drill.enable_unsafe_memory_access=true").
> >> >
> >> > 2) Materialize values from Validity and Value direct buffers instead
> of
> >> > calling getInt() function on the IntVector. This is implemented as a
> new
> >> > Unsafe reader type (
> >> >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
> >> > )
> >> >
> >> > 3) Optimize bitmap operation to check if a bit is set or not (
> >> >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
> >> > )
> >> >
> >> > A detailed write up of these steps is available here:
> >> >
> >>
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
> >> >
> >> > I have 2 follow-up questions:
> >> >
> >> > 1) Regarding the `isSet` function, why does it has to calculate number
> >> of
> >> > bits set? (
> >> >
> >>
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797
> >> ).
> >> > Wouldn't just checking if the result of the AND operation is zero or
> >> not be
> >> > sufficient? Like what I did :
> >> >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
> >> >
> >> >
> >> > 2) What is the reason behind this bitmap generation optimization here
> >> >
> >>
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
> >> > ? At this point when this function is called, the bitmap vector is
> >> already
> >> > read from the storage, and contains the right values (either all null,
> >> all
> >> > set, or whatever). Generating this mask here for the special cases
> when
> >> the
> >> > values are all NULL or all set (this was the case in my benchmark),
> can
> >> be
> >> > slower than just returning what one has read from the storage.
> >> >
> >> > Collectively optimizing these two bitmap operations give more than 1
> >> Gbps
> >> > gains in my bench-marking code.
> >> >
> >> > Cheers,
> >> > --
> >> > Animesh
> >> >
> >> >
> >> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >
> >> > > See e.g.
> >> > >
> >> > >
> >> > >
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
> >> > >
> >> > >
> >> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
> >> > > <an...@gmail.com> wrote:
> >> > > >
> >> > > > Primarily write the same microbenchmark as I have in Java in C++
> for
> >> > > table
> >> > > > reading and value materialization. So just an example of
> equivalent
> >> > > > ArrowFileReader example code in C++. Unit tests are a good
> starting
> >> > > point,
> >> > > > thanks for the tip :)
> >> > > >
> >> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <wesmckinn@gmail.com
> >
> >> > > wrote:
> >> > > >
> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
> can
> >> > > have a
> >> > > > > look?
> >> > > > >
> >> > > > > What kind of code are you looking for? I would direct you to
> >> relevant
> >> > > > > unit tests that exhibit certain functionality, but it depends on
> >> what
> >> > > > > you are trying to do
> >> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> >> > > > > <an...@gmail.com> wrote:
> >> > > > > >
> >> > > > > > Hi all - quick update on the performance investigation:
> >> > > > > >
> >> > > > > > - I spent some time looking at performance profile for a
> binary
> >> blob
> >> > > > > column
> >> > > > > > (1024 bytes of byte[]) and found a few favorable settings for
> >> > > delivering
> >> > > > > up
> >> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores.
> These
> >> > > settings
> >> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.)
> are
> >> > > > > documented
> >> > > > > > here:
> >> > > > > >
> >> > > > >
> >> > >
> >>
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> >> > > > > > - these setting also help to improved the last number that
> >> reported
> >> > > (but
> >> > > > > > not by much) for the in-memory TPC-DS store_sales table from
> ~39
> >> > > Gbps up
> >> > > > > to
> >> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark,
> >> i.e.,
> >> > > w/o any
> >> > > > > > networking or storage links)
> >> > > > > >
> >> > > > > > A few follow up questions that I have:
> >> > > > > > 1. Arrow reads a batch size worth of data in one go. Are there
> >> any
> >> > > > > > recommended batch sizes? In my investigation, small batch size
> >> help
> >> > > with
> >> > > > > a
> >> > > > > > better cache profile but increase number of instructions
> >> required
> >> > > (more
> >> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem
> to
> >> be
> >> > > the
> >> > > > > best
> >> > > > > > performing configuration, which is also a bit counter
> intuitive
> >> as
> >> > > for 16
> >> > > > > > threads this will lead to 160 MB of memory footprint. May be
> >> this is
> >> > > also
> >> > > > > > tired to the memory management logic which is my next
> question.
> >> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent
> netty
> >> > > memory
> >> > > > > > management settings for "io.netty.allocator.*" parameters? I
> >> don't
> >> > > find
> >> > > > > any
> >> > > > > > decent write-up on them; (ii) Is there a provision for
> ArrowBuf
> >> being
> >> > > > > > re-used once a batch is consumed? As it looks for now, read
> read
> >> > > > > allocates
> >> > > > > > a new buffer to read the whole batch size.
> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
> can
> >> > > have a
> >> > > > > > look?
> >> > > > > >
> >> > > > > > Cheers,
> >> > > > > > --
> >> > > > > > Animesh
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <
> >> wesmckinn@gmail.com>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> >> > > > > > > <an...@gmail.com> wrote:
> >> > > > > > > >
> >> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your
> comments:
> >> > > > > > > >
> >> > > > > > > > @Johan -
> >> > > > > > > > 1. I also do not suspect that there is any inherent
> >> drawback in
> >> > > Java
> >> > > > > or
> >> > > > > > > C++
> >> > > > > > > > due to the Arrow format. I mentioned C++ because Wes
> >> pointed out
> >> > > that
> >> > > > > > > Java
> >> > > > > > > > routines are not the most optimized ones (yet!). And
> >> naturally
> >> > > one
> >> > > > > would
> >> > > > > > > > expect better performance in a native language with all
> >> > > > > > > pointer/memory/SIMD
> >> > > > > > > > instruction optimizations that you mentioned. As far as I
> >> know,
> >> > > the
> >> > > > > > > > off-heap buffers are managed in ArrowBuf which implements
> an
> >> > > abstract
> >> > > > > > > netty
> >> > > > > > > > class. But there is nothing unusual, i.e., netty specific,
> >> about
> >> > > > > these
> >> > > > > > > > unsafe routines, they are used by many projects. Though
> >> there is
> >> > > cost
> >> > > > > > > > associated with materializing on-heap Java values from
> >> off-heap
> >> > > > > memory
> >> > > > > > > > regions. I need to benchmark that more carefully.
> >> > > > > > > >
> >> > > > > > > > 2. When you say "I've so far always been able to get
> similar
> >> > > > > performance
> >> > > > > > > > numbers" - do you mean the same performance of my case 3
> >> where 16
> >> > > > > cores
> >> > > > > > > > drive close to 40 Gbps, or the same performance between
> >> your C++
> >> > > and
> >> > > > > Java
> >> > > > > > > > benchmarks. Do you have some write-up? I would be
> >> interested to
> >> > > read
> >> > > > > up
> >> > > > > > > :)
> >> > > > > > > >
> >> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays
> >> in
> >> > > Java"
> >> > > > > ->
> >> > > > > > > that
> >> > > > > > > > is a good idea. Let me try and report back.
> >> > > > > > > >
> >> > > > > > > > @Wes -
> >> > > > > > > >
> >> > > > > > > > Is there some benchmark template for C++ routines I can
> >> have a
> >> > > look?
> >> > > > > I
> >> > > > > > > > would be happy to get some input from Java-Arrow experts
> on
> >> how
> >> > > to
> >> > > > > write
> >> > > > > > > > these benchmarks more efficiently. I will have a closer
> >> look at
> >> > > the
> >> > > > > JIRA
> >> > > > > > > > tickets that you mentioned.
> >> > > > > > > >
> >> > > > > > > > So, for now I am focusing on the case 3, which is about
> >> > > establishing
> >> > > > > > > > performance when reading from a local in-memory I/O stream
> >> that I
> >> > > > > > > > implemented (
> >> > > > > > > >
> >> > > > > > >
> >> > > > >
> >> > >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> >> > > > > > > ).
> >> > > > > > > > In this case I first read data from parquet files, convert
> >> them
> >> > > into
> >> > > > > > > Arrow,
> >> > > > > > > > and write-out to a MemoryIOChannel, and then read back
> from
> >> it.
> >> > > So,
> >> > > > > the
> >> > > > > > > > performance has nothing to do with Crail or HDFS in the
> >> case 3.
> >> > > > > Once, I
> >> > > > > > > > establish the base performance in this setup (which is
> >> around ~40
> >> > > > > Gbps
> >> > > > > > > with
> >> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O
> >> streams
> >> > > can
> >> > > > > take
> >> > > > > > > > ArrowBuf as src/dst buffers. That should be doable.
> >> > > > > > >
> >> > > > > > > Right, in any case what you are testing is the performance
> of
> >> > > using a
> >> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory
> >> to sum
> >> > > the
> >> > > > > > > non-null values of each column. I'm not sure that a single
> >> > > bandwidth
> >> > > > > > > number produced by this benchmark is very informative for
> >> people
> >> > > > > > > contemplating what memory format to use in their system due
> >> to the
> >> > > > > > > current state of the implementation (Java) and workload
> >> measured
> >> > > > > > > (summing the non-null values with a naive algorithm). I
> would
> >> guess
> >> > > > > > > that a C++ version with raw pointers and a loop-unrolled,
> >> > > branch-free
> >> > > > > > > vectorized sum is going to be a lot faster.
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > @Jacques -
> >> > > > > > > >
> >> > > > > > > > That is a good point that "Arrow's implementation is more
> >> > > focused on
> >> > > > > > > > interacting with the structure than transporting it".
> >> However,
> >> > > in any
> >> > > > > > > > distributed system one needs to move data/structure around
> >> - as
> >> > > far
> >> > > > > as I
> >> > > > > > > > understand that is another goal of the project. My
> >> investigation
> >> > > > > started
> >> > > > > > > > within the context of Spark/SQL data processing. Spark
> >> converts
> >> > > > > incoming
> >> > > > > > > > data into its own in-memory UnsafeRow representation for
> >> > > processing.
> >> > > > > So
> >> > > > > > > > naturally the performance of this data ingestion pipeline
> >> cannot
> >> > > > > > > outperform
> >> > > > > > > > the read performance of the used file format. I
> benchmarked
> >> > > Parquet,
> >> > > > > ORC,
> >> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table).
> And
> >> then
> >> > > > > > > curiously
> >> > > > > > > > benchmarked Arrow as well because its design choices are a
> >> better
> >> > > > > fit for
> >> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> >> > > > > investigating.
> >> > > > > > > From
> >> > > > > > > > this point of view, I am trying to find out can Arrow be
> >> the file
> >> > > > > format
> >> > > > > > > > for the next generation of storage/networking devices (see
> >> Apache
> >> > > > > Crail
> >> > > > > > > > project) delivering close to the hardware speed
> >> reading/writing
> >> > > > > rates. As
> >> > > > > > > > Wes pointed out that a C++ library implementation should
> be
> >> > > memory-IO
> >> > > > > > > > bound, so what would it take to deliver the same
> >> performance in
> >> > > Java
> >> > > > > ;)
> >> > > > > > > > (and then, from across the network).
> >> > > > > > > >
> >> > > > > > > > I hope this makes sense.
> >> > > > > > > >
> >> > > > > > > > Cheers,
> >> > > > > > > > --
> >> > > > > > > > Animesh
> >> > > > > > > >
> >> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> >> > > jacques@apache.org>
> >> > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > My big question is what is the use case and how/what are
> >> you
> >> > > > > trying to
> >> > > > > > > > > compare? Arrow's implementation is more focused on
> >> interacting
> >> > > > > with the
> >> > > > > > > > > structure than transporting it. Generally speaking, when
> >> we're
> >> > > > > working
> >> > > > > > > with
> >> > > > > > > > > Arrow data we frequently are just interacting with
> memory
> >> > > > > locations and
> >> > > > > > > > > doing direct operations. If you have a layer that
> >> supports that
> >> > > > > type of
> >> > > > > > > > > semantic, create a movement technique that depends on
> >> that.
> >> > > Arrow
> >> > > > > > > doesn't
> >> > > > > > > > > force a particular API since the data itself is defined
> >> by its
> >> > > > > > > in-memory
> >> > > > > > > > > layout so if you have a custom use or pattern, just work
> >> with
> >> > > the
> >> > > > > > > in-memory
> >> > > > > > > > > structures.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> >> > > wesmckinn@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > hi Animesh,
> >> > > > > > > > > >
> >> > > > > > > > > > Per Johan's comments, the C++ library is essentially
> >> going
> >> > > to be
> >> > > > > > > > > > IO/memory bandwidth bound since you're interacting
> with
> >> raw
> >> > > > > pointers.
> >> > > > > > > > > >
> >> > > > > > > > > > I'm looking at your code
> >> > > > > > > > > >
> >> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
> >> > > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> >> > > > > > > > > >     int valCount = accessor.getValueCount();
> >> > > > > > > > > >     for(int i = 0; i < valCount; i++){
> >> > > > > > > > > >         if(!accessor.isNull(i)){
> >> > > > > > > > > >             float4Count+=1;
> >> > > > > > > > > >             checksum+=accessor.get(i);
> >> > > > > > > > > >         }
> >> > > > > > > > > >     }
> >> > > > > > > > > > }
> >> > > > > > > > > >
> >> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
> >> advise
> >> > > you
> >> > > > > the
> >> > > > > > > > > > fastest way to iterate over this data -- my
> >> understanding is
> >> > > that
> >> > > > > > > much
> >> > > > > > > > > > code in Dremio interacts with the wrapped Netty
> ArrowBuf
> >> > > objects
> >> > > > > > > > > > rather than going through the higher level APIs.
> You're
> >> also
> >> > > > > dropping
> >> > > > > > > > > > performance because memory mapping is not yet
> >> implemented in
> >> > > > > Java,
> >> > > > > > > see
> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> >> > > > > > > > > >
> >> > > > > > > > > > Furthermore, the IPC reader class you are using could
> >> be made
> >> > > > > more
> >> > > > > > > > > > efficient. I described the problem in
> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 --
> >> this
> >> > > will be
> >> > > > > > > > > > required as soon as we have the ability to do memory
> >> mapping
> >> > > in
> >> > > > > Java
> >> > > > > > > > > >
> >> > > > > > > > > > Could Crail use the Arrow data structures in its
> runtime
> >> > > rather
> >> > > > > than
> >> > > > > > > > > > copying? If not, how are Crail's runtime data
> structures
> >> > > > > different?
> >> > > > > > > > > >
> >> > > > > > > > > > - Wes
> >> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> >> > > > > > > > > > <J....@tudelft.nl> wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > Hello Animesh,
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I browsed a bit in your sources, thanks for sharing.
> >> We
> >> > > have
> >> > > > > > > performed
> >> > > > > > > > > > some similar measurements to your third case in the
> >> past for
> >> > > > > C/C++ on
> >> > > > > > > > > > collections of various basic types such as primitives
> >> and
> >> > > > > strings.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I can say that in terms of consuming data from the
> >> Arrow
> >> > > format
> >> > > > > > > versus
> >> > > > > > > > > > language native collections in C++, I've so far always
> >> been
> >> > > able
> >> > > > > to
> >> > > > > > > get
> >> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to
> >> the
> >> > > Arrow
> >> > > > > > > format
> >> > > > > > > > > > itself). Especially when accessing the data through
> >> Arrow's
> >> > > raw
> >> > > > > data
> >> > > > > > > > > > pointers (and using for example std::string_view-like
> >> > > > > constructs).
> >> > > > > > > > > > >
> >> > > > > > > > > > > In C/C++ the fast data structures are engineered in
> >> such a
> >> > > way
> >> > > > > > > that as
> >> > > > > > > > > > little pointer traversals are required and they take
> up
> >> an as
> >> > > > > small
> >> > > > > > > as
> >> > > > > > > > > > possible memory footprint. Thus each memory access is
> >> > > relatively
> >> > > > > > > > > efficient
> >> > > > > > > > > > (in terms of obtaining the data of interest). The same
> >> can
> >> > > > > > > absolutely be
> >> > > > > > > > > > said for Arrow, if not even more efficient in some
> cases
> >> > > where
> >> > > > > object
> >> > > > > > > > > > fields are of variable length.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
> >> This
> >> > > > > requires
> >> > > > > > > the
> >> > > > > > > > > > JVM to interface to it through some calls to Unsafe
> >> hidden
> >> > > under
> >> > > > > the
> >> > > > > > > > > Netty
> >> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
> >> expert
> >> > > on
> >> > > > > > > this).
> >> > > > > > > > > > Those calls are the only reason I can think of that
> >> would
> >> > > > > degrade the
> >> > > > > > > > > > performance a bit compared to a pure JAva case. I
> don't
> >> know
> >> > > if
> >> > > > > the
> >> > > > > > > > > Unsafe
> >> > > > > > > > > > calls are inlined during JIT compilation. If they
> >> aren't they
> >> > > > > will
> >> > > > > > > > > increase
> >> > > > > > > > > > access latency to any data a little bit.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I don't have a similar machine so it's not easy to
> >> relate
> >> > > my
> >> > > > > > > numbers to
> >> > > > > > > > > > yours, but if you can get that data consumed with 100
> >> Gbps
> >> > > in a
> >> > > > > pure
> >> > > > > > > Java
> >> > > > > > > > > > case, I don't see any reason (resulting from Arrow
> >> format /
> >> > > > > off-heap
> >> > > > > > > > > > storage) why you wouldn't be able to get at least
> really
> >> > > close.
> >> > > > > Can
> >> > > > > > > you
> >> > > > > > > > > get
> >> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java
> with
> >> your
> >> > > > > > > consumption
> >> > > > > > > > > > functions in the first place?
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I'm interested to see your progress on this.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > Kind regards,
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > Johan Peltenburg
> >> > > > > > > > > > >
> >> > > > > > > > > > > ________________________________
> >> > > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
> >> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> >> > > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> >> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
> >> > > > > > > > > > >
> >> > > > > > > > > > > Hi all,
> >> > > > > > > > > > >
> >> > > > > > > > > > > A week ago, Wes and I had a discussion about the
> >> > > performance
> >> > > > > of the
> >> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
> >> (Incubating)
> >> > > > > mailing
> >> > > > > > > > > list (
> >> > > > > > > > > > >
> >> > > > > > >
> >> > >
> >> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> >> > > > > > > > > ).
> >> > > > > > > > > > In
> >> > > > > > > > > > > a nutshell: I am investigating the performance of
> >> various
> >> > > file
> >> > > > > > > formats
> >> > > > > > > > > > > (including Arrow) on high-performance NVMe and
> >> > > > > RDMA/100Gbps/RoCE
> >> > > > > > > > > setups.
> >> > > > > > > > > > I
> >> > > > > > > > > > > benchmarked how long does it take to materialize
> >> values
> >> > > (ints,
> >> > > > > > > longs,
> >> > > > > > > > > > > doubles) of the store_sales table, the largest table
> >> in the
> >> > > > > TPC-DS
> >> > > > > > > > > > dataset
> >> > > > > > > > > > > stored on different file formats. Here is a write-up
> >> on
> >> > > this -
> >> > > > > > > > > > >
> >> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> >> > > > > > > found
> >> > > > > > > > > > that
> >> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
> >> link,
> >> > > Arrow
> >> > > > > > > (using
> >> > > > > > > > > > as a
> >> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> >> > > bandwidth
> >> > > > > with
> >> > > > > > > all
> >> > > > > > > > > 16
> >> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
> >> in-memory
> >> > > IPC
> >> > > > > > > format,
> >> > > > > > > > > > and
> >> > > > > > > > > > > has not been optimized for storage interfaces/APIs
> >> like
> >> > > HDFS;
> >> > > > > (ii)
> >> > > > > > > the
> >> > > > > > > > > > > performance I am measuring is for the java
> >> implementation.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
> >> > > > > > > > > > >
> >> > > > > > > > > > > That brings us to this email where I promised to
> >> follow up
> >> > > on
> >> > > > > the
> >> > > > > > > Arrow
> >> > > > > > > > > > > mailing list to understand and optimize the
> >> performance of
> >> > > > > > > Arrow/Java
> >> > > > > > > > > > > implementation on high-performance devices. I wrote
> a
> >> small
> >> > > > > > > stand-alone
> >> > > > > > > > > > > benchmark (
> >> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
> >> > > > > > > with
> >> > > > > > > > > > three
> >> > > > > > > > > > > implementations of WritableByteChannel,
> >> SeekableByteChannel
> >> > > > > > > interfaces:
> >> > > > > > > > > > >
> >> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives
> me
> >> ~30
> >> > > Gbps
> >> > > > > > > > > > performance
> >> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives
> me
> >> > > ~35-36
> >> > > > > Gbps
> >> > > > > > > > > > > performance
> >> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this
> >> gives me
> >> > > ~39
> >> > > > > Gbps
> >> > > > > > > > > > > performance
> >> > > > > > > > > > >
> >> > > > > > > > > > > I think the order makes sense. To better understand
> >> the
> >> > > > > > > performance of
> >> > > > > > > > > > > Arrow/Java we can focus on the option 3.
> >> > > > > > > > > > >
> >> > > > > > > > > > > The key question I am trying to answer is "what
> would
> >> it
> >> > > take
> >> > > > > for
> >> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is
> it
> >> > > > > possible? If
> >> > > > > > > > > yes,
> >> > > > > > > > > > > then what is missing/or mis-interpreted by me? If
> >> not, then
> >> > > > > where
> >> > > > > > > is
> >> > > > > > > > > the
> >> > > > > > > > > > > performance lost? Does anyone have any performance
> >> > > measurements
> >> > > > > > > for C++
> >> > > > > > > > > > > implementation? if they have seen/expect better
> >> numbers.
> >> > > > > > > > > > >
> >> > > > > > > > > > > As a next step, I will profile the read path of
> >> Arrow/Java
> >> > > for
> >> > > > > the
> >> > > > > > > > > option
> >> > > > > > > > > > > 3. I will report my findings.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Any thoughts and feedback on this investigation are
> >> very
> >> > > > > welcome.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Cheers,
> >> > > > > > > > > > > --
> >> > > > > > > > > > > Animesh
> >> > > > > > > > > > >
> >> > > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list
> as
> >> > > well.
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > >
> >> > > > >
> >> > >
> >>
> >
>
>

Re: [JAVA] Arrow performance measurement

Posted by Animesh Trivedi <an...@gmail.com>.
Hi Li - thanks for setting up the jiras. I'll prepare the pull requests
soon.

Hi Wes, I think that is a good idea to write a blog about performance, with
pointing to the improvements in the code base (fixes and microbenchmarks).
I can prepare a first draft as we go along. This will definitely serve well
to the large java community.

Cheers,
--
Animesh


On Thu, 11 Oct 2018, 17:55 Li Jin, <ic...@gmail.com> wrote:

> I have created these as the first step. Animesh, feel free to submit PR for
> these. I will look into your micro benchmarks soon.
>
>
>
>    1. [image: Improvement] ARROW-3497[Java] Add user documentation for
>    achieving better performance
>    <https://jira.apache.org/jira/browse/ARROW-3497>
>    2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to Java
>    <https://jira.apache.org/jira/browse/ARROW-3496>
>    3. [image: Improvement] ARROW-3495[Java] Optimize bit operations
>    performance <https://jira.apache.org/jira/browse/ARROW-3495>
>    4. [image: Improvement] ARROW-3493[Java] Document
> BOUNDS_CHECKING_ENABLED
>    <https://jira.apache.org/jira/browse/ARROW-3493>
>
>
> On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ic...@gmail.com> wrote:
>
> > Hi Wes and Animesh,
> >
> > Thanks for the analysis and discussion. I am happy to looking into this.
> I
> > will create some Jiras soon.
> >
> > Li
> >
> > On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> hey Animesh,
> >>
> >> Thank you for doing this analysis. If you'd like to share some of the
> >> analysis more broadly e.g. on the Apache Arrow blog or social media,
> >> let us know.
> >>
> >> Seems like there might be a few follow ups here for the Arrow Java
> >> community:
> >>
> >> * Documentation about achieving better performance
> >> * Writing some microperformance benchmarks
> >> * Making some improvements to the code to facilitate better performance
> >>
> >> Feel free to create some JIRA issues. Are any Java developers
> >> interested in digging a little more into these issues?
> >>
> >> Thanks,
> >> Wes
> >> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
> >> <an...@gmail.com> wrote:
> >> >
> >> > Hi Wes and all,
> >> >
> >> > Here is another round of updates:
> >> >
> >> > Quick recap - previously we established that for 1kB binary blobs,
> Arrow
> >> > can deliver > 160 Gbps performance from in-memory buffers.
> >> >
> >> > In this round I looked at the performance of materializing "integers".
> >> In
> >> > my benchmarks, I found that with careful optimizations/code-rewriting
> we
> >> > can push the performance of integer reading from 5.42 Gbps/core to
> 13.61
> >> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to
> 110+
> >> > Gbps. Key things to do is:
> >> >
> >> > 1) Disable memory access checks in Arrow and Netty buffers. This gave
> >> > significant performance boost. However, for such an important
> >> performance
> >> > flag, it is very poorly documented
> >> > ("drill.enable_unsafe_memory_access=true").
> >> >
> >> > 2) Materialize values from Validity and Value direct buffers instead
> of
> >> > calling getInt() function on the IntVector. This is implemented as a
> new
> >> > Unsafe reader type (
> >> >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
> >> > )
> >> >
> >> > 3) Optimize bitmap operation to check if a bit is set or not (
> >> >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
> >> > )
> >> >
> >> > A detailed write up of these steps is available here:
> >> >
> >>
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
> >> >
> >> > I have 2 follow-up questions:
> >> >
> >> > 1) Regarding the `isSet` function, why does it has to calculate number
> >> of
> >> > bits set? (
> >> >
> >>
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797
> >> ).
> >> > Wouldn't just checking if the result of the AND operation is zero or
> >> not be
> >> > sufficient? Like what I did :
> >> >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
> >> >
> >> >
> >> > 2) What is the reason behind this bitmap generation optimization here
> >> >
> >>
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
> >> > ? At this point when this function is called, the bitmap vector is
> >> already
> >> > read from the storage, and contains the right values (either all null,
> >> all
> >> > set, or whatever). Generating this mask here for the special cases
> when
> >> the
> >> > values are all NULL or all set (this was the case in my benchmark),
> can
> >> be
> >> > slower than just returning what one has read from the storage.
> >> >
> >> > Collectively optimizing these two bitmap operations give more than 1
> >> Gbps
> >> > gains in my bench-marking code.
> >> >
> >> > Cheers,
> >> > --
> >> > Animesh
> >> >
> >> >
> >> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >
> >> > > See e.g.
> >> > >
> >> > >
> >> > >
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
> >> > >
> >> > >
> >> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
> >> > > <an...@gmail.com> wrote:
> >> > > >
> >> > > > Primarily write the same microbenchmark as I have in Java in C++
> for
> >> > > table
> >> > > > reading and value materialization. So just an example of
> equivalent
> >> > > > ArrowFileReader example code in C++. Unit tests are a good
> starting
> >> > > point,
> >> > > > thanks for the tip :)
> >> > > >
> >> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <wesmckinn@gmail.com
> >
> >> > > wrote:
> >> > > >
> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
> can
> >> > > have a
> >> > > > > look?
> >> > > > >
> >> > > > > What kind of code are you looking for? I would direct you to
> >> relevant
> >> > > > > unit tests that exhibit certain functionality, but it depends on
> >> what
> >> > > > > you are trying to do
> >> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> >> > > > > <an...@gmail.com> wrote:
> >> > > > > >
> >> > > > > > Hi all - quick update on the performance investigation:
> >> > > > > >
> >> > > > > > - I spent some time looking at performance profile for a
> binary
> >> blob
> >> > > > > column
> >> > > > > > (1024 bytes of byte[]) and found a few favorable settings for
> >> > > delivering
> >> > > > > up
> >> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores.
> These
> >> > > settings
> >> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.)
> are
> >> > > > > documented
> >> > > > > > here:
> >> > > > > >
> >> > > > >
> >> > >
> >>
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> >> > > > > > - these setting also help to improved the last number that
> >> reported
> >> > > (but
> >> > > > > > not by much) for the in-memory TPC-DS store_sales table from
> ~39
> >> > > Gbps up
> >> > > > > to
> >> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark,
> >> i.e.,
> >> > > w/o any
> >> > > > > > networking or storage links)
> >> > > > > >
> >> > > > > > A few follow up questions that I have:
> >> > > > > > 1. Arrow reads a batch size worth of data in one go. Are there
> >> any
> >> > > > > > recommended batch sizes? In my investigation, small batch size
> >> help
> >> > > with
> >> > > > > a
> >> > > > > > better cache profile but increase number of instructions
> >> required
> >> > > (more
> >> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem
> to
> >> be
> >> > > the
> >> > > > > best
> >> > > > > > performing configuration, which is also a bit counter
> intuitive
> >> as
> >> > > for 16
> >> > > > > > threads this will lead to 160 MB of memory footprint. May be
> >> this is
> >> > > also
> >> > > > > > tired to the memory management logic which is my next
> question.
> >> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent
> netty
> >> > > memory
> >> > > > > > management settings for "io.netty.allocator.*" parameters? I
> >> don't
> >> > > find
> >> > > > > any
> >> > > > > > decent write-up on them; (ii) Is there a provision for
> ArrowBuf
> >> being
> >> > > > > > re-used once a batch is consumed? As it looks for now, read
> read
> >> > > > > allocates
> >> > > > > > a new buffer to read the whole batch size.
> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
> can
> >> > > have a
> >> > > > > > look?
> >> > > > > >
> >> > > > > > Cheers,
> >> > > > > > --
> >> > > > > > Animesh
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <
> >> wesmckinn@gmail.com>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> >> > > > > > > <an...@gmail.com> wrote:
> >> > > > > > > >
> >> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your
> comments:
> >> > > > > > > >
> >> > > > > > > > @Johan -
> >> > > > > > > > 1. I also do not suspect that there is any inherent
> >> drawback in
> >> > > Java
> >> > > > > or
> >> > > > > > > C++
> >> > > > > > > > due to the Arrow format. I mentioned C++ because Wes
> >> pointed out
> >> > > that
> >> > > > > > > Java
> >> > > > > > > > routines are not the most optimized ones (yet!). And
> >> naturally
> >> > > one
> >> > > > > would
> >> > > > > > > > expect better performance in a native language with all
> >> > > > > > > pointer/memory/SIMD
> >> > > > > > > > instruction optimizations that you mentioned. As far as I
> >> know,
> >> > > the
> >> > > > > > > > off-heap buffers are managed in ArrowBuf which implements
> an
> >> > > abstract
> >> > > > > > > netty
> >> > > > > > > > class. But there is nothing unusual, i.e., netty specific,
> >> about
> >> > > > > these
> >> > > > > > > > unsafe routines, they are used by many projects. Though
> >> there is
> >> > > cost
> >> > > > > > > > associated with materializing on-heap Java values from
> >> off-heap
> >> > > > > memory
> >> > > > > > > > regions. I need to benchmark that more carefully.
> >> > > > > > > >
> >> > > > > > > > 2. When you say "I've so far always been able to get
> similar
> >> > > > > performance
> >> > > > > > > > numbers" - do you mean the same performance of my case 3
> >> where 16
> >> > > > > cores
> >> > > > > > > > drive close to 40 Gbps, or the same performance between
> >> your C++
> >> > > and
> >> > > > > Java
> >> > > > > > > > benchmarks. Do you have some write-up? I would be
> >> interested to
> >> > > read
> >> > > > > up
> >> > > > > > > :)
> >> > > > > > > >
> >> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays
> >> in
> >> > > Java"
> >> > > > > ->
> >> > > > > > > that
> >> > > > > > > > is a good idea. Let me try and report back.
> >> > > > > > > >
> >> > > > > > > > @Wes -
> >> > > > > > > >
> >> > > > > > > > Is there some benchmark template for C++ routines I can
> >> have a
> >> > > look?
> >> > > > > I
> >> > > > > > > > would be happy to get some input from Java-Arrow experts
> on
> >> how
> >> > > to
> >> > > > > write
> >> > > > > > > > these benchmarks more efficiently. I will have a closer
> >> look at
> >> > > the
> >> > > > > JIRA
> >> > > > > > > > tickets that you mentioned.
> >> > > > > > > >
> >> > > > > > > > So, for now I am focusing on the case 3, which is about
> >> > > establishing
> >> > > > > > > > performance when reading from a local in-memory I/O stream
> >> that I
> >> > > > > > > > implemented (
> >> > > > > > > >
> >> > > > > > >
> >> > > > >
> >> > >
> >>
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> >> > > > > > > ).
> >> > > > > > > > In this case I first read data from parquet files, convert
> >> them
> >> > > into
> >> > > > > > > Arrow,
> >> > > > > > > > and write-out to a MemoryIOChannel, and then read back
> from
> >> it.
> >> > > So,
> >> > > > > the
> >> > > > > > > > performance has nothing to do with Crail or HDFS in the
> >> case 3.
> >> > > > > Once, I
> >> > > > > > > > establish the base performance in this setup (which is
> >> around ~40
> >> > > > > Gbps
> >> > > > > > > with
> >> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O
> >> streams
> >> > > can
> >> > > > > take
> >> > > > > > > > ArrowBuf as src/dst buffers. That should be doable.
> >> > > > > > >
> >> > > > > > > Right, in any case what you are testing is the performance
> of
> >> > > using a
> >> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory
> >> to sum
> >> > > the
> >> > > > > > > non-null values of each column. I'm not sure that a single
> >> > > bandwidth
> >> > > > > > > number produced by this benchmark is very informative for
> >> people
> >> > > > > > > contemplating what memory format to use in their system due
> >> to the
> >> > > > > > > current state of the implementation (Java) and workload
> >> measured
> >> > > > > > > (summing the non-null values with a naive algorithm). I
> would
> >> guess
> >> > > > > > > that a C++ version with raw pointers and a loop-unrolled,
> >> > > branch-free
> >> > > > > > > vectorized sum is going to be a lot faster.
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > @Jacques -
> >> > > > > > > >
> >> > > > > > > > That is a good point that "Arrow's implementation is more
> >> > > focused on
> >> > > > > > > > interacting with the structure than transporting it".
> >> However,
> >> > > in any
> >> > > > > > > > distributed system one needs to move data/structure around
> >> - as
> >> > > far
> >> > > > > as I
> >> > > > > > > > understand that is another goal of the project. My
> >> investigation
> >> > > > > started
> >> > > > > > > > within the context of Spark/SQL data processing. Spark
> >> converts
> >> > > > > incoming
> >> > > > > > > > data into its own in-memory UnsafeRow representation for
> >> > > processing.
> >> > > > > So
> >> > > > > > > > naturally the performance of this data ingestion pipeline
> >> cannot
> >> > > > > > > outperform
> >> > > > > > > > the read performance of the used file format. I
> benchmarked
> >> > > Parquet,
> >> > > > > ORC,
> >> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table).
> And
> >> then
> >> > > > > > > curiously
> >> > > > > > > > benchmarked Arrow as well because its design choices are a
> >> better
> >> > > > > fit for
> >> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> >> > > > > investigating.
> >> > > > > > > From
> >> > > > > > > > this point of view, I am trying to find out can Arrow be
> >> the file
> >> > > > > format
> >> > > > > > > > for the next generation of storage/networking devices (see
> >> Apache
> >> > > > > Crail
> >> > > > > > > > project) delivering close to the hardware speed
> >> reading/writing
> >> > > > > rates. As
> >> > > > > > > > Wes pointed out that a C++ library implementation should
> be
> >> > > memory-IO
> >> > > > > > > > bound, so what would it take to deliver the same
> >> performance in
> >> > > Java
> >> > > > > ;)
> >> > > > > > > > (and then, from across the network).
> >> > > > > > > >
> >> > > > > > > > I hope this makes sense.
> >> > > > > > > >
> >> > > > > > > > Cheers,
> >> > > > > > > > --
> >> > > > > > > > Animesh
> >> > > > > > > >
> >> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> >> > > jacques@apache.org>
> >> > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > My big question is what is the use case and how/what are
> >> you
> >> > > > > trying to
> >> > > > > > > > > compare? Arrow's implementation is more focused on
> >> interacting
> >> > > > > with the
> >> > > > > > > > > structure than transporting it. Generally speaking, when
> >> we're
> >> > > > > working
> >> > > > > > > with
> >> > > > > > > > > Arrow data we frequently are just interacting with
> memory
> >> > > > > locations and
> >> > > > > > > > > doing direct operations. If you have a layer that
> >> supports that
> >> > > > > type of
> >> > > > > > > > > semantic, create a movement technique that depends on
> >> that.
> >> > > Arrow
> >> > > > > > > doesn't
> >> > > > > > > > > force a particular API since the data itself is defined
> >> by its
> >> > > > > > > in-memory
> >> > > > > > > > > layout so if you have a custom use or pattern, just work
> >> with
> >> > > the
> >> > > > > > > in-memory
> >> > > > > > > > > structures.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> >> > > wesmckinn@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > hi Animesh,
> >> > > > > > > > > >
> >> > > > > > > > > > Per Johan's comments, the C++ library is essentially
> >> going
> >> > > to be
> >> > > > > > > > > > IO/memory bandwidth bound since you're interacting
> with
> >> raw
> >> > > > > pointers.
> >> > > > > > > > > >
> >> > > > > > > > > > I'm looking at your code
> >> > > > > > > > > >
> >> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
> >> > > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> >> > > > > > > > > >     int valCount = accessor.getValueCount();
> >> > > > > > > > > >     for(int i = 0; i < valCount; i++){
> >> > > > > > > > > >         if(!accessor.isNull(i)){
> >> > > > > > > > > >             float4Count+=1;
> >> > > > > > > > > >             checksum+=accessor.get(i);
> >> > > > > > > > > >         }
> >> > > > > > > > > >     }
> >> > > > > > > > > > }
> >> > > > > > > > > >
> >> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
> >> advise
> >> > > you
> >> > > > > the
> >> > > > > > > > > > fastest way to iterate over this data -- my
> >> understanding is
> >> > > that
> >> > > > > > > much
> >> > > > > > > > > > code in Dremio interacts with the wrapped Netty
> ArrowBuf
> >> > > objects
> >> > > > > > > > > > rather than going through the higher level APIs.
> You're
> >> also
> >> > > > > dropping
> >> > > > > > > > > > performance because memory mapping is not yet
> >> implemented in
> >> > > > > Java,
> >> > > > > > > see
> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> >> > > > > > > > > >
> >> > > > > > > > > > Furthermore, the IPC reader class you are using could
> >> be made
> >> > > > > more
> >> > > > > > > > > > efficient. I described the problem in
> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 --
> >> this
> >> > > will be
> >> > > > > > > > > > required as soon as we have the ability to do memory
> >> mapping
> >> > > in
> >> > > > > Java
> >> > > > > > > > > >
> >> > > > > > > > > > Could Crail use the Arrow data structures in its
> runtime
> >> > > rather
> >> > > > > than
> >> > > > > > > > > > copying? If not, how are Crail's runtime data
> structures
> >> > > > > different?
> >> > > > > > > > > >
> >> > > > > > > > > > - Wes
> >> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> >> > > > > > > > > > <J....@tudelft.nl> wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > Hello Animesh,
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I browsed a bit in your sources, thanks for sharing.
> >> We
> >> > > have
> >> > > > > > > performed
> >> > > > > > > > > > some similar measurements to your third case in the
> >> past for
> >> > > > > C/C++ on
> >> > > > > > > > > > collections of various basic types such as primitives
> >> and
> >> > > > > strings.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I can say that in terms of consuming data from the
> >> Arrow
> >> > > format
> >> > > > > > > versus
> >> > > > > > > > > > language native collections in C++, I've so far always
> >> been
> >> > > able
> >> > > > > to
> >> > > > > > > get
> >> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to
> >> the
> >> > > Arrow
> >> > > > > > > format
> >> > > > > > > > > > itself). Especially when accessing the data through
> >> Arrow's
> >> > > raw
> >> > > > > data
> >> > > > > > > > > > pointers (and using for example std::string_view-like
> >> > > > > constructs).
> >> > > > > > > > > > >
> >> > > > > > > > > > > In C/C++ the fast data structures are engineered in
> >> such a
> >> > > way
> >> > > > > > > that as
> >> > > > > > > > > > little pointer traversals are required and they take
> up
> >> an as
> >> > > > > small
> >> > > > > > > as
> >> > > > > > > > > > possible memory footprint. Thus each memory access is
> >> > > relatively
> >> > > > > > > > > efficient
> >> > > > > > > > > > (in terms of obtaining the data of interest). The same
> >> can
> >> > > > > > > absolutely be
> >> > > > > > > > > > said for Arrow, if not even more efficient in some
> cases
> >> > > where
> >> > > > > object
> >> > > > > > > > > > fields are of variable length.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
> >> This
> >> > > > > requires
> >> > > > > > > the
> >> > > > > > > > > > JVM to interface to it through some calls to Unsafe
> >> hidden
> >> > > under
> >> > > > > the
> >> > > > > > > > > Netty
> >> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
> >> expert
> >> > > on
> >> > > > > > > this).
> >> > > > > > > > > > Those calls are the only reason I can think of that
> >> would
> >> > > > > degrade the
> >> > > > > > > > > > performance a bit compared to a pure JAva case. I
> don't
> >> know
> >> > > if
> >> > > > > the
> >> > > > > > > > > Unsafe
> >> > > > > > > > > > calls are inlined during JIT compilation. If they
> >> aren't they
> >> > > > > will
> >> > > > > > > > > increase
> >> > > > > > > > > > access latency to any data a little bit.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I don't have a similar machine so it's not easy to
> >> relate
> >> > > my
> >> > > > > > > numbers to
> >> > > > > > > > > > yours, but if you can get that data consumed with 100
> >> Gbps
> >> > > in a
> >> > > > > pure
> >> > > > > > > Java
> >> > > > > > > > > > case, I don't see any reason (resulting from Arrow
> >> format /
> >> > > > > off-heap
> >> > > > > > > > > > storage) why you wouldn't be able to get at least
> really
> >> > > close.
> >> > > > > Can
> >> > > > > > > you
> >> > > > > > > > > get
> >> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java
> with
> >> your
> >> > > > > > > consumption
> >> > > > > > > > > > functions in the first place?
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I'm interested to see your progress on this.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > Kind regards,
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > Johan Peltenburg
> >> > > > > > > > > > >
> >> > > > > > > > > > > ________________________________
> >> > > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
> >> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> >> > > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> >> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
> >> > > > > > > > > > >
> >> > > > > > > > > > > Hi all,
> >> > > > > > > > > > >
> >> > > > > > > > > > > A week ago, Wes and I had a discussion about the
> >> > > performance
> >> > > > > of the
> >> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
> >> (Incubating)
> >> > > > > mailing
> >> > > > > > > > > list (
> >> > > > > > > > > > >
> >> > > > > > >
> >> > >
> >> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> >> > > > > > > > > ).
> >> > > > > > > > > > In
> >> > > > > > > > > > > a nutshell: I am investigating the performance of
> >> various
> >> > > file
> >> > > > > > > formats
> >> > > > > > > > > > > (including Arrow) on high-performance NVMe and
> >> > > > > RDMA/100Gbps/RoCE
> >> > > > > > > > > setups.
> >> > > > > > > > > > I
> >> > > > > > > > > > > benchmarked how long does it take to materialize
> >> values
> >> > > (ints,
> >> > > > > > > longs,
> >> > > > > > > > > > > doubles) of the store_sales table, the largest table
> >> in the
> >> > > > > TPC-DS
> >> > > > > > > > > > dataset
> >> > > > > > > > > > > stored on different file formats. Here is a write-up
> >> on
> >> > > this -
> >> > > > > > > > > > >
> >> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> >> > > > > > > found
> >> > > > > > > > > > that
> >> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
> >> link,
> >> > > Arrow
> >> > > > > > > (using
> >> > > > > > > > > > as a
> >> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> >> > > bandwidth
> >> > > > > with
> >> > > > > > > all
> >> > > > > > > > > 16
> >> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
> >> in-memory
> >> > > IPC
> >> > > > > > > format,
> >> > > > > > > > > > and
> >> > > > > > > > > > > has not been optimized for storage interfaces/APIs
> >> like
> >> > > HDFS;
> >> > > > > (ii)
> >> > > > > > > the
> >> > > > > > > > > > > performance I am measuring is for the java
> >> implementation.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
> >> > > > > > > > > > >
> >> > > > > > > > > > > That brings us to this email where I promised to
> >> follow up
> >> > > on
> >> > > > > the
> >> > > > > > > Arrow
> >> > > > > > > > > > > mailing list to understand and optimize the
> >> performance of
> >> > > > > > > Arrow/Java
> >> > > > > > > > > > > implementation on high-performance devices. I wrote
> a
> >> small
> >> > > > > > > stand-alone
> >> > > > > > > > > > > benchmark (
> >> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
> >> > > > > > > with
> >> > > > > > > > > > three
> >> > > > > > > > > > > implementations of WritableByteChannel,
> >> SeekableByteChannel
> >> > > > > > > interfaces:
> >> > > > > > > > > > >
> >> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives
> me
> >> ~30
> >> > > Gbps
> >> > > > > > > > > > performance
> >> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives
> me
> >> > > ~35-36
> >> > > > > Gbps
> >> > > > > > > > > > > performance
> >> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this
> >> gives me
> >> > > ~39
> >> > > > > Gbps
> >> > > > > > > > > > > performance
> >> > > > > > > > > > >
> >> > > > > > > > > > > I think the order makes sense. To better understand
> >> the
> >> > > > > > > performance of
> >> > > > > > > > > > > Arrow/Java we can focus on the option 3.
> >> > > > > > > > > > >
> >> > > > > > > > > > > The key question I am trying to answer is "what
> would
> >> it
> >> > > take
> >> > > > > for
> >> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is
> it
> >> > > > > possible? If
> >> > > > > > > > > yes,
> >> > > > > > > > > > > then what is missing/or mis-interpreted by me? If
> >> not, then
> >> > > > > where
> >> > > > > > > is
> >> > > > > > > > > the
> >> > > > > > > > > > > performance lost? Does anyone have any performance
> >> > > measurements
> >> > > > > > > for C++
> >> > > > > > > > > > > implementation? if they have seen/expect better
> >> numbers.
> >> > > > > > > > > > >
> >> > > > > > > > > > > As a next step, I will profile the read path of
> >> Arrow/Java
> >> > > for
> >> > > > > the
> >> > > > > > > > > option
> >> > > > > > > > > > > 3. I will report my findings.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Any thoughts and feedback on this investigation are
> >> very
> >> > > > > welcome.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Cheers,
> >> > > > > > > > > > > --
> >> > > > > > > > > > > Animesh
> >> > > > > > > > > > >
> >> > > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list
> as
> >> > > well.
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > >
> >> > > > >
> >> > >
> >>
> >
>
>

Re: [JAVA] Arrow performance measurement

Posted by Li Jin <ic...@gmail.com>.
I have created these as the first step. Animesh, feel free to submit PR for
these. I will look into your micro benchmarks soon.



   1. [image: Improvement] ARROW-3497[Java] Add user documentation for
   achieving better performance
   <https://jira.apache.org/jira/browse/ARROW-3497>
   2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to Java
   <https://jira.apache.org/jira/browse/ARROW-3496>
   3. [image: Improvement] ARROW-3495[Java] Optimize bit operations
   performance <https://jira.apache.org/jira/browse/ARROW-3495>
   4. [image: Improvement] ARROW-3493[Java] Document BOUNDS_CHECKING_ENABLED
   <https://jira.apache.org/jira/browse/ARROW-3493>


On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ic...@gmail.com> wrote:

> Hi Wes and Animesh,
>
> Thanks for the analysis and discussion. I am happy to looking into this. I
> will create some Jiras soon.
>
> Li
>
> On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <we...@gmail.com> wrote:
>
>> hey Animesh,
>>
>> Thank you for doing this analysis. If you'd like to share some of the
>> analysis more broadly e.g. on the Apache Arrow blog or social media,
>> let us know.
>>
>> Seems like there might be a few follow ups here for the Arrow Java
>> community:
>>
>> * Documentation about achieving better performance
>> * Writing some microperformance benchmarks
>> * Making some improvements to the code to facilitate better performance
>>
>> Feel free to create some JIRA issues. Are any Java developers
>> interested in digging a little more into these issues?
>>
>> Thanks,
>> Wes
>> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
>> <an...@gmail.com> wrote:
>> >
>> > Hi Wes and all,
>> >
>> > Here is another round of updates:
>> >
>> > Quick recap - previously we established that for 1kB binary blobs, Arrow
>> > can deliver > 160 Gbps performance from in-memory buffers.
>> >
>> > In this round I looked at the performance of materializing "integers".
>> In
>> > my benchmarks, I found that with careful optimizations/code-rewriting we
>> > can push the performance of integer reading from 5.42 Gbps/core to 13.61
>> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+
>> > Gbps. Key things to do is:
>> >
>> > 1) Disable memory access checks in Arrow and Netty buffers. This gave
>> > significant performance boost. However, for such an important
>> performance
>> > flag, it is very poorly documented
>> > ("drill.enable_unsafe_memory_access=true").
>> >
>> > 2) Materialize values from Validity and Value direct buffers instead of
>> > calling getInt() function on the IntVector. This is implemented as a new
>> > Unsafe reader type (
>> >
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
>> > )
>> >
>> > 3) Optimize bitmap operation to check if a bit is set or not (
>> >
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
>> > )
>> >
>> > A detailed write up of these steps is available here:
>> >
>> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
>> >
>> > I have 2 follow-up questions:
>> >
>> > 1) Regarding the `isSet` function, why does it has to calculate number
>> of
>> > bits set? (
>> >
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797
>> ).
>> > Wouldn't just checking if the result of the AND operation is zero or
>> not be
>> > sufficient? Like what I did :
>> >
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
>> >
>> >
>> > 2) What is the reason behind this bitmap generation optimization here
>> >
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
>> > ? At this point when this function is called, the bitmap vector is
>> already
>> > read from the storage, and contains the right values (either all null,
>> all
>> > set, or whatever). Generating this mask here for the special cases when
>> the
>> > values are all NULL or all set (this was the case in my benchmark), can
>> be
>> > slower than just returning what one has read from the storage.
>> >
>> > Collectively optimizing these two bitmap operations give more than 1
>> Gbps
>> > gains in my bench-marking code.
>> >
>> > Cheers,
>> > --
>> > Animesh
>> >
>> >
>> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > > See e.g.
>> > >
>> > >
>> > >
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
>> > >
>> > >
>> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
>> > > <an...@gmail.com> wrote:
>> > > >
>> > > > Primarily write the same microbenchmark as I have in Java in C++ for
>> > > table
>> > > > reading and value materialization. So just an example of equivalent
>> > > > ArrowFileReader example code in C++. Unit tests are a good starting
>> > > point,
>> > > > thanks for the tip :)
>> > > >
>> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > > 3. Are there examples of Arrow in C++ read/write code that I can
>> > > have a
>> > > > > look?
>> > > > >
>> > > > > What kind of code are you looking for? I would direct you to
>> relevant
>> > > > > unit tests that exhibit certain functionality, but it depends on
>> what
>> > > > > you are trying to do
>> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
>> > > > > <an...@gmail.com> wrote:
>> > > > > >
>> > > > > > Hi all - quick update on the performance investigation:
>> > > > > >
>> > > > > > - I spent some time looking at performance profile for a binary
>> blob
>> > > > > column
>> > > > > > (1024 bytes of byte[]) and found a few favorable settings for
>> > > delivering
>> > > > > up
>> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These
>> > > settings
>> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
>> > > > > documented
>> > > > > > here:
>> > > > > >
>> > > > >
>> > >
>> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
>> > > > > > - these setting also help to improved the last number that
>> reported
>> > > (but
>> > > > > > not by much) for the in-memory TPC-DS store_sales table from ~39
>> > > Gbps up
>> > > > > to
>> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark,
>> i.e.,
>> > > w/o any
>> > > > > > networking or storage links)
>> > > > > >
>> > > > > > A few follow up questions that I have:
>> > > > > > 1. Arrow reads a batch size worth of data in one go. Are there
>> any
>> > > > > > recommended batch sizes? In my investigation, small batch size
>> help
>> > > with
>> > > > > a
>> > > > > > better cache profile but increase number of instructions
>> required
>> > > (more
>> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to
>> be
>> > > the
>> > > > > best
>> > > > > > performing configuration, which is also a bit counter intuitive
>> as
>> > > for 16
>> > > > > > threads this will lead to 160 MB of memory footprint. May be
>> this is
>> > > also
>> > > > > > tired to the memory management logic which is my next question.
>> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent netty
>> > > memory
>> > > > > > management settings for "io.netty.allocator.*" parameters? I
>> don't
>> > > find
>> > > > > any
>> > > > > > decent write-up on them; (ii) Is there a provision for ArrowBuf
>> being
>> > > > > > re-used once a batch is consumed? As it looks for now, read read
>> > > > > allocates
>> > > > > > a new buffer to read the whole batch size.
>> > > > > > 3. Are there examples of Arrow in C++ read/write code that I can
>> > > have a
>> > > > > > look?
>> > > > > >
>> > > > > > Cheers,
>> > > > > > --
>> > > > > > Animesh
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <
>> wesmckinn@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
>> > > > > > > <an...@gmail.com> wrote:
>> > > > > > > >
>> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
>> > > > > > > >
>> > > > > > > > @Johan -
>> > > > > > > > 1. I also do not suspect that there is any inherent
>> drawback in
>> > > Java
>> > > > > or
>> > > > > > > C++
>> > > > > > > > due to the Arrow format. I mentioned C++ because Wes
>> pointed out
>> > > that
>> > > > > > > Java
>> > > > > > > > routines are not the most optimized ones (yet!). And
>> naturally
>> > > one
>> > > > > would
>> > > > > > > > expect better performance in a native language with all
>> > > > > > > pointer/memory/SIMD
>> > > > > > > > instruction optimizations that you mentioned. As far as I
>> know,
>> > > the
>> > > > > > > > off-heap buffers are managed in ArrowBuf which implements an
>> > > abstract
>> > > > > > > netty
>> > > > > > > > class. But there is nothing unusual, i.e., netty specific,
>> about
>> > > > > these
>> > > > > > > > unsafe routines, they are used by many projects. Though
>> there is
>> > > cost
>> > > > > > > > associated with materializing on-heap Java values from
>> off-heap
>> > > > > memory
>> > > > > > > > regions. I need to benchmark that more carefully.
>> > > > > > > >
>> > > > > > > > 2. When you say "I've so far always been able to get similar
>> > > > > performance
>> > > > > > > > numbers" - do you mean the same performance of my case 3
>> where 16
>> > > > > cores
>> > > > > > > > drive close to 40 Gbps, or the same performance between
>> your C++
>> > > and
>> > > > > Java
>> > > > > > > > benchmarks. Do you have some write-up? I would be
>> interested to
>> > > read
>> > > > > up
>> > > > > > > :)
>> > > > > > > >
>> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays
>> in
>> > > Java"
>> > > > > ->
>> > > > > > > that
>> > > > > > > > is a good idea. Let me try and report back.
>> > > > > > > >
>> > > > > > > > @Wes -
>> > > > > > > >
>> > > > > > > > Is there some benchmark template for C++ routines I can
>> have a
>> > > look?
>> > > > > I
>> > > > > > > > would be happy to get some input from Java-Arrow experts on
>> how
>> > > to
>> > > > > write
>> > > > > > > > these benchmarks more efficiently. I will have a closer
>> look at
>> > > the
>> > > > > JIRA
>> > > > > > > > tickets that you mentioned.
>> > > > > > > >
>> > > > > > > > So, for now I am focusing on the case 3, which is about
>> > > establishing
>> > > > > > > > performance when reading from a local in-memory I/O stream
>> that I
>> > > > > > > > implemented (
>> > > > > > > >
>> > > > > > >
>> > > > >
>> > >
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
>> > > > > > > ).
>> > > > > > > > In this case I first read data from parquet files, convert
>> them
>> > > into
>> > > > > > > Arrow,
>> > > > > > > > and write-out to a MemoryIOChannel, and then read back from
>> it.
>> > > So,
>> > > > > the
>> > > > > > > > performance has nothing to do with Crail or HDFS in the
>> case 3.
>> > > > > Once, I
>> > > > > > > > establish the base performance in this setup (which is
>> around ~40
>> > > > > Gbps
>> > > > > > > with
>> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O
>> streams
>> > > can
>> > > > > take
>> > > > > > > > ArrowBuf as src/dst buffers. That should be doable.
>> > > > > > >
>> > > > > > > Right, in any case what you are testing is the performance of
>> > > using a
>> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory
>> to sum
>> > > the
>> > > > > > > non-null values of each column. I'm not sure that a single
>> > > bandwidth
>> > > > > > > number produced by this benchmark is very informative for
>> people
>> > > > > > > contemplating what memory format to use in their system due
>> to the
>> > > > > > > current state of the implementation (Java) and workload
>> measured
>> > > > > > > (summing the non-null values with a naive algorithm). I would
>> guess
>> > > > > > > that a C++ version with raw pointers and a loop-unrolled,
>> > > branch-free
>> > > > > > > vectorized sum is going to be a lot faster.
>> > > > > > >
>> > > > > > > >
>> > > > > > > > @Jacques -
>> > > > > > > >
>> > > > > > > > That is a good point that "Arrow's implementation is more
>> > > focused on
>> > > > > > > > interacting with the structure than transporting it".
>> However,
>> > > in any
>> > > > > > > > distributed system one needs to move data/structure around
>> - as
>> > > far
>> > > > > as I
>> > > > > > > > understand that is another goal of the project. My
>> investigation
>> > > > > started
>> > > > > > > > within the context of Spark/SQL data processing. Spark
>> converts
>> > > > > incoming
>> > > > > > > > data into its own in-memory UnsafeRow representation for
>> > > processing.
>> > > > > So
>> > > > > > > > naturally the performance of this data ingestion pipeline
>> cannot
>> > > > > > > outperform
>> > > > > > > > the read performance of the used file format. I benchmarked
>> > > Parquet,
>> > > > > ORC,
>> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table). And
>> then
>> > > > > > > curiously
>> > > > > > > > benchmarked Arrow as well because its design choices are a
>> better
>> > > > > fit for
>> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
>> > > > > investigating.
>> > > > > > > From
>> > > > > > > > this point of view, I am trying to find out can Arrow be
>> the file
>> > > > > format
>> > > > > > > > for the next generation of storage/networking devices (see
>> Apache
>> > > > > Crail
>> > > > > > > > project) delivering close to the hardware speed
>> reading/writing
>> > > > > rates. As
>> > > > > > > > Wes pointed out that a C++ library implementation should be
>> > > memory-IO
>> > > > > > > > bound, so what would it take to deliver the same
>> performance in
>> > > Java
>> > > > > ;)
>> > > > > > > > (and then, from across the network).
>> > > > > > > >
>> > > > > > > > I hope this makes sense.
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > > --
>> > > > > > > > Animesh
>> > > > > > > >
>> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
>> > > jacques@apache.org>
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > My big question is what is the use case and how/what are
>> you
>> > > > > trying to
>> > > > > > > > > compare? Arrow's implementation is more focused on
>> interacting
>> > > > > with the
>> > > > > > > > > structure than transporting it. Generally speaking, when
>> we're
>> > > > > working
>> > > > > > > with
>> > > > > > > > > Arrow data we frequently are just interacting with memory
>> > > > > locations and
>> > > > > > > > > doing direct operations. If you have a layer that
>> supports that
>> > > > > type of
>> > > > > > > > > semantic, create a movement technique that depends on
>> that.
>> > > Arrow
>> > > > > > > doesn't
>> > > > > > > > > force a particular API since the data itself is defined
>> by its
>> > > > > > > in-memory
>> > > > > > > > > layout so if you have a custom use or pattern, just work
>> with
>> > > the
>> > > > > > > in-memory
>> > > > > > > > > structures.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
>> > > wesmckinn@gmail.com>
>> > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > hi Animesh,
>> > > > > > > > > >
>> > > > > > > > > > Per Johan's comments, the C++ library is essentially
>> going
>> > > to be
>> > > > > > > > > > IO/memory bandwidth bound since you're interacting with
>> raw
>> > > > > pointers.
>> > > > > > > > > >
>> > > > > > > > > > I'm looking at your code
>> > > > > > > > > >
>> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
>> > > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
>> > > > > > > > > >     int valCount = accessor.getValueCount();
>> > > > > > > > > >     for(int i = 0; i < valCount; i++){
>> > > > > > > > > >         if(!accessor.isNull(i)){
>> > > > > > > > > >             float4Count+=1;
>> > > > > > > > > >             checksum+=accessor.get(i);
>> > > > > > > > > >         }
>> > > > > > > > > >     }
>> > > > > > > > > > }
>> > > > > > > > > >
>> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
>> advise
>> > > you
>> > > > > the
>> > > > > > > > > > fastest way to iterate over this data -- my
>> understanding is
>> > > that
>> > > > > > > much
>> > > > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
>> > > objects
>> > > > > > > > > > rather than going through the higher level APIs. You're
>> also
>> > > > > dropping
>> > > > > > > > > > performance because memory mapping is not yet
>> implemented in
>> > > > > Java,
>> > > > > > > see
>> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
>> > > > > > > > > >
>> > > > > > > > > > Furthermore, the IPC reader class you are using could
>> be made
>> > > > > more
>> > > > > > > > > > efficient. I described the problem in
>> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 --
>> this
>> > > will be
>> > > > > > > > > > required as soon as we have the ability to do memory
>> mapping
>> > > in
>> > > > > Java
>> > > > > > > > > >
>> > > > > > > > > > Could Crail use the Arrow data structures in its runtime
>> > > rather
>> > > > > than
>> > > > > > > > > > copying? If not, how are Crail's runtime data structures
>> > > > > different?
>> > > > > > > > > >
>> > > > > > > > > > - Wes
>> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
>> > > > > > > > > > <J....@tudelft.nl> wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > Hello Animesh,
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > I browsed a bit in your sources, thanks for sharing.
>> We
>> > > have
>> > > > > > > performed
>> > > > > > > > > > some similar measurements to your third case in the
>> past for
>> > > > > C/C++ on
>> > > > > > > > > > collections of various basic types such as primitives
>> and
>> > > > > strings.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > I can say that in terms of consuming data from the
>> Arrow
>> > > format
>> > > > > > > versus
>> > > > > > > > > > language native collections in C++, I've so far always
>> been
>> > > able
>> > > > > to
>> > > > > > > get
>> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to
>> the
>> > > Arrow
>> > > > > > > format
>> > > > > > > > > > itself). Especially when accessing the data through
>> Arrow's
>> > > raw
>> > > > > data
>> > > > > > > > > > pointers (and using for example std::string_view-like
>> > > > > constructs).
>> > > > > > > > > > >
>> > > > > > > > > > > In C/C++ the fast data structures are engineered in
>> such a
>> > > way
>> > > > > > > that as
>> > > > > > > > > > little pointer traversals are required and they take up
>> an as
>> > > > > small
>> > > > > > > as
>> > > > > > > > > > possible memory footprint. Thus each memory access is
>> > > relatively
>> > > > > > > > > efficient
>> > > > > > > > > > (in terms of obtaining the data of interest). The same
>> can
>> > > > > > > absolutely be
>> > > > > > > > > > said for Arrow, if not even more efficient in some cases
>> > > where
>> > > > > object
>> > > > > > > > > > fields are of variable length.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
>> This
>> > > > > requires
>> > > > > > > the
>> > > > > > > > > > JVM to interface to it through some calls to Unsafe
>> hidden
>> > > under
>> > > > > the
>> > > > > > > > > Netty
>> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
>> expert
>> > > on
>> > > > > > > this).
>> > > > > > > > > > Those calls are the only reason I can think of that
>> would
>> > > > > degrade the
>> > > > > > > > > > performance a bit compared to a pure JAva case. I don't
>> know
>> > > if
>> > > > > the
>> > > > > > > > > Unsafe
>> > > > > > > > > > calls are inlined during JIT compilation. If they
>> aren't they
>> > > > > will
>> > > > > > > > > increase
>> > > > > > > > > > access latency to any data a little bit.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > I don't have a similar machine so it's not easy to
>> relate
>> > > my
>> > > > > > > numbers to
>> > > > > > > > > > yours, but if you can get that data consumed with 100
>> Gbps
>> > > in a
>> > > > > pure
>> > > > > > > Java
>> > > > > > > > > > case, I don't see any reason (resulting from Arrow
>> format /
>> > > > > off-heap
>> > > > > > > > > > storage) why you wouldn't be able to get at least really
>> > > close.
>> > > > > Can
>> > > > > > > you
>> > > > > > > > > get
>> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java with
>> your
>> > > > > > > consumption
>> > > > > > > > > > functions in the first place?
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > I'm interested to see your progress on this.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > Kind regards,
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > Johan Peltenburg
>> > > > > > > > > > >
>> > > > > > > > > > > ________________________________
>> > > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
>> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
>> > > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
>> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
>> > > > > > > > > > >
>> > > > > > > > > > > Hi all,
>> > > > > > > > > > >
>> > > > > > > > > > > A week ago, Wes and I had a discussion about the
>> > > performance
>> > > > > of the
>> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
>> (Incubating)
>> > > > > mailing
>> > > > > > > > > list (
>> > > > > > > > > > >
>> > > > > > >
>> > >
>> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
>> > > > > > > > > ).
>> > > > > > > > > > In
>> > > > > > > > > > > a nutshell: I am investigating the performance of
>> various
>> > > file
>> > > > > > > formats
>> > > > > > > > > > > (including Arrow) on high-performance NVMe and
>> > > > > RDMA/100Gbps/RoCE
>> > > > > > > > > setups.
>> > > > > > > > > > I
>> > > > > > > > > > > benchmarked how long does it take to materialize
>> values
>> > > (ints,
>> > > > > > > longs,
>> > > > > > > > > > > doubles) of the store_sales table, the largest table
>> in the
>> > > > > TPC-DS
>> > > > > > > > > > dataset
>> > > > > > > > > > > stored on different file formats. Here is a write-up
>> on
>> > > this -
>> > > > > > > > > > >
>> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
>> > > > > > > found
>> > > > > > > > > > that
>> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
>> link,
>> > > Arrow
>> > > > > > > (using
>> > > > > > > > > > as a
>> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
>> > > bandwidth
>> > > > > with
>> > > > > > > all
>> > > > > > > > > 16
>> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
>> in-memory
>> > > IPC
>> > > > > > > format,
>> > > > > > > > > > and
>> > > > > > > > > > > has not been optimized for storage interfaces/APIs
>> like
>> > > HDFS;
>> > > > > (ii)
>> > > > > > > the
>> > > > > > > > > > > performance I am measuring is for the java
>> implementation.
>> > > > > > > > > > >
>> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
>> > > > > > > > > > >
>> > > > > > > > > > > That brings us to this email where I promised to
>> follow up
>> > > on
>> > > > > the
>> > > > > > > Arrow
>> > > > > > > > > > > mailing list to understand and optimize the
>> performance of
>> > > > > > > Arrow/Java
>> > > > > > > > > > > implementation on high-performance devices. I wrote a
>> small
>> > > > > > > stand-alone
>> > > > > > > > > > > benchmark (
>> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
>> > > > > > > with
>> > > > > > > > > > three
>> > > > > > > > > > > implementations of WritableByteChannel,
>> SeekableByteChannel
>> > > > > > > interfaces:
>> > > > > > > > > > >
>> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me
>> ~30
>> > > Gbps
>> > > > > > > > > > performance
>> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me
>> > > ~35-36
>> > > > > Gbps
>> > > > > > > > > > > performance
>> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this
>> gives me
>> > > ~39
>> > > > > Gbps
>> > > > > > > > > > > performance
>> > > > > > > > > > >
>> > > > > > > > > > > I think the order makes sense. To better understand
>> the
>> > > > > > > performance of
>> > > > > > > > > > > Arrow/Java we can focus on the option 3.
>> > > > > > > > > > >
>> > > > > > > > > > > The key question I am trying to answer is "what would
>> it
>> > > take
>> > > > > for
>> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
>> > > > > possible? If
>> > > > > > > > > yes,
>> > > > > > > > > > > then what is missing/or mis-interpreted by me? If
>> not, then
>> > > > > where
>> > > > > > > is
>> > > > > > > > > the
>> > > > > > > > > > > performance lost? Does anyone have any performance
>> > > measurements
>> > > > > > > for C++
>> > > > > > > > > > > implementation? if they have seen/expect better
>> numbers.
>> > > > > > > > > > >
>> > > > > > > > > > > As a next step, I will profile the read path of
>> Arrow/Java
>> > > for
>> > > > > the
>> > > > > > > > > option
>> > > > > > > > > > > 3. I will report my findings.
>> > > > > > > > > > >
>> > > > > > > > > > > Any thoughts and feedback on this investigation are
>> very
>> > > > > welcome.
>> > > > > > > > > > >
>> > > > > > > > > > > Cheers,
>> > > > > > > > > > > --
>> > > > > > > > > > > Animesh
>> > > > > > > > > > >
>> > > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as
>> > > well.
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > >
>> > >
>>
>

Re: [JAVA] Arrow performance measurement

Posted by Li Jin <ic...@gmail.com>.
I have created these as the first step. Animesh, feel free to submit PR for
these. I will look into your micro benchmarks soon.



   1. [image: Improvement] ARROW-3497[Java] Add user documentation for
   achieving better performance
   <https://jira.apache.org/jira/browse/ARROW-3497>
   2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to Java
   <https://jira.apache.org/jira/browse/ARROW-3496>
   3. [image: Improvement] ARROW-3495[Java] Optimize bit operations
   performance <https://jira.apache.org/jira/browse/ARROW-3495>
   4. [image: Improvement] ARROW-3493[Java] Document BOUNDS_CHECKING_ENABLED
   <https://jira.apache.org/jira/browse/ARROW-3493>


On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ic...@gmail.com> wrote:

> Hi Wes and Animesh,
>
> Thanks for the analysis and discussion. I am happy to looking into this. I
> will create some Jiras soon.
>
> Li
>
> On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <we...@gmail.com> wrote:
>
>> hey Animesh,
>>
>> Thank you for doing this analysis. If you'd like to share some of the
>> analysis more broadly e.g. on the Apache Arrow blog or social media,
>> let us know.
>>
>> Seems like there might be a few follow ups here for the Arrow Java
>> community:
>>
>> * Documentation about achieving better performance
>> * Writing some microperformance benchmarks
>> * Making some improvements to the code to facilitate better performance
>>
>> Feel free to create some JIRA issues. Are any Java developers
>> interested in digging a little more into these issues?
>>
>> Thanks,
>> Wes
>> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
>> <an...@gmail.com> wrote:
>> >
>> > Hi Wes and all,
>> >
>> > Here is another round of updates:
>> >
>> > Quick recap - previously we established that for 1kB binary blobs, Arrow
>> > can deliver > 160 Gbps performance from in-memory buffers.
>> >
>> > In this round I looked at the performance of materializing "integers".
>> In
>> > my benchmarks, I found that with careful optimizations/code-rewriting we
>> > can push the performance of integer reading from 5.42 Gbps/core to 13.61
>> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+
>> > Gbps. Key things to do is:
>> >
>> > 1) Disable memory access checks in Arrow and Netty buffers. This gave
>> > significant performance boost. However, for such an important
>> performance
>> > flag, it is very poorly documented
>> > ("drill.enable_unsafe_memory_access=true").
>> >
>> > 2) Materialize values from Validity and Value direct buffers instead of
>> > calling getInt() function on the IntVector. This is implemented as a new
>> > Unsafe reader type (
>> >
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
>> > )
>> >
>> > 3) Optimize bitmap operation to check if a bit is set or not (
>> >
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
>> > )
>> >
>> > A detailed write up of these steps is available here:
>> >
>> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
>> >
>> > I have 2 follow-up questions:
>> >
>> > 1) Regarding the `isSet` function, why does it has to calculate number
>> of
>> > bits set? (
>> >
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797
>> ).
>> > Wouldn't just checking if the result of the AND operation is zero or
>> not be
>> > sufficient? Like what I did :
>> >
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
>> >
>> >
>> > 2) What is the reason behind this bitmap generation optimization here
>> >
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
>> > ? At this point when this function is called, the bitmap vector is
>> already
>> > read from the storage, and contains the right values (either all null,
>> all
>> > set, or whatever). Generating this mask here for the special cases when
>> the
>> > values are all NULL or all set (this was the case in my benchmark), can
>> be
>> > slower than just returning what one has read from the storage.
>> >
>> > Collectively optimizing these two bitmap operations give more than 1
>> Gbps
>> > gains in my bench-marking code.
>> >
>> > Cheers,
>> > --
>> > Animesh
>> >
>> >
>> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > > See e.g.
>> > >
>> > >
>> > >
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
>> > >
>> > >
>> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
>> > > <an...@gmail.com> wrote:
>> > > >
>> > > > Primarily write the same microbenchmark as I have in Java in C++ for
>> > > table
>> > > > reading and value materialization. So just an example of equivalent
>> > > > ArrowFileReader example code in C++. Unit tests are a good starting
>> > > point,
>> > > > thanks for the tip :)
>> > > >
>> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > > 3. Are there examples of Arrow in C++ read/write code that I can
>> > > have a
>> > > > > look?
>> > > > >
>> > > > > What kind of code are you looking for? I would direct you to
>> relevant
>> > > > > unit tests that exhibit certain functionality, but it depends on
>> what
>> > > > > you are trying to do
>> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
>> > > > > <an...@gmail.com> wrote:
>> > > > > >
>> > > > > > Hi all - quick update on the performance investigation:
>> > > > > >
>> > > > > > - I spent some time looking at performance profile for a binary
>> blob
>> > > > > column
>> > > > > > (1024 bytes of byte[]) and found a few favorable settings for
>> > > delivering
>> > > > > up
>> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These
>> > > settings
>> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
>> > > > > documented
>> > > > > > here:
>> > > > > >
>> > > > >
>> > >
>> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
>> > > > > > - these setting also help to improved the last number that
>> reported
>> > > (but
>> > > > > > not by much) for the in-memory TPC-DS store_sales table from ~39
>> > > Gbps up
>> > > > > to
>> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark,
>> i.e.,
>> > > w/o any
>> > > > > > networking or storage links)
>> > > > > >
>> > > > > > A few follow up questions that I have:
>> > > > > > 1. Arrow reads a batch size worth of data in one go. Are there
>> any
>> > > > > > recommended batch sizes? In my investigation, small batch size
>> help
>> > > with
>> > > > > a
>> > > > > > better cache profile but increase number of instructions
>> required
>> > > (more
>> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to
>> be
>> > > the
>> > > > > best
>> > > > > > performing configuration, which is also a bit counter intuitive
>> as
>> > > for 16
>> > > > > > threads this will lead to 160 MB of memory footprint. May be
>> this is
>> > > also
>> > > > > > tired to the memory management logic which is my next question.
>> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent netty
>> > > memory
>> > > > > > management settings for "io.netty.allocator.*" parameters? I
>> don't
>> > > find
>> > > > > any
>> > > > > > decent write-up on them; (ii) Is there a provision for ArrowBuf
>> being
>> > > > > > re-used once a batch is consumed? As it looks for now, read read
>> > > > > allocates
>> > > > > > a new buffer to read the whole batch size.
>> > > > > > 3. Are there examples of Arrow in C++ read/write code that I can
>> > > have a
>> > > > > > look?
>> > > > > >
>> > > > > > Cheers,
>> > > > > > --
>> > > > > > Animesh
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <
>> wesmckinn@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
>> > > > > > > <an...@gmail.com> wrote:
>> > > > > > > >
>> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
>> > > > > > > >
>> > > > > > > > @Johan -
>> > > > > > > > 1. I also do not suspect that there is any inherent
>> drawback in
>> > > Java
>> > > > > or
>> > > > > > > C++
>> > > > > > > > due to the Arrow format. I mentioned C++ because Wes
>> pointed out
>> > > that
>> > > > > > > Java
>> > > > > > > > routines are not the most optimized ones (yet!). And
>> naturally
>> > > one
>> > > > > would
>> > > > > > > > expect better performance in a native language with all
>> > > > > > > pointer/memory/SIMD
>> > > > > > > > instruction optimizations that you mentioned. As far as I
>> know,
>> > > the
>> > > > > > > > off-heap buffers are managed in ArrowBuf which implements an
>> > > abstract
>> > > > > > > netty
>> > > > > > > > class. But there is nothing unusual, i.e., netty specific,
>> about
>> > > > > these
>> > > > > > > > unsafe routines, they are used by many projects. Though
>> there is
>> > > cost
>> > > > > > > > associated with materializing on-heap Java values from
>> off-heap
>> > > > > memory
>> > > > > > > > regions. I need to benchmark that more carefully.
>> > > > > > > >
>> > > > > > > > 2. When you say "I've so far always been able to get similar
>> > > > > performance
>> > > > > > > > numbers" - do you mean the same performance of my case 3
>> where 16
>> > > > > cores
>> > > > > > > > drive close to 40 Gbps, or the same performance between
>> your C++
>> > > and
>> > > > > Java
>> > > > > > > > benchmarks. Do you have some write-up? I would be
>> interested to
>> > > read
>> > > > > up
>> > > > > > > :)
>> > > > > > > >
>> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays
>> in
>> > > Java"
>> > > > > ->
>> > > > > > > that
>> > > > > > > > is a good idea. Let me try and report back.
>> > > > > > > >
>> > > > > > > > @Wes -
>> > > > > > > >
>> > > > > > > > Is there some benchmark template for C++ routines I can
>> have a
>> > > look?
>> > > > > I
>> > > > > > > > would be happy to get some input from Java-Arrow experts on
>> how
>> > > to
>> > > > > write
>> > > > > > > > these benchmarks more efficiently. I will have a closer
>> look at
>> > > the
>> > > > > JIRA
>> > > > > > > > tickets that you mentioned.
>> > > > > > > >
>> > > > > > > > So, for now I am focusing on the case 3, which is about
>> > > establishing
>> > > > > > > > performance when reading from a local in-memory I/O stream
>> that I
>> > > > > > > > implemented (
>> > > > > > > >
>> > > > > > >
>> > > > >
>> > >
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
>> > > > > > > ).
>> > > > > > > > In this case I first read data from parquet files, convert
>> them
>> > > into
>> > > > > > > Arrow,
>> > > > > > > > and write-out to a MemoryIOChannel, and then read back from
>> it.
>> > > So,
>> > > > > the
>> > > > > > > > performance has nothing to do with Crail or HDFS in the
>> case 3.
>> > > > > Once, I
>> > > > > > > > establish the base performance in this setup (which is
>> around ~40
>> > > > > Gbps
>> > > > > > > with
>> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O
>> streams
>> > > can
>> > > > > take
>> > > > > > > > ArrowBuf as src/dst buffers. That should be doable.
>> > > > > > >
>> > > > > > > Right, in any case what you are testing is the performance of
>> > > using a
>> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory
>> to sum
>> > > the
>> > > > > > > non-null values of each column. I'm not sure that a single
>> > > bandwidth
>> > > > > > > number produced by this benchmark is very informative for
>> people
>> > > > > > > contemplating what memory format to use in their system due
>> to the
>> > > > > > > current state of the implementation (Java) and workload
>> measured
>> > > > > > > (summing the non-null values with a naive algorithm). I would
>> guess
>> > > > > > > that a C++ version with raw pointers and a loop-unrolled,
>> > > branch-free
>> > > > > > > vectorized sum is going to be a lot faster.
>> > > > > > >
>> > > > > > > >
>> > > > > > > > @Jacques -
>> > > > > > > >
>> > > > > > > > That is a good point that "Arrow's implementation is more
>> > > focused on
>> > > > > > > > interacting with the structure than transporting it".
>> However,
>> > > in any
>> > > > > > > > distributed system one needs to move data/structure around
>> - as
>> > > far
>> > > > > as I
>> > > > > > > > understand that is another goal of the project. My
>> investigation
>> > > > > started
>> > > > > > > > within the context of Spark/SQL data processing. Spark
>> converts
>> > > > > incoming
>> > > > > > > > data into its own in-memory UnsafeRow representation for
>> > > processing.
>> > > > > So
>> > > > > > > > naturally the performance of this data ingestion pipeline
>> cannot
>> > > > > > > outperform
>> > > > > > > > the read performance of the used file format. I benchmarked
>> > > Parquet,
>> > > > > ORC,
>> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table). And
>> then
>> > > > > > > curiously
>> > > > > > > > benchmarked Arrow as well because its design choices are a
>> better
>> > > > > fit for
>> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
>> > > > > investigating.
>> > > > > > > From
>> > > > > > > > this point of view, I am trying to find out can Arrow be
>> the file
>> > > > > format
>> > > > > > > > for the next generation of storage/networking devices (see
>> Apache
>> > > > > Crail
>> > > > > > > > project) delivering close to the hardware speed
>> reading/writing
>> > > > > rates. As
>> > > > > > > > Wes pointed out that a C++ library implementation should be
>> > > memory-IO
>> > > > > > > > bound, so what would it take to deliver the same
>> performance in
>> > > Java
>> > > > > ;)
>> > > > > > > > (and then, from across the network).
>> > > > > > > >
>> > > > > > > > I hope this makes sense.
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > > --
>> > > > > > > > Animesh
>> > > > > > > >
>> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
>> > > jacques@apache.org>
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > My big question is what is the use case and how/what are
>> you
>> > > > > trying to
>> > > > > > > > > compare? Arrow's implementation is more focused on
>> interacting
>> > > > > with the
>> > > > > > > > > structure than transporting it. Generally speaking, when
>> we're
>> > > > > working
>> > > > > > > with
>> > > > > > > > > Arrow data we frequently are just interacting with memory
>> > > > > locations and
>> > > > > > > > > doing direct operations. If you have a layer that
>> supports that
>> > > > > type of
>> > > > > > > > > semantic, create a movement technique that depends on
>> that.
>> > > Arrow
>> > > > > > > doesn't
>> > > > > > > > > force a particular API since the data itself is defined
>> by its
>> > > > > > > in-memory
>> > > > > > > > > layout so if you have a custom use or pattern, just work
>> with
>> > > the
>> > > > > > > in-memory
>> > > > > > > > > structures.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
>> > > wesmckinn@gmail.com>
>> > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > hi Animesh,
>> > > > > > > > > >
>> > > > > > > > > > Per Johan's comments, the C++ library is essentially
>> going
>> > > to be
>> > > > > > > > > > IO/memory bandwidth bound since you're interacting with
>> raw
>> > > > > pointers.
>> > > > > > > > > >
>> > > > > > > > > > I'm looking at your code
>> > > > > > > > > >
>> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
>> > > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
>> > > > > > > > > >     int valCount = accessor.getValueCount();
>> > > > > > > > > >     for(int i = 0; i < valCount; i++){
>> > > > > > > > > >         if(!accessor.isNull(i)){
>> > > > > > > > > >             float4Count+=1;
>> > > > > > > > > >             checksum+=accessor.get(i);
>> > > > > > > > > >         }
>> > > > > > > > > >     }
>> > > > > > > > > > }
>> > > > > > > > > >
>> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
>> advise
>> > > you
>> > > > > the
>> > > > > > > > > > fastest way to iterate over this data -- my
>> understanding is
>> > > that
>> > > > > > > much
>> > > > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
>> > > objects
>> > > > > > > > > > rather than going through the higher level APIs. You're
>> also
>> > > > > dropping
>> > > > > > > > > > performance because memory mapping is not yet
>> implemented in
>> > > > > Java,
>> > > > > > > see
>> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
>> > > > > > > > > >
>> > > > > > > > > > Furthermore, the IPC reader class you are using could
>> be made
>> > > > > more
>> > > > > > > > > > efficient. I described the problem in
>> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 --
>> this
>> > > will be
>> > > > > > > > > > required as soon as we have the ability to do memory
>> mapping
>> > > in
>> > > > > Java
>> > > > > > > > > >
>> > > > > > > > > > Could Crail use the Arrow data structures in its runtime
>> > > rather
>> > > > > than
>> > > > > > > > > > copying? If not, how are Crail's runtime data structures
>> > > > > different?
>> > > > > > > > > >
>> > > > > > > > > > - Wes
>> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
>> > > > > > > > > > <J....@tudelft.nl> wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > Hello Animesh,
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > I browsed a bit in your sources, thanks for sharing.
>> We
>> > > have
>> > > > > > > performed
>> > > > > > > > > > some similar measurements to your third case in the
>> past for
>> > > > > C/C++ on
>> > > > > > > > > > collections of various basic types such as primitives
>> and
>> > > > > strings.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > I can say that in terms of consuming data from the
>> Arrow
>> > > format
>> > > > > > > versus
>> > > > > > > > > > language native collections in C++, I've so far always
>> been
>> > > able
>> > > > > to
>> > > > > > > get
>> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to
>> the
>> > > Arrow
>> > > > > > > format
>> > > > > > > > > > itself). Especially when accessing the data through
>> Arrow's
>> > > raw
>> > > > > data
>> > > > > > > > > > pointers (and using for example std::string_view-like
>> > > > > constructs).
>> > > > > > > > > > >
>> > > > > > > > > > > In C/C++ the fast data structures are engineered in
>> such a
>> > > way
>> > > > > > > that as
>> > > > > > > > > > little pointer traversals are required and they take up
>> an as
>> > > > > small
>> > > > > > > as
>> > > > > > > > > > possible memory footprint. Thus each memory access is
>> > > relatively
>> > > > > > > > > efficient
>> > > > > > > > > > (in terms of obtaining the data of interest). The same
>> can
>> > > > > > > absolutely be
>> > > > > > > > > > said for Arrow, if not even more efficient in some cases
>> > > where
>> > > > > object
>> > > > > > > > > > fields are of variable length.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
>> This
>> > > > > requires
>> > > > > > > the
>> > > > > > > > > > JVM to interface to it through some calls to Unsafe
>> hidden
>> > > under
>> > > > > the
>> > > > > > > > > Netty
>> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
>> expert
>> > > on
>> > > > > > > this).
>> > > > > > > > > > Those calls are the only reason I can think of that
>> would
>> > > > > degrade the
>> > > > > > > > > > performance a bit compared to a pure JAva case. I don't
>> know
>> > > if
>> > > > > the
>> > > > > > > > > Unsafe
>> > > > > > > > > > calls are inlined during JIT compilation. If they
>> aren't they
>> > > > > will
>> > > > > > > > > increase
>> > > > > > > > > > access latency to any data a little bit.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > I don't have a similar machine so it's not easy to
>> relate
>> > > my
>> > > > > > > numbers to
>> > > > > > > > > > yours, but if you can get that data consumed with 100
>> Gbps
>> > > in a
>> > > > > pure
>> > > > > > > Java
>> > > > > > > > > > case, I don't see any reason (resulting from Arrow
>> format /
>> > > > > off-heap
>> > > > > > > > > > storage) why you wouldn't be able to get at least really
>> > > close.
>> > > > > Can
>> > > > > > > you
>> > > > > > > > > get
>> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java with
>> your
>> > > > > > > consumption
>> > > > > > > > > > functions in the first place?
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > I'm interested to see your progress on this.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > Kind regards,
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > Johan Peltenburg
>> > > > > > > > > > >
>> > > > > > > > > > > ________________________________
>> > > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
>> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
>> > > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
>> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
>> > > > > > > > > > >
>> > > > > > > > > > > Hi all,
>> > > > > > > > > > >
>> > > > > > > > > > > A week ago, Wes and I had a discussion about the
>> > > performance
>> > > > > of the
>> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
>> (Incubating)
>> > > > > mailing
>> > > > > > > > > list (
>> > > > > > > > > > >
>> > > > > > >
>> > >
>> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
>> > > > > > > > > ).
>> > > > > > > > > > In
>> > > > > > > > > > > a nutshell: I am investigating the performance of
>> various
>> > > file
>> > > > > > > formats
>> > > > > > > > > > > (including Arrow) on high-performance NVMe and
>> > > > > RDMA/100Gbps/RoCE
>> > > > > > > > > setups.
>> > > > > > > > > > I
>> > > > > > > > > > > benchmarked how long does it take to materialize
>> values
>> > > (ints,
>> > > > > > > longs,
>> > > > > > > > > > > doubles) of the store_sales table, the largest table
>> in the
>> > > > > TPC-DS
>> > > > > > > > > > dataset
>> > > > > > > > > > > stored on different file formats. Here is a write-up
>> on
>> > > this -
>> > > > > > > > > > >
>> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
>> > > > > > > found
>> > > > > > > > > > that
>> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
>> link,
>> > > Arrow
>> > > > > > > (using
>> > > > > > > > > > as a
>> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
>> > > bandwidth
>> > > > > with
>> > > > > > > all
>> > > > > > > > > 16
>> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
>> in-memory
>> > > IPC
>> > > > > > > format,
>> > > > > > > > > > and
>> > > > > > > > > > > has not been optimized for storage interfaces/APIs
>> like
>> > > HDFS;
>> > > > > (ii)
>> > > > > > > the
>> > > > > > > > > > > performance I am measuring is for the java
>> implementation.
>> > > > > > > > > > >
>> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
>> > > > > > > > > > >
>> > > > > > > > > > > That brings us to this email where I promised to
>> follow up
>> > > on
>> > > > > the
>> > > > > > > Arrow
>> > > > > > > > > > > mailing list to understand and optimize the
>> performance of
>> > > > > > > Arrow/Java
>> > > > > > > > > > > implementation on high-performance devices. I wrote a
>> small
>> > > > > > > stand-alone
>> > > > > > > > > > > benchmark (
>> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
>> > > > > > > with
>> > > > > > > > > > three
>> > > > > > > > > > > implementations of WritableByteChannel,
>> SeekableByteChannel
>> > > > > > > interfaces:
>> > > > > > > > > > >
>> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me
>> ~30
>> > > Gbps
>> > > > > > > > > > performance
>> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me
>> > > ~35-36
>> > > > > Gbps
>> > > > > > > > > > > performance
>> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this
>> gives me
>> > > ~39
>> > > > > Gbps
>> > > > > > > > > > > performance
>> > > > > > > > > > >
>> > > > > > > > > > > I think the order makes sense. To better understand
>> the
>> > > > > > > performance of
>> > > > > > > > > > > Arrow/Java we can focus on the option 3.
>> > > > > > > > > > >
>> > > > > > > > > > > The key question I am trying to answer is "what would
>> it
>> > > take
>> > > > > for
>> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
>> > > > > possible? If
>> > > > > > > > > yes,
>> > > > > > > > > > > then what is missing/or mis-interpreted by me? If
>> not, then
>> > > > > where
>> > > > > > > is
>> > > > > > > > > the
>> > > > > > > > > > > performance lost? Does anyone have any performance
>> > > measurements
>> > > > > > > for C++
>> > > > > > > > > > > implementation? if they have seen/expect better
>> numbers.
>> > > > > > > > > > >
>> > > > > > > > > > > As a next step, I will profile the read path of
>> Arrow/Java
>> > > for
>> > > > > the
>> > > > > > > > > option
>> > > > > > > > > > > 3. I will report my findings.
>> > > > > > > > > > >
>> > > > > > > > > > > Any thoughts and feedback on this investigation are
>> very
>> > > > > welcome.
>> > > > > > > > > > >
>> > > > > > > > > > > Cheers,
>> > > > > > > > > > > --
>> > > > > > > > > > > Animesh
>> > > > > > > > > > >
>> > > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as
>> > > well.
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > >
>> > >
>>
>

Re: [JAVA] Arrow performance measurement

Posted by Li Jin <ic...@gmail.com>.
Hi Wes and Animesh,

Thanks for the analysis and discussion. I am happy to looking into this. I
will create some Jiras soon.

Li

On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <we...@gmail.com> wrote:

> hey Animesh,
>
> Thank you for doing this analysis. If you'd like to share some of the
> analysis more broadly e.g. on the Apache Arrow blog or social media,
> let us know.
>
> Seems like there might be a few follow ups here for the Arrow Java
> community:
>
> * Documentation about achieving better performance
> * Writing some microperformance benchmarks
> * Making some improvements to the code to facilitate better performance
>
> Feel free to create some JIRA issues. Are any Java developers
> interested in digging a little more into these issues?
>
> Thanks,
> Wes
> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
> <an...@gmail.com> wrote:
> >
> > Hi Wes and all,
> >
> > Here is another round of updates:
> >
> > Quick recap - previously we established that for 1kB binary blobs, Arrow
> > can deliver > 160 Gbps performance from in-memory buffers.
> >
> > In this round I looked at the performance of materializing "integers". In
> > my benchmarks, I found that with careful optimizations/code-rewriting we
> > can push the performance of integer reading from 5.42 Gbps/core to 13.61
> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+
> > Gbps. Key things to do is:
> >
> > 1) Disable memory access checks in Arrow and Netty buffers. This gave
> > significant performance boost. However, for such an important performance
> > flag, it is very poorly documented
> > ("drill.enable_unsafe_memory_access=true").
> >
> > 2) Materialize values from Validity and Value direct buffers instead of
> > calling getInt() function on the IntVector. This is implemented as a new
> > Unsafe reader type (
> >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
> > )
> >
> > 3) Optimize bitmap operation to check if a bit is set or not (
> >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
> > )
> >
> > A detailed write up of these steps is available here:
> >
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
> >
> > I have 2 follow-up questions:
> >
> > 1) Regarding the `isSet` function, why does it has to calculate number of
> > bits set? (
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797
> ).
> > Wouldn't just checking if the result of the AND operation is zero or not
> be
> > sufficient? Like what I did :
> >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
> >
> >
> > 2) What is the reason behind this bitmap generation optimization here
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
> > ? At this point when this function is called, the bitmap vector is
> already
> > read from the storage, and contains the right values (either all null,
> all
> > set, or whatever). Generating this mask here for the special cases when
> the
> > values are all NULL or all set (this was the case in my benchmark), can
> be
> > slower than just returning what one has read from the storage.
> >
> > Collectively optimizing these two bitmap operations give more than 1 Gbps
> > gains in my bench-marking code.
> >
> > Cheers,
> > --
> > Animesh
> >
> >
> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > See e.g.
> > >
> > >
> > >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
> > >
> > >
> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
> > > <an...@gmail.com> wrote:
> > > >
> > > > Primarily write the same microbenchmark as I have in Java in C++ for
> > > table
> > > > reading and value materialization. So just an example of equivalent
> > > > ArrowFileReader example code in C++. Unit tests are a good starting
> > > point,
> > > > thanks for the tip :)
> > > >
> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > > 3. Are there examples of Arrow in C++ read/write code that I can
> > > have a
> > > > > look?
> > > > >
> > > > > What kind of code are you looking for? I would direct you to
> relevant
> > > > > unit tests that exhibit certain functionality, but it depends on
> what
> > > > > you are trying to do
> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > > > > <an...@gmail.com> wrote:
> > > > > >
> > > > > > Hi all - quick update on the performance investigation:
> > > > > >
> > > > > > - I spent some time looking at performance profile for a binary
> blob
> > > > > column
> > > > > > (1024 bytes of byte[]) and found a few favorable settings for
> > > delivering
> > > > > up
> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These
> > > settings
> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > > > > documented
> > > > > > here:
> > > > > >
> > > > >
> > >
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > > > > - these setting also help to improved the last number that
> reported
> > > (but
> > > > > > not by much) for the in-memory TPC-DS store_sales table from ~39
> > > Gbps up
> > > > > to
> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e.,
> > > w/o any
> > > > > > networking or storage links)
> > > > > >
> > > > > > A few follow up questions that I have:
> > > > > > 1. Arrow reads a batch size worth of data in one go. Are there
> any
> > > > > > recommended batch sizes? In my investigation, small batch size
> help
> > > with
> > > > > a
> > > > > > better cache profile but increase number of instructions required
> > > (more
> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to
> be
> > > the
> > > > > best
> > > > > > performing configuration, which is also a bit counter intuitive
> as
> > > for 16
> > > > > > threads this will lead to 160 MB of memory footprint. May be
> this is
> > > also
> > > > > > tired to the memory management logic which is my next question.
> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent netty
> > > memory
> > > > > > management settings for "io.netty.allocator.*" parameters? I
> don't
> > > find
> > > > > any
> > > > > > decent write-up on them; (ii) Is there a provision for ArrowBuf
> being
> > > > > > re-used once a batch is consumed? As it looks for now, read read
> > > > > allocates
> > > > > > a new buffer to read the whole batch size.
> > > > > > 3. Are there examples of Arrow in C++ read/write code that I can
> > > have a
> > > > > > look?
> > > > > >
> > > > > > Cheers,
> > > > > > --
> > > > > > Animesh
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <
> wesmckinn@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > > > > > <an...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > > > > > >
> > > > > > > > @Johan -
> > > > > > > > 1. I also do not suspect that there is any inherent drawback
> in
> > > Java
> > > > > or
> > > > > > > C++
> > > > > > > > due to the Arrow format. I mentioned C++ because Wes pointed
> out
> > > that
> > > > > > > Java
> > > > > > > > routines are not the most optimized ones (yet!). And
> naturally
> > > one
> > > > > would
> > > > > > > > expect better performance in a native language with all
> > > > > > > pointer/memory/SIMD
> > > > > > > > instruction optimizations that you mentioned. As far as I
> know,
> > > the
> > > > > > > > off-heap buffers are managed in ArrowBuf which implements an
> > > abstract
> > > > > > > netty
> > > > > > > > class. But there is nothing unusual, i.e., netty specific,
> about
> > > > > these
> > > > > > > > unsafe routines, they are used by many projects. Though
> there is
> > > cost
> > > > > > > > associated with materializing on-heap Java values from
> off-heap
> > > > > memory
> > > > > > > > regions. I need to benchmark that more carefully.
> > > > > > > >
> > > > > > > > 2. When you say "I've so far always been able to get similar
> > > > > performance
> > > > > > > > numbers" - do you mean the same performance of my case 3
> where 16
> > > > > cores
> > > > > > > > drive close to 40 Gbps, or the same performance between your
> C++
> > > and
> > > > > Java
> > > > > > > > benchmarks. Do you have some write-up? I would be interested
> to
> > > read
> > > > > up
> > > > > > > :)
> > > > > > > >
> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays in
> > > Java"
> > > > > ->
> > > > > > > that
> > > > > > > > is a good idea. Let me try and report back.
> > > > > > > >
> > > > > > > > @Wes -
> > > > > > > >
> > > > > > > > Is there some benchmark template for C++ routines I can have
> a
> > > look?
> > > > > I
> > > > > > > > would be happy to get some input from Java-Arrow experts on
> how
> > > to
> > > > > write
> > > > > > > > these benchmarks more efficiently. I will have a closer look
> at
> > > the
> > > > > JIRA
> > > > > > > > tickets that you mentioned.
> > > > > > > >
> > > > > > > > So, for now I am focusing on the case 3, which is about
> > > establishing
> > > > > > > > performance when reading from a local in-memory I/O stream
> that I
> > > > > > > > implemented (
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > > > > > ).
> > > > > > > > In this case I first read data from parquet files, convert
> them
> > > into
> > > > > > > Arrow,
> > > > > > > > and write-out to a MemoryIOChannel, and then read back from
> it.
> > > So,
> > > > > the
> > > > > > > > performance has nothing to do with Crail or HDFS in the case
> 3.
> > > > > Once, I
> > > > > > > > establish the base performance in this setup (which is
> around ~40
> > > > > Gbps
> > > > > > > with
> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O
> streams
> > > can
> > > > > take
> > > > > > > > ArrowBuf as src/dst buffers. That should be doable.
> > > > > > >
> > > > > > > Right, in any case what you are testing is the performance of
> > > using a
> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory to
> sum
> > > the
> > > > > > > non-null values of each column. I'm not sure that a single
> > > bandwidth
> > > > > > > number produced by this benchmark is very informative for
> people
> > > > > > > contemplating what memory format to use in their system due to
> the
> > > > > > > current state of the implementation (Java) and workload
> measured
> > > > > > > (summing the non-null values with a naive algorithm). I would
> guess
> > > > > > > that a C++ version with raw pointers and a loop-unrolled,
> > > branch-free
> > > > > > > vectorized sum is going to be a lot faster.
> > > > > > >
> > > > > > > >
> > > > > > > > @Jacques -
> > > > > > > >
> > > > > > > > That is a good point that "Arrow's implementation is more
> > > focused on
> > > > > > > > interacting with the structure than transporting it".
> However,
> > > in any
> > > > > > > > distributed system one needs to move data/structure around -
> as
> > > far
> > > > > as I
> > > > > > > > understand that is another goal of the project. My
> investigation
> > > > > started
> > > > > > > > within the context of Spark/SQL data processing. Spark
> converts
> > > > > incoming
> > > > > > > > data into its own in-memory UnsafeRow representation for
> > > processing.
> > > > > So
> > > > > > > > naturally the performance of this data ingestion pipeline
> cannot
> > > > > > > outperform
> > > > > > > > the read performance of the used file format. I benchmarked
> > > Parquet,
> > > > > ORC,
> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table). And
> then
> > > > > > > curiously
> > > > > > > > benchmarked Arrow as well because its design choices are a
> better
> > > > > fit for
> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> > > > > investigating.
> > > > > > > From
> > > > > > > > this point of view, I am trying to find out can Arrow be the
> file
> > > > > format
> > > > > > > > for the next generation of storage/networking devices (see
> Apache
> > > > > Crail
> > > > > > > > project) delivering close to the hardware speed
> reading/writing
> > > > > rates. As
> > > > > > > > Wes pointed out that a C++ library implementation should be
> > > memory-IO
> > > > > > > > bound, so what would it take to deliver the same performance
> in
> > > Java
> > > > > ;)
> > > > > > > > (and then, from across the network).
> > > > > > > >
> > > > > > > > I hope this makes sense.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > --
> > > > > > > > Animesh
> > > > > > > >
> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> > > jacques@apache.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > My big question is what is the use case and how/what are
> you
> > > > > trying to
> > > > > > > > > compare? Arrow's implementation is more focused on
> interacting
> > > > > with the
> > > > > > > > > structure than transporting it. Generally speaking, when
> we're
> > > > > working
> > > > > > > with
> > > > > > > > > Arrow data we frequently are just interacting with memory
> > > > > locations and
> > > > > > > > > doing direct operations. If you have a layer that supports
> that
> > > > > type of
> > > > > > > > > semantic, create a movement technique that depends on that.
> > > Arrow
> > > > > > > doesn't
> > > > > > > > > force a particular API since the data itself is defined by
> its
> > > > > > > in-memory
> > > > > > > > > layout so if you have a custom use or pattern, just work
> with
> > > the
> > > > > > > in-memory
> > > > > > > > > structures.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> > > wesmckinn@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > hi Animesh,
> > > > > > > > > >
> > > > > > > > > > Per Johan's comments, the C++ library is essentially
> going
> > > to be
> > > > > > > > > > IO/memory bandwidth bound since you're interacting with
> raw
> > > > > pointers.
> > > > > > > > > >
> > > > > > > > > > I'm looking at your code
> > > > > > > > > >
> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > > > > > >     int valCount = accessor.getValueCount();
> > > > > > > > > >     for(int i = 0; i < valCount; i++){
> > > > > > > > > >         if(!accessor.isNull(i)){
> > > > > > > > > >             float4Count+=1;
> > > > > > > > > >             checksum+=accessor.get(i);
> > > > > > > > > >         }
> > > > > > > > > >     }
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
> advise
> > > you
> > > > > the
> > > > > > > > > > fastest way to iterate over this data -- my
> understanding is
> > > that
> > > > > > > much
> > > > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
> > > objects
> > > > > > > > > > rather than going through the higher level APIs. You're
> also
> > > > > dropping
> > > > > > > > > > performance because memory mapping is not yet
> implemented in
> > > > > Java,
> > > > > > > see
> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > > > > >
> > > > > > > > > > Furthermore, the IPC reader class you are using could be
> made
> > > > > more
> > > > > > > > > > efficient. I described the problem in
> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this
> > > will be
> > > > > > > > > > required as soon as we have the ability to do memory
> mapping
> > > in
> > > > > Java
> > > > > > > > > >
> > > > > > > > > > Could Crail use the Arrow data structures in its runtime
> > > rather
> > > > > than
> > > > > > > > > > copying? If not, how are Crail's runtime data structures
> > > > > different?
> > > > > > > > > >
> > > > > > > > > > - Wes
> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > > > > <J....@tudelft.nl> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hello Animesh,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I browsed a bit in your sources, thanks for sharing. We
> > > have
> > > > > > > performed
> > > > > > > > > > some similar measurements to your third case in the past
> for
> > > > > C/C++ on
> > > > > > > > > > collections of various basic types such as primitives and
> > > > > strings.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I can say that in terms of consuming data from the
> Arrow
> > > format
> > > > > > > versus
> > > > > > > > > > language native collections in C++, I've so far always
> been
> > > able
> > > > > to
> > > > > > > get
> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to the
> > > Arrow
> > > > > > > format
> > > > > > > > > > itself). Especially when accessing the data through
> Arrow's
> > > raw
> > > > > data
> > > > > > > > > > pointers (and using for example std::string_view-like
> > > > > constructs).
> > > > > > > > > > >
> > > > > > > > > > > In C/C++ the fast data structures are engineered in
> such a
> > > way
> > > > > > > that as
> > > > > > > > > > little pointer traversals are required and they take up
> an as
> > > > > small
> > > > > > > as
> > > > > > > > > > possible memory footprint. Thus each memory access is
> > > relatively
> > > > > > > > > efficient
> > > > > > > > > > (in terms of obtaining the data of interest). The same
> can
> > > > > > > absolutely be
> > > > > > > > > > said for Arrow, if not even more efficient in some cases
> > > where
> > > > > object
> > > > > > > > > > fields are of variable length.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
> This
> > > > > requires
> > > > > > > the
> > > > > > > > > > JVM to interface to it through some calls to Unsafe
> hidden
> > > under
> > > > > the
> > > > > > > > > Netty
> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
> expert
> > > on
> > > > > > > this).
> > > > > > > > > > Those calls are the only reason I can think of that would
> > > > > degrade the
> > > > > > > > > > performance a bit compared to a pure JAva case. I don't
> know
> > > if
> > > > > the
> > > > > > > > > Unsafe
> > > > > > > > > > calls are inlined during JIT compilation. If they aren't
> they
> > > > > will
> > > > > > > > > increase
> > > > > > > > > > access latency to any data a little bit.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I don't have a similar machine so it's not easy to
> relate
> > > my
> > > > > > > numbers to
> > > > > > > > > > yours, but if you can get that data consumed with 100
> Gbps
> > > in a
> > > > > pure
> > > > > > > Java
> > > > > > > > > > case, I don't see any reason (resulting from Arrow
> format /
> > > > > off-heap
> > > > > > > > > > storage) why you wouldn't be able to get at least really
> > > close.
> > > > > Can
> > > > > > > you
> > > > > > > > > get
> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java with
> your
> > > > > > > consumption
> > > > > > > > > > functions in the first place?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm interested to see your progress on this.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Kind regards,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Johan Peltenburg
> > > > > > > > > > >
> > > > > > > > > > > ________________________________
> > > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > A week ago, Wes and I had a discussion about the
> > > performance
> > > > > of the
> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
> (Incubating)
> > > > > mailing
> > > > > > > > > list (
> > > > > > > > > > >
> > > > > > >
> > > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > > > > ).
> > > > > > > > > > In
> > > > > > > > > > > a nutshell: I am investigating the performance of
> various
> > > file
> > > > > > > formats
> > > > > > > > > > > (including Arrow) on high-performance NVMe and
> > > > > RDMA/100Gbps/RoCE
> > > > > > > > > setups.
> > > > > > > > > > I
> > > > > > > > > > > benchmarked how long does it take to materialize values
> > > (ints,
> > > > > > > longs,
> > > > > > > > > > > doubles) of the store_sales table, the largest table
> in the
> > > > > TPC-DS
> > > > > > > > > > dataset
> > > > > > > > > > > stored on different file formats. Here is a write-up on
> > > this -
> > > > > > > > > > >
> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > > > > found
> > > > > > > > > > that
> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
> link,
> > > Arrow
> > > > > > > (using
> > > > > > > > > > as a
> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> > > bandwidth
> > > > > with
> > > > > > > all
> > > > > > > > > 16
> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
> in-memory
> > > IPC
> > > > > > > format,
> > > > > > > > > > and
> > > > > > > > > > > has not been optimized for storage interfaces/APIs like
> > > HDFS;
> > > > > (ii)
> > > > > > > the
> > > > > > > > > > > performance I am measuring is for the java
> implementation.
> > > > > > > > > > >
> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > > > > >
> > > > > > > > > > > That brings us to this email where I promised to
> follow up
> > > on
> > > > > the
> > > > > > > Arrow
> > > > > > > > > > > mailing list to understand and optimize the
> performance of
> > > > > > > Arrow/Java
> > > > > > > > > > > implementation on high-performance devices. I wrote a
> small
> > > > > > > stand-alone
> > > > > > > > > > > benchmark (
> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > > > > with
> > > > > > > > > > three
> > > > > > > > > > > implementations of WritableByteChannel,
> SeekableByteChannel
> > > > > > > interfaces:
> > > > > > > > > > >
> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me
> ~30
> > > Gbps
> > > > > > > > > > performance
> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me
> > > ~35-36
> > > > > Gbps
> > > > > > > > > > > performance
> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives
> me
> > > ~39
> > > > > Gbps
> > > > > > > > > > > performance
> > > > > > > > > > >
> > > > > > > > > > > I think the order makes sense. To better understand the
> > > > > > > performance of
> > > > > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > > > > >
> > > > > > > > > > > The key question I am trying to answer is "what would
> it
> > > take
> > > > > for
> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > > > > possible? If
> > > > > > > > > yes,
> > > > > > > > > > > then what is missing/or mis-interpreted by me? If not,
> then
> > > > > where
> > > > > > > is
> > > > > > > > > the
> > > > > > > > > > > performance lost? Does anyone have any performance
> > > measurements
> > > > > > > for C++
> > > > > > > > > > > implementation? if they have seen/expect better
> numbers.
> > > > > > > > > > >
> > > > > > > > > > > As a next step, I will profile the read path of
> Arrow/Java
> > > for
> > > > > the
> > > > > > > > > option
> > > > > > > > > > > 3. I will report my findings.
> > > > > > > > > > >
> > > > > > > > > > > Any thoughts and feedback on this investigation are
> very
> > > > > welcome.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > --
> > > > > > > > > > > Animesh
> > > > > > > > > > >
> > > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as
> > > well.
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: [JAVA] Arrow performance measurement

Posted by Li Jin <ic...@gmail.com>.
Hi Wes and Animesh,

Thanks for the analysis and discussion. I am happy to looking into this. I
will create some Jiras soon.

Li

On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <we...@gmail.com> wrote:

> hey Animesh,
>
> Thank you for doing this analysis. If you'd like to share some of the
> analysis more broadly e.g. on the Apache Arrow blog or social media,
> let us know.
>
> Seems like there might be a few follow ups here for the Arrow Java
> community:
>
> * Documentation about achieving better performance
> * Writing some microperformance benchmarks
> * Making some improvements to the code to facilitate better performance
>
> Feel free to create some JIRA issues. Are any Java developers
> interested in digging a little more into these issues?
>
> Thanks,
> Wes
> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
> <an...@gmail.com> wrote:
> >
> > Hi Wes and all,
> >
> > Here is another round of updates:
> >
> > Quick recap - previously we established that for 1kB binary blobs, Arrow
> > can deliver > 160 Gbps performance from in-memory buffers.
> >
> > In this round I looked at the performance of materializing "integers". In
> > my benchmarks, I found that with careful optimizations/code-rewriting we
> > can push the performance of integer reading from 5.42 Gbps/core to 13.61
> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+
> > Gbps. Key things to do is:
> >
> > 1) Disable memory access checks in Arrow and Netty buffers. This gave
> > significant performance boost. However, for such an important performance
> > flag, it is very poorly documented
> > ("drill.enable_unsafe_memory_access=true").
> >
> > 2) Materialize values from Validity and Value direct buffers instead of
> > calling getInt() function on the IntVector. This is implemented as a new
> > Unsafe reader type (
> >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
> > )
> >
> > 3) Optimize bitmap operation to check if a bit is set or not (
> >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
> > )
> >
> > A detailed write up of these steps is available here:
> >
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
> >
> > I have 2 follow-up questions:
> >
> > 1) Regarding the `isSet` function, why does it has to calculate number of
> > bits set? (
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797
> ).
> > Wouldn't just checking if the result of the AND operation is zero or not
> be
> > sufficient? Like what I did :
> >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
> >
> >
> > 2) What is the reason behind this bitmap generation optimization here
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
> > ? At this point when this function is called, the bitmap vector is
> already
> > read from the storage, and contains the right values (either all null,
> all
> > set, or whatever). Generating this mask here for the special cases when
> the
> > values are all NULL or all set (this was the case in my benchmark), can
> be
> > slower than just returning what one has read from the storage.
> >
> > Collectively optimizing these two bitmap operations give more than 1 Gbps
> > gains in my bench-marking code.
> >
> > Cheers,
> > --
> > Animesh
> >
> >
> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > See e.g.
> > >
> > >
> > >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
> > >
> > >
> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
> > > <an...@gmail.com> wrote:
> > > >
> > > > Primarily write the same microbenchmark as I have in Java in C++ for
> > > table
> > > > reading and value materialization. So just an example of equivalent
> > > > ArrowFileReader example code in C++. Unit tests are a good starting
> > > point,
> > > > thanks for the tip :)
> > > >
> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > > 3. Are there examples of Arrow in C++ read/write code that I can
> > > have a
> > > > > look?
> > > > >
> > > > > What kind of code are you looking for? I would direct you to
> relevant
> > > > > unit tests that exhibit certain functionality, but it depends on
> what
> > > > > you are trying to do
> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > > > > <an...@gmail.com> wrote:
> > > > > >
> > > > > > Hi all - quick update on the performance investigation:
> > > > > >
> > > > > > - I spent some time looking at performance profile for a binary
> blob
> > > > > column
> > > > > > (1024 bytes of byte[]) and found a few favorable settings for
> > > delivering
> > > > > up
> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These
> > > settings
> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > > > > documented
> > > > > > here:
> > > > > >
> > > > >
> > >
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > > > > - these setting also help to improved the last number that
> reported
> > > (but
> > > > > > not by much) for the in-memory TPC-DS store_sales table from ~39
> > > Gbps up
> > > > > to
> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e.,
> > > w/o any
> > > > > > networking or storage links)
> > > > > >
> > > > > > A few follow up questions that I have:
> > > > > > 1. Arrow reads a batch size worth of data in one go. Are there
> any
> > > > > > recommended batch sizes? In my investigation, small batch size
> help
> > > with
> > > > > a
> > > > > > better cache profile but increase number of instructions required
> > > (more
> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to
> be
> > > the
> > > > > best
> > > > > > performing configuration, which is also a bit counter intuitive
> as
> > > for 16
> > > > > > threads this will lead to 160 MB of memory footprint. May be
> this is
> > > also
> > > > > > tired to the memory management logic which is my next question.
> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent netty
> > > memory
> > > > > > management settings for "io.netty.allocator.*" parameters? I
> don't
> > > find
> > > > > any
> > > > > > decent write-up on them; (ii) Is there a provision for ArrowBuf
> being
> > > > > > re-used once a batch is consumed? As it looks for now, read read
> > > > > allocates
> > > > > > a new buffer to read the whole batch size.
> > > > > > 3. Are there examples of Arrow in C++ read/write code that I can
> > > have a
> > > > > > look?
> > > > > >
> > > > > > Cheers,
> > > > > > --
> > > > > > Animesh
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <
> wesmckinn@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > > > > > <an...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > > > > > >
> > > > > > > > @Johan -
> > > > > > > > 1. I also do not suspect that there is any inherent drawback
> in
> > > Java
> > > > > or
> > > > > > > C++
> > > > > > > > due to the Arrow format. I mentioned C++ because Wes pointed
> out
> > > that
> > > > > > > Java
> > > > > > > > routines are not the most optimized ones (yet!). And
> naturally
> > > one
> > > > > would
> > > > > > > > expect better performance in a native language with all
> > > > > > > pointer/memory/SIMD
> > > > > > > > instruction optimizations that you mentioned. As far as I
> know,
> > > the
> > > > > > > > off-heap buffers are managed in ArrowBuf which implements an
> > > abstract
> > > > > > > netty
> > > > > > > > class. But there is nothing unusual, i.e., netty specific,
> about
> > > > > these
> > > > > > > > unsafe routines, they are used by many projects. Though
> there is
> > > cost
> > > > > > > > associated with materializing on-heap Java values from
> off-heap
> > > > > memory
> > > > > > > > regions. I need to benchmark that more carefully.
> > > > > > > >
> > > > > > > > 2. When you say "I've so far always been able to get similar
> > > > > performance
> > > > > > > > numbers" - do you mean the same performance of my case 3
> where 16
> > > > > cores
> > > > > > > > drive close to 40 Gbps, or the same performance between your
> C++
> > > and
> > > > > Java
> > > > > > > > benchmarks. Do you have some write-up? I would be interested
> to
> > > read
> > > > > up
> > > > > > > :)
> > > > > > > >
> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays in
> > > Java"
> > > > > ->
> > > > > > > that
> > > > > > > > is a good idea. Let me try and report back.
> > > > > > > >
> > > > > > > > @Wes -
> > > > > > > >
> > > > > > > > Is there some benchmark template for C++ routines I can have
> a
> > > look?
> > > > > I
> > > > > > > > would be happy to get some input from Java-Arrow experts on
> how
> > > to
> > > > > write
> > > > > > > > these benchmarks more efficiently. I will have a closer look
> at
> > > the
> > > > > JIRA
> > > > > > > > tickets that you mentioned.
> > > > > > > >
> > > > > > > > So, for now I am focusing on the case 3, which is about
> > > establishing
> > > > > > > > performance when reading from a local in-memory I/O stream
> that I
> > > > > > > > implemented (
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > > > > > ).
> > > > > > > > In this case I first read data from parquet files, convert
> them
> > > into
> > > > > > > Arrow,
> > > > > > > > and write-out to a MemoryIOChannel, and then read back from
> it.
> > > So,
> > > > > the
> > > > > > > > performance has nothing to do with Crail or HDFS in the case
> 3.
> > > > > Once, I
> > > > > > > > establish the base performance in this setup (which is
> around ~40
> > > > > Gbps
> > > > > > > with
> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O
> streams
> > > can
> > > > > take
> > > > > > > > ArrowBuf as src/dst buffers. That should be doable.
> > > > > > >
> > > > > > > Right, in any case what you are testing is the performance of
> > > using a
> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory to
> sum
> > > the
> > > > > > > non-null values of each column. I'm not sure that a single
> > > bandwidth
> > > > > > > number produced by this benchmark is very informative for
> people
> > > > > > > contemplating what memory format to use in their system due to
> the
> > > > > > > current state of the implementation (Java) and workload
> measured
> > > > > > > (summing the non-null values with a naive algorithm). I would
> guess
> > > > > > > that a C++ version with raw pointers and a loop-unrolled,
> > > branch-free
> > > > > > > vectorized sum is going to be a lot faster.
> > > > > > >
> > > > > > > >
> > > > > > > > @Jacques -
> > > > > > > >
> > > > > > > > That is a good point that "Arrow's implementation is more
> > > focused on
> > > > > > > > interacting with the structure than transporting it".
> However,
> > > in any
> > > > > > > > distributed system one needs to move data/structure around -
> as
> > > far
> > > > > as I
> > > > > > > > understand that is another goal of the project. My
> investigation
> > > > > started
> > > > > > > > within the context of Spark/SQL data processing. Spark
> converts
> > > > > incoming
> > > > > > > > data into its own in-memory UnsafeRow representation for
> > > processing.
> > > > > So
> > > > > > > > naturally the performance of this data ingestion pipeline
> cannot
> > > > > > > outperform
> > > > > > > > the read performance of the used file format. I benchmarked
> > > Parquet,
> > > > > ORC,
> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table). And
> then
> > > > > > > curiously
> > > > > > > > benchmarked Arrow as well because its design choices are a
> better
> > > > > fit for
> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> > > > > investigating.
> > > > > > > From
> > > > > > > > this point of view, I am trying to find out can Arrow be the
> file
> > > > > format
> > > > > > > > for the next generation of storage/networking devices (see
> Apache
> > > > > Crail
> > > > > > > > project) delivering close to the hardware speed
> reading/writing
> > > > > rates. As
> > > > > > > > Wes pointed out that a C++ library implementation should be
> > > memory-IO
> > > > > > > > bound, so what would it take to deliver the same performance
> in
> > > Java
> > > > > ;)
> > > > > > > > (and then, from across the network).
> > > > > > > >
> > > > > > > > I hope this makes sense.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > --
> > > > > > > > Animesh
> > > > > > > >
> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> > > jacques@apache.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > My big question is what is the use case and how/what are
> you
> > > > > trying to
> > > > > > > > > compare? Arrow's implementation is more focused on
> interacting
> > > > > with the
> > > > > > > > > structure than transporting it. Generally speaking, when
> we're
> > > > > working
> > > > > > > with
> > > > > > > > > Arrow data we frequently are just interacting with memory
> > > > > locations and
> > > > > > > > > doing direct operations. If you have a layer that supports
> that
> > > > > type of
> > > > > > > > > semantic, create a movement technique that depends on that.
> > > Arrow
> > > > > > > doesn't
> > > > > > > > > force a particular API since the data itself is defined by
> its
> > > > > > > in-memory
> > > > > > > > > layout so if you have a custom use or pattern, just work
> with
> > > the
> > > > > > > in-memory
> > > > > > > > > structures.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> > > wesmckinn@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > hi Animesh,
> > > > > > > > > >
> > > > > > > > > > Per Johan's comments, the C++ library is essentially
> going
> > > to be
> > > > > > > > > > IO/memory bandwidth bound since you're interacting with
> raw
> > > > > pointers.
> > > > > > > > > >
> > > > > > > > > > I'm looking at your code
> > > > > > > > > >
> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > > > > > >     int valCount = accessor.getValueCount();
> > > > > > > > > >     for(int i = 0; i < valCount; i++){
> > > > > > > > > >         if(!accessor.isNull(i)){
> > > > > > > > > >             float4Count+=1;
> > > > > > > > > >             checksum+=accessor.get(i);
> > > > > > > > > >         }
> > > > > > > > > >     }
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
> advise
> > > you
> > > > > the
> > > > > > > > > > fastest way to iterate over this data -- my
> understanding is
> > > that
> > > > > > > much
> > > > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
> > > objects
> > > > > > > > > > rather than going through the higher level APIs. You're
> also
> > > > > dropping
> > > > > > > > > > performance because memory mapping is not yet
> implemented in
> > > > > Java,
> > > > > > > see
> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > > > > >
> > > > > > > > > > Furthermore, the IPC reader class you are using could be
> made
> > > > > more
> > > > > > > > > > efficient. I described the problem in
> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this
> > > will be
> > > > > > > > > > required as soon as we have the ability to do memory
> mapping
> > > in
> > > > > Java
> > > > > > > > > >
> > > > > > > > > > Could Crail use the Arrow data structures in its runtime
> > > rather
> > > > > than
> > > > > > > > > > copying? If not, how are Crail's runtime data structures
> > > > > different?
> > > > > > > > > >
> > > > > > > > > > - Wes
> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > > > > <J....@tudelft.nl> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hello Animesh,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I browsed a bit in your sources, thanks for sharing. We
> > > have
> > > > > > > performed
> > > > > > > > > > some similar measurements to your third case in the past
> for
> > > > > C/C++ on
> > > > > > > > > > collections of various basic types such as primitives and
> > > > > strings.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I can say that in terms of consuming data from the
> Arrow
> > > format
> > > > > > > versus
> > > > > > > > > > language native collections in C++, I've so far always
> been
> > > able
> > > > > to
> > > > > > > get
> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to the
> > > Arrow
> > > > > > > format
> > > > > > > > > > itself). Especially when accessing the data through
> Arrow's
> > > raw
> > > > > data
> > > > > > > > > > pointers (and using for example std::string_view-like
> > > > > constructs).
> > > > > > > > > > >
> > > > > > > > > > > In C/C++ the fast data structures are engineered in
> such a
> > > way
> > > > > > > that as
> > > > > > > > > > little pointer traversals are required and they take up
> an as
> > > > > small
> > > > > > > as
> > > > > > > > > > possible memory footprint. Thus each memory access is
> > > relatively
> > > > > > > > > efficient
> > > > > > > > > > (in terms of obtaining the data of interest). The same
> can
> > > > > > > absolutely be
> > > > > > > > > > said for Arrow, if not even more efficient in some cases
> > > where
> > > > > object
> > > > > > > > > > fields are of variable length.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
> This
> > > > > requires
> > > > > > > the
> > > > > > > > > > JVM to interface to it through some calls to Unsafe
> hidden
> > > under
> > > > > the
> > > > > > > > > Netty
> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
> expert
> > > on
> > > > > > > this).
> > > > > > > > > > Those calls are the only reason I can think of that would
> > > > > degrade the
> > > > > > > > > > performance a bit compared to a pure JAva case. I don't
> know
> > > if
> > > > > the
> > > > > > > > > Unsafe
> > > > > > > > > > calls are inlined during JIT compilation. If they aren't
> they
> > > > > will
> > > > > > > > > increase
> > > > > > > > > > access latency to any data a little bit.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I don't have a similar machine so it's not easy to
> relate
> > > my
> > > > > > > numbers to
> > > > > > > > > > yours, but if you can get that data consumed with 100
> Gbps
> > > in a
> > > > > pure
> > > > > > > Java
> > > > > > > > > > case, I don't see any reason (resulting from Arrow
> format /
> > > > > off-heap
> > > > > > > > > > storage) why you wouldn't be able to get at least really
> > > close.
> > > > > Can
> > > > > > > you
> > > > > > > > > get
> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java with
> your
> > > > > > > consumption
> > > > > > > > > > functions in the first place?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm interested to see your progress on this.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Kind regards,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Johan Peltenburg
> > > > > > > > > > >
> > > > > > > > > > > ________________________________
> > > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > A week ago, Wes and I had a discussion about the
> > > performance
> > > > > of the
> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
> (Incubating)
> > > > > mailing
> > > > > > > > > list (
> > > > > > > > > > >
> > > > > > >
> > > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > > > > ).
> > > > > > > > > > In
> > > > > > > > > > > a nutshell: I am investigating the performance of
> various
> > > file
> > > > > > > formats
> > > > > > > > > > > (including Arrow) on high-performance NVMe and
> > > > > RDMA/100Gbps/RoCE
> > > > > > > > > setups.
> > > > > > > > > > I
> > > > > > > > > > > benchmarked how long does it take to materialize values
> > > (ints,
> > > > > > > longs,
> > > > > > > > > > > doubles) of the store_sales table, the largest table
> in the
> > > > > TPC-DS
> > > > > > > > > > dataset
> > > > > > > > > > > stored on different file formats. Here is a write-up on
> > > this -
> > > > > > > > > > >
> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > > > > found
> > > > > > > > > > that
> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
> link,
> > > Arrow
> > > > > > > (using
> > > > > > > > > > as a
> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> > > bandwidth
> > > > > with
> > > > > > > all
> > > > > > > > > 16
> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
> in-memory
> > > IPC
> > > > > > > format,
> > > > > > > > > > and
> > > > > > > > > > > has not been optimized for storage interfaces/APIs like
> > > HDFS;
> > > > > (ii)
> > > > > > > the
> > > > > > > > > > > performance I am measuring is for the java
> implementation.
> > > > > > > > > > >
> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > > > > >
> > > > > > > > > > > That brings us to this email where I promised to
> follow up
> > > on
> > > > > the
> > > > > > > Arrow
> > > > > > > > > > > mailing list to understand and optimize the
> performance of
> > > > > > > Arrow/Java
> > > > > > > > > > > implementation on high-performance devices. I wrote a
> small
> > > > > > > stand-alone
> > > > > > > > > > > benchmark (
> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > > > > with
> > > > > > > > > > three
> > > > > > > > > > > implementations of WritableByteChannel,
> SeekableByteChannel
> > > > > > > interfaces:
> > > > > > > > > > >
> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me
> ~30
> > > Gbps
> > > > > > > > > > performance
> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me
> > > ~35-36
> > > > > Gbps
> > > > > > > > > > > performance
> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives
> me
> > > ~39
> > > > > Gbps
> > > > > > > > > > > performance
> > > > > > > > > > >
> > > > > > > > > > > I think the order makes sense. To better understand the
> > > > > > > performance of
> > > > > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > > > > >
> > > > > > > > > > > The key question I am trying to answer is "what would
> it
> > > take
> > > > > for
> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > > > > possible? If
> > > > > > > > > yes,
> > > > > > > > > > > then what is missing/or mis-interpreted by me? If not,
> then
> > > > > where
> > > > > > > is
> > > > > > > > > the
> > > > > > > > > > > performance lost? Does anyone have any performance
> > > measurements
> > > > > > > for C++
> > > > > > > > > > > implementation? if they have seen/expect better
> numbers.
> > > > > > > > > > >
> > > > > > > > > > > As a next step, I will profile the read path of
> Arrow/Java
> > > for
> > > > > the
> > > > > > > > > option
> > > > > > > > > > > 3. I will report my findings.
> > > > > > > > > > >
> > > > > > > > > > > Any thoughts and feedback on this investigation are
> very
> > > > > welcome.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > --
> > > > > > > > > > > Animesh
> > > > > > > > > > >
> > > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as
> > > well.
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: [JAVA] Arrow performance measurement

Posted by Wes McKinney <we...@gmail.com>.
hey Animesh,

Thank you for doing this analysis. If you'd like to share some of the
analysis more broadly e.g. on the Apache Arrow blog or social media,
let us know.

Seems like there might be a few follow ups here for the Arrow Java community:

* Documentation about achieving better performance
* Writing some microperformance benchmarks
* Making some improvements to the code to facilitate better performance

Feel free to create some JIRA issues. Are any Java developers
interested in digging a little more into these issues?

Thanks,
Wes
On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
<an...@gmail.com> wrote:
>
> Hi Wes and all,
>
> Here is another round of updates:
>
> Quick recap - previously we established that for 1kB binary blobs, Arrow
> can deliver > 160 Gbps performance from in-memory buffers.
>
> In this round I looked at the performance of materializing "integers". In
> my benchmarks, I found that with careful optimizations/code-rewriting we
> can push the performance of integer reading from 5.42 Gbps/core to 13.61
> Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+
> Gbps. Key things to do is:
>
> 1) Disable memory access checks in Arrow and Netty buffers. This gave
> significant performance boost. However, for such an important performance
> flag, it is very poorly documented
> ("drill.enable_unsafe_memory_access=true").
>
> 2) Materialize values from Validity and Value direct buffers instead of
> calling getInt() function on the IntVector. This is implemented as a new
> Unsafe reader type (
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
> )
>
> 3) Optimize bitmap operation to check if a bit is set or not (
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
> )
>
> A detailed write up of these steps is available here:
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
>
> I have 2 follow-up questions:
>
> 1) Regarding the `isSet` function, why does it has to calculate number of
> bits set? (
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797).
> Wouldn't just checking if the result of the AND operation is zero or not be
> sufficient? Like what I did :
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
>
>
> 2) What is the reason behind this bitmap generation optimization here
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
> ? At this point when this function is called, the bitmap vector is already
> read from the storage, and contains the right values (either all null, all
> set, or whatever). Generating this mask here for the special cases when the
> values are all NULL or all set (this was the case in my benchmark), can be
> slower than just returning what one has read from the storage.
>
> Collectively optimizing these two bitmap operations give more than 1 Gbps
> gains in my bench-marking code.
>
> Cheers,
> --
> Animesh
>
>
> On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com> wrote:
>
> > See e.g.
> >
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
> >
> >
> > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
> > <an...@gmail.com> wrote:
> > >
> > > Primarily write the same microbenchmark as I have in Java in C++ for
> > table
> > > reading and value materialization. So just an example of equivalent
> > > ArrowFileReader example code in C++. Unit tests are a good starting
> > point,
> > > thanks for the tip :)
> > >
> > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > > 3. Are there examples of Arrow in C++ read/write code that I can
> > have a
> > > > look?
> > > >
> > > > What kind of code are you looking for? I would direct you to relevant
> > > > unit tests that exhibit certain functionality, but it depends on what
> > > > you are trying to do
> > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > > > <an...@gmail.com> wrote:
> > > > >
> > > > > Hi all - quick update on the performance investigation:
> > > > >
> > > > > - I spent some time looking at performance profile for a binary blob
> > > > column
> > > > > (1024 bytes of byte[]) and found a few favorable settings for
> > delivering
> > > > up
> > > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These
> > settings
> > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > > > documented
> > > > > here:
> > > > >
> > > >
> > https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > > > - these setting also help to improved the last number that reported
> > (but
> > > > > not by much) for the in-memory TPC-DS store_sales table from ~39
> > Gbps up
> > > > to
> > > > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e.,
> > w/o any
> > > > > networking or storage links)
> > > > >
> > > > > A few follow up questions that I have:
> > > > > 1. Arrow reads a batch size worth of data in one go. Are there any
> > > > > recommended batch sizes? In my investigation, small batch size help
> > with
> > > > a
> > > > > better cache profile but increase number of instructions required
> > (more
> > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be
> > the
> > > > best
> > > > > performing configuration, which is also a bit counter intuitive as
> > for 16
> > > > > threads this will lead to 160 MB of memory footprint. May be this is
> > also
> > > > > tired to the memory management logic which is my next question.
> > > > > 2. Arrow use's netty's memory manager. (i) what are decent netty
> > memory
> > > > > management settings for "io.netty.allocator.*" parameters? I don't
> > find
> > > > any
> > > > > decent write-up on them; (ii) Is there a provision for ArrowBuf being
> > > > > re-used once a batch is consumed? As it looks for now, read read
> > > > allocates
> > > > > a new buffer to read the whole batch size.
> > > > > 3. Are there examples of Arrow in C++ read/write code that I can
> > have a
> > > > > look?
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > Animesh
> > > > >
> > > > >
> > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > > > > <an...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > > > > >
> > > > > > > @Johan -
> > > > > > > 1. I also do not suspect that there is any inherent drawback in
> > Java
> > > > or
> > > > > > C++
> > > > > > > due to the Arrow format. I mentioned C++ because Wes pointed out
> > that
> > > > > > Java
> > > > > > > routines are not the most optimized ones (yet!). And naturally
> > one
> > > > would
> > > > > > > expect better performance in a native language with all
> > > > > > pointer/memory/SIMD
> > > > > > > instruction optimizations that you mentioned. As far as I know,
> > the
> > > > > > > off-heap buffers are managed in ArrowBuf which implements an
> > abstract
> > > > > > netty
> > > > > > > class. But there is nothing unusual, i.e., netty specific, about
> > > > these
> > > > > > > unsafe routines, they are used by many projects. Though there is
> > cost
> > > > > > > associated with materializing on-heap Java values from off-heap
> > > > memory
> > > > > > > regions. I need to benchmark that more carefully.
> > > > > > >
> > > > > > > 2. When you say "I've so far always been able to get similar
> > > > performance
> > > > > > > numbers" - do you mean the same performance of my case 3 where 16
> > > > cores
> > > > > > > drive close to 40 Gbps, or the same performance between your C++
> > and
> > > > Java
> > > > > > > benchmarks. Do you have some write-up? I would be interested to
> > read
> > > > up
> > > > > > :)
> > > > > > >
> > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays in
> > Java"
> > > > ->
> > > > > > that
> > > > > > > is a good idea. Let me try and report back.
> > > > > > >
> > > > > > > @Wes -
> > > > > > >
> > > > > > > Is there some benchmark template for C++ routines I can have a
> > look?
> > > > I
> > > > > > > would be happy to get some input from Java-Arrow experts on how
> > to
> > > > write
> > > > > > > these benchmarks more efficiently. I will have a closer look at
> > the
> > > > JIRA
> > > > > > > tickets that you mentioned.
> > > > > > >
> > > > > > > So, for now I am focusing on the case 3, which is about
> > establishing
> > > > > > > performance when reading from a local in-memory I/O stream that I
> > > > > > > implemented (
> > > > > > >
> > > > > >
> > > >
> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > > > > ).
> > > > > > > In this case I first read data from parquet files, convert them
> > into
> > > > > > Arrow,
> > > > > > > and write-out to a MemoryIOChannel, and then read back from it.
> > So,
> > > > the
> > > > > > > performance has nothing to do with Crail or HDFS in the case 3.
> > > > Once, I
> > > > > > > establish the base performance in this setup (which is around ~40
> > > > Gbps
> > > > > > with
> > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams
> > can
> > > > take
> > > > > > > ArrowBuf as src/dst buffers. That should be doable.
> > > > > >
> > > > > > Right, in any case what you are testing is the performance of
> > using a
> > > > > > particular Java accessor layer to JVM off-heap Arrow memory to sum
> > the
> > > > > > non-null values of each column. I'm not sure that a single
> > bandwidth
> > > > > > number produced by this benchmark is very informative for people
> > > > > > contemplating what memory format to use in their system due to the
> > > > > > current state of the implementation (Java) and workload measured
> > > > > > (summing the non-null values with a naive algorithm). I would guess
> > > > > > that a C++ version with raw pointers and a loop-unrolled,
> > branch-free
> > > > > > vectorized sum is going to be a lot faster.
> > > > > >
> > > > > > >
> > > > > > > @Jacques -
> > > > > > >
> > > > > > > That is a good point that "Arrow's implementation is more
> > focused on
> > > > > > > interacting with the structure than transporting it". However,
> > in any
> > > > > > > distributed system one needs to move data/structure around - as
> > far
> > > > as I
> > > > > > > understand that is another goal of the project. My investigation
> > > > started
> > > > > > > within the context of Spark/SQL data processing. Spark converts
> > > > incoming
> > > > > > > data into its own in-memory UnsafeRow representation for
> > processing.
> > > > So
> > > > > > > naturally the performance of this data ingestion pipeline cannot
> > > > > > outperform
> > > > > > > the read performance of the used file format. I benchmarked
> > Parquet,
> > > > ORC,
> > > > > > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > > > > > curiously
> > > > > > > benchmarked Arrow as well because its design choices are a better
> > > > fit for
> > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> > > > investigating.
> > > > > > From
> > > > > > > this point of view, I am trying to find out can Arrow be the file
> > > > format
> > > > > > > for the next generation of storage/networking devices (see Apache
> > > > Crail
> > > > > > > project) delivering close to the hardware speed reading/writing
> > > > rates. As
> > > > > > > Wes pointed out that a C++ library implementation should be
> > memory-IO
> > > > > > > bound, so what would it take to deliver the same performance in
> > Java
> > > > ;)
> > > > > > > (and then, from across the network).
> > > > > > >
> > > > > > > I hope this makes sense.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > --
> > > > > > > Animesh
> > > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> > jacques@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > My big question is what is the use case and how/what are you
> > > > trying to
> > > > > > > > compare? Arrow's implementation is more focused on interacting
> > > > with the
> > > > > > > > structure than transporting it. Generally speaking, when we're
> > > > working
> > > > > > with
> > > > > > > > Arrow data we frequently are just interacting with memory
> > > > locations and
> > > > > > > > doing direct operations. If you have a layer that supports that
> > > > type of
> > > > > > > > semantic, create a movement technique that depends on that.
> > Arrow
> > > > > > doesn't
> > > > > > > > force a particular API since the data itself is defined by its
> > > > > > in-memory
> > > > > > > > layout so if you have a custom use or pattern, just work with
> > the
> > > > > > in-memory
> > > > > > > > structures.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> > wesmckinn@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > hi Animesh,
> > > > > > > > >
> > > > > > > > > Per Johan's comments, the C++ library is essentially going
> > to be
> > > > > > > > > IO/memory bandwidth bound since you're interacting with raw
> > > > pointers.
> > > > > > > > >
> > > > > > > > > I'm looking at your code
> > > > > > > > >
> > > > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > > > > >     int valCount = accessor.getValueCount();
> > > > > > > > >     for(int i = 0; i < valCount; i++){
> > > > > > > > >         if(!accessor.isNull(i)){
> > > > > > > > >             float4Count+=1;
> > > > > > > > >             checksum+=accessor.get(i);
> > > > > > > > >         }
> > > > > > > > >     }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to advise
> > you
> > > > the
> > > > > > > > > fastest way to iterate over this data -- my understanding is
> > that
> > > > > > much
> > > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
> > objects
> > > > > > > > > rather than going through the higher level APIs. You're also
> > > > dropping
> > > > > > > > > performance because memory mapping is not yet implemented in
> > > > Java,
> > > > > > see
> > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > > > >
> > > > > > > > > Furthermore, the IPC reader class you are using could be made
> > > > more
> > > > > > > > > efficient. I described the problem in
> > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this
> > will be
> > > > > > > > > required as soon as we have the ability to do memory mapping
> > in
> > > > Java
> > > > > > > > >
> > > > > > > > > Could Crail use the Arrow data structures in its runtime
> > rather
> > > > than
> > > > > > > > > copying? If not, how are Crail's runtime data structures
> > > > different?
> > > > > > > > >
> > > > > > > > > - Wes
> > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > > > <J....@tudelft.nl> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello Animesh,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I browsed a bit in your sources, thanks for sharing. We
> > have
> > > > > > performed
> > > > > > > > > some similar measurements to your third case in the past for
> > > > C/C++ on
> > > > > > > > > collections of various basic types such as primitives and
> > > > strings.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I can say that in terms of consuming data from the Arrow
> > format
> > > > > > versus
> > > > > > > > > language native collections in C++, I've so far always been
> > able
> > > > to
> > > > > > get
> > > > > > > > > similar performance numbers (e.g. no drawbacks due to the
> > Arrow
> > > > > > format
> > > > > > > > > itself). Especially when accessing the data through Arrow's
> > raw
> > > > data
> > > > > > > > > pointers (and using for example std::string_view-like
> > > > constructs).
> > > > > > > > > >
> > > > > > > > > > In C/C++ the fast data structures are engineered in such a
> > way
> > > > > > that as
> > > > > > > > > little pointer traversals are required and they take up an as
> > > > small
> > > > > > as
> > > > > > > > > possible memory footprint. Thus each memory access is
> > relatively
> > > > > > > > efficient
> > > > > > > > > (in terms of obtaining the data of interest). The same can
> > > > > > absolutely be
> > > > > > > > > said for Arrow, if not even more efficient in some cases
> > where
> > > > object
> > > > > > > > > fields are of variable length.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > In the JVM case, the Arrow data is stored off-heap. This
> > > > requires
> > > > > > the
> > > > > > > > > JVM to interface to it through some calls to Unsafe hidden
> > under
> > > > the
> > > > > > > > Netty
> > > > > > > > > layer (but please correct me if I'm wrong, I'm not an expert
> > on
> > > > > > this).
> > > > > > > > > Those calls are the only reason I can think of that would
> > > > degrade the
> > > > > > > > > performance a bit compared to a pure JAva case. I don't know
> > if
> > > > the
> > > > > > > > Unsafe
> > > > > > > > > calls are inlined during JIT compilation. If they aren't they
> > > > will
> > > > > > > > increase
> > > > > > > > > access latency to any data a little bit.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I don't have a similar machine so it's not easy to relate
> > my
> > > > > > numbers to
> > > > > > > > > yours, but if you can get that data consumed with 100 Gbps
> > in a
> > > > pure
> > > > > > Java
> > > > > > > > > case, I don't see any reason (resulting from Arrow format /
> > > > off-heap
> > > > > > > > > storage) why you wouldn't be able to get at least really
> > close.
> > > > Can
> > > > > > you
> > > > > > > > get
> > > > > > > > > to 100 Gbps starting from primitive arrays in Java with your
> > > > > > consumption
> > > > > > > > > functions in the first place?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm interested to see your progress on this.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Kind regards,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Johan Peltenburg
> > > > > > > > > >
> > > > > > > > > > ________________________________
> > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > A week ago, Wes and I had a discussion about the
> > performance
> > > > of the
> > > > > > > > > > Arrow/Java implementation on the Apache Crail (Incubating)
> > > > mailing
> > > > > > > > list (
> > > > > > > > > >
> > > > > >
> > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > > > ).
> > > > > > > > > In
> > > > > > > > > > a nutshell: I am investigating the performance of various
> > file
> > > > > > formats
> > > > > > > > > > (including Arrow) on high-performance NVMe and
> > > > RDMA/100Gbps/RoCE
> > > > > > > > setups.
> > > > > > > > > I
> > > > > > > > > > benchmarked how long does it take to materialize values
> > (ints,
> > > > > > longs,
> > > > > > > > > > doubles) of the store_sales table, the largest table in the
> > > > TPC-DS
> > > > > > > > > dataset
> > > > > > > > > > stored on different file formats. Here is a write-up on
> > this -
> > > > > > > > > >
> > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > > > found
> > > > > > > > > that
> > > > > > > > > > between a pair of machine connected over a 100 Gbps link,
> > Arrow
> > > > > > (using
> > > > > > > > > as a
> > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> > bandwidth
> > > > with
> > > > > > all
> > > > > > > > 16
> > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory
> > IPC
> > > > > > format,
> > > > > > > > > and
> > > > > > > > > > has not been optimized for storage interfaces/APIs like
> > HDFS;
> > > > (ii)
> > > > > > the
> > > > > > > > > > performance I am measuring is for the java implementation.
> > > > > > > > > >
> > > > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > > > >
> > > > > > > > > > That brings us to this email where I promised to follow up
> > on
> > > > the
> > > > > > Arrow
> > > > > > > > > > mailing list to understand and optimize the performance of
> > > > > > Arrow/Java
> > > > > > > > > > implementation on high-performance devices. I wrote a small
> > > > > > stand-alone
> > > > > > > > > > benchmark (
> > > > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > > > with
> > > > > > > > > three
> > > > > > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > > > > > interfaces:
> > > > > > > > > >
> > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30
> > Gbps
> > > > > > > > > performance
> > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me
> > ~35-36
> > > > Gbps
> > > > > > > > > > performance
> > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me
> > ~39
> > > > Gbps
> > > > > > > > > > performance
> > > > > > > > > >
> > > > > > > > > > I think the order makes sense. To better understand the
> > > > > > performance of
> > > > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > > > >
> > > > > > > > > > The key question I am trying to answer is "what would it
> > take
> > > > for
> > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > > > possible? If
> > > > > > > > yes,
> > > > > > > > > > then what is missing/or mis-interpreted by me? If not, then
> > > > where
> > > > > > is
> > > > > > > > the
> > > > > > > > > > performance lost? Does anyone have any performance
> > measurements
> > > > > > for C++
> > > > > > > > > > implementation? if they have seen/expect better numbers.
> > > > > > > > > >
> > > > > > > > > > As a next step, I will profile the read path of Arrow/Java
> > for
> > > > the
> > > > > > > > option
> > > > > > > > > > 3. I will report my findings.
> > > > > > > > > >
> > > > > > > > > > Any thoughts and feedback on this investigation are very
> > > > welcome.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > --
> > > > > > > > > > Animesh
> > > > > > > > > >
> > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as
> > well.
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: [JAVA] Arrow performance measurement

Posted by Wes McKinney <we...@gmail.com>.
hey Animesh,

Thank you for doing this analysis. If you'd like to share some of the
analysis more broadly e.g. on the Apache Arrow blog or social media,
let us know.

Seems like there might be a few follow ups here for the Arrow Java community:

* Documentation about achieving better performance
* Writing some microperformance benchmarks
* Making some improvements to the code to facilitate better performance

Feel free to create some JIRA issues. Are any Java developers
interested in digging a little more into these issues?

Thanks,
Wes
On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
<an...@gmail.com> wrote:
>
> Hi Wes and all,
>
> Here is another round of updates:
>
> Quick recap - previously we established that for 1kB binary blobs, Arrow
> can deliver > 160 Gbps performance from in-memory buffers.
>
> In this round I looked at the performance of materializing "integers". In
> my benchmarks, I found that with careful optimizations/code-rewriting we
> can push the performance of integer reading from 5.42 Gbps/core to 13.61
> Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+
> Gbps. Key things to do is:
>
> 1) Disable memory access checks in Arrow and Netty buffers. This gave
> significant performance boost. However, for such an important performance
> flag, it is very poorly documented
> ("drill.enable_unsafe_memory_access=true").
>
> 2) Materialize values from Validity and Value direct buffers instead of
> calling getInt() function on the IntVector. This is implemented as a new
> Unsafe reader type (
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
> )
>
> 3) Optimize bitmap operation to check if a bit is set or not (
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
> )
>
> A detailed write up of these steps is available here:
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
>
> I have 2 follow-up questions:
>
> 1) Regarding the `isSet` function, why does it has to calculate number of
> bits set? (
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797).
> Wouldn't just checking if the result of the AND operation is zero or not be
> sufficient? Like what I did :
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
>
>
> 2) What is the reason behind this bitmap generation optimization here
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
> ? At this point when this function is called, the bitmap vector is already
> read from the storage, and contains the right values (either all null, all
> set, or whatever). Generating this mask here for the special cases when the
> values are all NULL or all set (this was the case in my benchmark), can be
> slower than just returning what one has read from the storage.
>
> Collectively optimizing these two bitmap operations give more than 1 Gbps
> gains in my bench-marking code.
>
> Cheers,
> --
> Animesh
>
>
> On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com> wrote:
>
> > See e.g.
> >
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
> >
> >
> > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
> > <an...@gmail.com> wrote:
> > >
> > > Primarily write the same microbenchmark as I have in Java in C++ for
> > table
> > > reading and value materialization. So just an example of equivalent
> > > ArrowFileReader example code in C++. Unit tests are a good starting
> > point,
> > > thanks for the tip :)
> > >
> > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > > 3. Are there examples of Arrow in C++ read/write code that I can
> > have a
> > > > look?
> > > >
> > > > What kind of code are you looking for? I would direct you to relevant
> > > > unit tests that exhibit certain functionality, but it depends on what
> > > > you are trying to do
> > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > > > <an...@gmail.com> wrote:
> > > > >
> > > > > Hi all - quick update on the performance investigation:
> > > > >
> > > > > - I spent some time looking at performance profile for a binary blob
> > > > column
> > > > > (1024 bytes of byte[]) and found a few favorable settings for
> > delivering
> > > > up
> > > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These
> > settings
> > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > > > documented
> > > > > here:
> > > > >
> > > >
> > https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > > > - these setting also help to improved the last number that reported
> > (but
> > > > > not by much) for the in-memory TPC-DS store_sales table from ~39
> > Gbps up
> > > > to
> > > > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e.,
> > w/o any
> > > > > networking or storage links)
> > > > >
> > > > > A few follow up questions that I have:
> > > > > 1. Arrow reads a batch size worth of data in one go. Are there any
> > > > > recommended batch sizes? In my investigation, small batch size help
> > with
> > > > a
> > > > > better cache profile but increase number of instructions required
> > (more
> > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be
> > the
> > > > best
> > > > > performing configuration, which is also a bit counter intuitive as
> > for 16
> > > > > threads this will lead to 160 MB of memory footprint. May be this is
> > also
> > > > > tired to the memory management logic which is my next question.
> > > > > 2. Arrow use's netty's memory manager. (i) what are decent netty
> > memory
> > > > > management settings for "io.netty.allocator.*" parameters? I don't
> > find
> > > > any
> > > > > decent write-up on them; (ii) Is there a provision for ArrowBuf being
> > > > > re-used once a batch is consumed? As it looks for now, read read
> > > > allocates
> > > > > a new buffer to read the whole batch size.
> > > > > 3. Are there examples of Arrow in C++ read/write code that I can
> > have a
> > > > > look?
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > Animesh
> > > > >
> > > > >
> > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > > > > <an...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > > > > >
> > > > > > > @Johan -
> > > > > > > 1. I also do not suspect that there is any inherent drawback in
> > Java
> > > > or
> > > > > > C++
> > > > > > > due to the Arrow format. I mentioned C++ because Wes pointed out
> > that
> > > > > > Java
> > > > > > > routines are not the most optimized ones (yet!). And naturally
> > one
> > > > would
> > > > > > > expect better performance in a native language with all
> > > > > > pointer/memory/SIMD
> > > > > > > instruction optimizations that you mentioned. As far as I know,
> > the
> > > > > > > off-heap buffers are managed in ArrowBuf which implements an
> > abstract
> > > > > > netty
> > > > > > > class. But there is nothing unusual, i.e., netty specific, about
> > > > these
> > > > > > > unsafe routines, they are used by many projects. Though there is
> > cost
> > > > > > > associated with materializing on-heap Java values from off-heap
> > > > memory
> > > > > > > regions. I need to benchmark that more carefully.
> > > > > > >
> > > > > > > 2. When you say "I've so far always been able to get similar
> > > > performance
> > > > > > > numbers" - do you mean the same performance of my case 3 where 16
> > > > cores
> > > > > > > drive close to 40 Gbps, or the same performance between your C++
> > and
> > > > Java
> > > > > > > benchmarks. Do you have some write-up? I would be interested to
> > read
> > > > up
> > > > > > :)
> > > > > > >
> > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays in
> > Java"
> > > > ->
> > > > > > that
> > > > > > > is a good idea. Let me try and report back.
> > > > > > >
> > > > > > > @Wes -
> > > > > > >
> > > > > > > Is there some benchmark template for C++ routines I can have a
> > look?
> > > > I
> > > > > > > would be happy to get some input from Java-Arrow experts on how
> > to
> > > > write
> > > > > > > these benchmarks more efficiently. I will have a closer look at
> > the
> > > > JIRA
> > > > > > > tickets that you mentioned.
> > > > > > >
> > > > > > > So, for now I am focusing on the case 3, which is about
> > establishing
> > > > > > > performance when reading from a local in-memory I/O stream that I
> > > > > > > implemented (
> > > > > > >
> > > > > >
> > > >
> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > > > > ).
> > > > > > > In this case I first read data from parquet files, convert them
> > into
> > > > > > Arrow,
> > > > > > > and write-out to a MemoryIOChannel, and then read back from it.
> > So,
> > > > the
> > > > > > > performance has nothing to do with Crail or HDFS in the case 3.
> > > > Once, I
> > > > > > > establish the base performance in this setup (which is around ~40
> > > > Gbps
> > > > > > with
> > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams
> > can
> > > > take
> > > > > > > ArrowBuf as src/dst buffers. That should be doable.
> > > > > >
> > > > > > Right, in any case what you are testing is the performance of
> > using a
> > > > > > particular Java accessor layer to JVM off-heap Arrow memory to sum
> > the
> > > > > > non-null values of each column. I'm not sure that a single
> > bandwidth
> > > > > > number produced by this benchmark is very informative for people
> > > > > > contemplating what memory format to use in their system due to the
> > > > > > current state of the implementation (Java) and workload measured
> > > > > > (summing the non-null values with a naive algorithm). I would guess
> > > > > > that a C++ version with raw pointers and a loop-unrolled,
> > branch-free
> > > > > > vectorized sum is going to be a lot faster.
> > > > > >
> > > > > > >
> > > > > > > @Jacques -
> > > > > > >
> > > > > > > That is a good point that "Arrow's implementation is more
> > focused on
> > > > > > > interacting with the structure than transporting it". However,
> > in any
> > > > > > > distributed system one needs to move data/structure around - as
> > far
> > > > as I
> > > > > > > understand that is another goal of the project. My investigation
> > > > started
> > > > > > > within the context of Spark/SQL data processing. Spark converts
> > > > incoming
> > > > > > > data into its own in-memory UnsafeRow representation for
> > processing.
> > > > So
> > > > > > > naturally the performance of this data ingestion pipeline cannot
> > > > > > outperform
> > > > > > > the read performance of the used file format. I benchmarked
> > Parquet,
> > > > ORC,
> > > > > > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > > > > > curiously
> > > > > > > benchmarked Arrow as well because its design choices are a better
> > > > fit for
> > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> > > > investigating.
> > > > > > From
> > > > > > > this point of view, I am trying to find out can Arrow be the file
> > > > format
> > > > > > > for the next generation of storage/networking devices (see Apache
> > > > Crail
> > > > > > > project) delivering close to the hardware speed reading/writing
> > > > rates. As
> > > > > > > Wes pointed out that a C++ library implementation should be
> > memory-IO
> > > > > > > bound, so what would it take to deliver the same performance in
> > Java
> > > > ;)
> > > > > > > (and then, from across the network).
> > > > > > >
> > > > > > > I hope this makes sense.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > --
> > > > > > > Animesh
> > > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> > jacques@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > My big question is what is the use case and how/what are you
> > > > trying to
> > > > > > > > compare? Arrow's implementation is more focused on interacting
> > > > with the
> > > > > > > > structure than transporting it. Generally speaking, when we're
> > > > working
> > > > > > with
> > > > > > > > Arrow data we frequently are just interacting with memory
> > > > locations and
> > > > > > > > doing direct operations. If you have a layer that supports that
> > > > type of
> > > > > > > > semantic, create a movement technique that depends on that.
> > Arrow
> > > > > > doesn't
> > > > > > > > force a particular API since the data itself is defined by its
> > > > > > in-memory
> > > > > > > > layout so if you have a custom use or pattern, just work with
> > the
> > > > > > in-memory
> > > > > > > > structures.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> > wesmckinn@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > hi Animesh,
> > > > > > > > >
> > > > > > > > > Per Johan's comments, the C++ library is essentially going
> > to be
> > > > > > > > > IO/memory bandwidth bound since you're interacting with raw
> > > > pointers.
> > > > > > > > >
> > > > > > > > > I'm looking at your code
> > > > > > > > >
> > > > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > > > > >     int valCount = accessor.getValueCount();
> > > > > > > > >     for(int i = 0; i < valCount; i++){
> > > > > > > > >         if(!accessor.isNull(i)){
> > > > > > > > >             float4Count+=1;
> > > > > > > > >             checksum+=accessor.get(i);
> > > > > > > > >         }
> > > > > > > > >     }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to advise
> > you
> > > > the
> > > > > > > > > fastest way to iterate over this data -- my understanding is
> > that
> > > > > > much
> > > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
> > objects
> > > > > > > > > rather than going through the higher level APIs. You're also
> > > > dropping
> > > > > > > > > performance because memory mapping is not yet implemented in
> > > > Java,
> > > > > > see
> > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > > > >
> > > > > > > > > Furthermore, the IPC reader class you are using could be made
> > > > more
> > > > > > > > > efficient. I described the problem in
> > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this
> > will be
> > > > > > > > > required as soon as we have the ability to do memory mapping
> > in
> > > > Java
> > > > > > > > >
> > > > > > > > > Could Crail use the Arrow data structures in its runtime
> > rather
> > > > than
> > > > > > > > > copying? If not, how are Crail's runtime data structures
> > > > different?
> > > > > > > > >
> > > > > > > > > - Wes
> > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > > > <J....@tudelft.nl> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello Animesh,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I browsed a bit in your sources, thanks for sharing. We
> > have
> > > > > > performed
> > > > > > > > > some similar measurements to your third case in the past for
> > > > C/C++ on
> > > > > > > > > collections of various basic types such as primitives and
> > > > strings.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I can say that in terms of consuming data from the Arrow
> > format
> > > > > > versus
> > > > > > > > > language native collections in C++, I've so far always been
> > able
> > > > to
> > > > > > get
> > > > > > > > > similar performance numbers (e.g. no drawbacks due to the
> > Arrow
> > > > > > format
> > > > > > > > > itself). Especially when accessing the data through Arrow's
> > raw
> > > > data
> > > > > > > > > pointers (and using for example std::string_view-like
> > > > constructs).
> > > > > > > > > >
> > > > > > > > > > In C/C++ the fast data structures are engineered in such a
> > way
> > > > > > that as
> > > > > > > > > little pointer traversals are required and they take up an as
> > > > small
> > > > > > as
> > > > > > > > > possible memory footprint. Thus each memory access is
> > relatively
> > > > > > > > efficient
> > > > > > > > > (in terms of obtaining the data of interest). The same can
> > > > > > absolutely be
> > > > > > > > > said for Arrow, if not even more efficient in some cases
> > where
> > > > object
> > > > > > > > > fields are of variable length.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > In the JVM case, the Arrow data is stored off-heap. This
> > > > requires
> > > > > > the
> > > > > > > > > JVM to interface to it through some calls to Unsafe hidden
> > under
> > > > the
> > > > > > > > Netty
> > > > > > > > > layer (but please correct me if I'm wrong, I'm not an expert
> > on
> > > > > > this).
> > > > > > > > > Those calls are the only reason I can think of that would
> > > > degrade the
> > > > > > > > > performance a bit compared to a pure JAva case. I don't know
> > if
> > > > the
> > > > > > > > Unsafe
> > > > > > > > > calls are inlined during JIT compilation. If they aren't they
> > > > will
> > > > > > > > increase
> > > > > > > > > access latency to any data a little bit.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I don't have a similar machine so it's not easy to relate
> > my
> > > > > > numbers to
> > > > > > > > > yours, but if you can get that data consumed with 100 Gbps
> > in a
> > > > pure
> > > > > > Java
> > > > > > > > > case, I don't see any reason (resulting from Arrow format /
> > > > off-heap
> > > > > > > > > storage) why you wouldn't be able to get at least really
> > close.
> > > > Can
> > > > > > you
> > > > > > > > get
> > > > > > > > > to 100 Gbps starting from primitive arrays in Java with your
> > > > > > consumption
> > > > > > > > > functions in the first place?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm interested to see your progress on this.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Kind regards,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Johan Peltenburg
> > > > > > > > > >
> > > > > > > > > > ________________________________
> > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > A week ago, Wes and I had a discussion about the
> > performance
> > > > of the
> > > > > > > > > > Arrow/Java implementation on the Apache Crail (Incubating)
> > > > mailing
> > > > > > > > list (
> > > > > > > > > >
> > > > > >
> > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > > > ).
> > > > > > > > > In
> > > > > > > > > > a nutshell: I am investigating the performance of various
> > file
> > > > > > formats
> > > > > > > > > > (including Arrow) on high-performance NVMe and
> > > > RDMA/100Gbps/RoCE
> > > > > > > > setups.
> > > > > > > > > I
> > > > > > > > > > benchmarked how long does it take to materialize values
> > (ints,
> > > > > > longs,
> > > > > > > > > > doubles) of the store_sales table, the largest table in the
> > > > TPC-DS
> > > > > > > > > dataset
> > > > > > > > > > stored on different file formats. Here is a write-up on
> > this -
> > > > > > > > > >
> > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > > > found
> > > > > > > > > that
> > > > > > > > > > between a pair of machine connected over a 100 Gbps link,
> > Arrow
> > > > > > (using
> > > > > > > > > as a
> > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> > bandwidth
> > > > with
> > > > > > all
> > > > > > > > 16
> > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory
> > IPC
> > > > > > format,
> > > > > > > > > and
> > > > > > > > > > has not been optimized for storage interfaces/APIs like
> > HDFS;
> > > > (ii)
> > > > > > the
> > > > > > > > > > performance I am measuring is for the java implementation.
> > > > > > > > > >
> > > > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > > > >
> > > > > > > > > > That brings us to this email where I promised to follow up
> > on
> > > > the
> > > > > > Arrow
> > > > > > > > > > mailing list to understand and optimize the performance of
> > > > > > Arrow/Java
> > > > > > > > > > implementation on high-performance devices. I wrote a small
> > > > > > stand-alone
> > > > > > > > > > benchmark (
> > > > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > > > with
> > > > > > > > > three
> > > > > > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > > > > > interfaces:
> > > > > > > > > >
> > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30
> > Gbps
> > > > > > > > > performance
> > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me
> > ~35-36
> > > > Gbps
> > > > > > > > > > performance
> > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me
> > ~39
> > > > Gbps
> > > > > > > > > > performance
> > > > > > > > > >
> > > > > > > > > > I think the order makes sense. To better understand the
> > > > > > performance of
> > > > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > > > >
> > > > > > > > > > The key question I am trying to answer is "what would it
> > take
> > > > for
> > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > > > possible? If
> > > > > > > > yes,
> > > > > > > > > > then what is missing/or mis-interpreted by me? If not, then
> > > > where
> > > > > > is
> > > > > > > > the
> > > > > > > > > > performance lost? Does anyone have any performance
> > measurements
> > > > > > for C++
> > > > > > > > > > implementation? if they have seen/expect better numbers.
> > > > > > > > > >
> > > > > > > > > > As a next step, I will profile the read path of Arrow/Java
> > for
> > > > the
> > > > > > > > option
> > > > > > > > > > 3. I will report my findings.
> > > > > > > > > >
> > > > > > > > > > Any thoughts and feedback on this investigation are very
> > > > welcome.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > --
> > > > > > > > > > Animesh
> > > > > > > > > >
> > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as
> > well.
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: [JAVA] Arrow performance measurement

Posted by Animesh Trivedi <an...@gmail.com>.
Hi Wes and all,

Here is another round of updates:

Quick recap - previously we established that for 1kB binary blobs, Arrow
can deliver > 160 Gbps performance from in-memory buffers.

In this round I looked at the performance of materializing "integers". In
my benchmarks, I found that with careful optimizations/code-rewriting we
can push the performance of integer reading from 5.42 Gbps/core to 13.61
Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+
Gbps. Key things to do is:

1) Disable memory access checks in Arrow and Netty buffers. This gave
significant performance boost. However, for such an important performance
flag, it is very poorly documented
("drill.enable_unsafe_memory_access=true").

2) Materialize values from Validity and Value direct buffers instead of
calling getInt() function on the IntVector. This is implemented as a new
Unsafe reader type (
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
)

3) Optimize bitmap operation to check if a bit is set or not (
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
)

A detailed write up of these steps is available here:
https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md

I have 2 follow-up questions:

1) Regarding the `isSet` function, why does it has to calculate number of
bits set? (
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797).
Wouldn't just checking if the result of the AND operation is zero or not be
sufficient? Like what I did :
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28


2) What is the reason behind this bitmap generation optimization here
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
? At this point when this function is called, the bitmap vector is already
read from the storage, and contains the right values (either all null, all
set, or whatever). Generating this mask here for the special cases when the
values are all NULL or all set (this was the case in my benchmark), can be
slower than just returning what one has read from the storage.

Collectively optimizing these two bitmap operations give more than 1 Gbps
gains in my bench-marking code.

Cheers,
--
Animesh


On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com> wrote:

> See e.g.
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
>
>
> On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
> <an...@gmail.com> wrote:
> >
> > Primarily write the same microbenchmark as I have in Java in C++ for
> table
> > reading and value materialization. So just an example of equivalent
> > ArrowFileReader example code in C++. Unit tests are a good starting
> point,
> > thanks for the tip :)
> >
> > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > > 3. Are there examples of Arrow in C++ read/write code that I can
> have a
> > > look?
> > >
> > > What kind of code are you looking for? I would direct you to relevant
> > > unit tests that exhibit certain functionality, but it depends on what
> > > you are trying to do
> > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > > <an...@gmail.com> wrote:
> > > >
> > > > Hi all - quick update on the performance investigation:
> > > >
> > > > - I spent some time looking at performance profile for a binary blob
> > > column
> > > > (1024 bytes of byte[]) and found a few favorable settings for
> delivering
> > > up
> > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These
> settings
> > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > > documented
> > > > here:
> > > >
> > >
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > > - these setting also help to improved the last number that reported
> (but
> > > > not by much) for the in-memory TPC-DS store_sales table from ~39
> Gbps up
> > > to
> > > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e.,
> w/o any
> > > > networking or storage links)
> > > >
> > > > A few follow up questions that I have:
> > > > 1. Arrow reads a batch size worth of data in one go. Are there any
> > > > recommended batch sizes? In my investigation, small batch size help
> with
> > > a
> > > > better cache profile but increase number of instructions required
> (more
> > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be
> the
> > > best
> > > > performing configuration, which is also a bit counter intuitive as
> for 16
> > > > threads this will lead to 160 MB of memory footprint. May be this is
> also
> > > > tired to the memory management logic which is my next question.
> > > > 2. Arrow use's netty's memory manager. (i) what are decent netty
> memory
> > > > management settings for "io.netty.allocator.*" parameters? I don't
> find
> > > any
> > > > decent write-up on them; (ii) Is there a provision for ArrowBuf being
> > > > re-used once a batch is consumed? As it looks for now, read read
> > > allocates
> > > > a new buffer to read the whole batch size.
> > > > 3. Are there examples of Arrow in C++ read/write code that I can
> have a
> > > > look?
> > > >
> > > > Cheers,
> > > > --
> > > > Animesh
> > > >
> > > >
> > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > > > <an...@gmail.com> wrote:
> > > > > >
> > > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > > > >
> > > > > > @Johan -
> > > > > > 1. I also do not suspect that there is any inherent drawback in
> Java
> > > or
> > > > > C++
> > > > > > due to the Arrow format. I mentioned C++ because Wes pointed out
> that
> > > > > Java
> > > > > > routines are not the most optimized ones (yet!). And naturally
> one
> > > would
> > > > > > expect better performance in a native language with all
> > > > > pointer/memory/SIMD
> > > > > > instruction optimizations that you mentioned. As far as I know,
> the
> > > > > > off-heap buffers are managed in ArrowBuf which implements an
> abstract
> > > > > netty
> > > > > > class. But there is nothing unusual, i.e., netty specific, about
> > > these
> > > > > > unsafe routines, they are used by many projects. Though there is
> cost
> > > > > > associated with materializing on-heap Java values from off-heap
> > > memory
> > > > > > regions. I need to benchmark that more carefully.
> > > > > >
> > > > > > 2. When you say "I've so far always been able to get similar
> > > performance
> > > > > > numbers" - do you mean the same performance of my case 3 where 16
> > > cores
> > > > > > drive close to 40 Gbps, or the same performance between your C++
> and
> > > Java
> > > > > > benchmarks. Do you have some write-up? I would be interested to
> read
> > > up
> > > > > :)
> > > > > >
> > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays in
> Java"
> > > ->
> > > > > that
> > > > > > is a good idea. Let me try and report back.
> > > > > >
> > > > > > @Wes -
> > > > > >
> > > > > > Is there some benchmark template for C++ routines I can have a
> look?
> > > I
> > > > > > would be happy to get some input from Java-Arrow experts on how
> to
> > > write
> > > > > > these benchmarks more efficiently. I will have a closer look at
> the
> > > JIRA
> > > > > > tickets that you mentioned.
> > > > > >
> > > > > > So, for now I am focusing on the case 3, which is about
> establishing
> > > > > > performance when reading from a local in-memory I/O stream that I
> > > > > > implemented (
> > > > > >
> > > > >
> > >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > > > ).
> > > > > > In this case I first read data from parquet files, convert them
> into
> > > > > Arrow,
> > > > > > and write-out to a MemoryIOChannel, and then read back from it.
> So,
> > > the
> > > > > > performance has nothing to do with Crail or HDFS in the case 3.
> > > Once, I
> > > > > > establish the base performance in this setup (which is around ~40
> > > Gbps
> > > > > with
> > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams
> can
> > > take
> > > > > > ArrowBuf as src/dst buffers. That should be doable.
> > > > >
> > > > > Right, in any case what you are testing is the performance of
> using a
> > > > > particular Java accessor layer to JVM off-heap Arrow memory to sum
> the
> > > > > non-null values of each column. I'm not sure that a single
> bandwidth
> > > > > number produced by this benchmark is very informative for people
> > > > > contemplating what memory format to use in their system due to the
> > > > > current state of the implementation (Java) and workload measured
> > > > > (summing the non-null values with a naive algorithm). I would guess
> > > > > that a C++ version with raw pointers and a loop-unrolled,
> branch-free
> > > > > vectorized sum is going to be a lot faster.
> > > > >
> > > > > >
> > > > > > @Jacques -
> > > > > >
> > > > > > That is a good point that "Arrow's implementation is more
> focused on
> > > > > > interacting with the structure than transporting it". However,
> in any
> > > > > > distributed system one needs to move data/structure around - as
> far
> > > as I
> > > > > > understand that is another goal of the project. My investigation
> > > started
> > > > > > within the context of Spark/SQL data processing. Spark converts
> > > incoming
> > > > > > data into its own in-memory UnsafeRow representation for
> processing.
> > > So
> > > > > > naturally the performance of this data ingestion pipeline cannot
> > > > > outperform
> > > > > > the read performance of the used file format. I benchmarked
> Parquet,
> > > ORC,
> > > > > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > > > > curiously
> > > > > > benchmarked Arrow as well because its design choices are a better
> > > fit for
> > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> > > investigating.
> > > > > From
> > > > > > this point of view, I am trying to find out can Arrow be the file
> > > format
> > > > > > for the next generation of storage/networking devices (see Apache
> > > Crail
> > > > > > project) delivering close to the hardware speed reading/writing
> > > rates. As
> > > > > > Wes pointed out that a C++ library implementation should be
> memory-IO
> > > > > > bound, so what would it take to deliver the same performance in
> Java
> > > ;)
> > > > > > (and then, from across the network).
> > > > > >
> > > > > > I hope this makes sense.
> > > > > >
> > > > > > Cheers,
> > > > > > --
> > > > > > Animesh
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> jacques@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > My big question is what is the use case and how/what are you
> > > trying to
> > > > > > > compare? Arrow's implementation is more focused on interacting
> > > with the
> > > > > > > structure than transporting it. Generally speaking, when we're
> > > working
> > > > > with
> > > > > > > Arrow data we frequently are just interacting with memory
> > > locations and
> > > > > > > doing direct operations. If you have a layer that supports that
> > > type of
> > > > > > > semantic, create a movement technique that depends on that.
> Arrow
> > > > > doesn't
> > > > > > > force a particular API since the data itself is defined by its
> > > > > in-memory
> > > > > > > layout so if you have a custom use or pattern, just work with
> the
> > > > > in-memory
> > > > > > > structures.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> wesmckinn@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > hi Animesh,
> > > > > > > >
> > > > > > > > Per Johan's comments, the C++ library is essentially going
> to be
> > > > > > > > IO/memory bandwidth bound since you're interacting with raw
> > > pointers.
> > > > > > > >
> > > > > > > > I'm looking at your code
> > > > > > > >
> > > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > > > >     int valCount = accessor.getValueCount();
> > > > > > > >     for(int i = 0; i < valCount; i++){
> > > > > > > >         if(!accessor.isNull(i)){
> > > > > > > >             float4Count+=1;
> > > > > > > >             checksum+=accessor.get(i);
> > > > > > > >         }
> > > > > > > >     }
> > > > > > > > }
> > > > > > > >
> > > > > > > > You'll want to get a Java-Arrow expert from Dremio to advise
> you
> > > the
> > > > > > > > fastest way to iterate over this data -- my understanding is
> that
> > > > > much
> > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
> objects
> > > > > > > > rather than going through the higher level APIs. You're also
> > > dropping
> > > > > > > > performance because memory mapping is not yet implemented in
> > > Java,
> > > > > see
> > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > > >
> > > > > > > > Furthermore, the IPC reader class you are using could be made
> > > more
> > > > > > > > efficient. I described the problem in
> > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this
> will be
> > > > > > > > required as soon as we have the ability to do memory mapping
> in
> > > Java
> > > > > > > >
> > > > > > > > Could Crail use the Arrow data structures in its runtime
> rather
> > > than
> > > > > > > > copying? If not, how are Crail's runtime data structures
> > > different?
> > > > > > > >
> > > > > > > > - Wes
> > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > > <J....@tudelft.nl> wrote:
> > > > > > > > >
> > > > > > > > > Hello Animesh,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I browsed a bit in your sources, thanks for sharing. We
> have
> > > > > performed
> > > > > > > > some similar measurements to your third case in the past for
> > > C/C++ on
> > > > > > > > collections of various basic types such as primitives and
> > > strings.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I can say that in terms of consuming data from the Arrow
> format
> > > > > versus
> > > > > > > > language native collections in C++, I've so far always been
> able
> > > to
> > > > > get
> > > > > > > > similar performance numbers (e.g. no drawbacks due to the
> Arrow
> > > > > format
> > > > > > > > itself). Especially when accessing the data through Arrow's
> raw
> > > data
> > > > > > > > pointers (and using for example std::string_view-like
> > > constructs).
> > > > > > > > >
> > > > > > > > > In C/C++ the fast data structures are engineered in such a
> way
> > > > > that as
> > > > > > > > little pointer traversals are required and they take up an as
> > > small
> > > > > as
> > > > > > > > possible memory footprint. Thus each memory access is
> relatively
> > > > > > > efficient
> > > > > > > > (in terms of obtaining the data of interest). The same can
> > > > > absolutely be
> > > > > > > > said for Arrow, if not even more efficient in some cases
> where
> > > object
> > > > > > > > fields are of variable length.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > In the JVM case, the Arrow data is stored off-heap. This
> > > requires
> > > > > the
> > > > > > > > JVM to interface to it through some calls to Unsafe hidden
> under
> > > the
> > > > > > > Netty
> > > > > > > > layer (but please correct me if I'm wrong, I'm not an expert
> on
> > > > > this).
> > > > > > > > Those calls are the only reason I can think of that would
> > > degrade the
> > > > > > > > performance a bit compared to a pure JAva case. I don't know
> if
> > > the
> > > > > > > Unsafe
> > > > > > > > calls are inlined during JIT compilation. If they aren't they
> > > will
> > > > > > > increase
> > > > > > > > access latency to any data a little bit.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I don't have a similar machine so it's not easy to relate
> my
> > > > > numbers to
> > > > > > > > yours, but if you can get that data consumed with 100 Gbps
> in a
> > > pure
> > > > > Java
> > > > > > > > case, I don't see any reason (resulting from Arrow format /
> > > off-heap
> > > > > > > > storage) why you wouldn't be able to get at least really
> close.
> > > Can
> > > > > you
> > > > > > > get
> > > > > > > > to 100 Gbps starting from primitive arrays in Java with your
> > > > > consumption
> > > > > > > > functions in the first place?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm interested to see your progress on this.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Kind regards,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Johan Peltenburg
> > > > > > > > >
> > > > > > > > > ________________________________
> > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > A week ago, Wes and I had a discussion about the
> performance
> > > of the
> > > > > > > > > Arrow/Java implementation on the Apache Crail (Incubating)
> > > mailing
> > > > > > > list (
> > > > > > > > >
> > > > >
> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > > ).
> > > > > > > > In
> > > > > > > > > a nutshell: I am investigating the performance of various
> file
> > > > > formats
> > > > > > > > > (including Arrow) on high-performance NVMe and
> > > RDMA/100Gbps/RoCE
> > > > > > > setups.
> > > > > > > > I
> > > > > > > > > benchmarked how long does it take to materialize values
> (ints,
> > > > > longs,
> > > > > > > > > doubles) of the store_sales table, the largest table in the
> > > TPC-DS
> > > > > > > > dataset
> > > > > > > > > stored on different file formats. Here is a write-up on
> this -
> > > > > > > > >
> https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > > found
> > > > > > > > that
> > > > > > > > > between a pair of machine connected over a 100 Gbps link,
> Arrow
> > > > > (using
> > > > > > > > as a
> > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> bandwidth
> > > with
> > > > > all
> > > > > > > 16
> > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory
> IPC
> > > > > format,
> > > > > > > > and
> > > > > > > > > has not been optimized for storage interfaces/APIs like
> HDFS;
> > > (ii)
> > > > > the
> > > > > > > > > performance I am measuring is for the java implementation.
> > > > > > > > >
> > > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > > >
> > > > > > > > > That brings us to this email where I promised to follow up
> on
> > > the
> > > > > Arrow
> > > > > > > > > mailing list to understand and optimize the performance of
> > > > > Arrow/Java
> > > > > > > > > implementation on high-performance devices. I wrote a small
> > > > > stand-alone
> > > > > > > > > benchmark (
> > > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > > with
> > > > > > > > three
> > > > > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > > > > interfaces:
> > > > > > > > >
> > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30
> Gbps
> > > > > > > > performance
> > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me
> ~35-36
> > > Gbps
> > > > > > > > > performance
> > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me
> ~39
> > > Gbps
> > > > > > > > > performance
> > > > > > > > >
> > > > > > > > > I think the order makes sense. To better understand the
> > > > > performance of
> > > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > > >
> > > > > > > > > The key question I am trying to answer is "what would it
> take
> > > for
> > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > > possible? If
> > > > > > > yes,
> > > > > > > > > then what is missing/or mis-interpreted by me? If not, then
> > > where
> > > > > is
> > > > > > > the
> > > > > > > > > performance lost? Does anyone have any performance
> measurements
> > > > > for C++
> > > > > > > > > implementation? if they have seen/expect better numbers.
> > > > > > > > >
> > > > > > > > > As a next step, I will profile the read path of Arrow/Java
> for
> > > the
> > > > > > > option
> > > > > > > > > 3. I will report my findings.
> > > > > > > > >
> > > > > > > > > Any thoughts and feedback on this investigation are very
> > > welcome.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > --
> > > > > > > > > Animesh
> > > > > > > > >
> > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as
> well.
> > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: [JAVA] Arrow performance measurement

Posted by Animesh Trivedi <an...@gmail.com>.
Hi Wes and all,

Here is another round of updates:

Quick recap - previously we established that for 1kB binary blobs, Arrow
can deliver > 160 Gbps performance from in-memory buffers.

In this round I looked at the performance of materializing "integers". In
my benchmarks, I found that with careful optimizations/code-rewriting we
can push the performance of integer reading from 5.42 Gbps/core to 13.61
Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+
Gbps. Key things to do is:

1) Disable memory access checks in Arrow and Netty buffers. This gave
significant performance boost. However, for such an important performance
flag, it is very poorly documented
("drill.enable_unsafe_memory_access=true").

2) Materialize values from Validity and Value direct buffers instead of
calling getInt() function on the IntVector. This is implemented as a new
Unsafe reader type (
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
)

3) Optimize bitmap operation to check if a bit is set or not (
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
)

A detailed write up of these steps is available here:
https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md

I have 2 follow-up questions:

1) Regarding the `isSet` function, why does it has to calculate number of
bits set? (
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797).
Wouldn't just checking if the result of the AND operation is zero or not be
sufficient? Like what I did :
https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28


2) What is the reason behind this bitmap generation optimization here
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
? At this point when this function is called, the bitmap vector is already
read from the storage, and contains the right values (either all null, all
set, or whatever). Generating this mask here for the special cases when the
values are all NULL or all set (this was the case in my benchmark), can be
slower than just returning what one has read from the storage.

Collectively optimizing these two bitmap operations give more than 1 Gbps
gains in my bench-marking code.

Cheers,
--
Animesh


On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com> wrote:

> See e.g.
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
>
>
> On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
> <an...@gmail.com> wrote:
> >
> > Primarily write the same microbenchmark as I have in Java in C++ for
> table
> > reading and value materialization. So just an example of equivalent
> > ArrowFileReader example code in C++. Unit tests are a good starting
> point,
> > thanks for the tip :)
> >
> > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > > 3. Are there examples of Arrow in C++ read/write code that I can
> have a
> > > look?
> > >
> > > What kind of code are you looking for? I would direct you to relevant
> > > unit tests that exhibit certain functionality, but it depends on what
> > > you are trying to do
> > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > > <an...@gmail.com> wrote:
> > > >
> > > > Hi all - quick update on the performance investigation:
> > > >
> > > > - I spent some time looking at performance profile for a binary blob
> > > column
> > > > (1024 bytes of byte[]) and found a few favorable settings for
> delivering
> > > up
> > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These
> settings
> > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > > documented
> > > > here:
> > > >
> > >
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > > - these setting also help to improved the last number that reported
> (but
> > > > not by much) for the in-memory TPC-DS store_sales table from ~39
> Gbps up
> > > to
> > > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e.,
> w/o any
> > > > networking or storage links)
> > > >
> > > > A few follow up questions that I have:
> > > > 1. Arrow reads a batch size worth of data in one go. Are there any
> > > > recommended batch sizes? In my investigation, small batch size help
> with
> > > a
> > > > better cache profile but increase number of instructions required
> (more
> > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be
> the
> > > best
> > > > performing configuration, which is also a bit counter intuitive as
> for 16
> > > > threads this will lead to 160 MB of memory footprint. May be this is
> also
> > > > tired to the memory management logic which is my next question.
> > > > 2. Arrow use's netty's memory manager. (i) what are decent netty
> memory
> > > > management settings for "io.netty.allocator.*" parameters? I don't
> find
> > > any
> > > > decent write-up on them; (ii) Is there a provision for ArrowBuf being
> > > > re-used once a batch is consumed? As it looks for now, read read
> > > allocates
> > > > a new buffer to read the whole batch size.
> > > > 3. Are there examples of Arrow in C++ read/write code that I can
> have a
> > > > look?
> > > >
> > > > Cheers,
> > > > --
> > > > Animesh
> > > >
> > > >
> > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > > > <an...@gmail.com> wrote:
> > > > > >
> > > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > > > >
> > > > > > @Johan -
> > > > > > 1. I also do not suspect that there is any inherent drawback in
> Java
> > > or
> > > > > C++
> > > > > > due to the Arrow format. I mentioned C++ because Wes pointed out
> that
> > > > > Java
> > > > > > routines are not the most optimized ones (yet!). And naturally
> one
> > > would
> > > > > > expect better performance in a native language with all
> > > > > pointer/memory/SIMD
> > > > > > instruction optimizations that you mentioned. As far as I know,
> the
> > > > > > off-heap buffers are managed in ArrowBuf which implements an
> abstract
> > > > > netty
> > > > > > class. But there is nothing unusual, i.e., netty specific, about
> > > these
> > > > > > unsafe routines, they are used by many projects. Though there is
> cost
> > > > > > associated with materializing on-heap Java values from off-heap
> > > memory
> > > > > > regions. I need to benchmark that more carefully.
> > > > > >
> > > > > > 2. When you say "I've so far always been able to get similar
> > > performance
> > > > > > numbers" - do you mean the same performance of my case 3 where 16
> > > cores
> > > > > > drive close to 40 Gbps, or the same performance between your C++
> and
> > > Java
> > > > > > benchmarks. Do you have some write-up? I would be interested to
> read
> > > up
> > > > > :)
> > > > > >
> > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays in
> Java"
> > > ->
> > > > > that
> > > > > > is a good idea. Let me try and report back.
> > > > > >
> > > > > > @Wes -
> > > > > >
> > > > > > Is there some benchmark template for C++ routines I can have a
> look?
> > > I
> > > > > > would be happy to get some input from Java-Arrow experts on how
> to
> > > write
> > > > > > these benchmarks more efficiently. I will have a closer look at
> the
> > > JIRA
> > > > > > tickets that you mentioned.
> > > > > >
> > > > > > So, for now I am focusing on the case 3, which is about
> establishing
> > > > > > performance when reading from a local in-memory I/O stream that I
> > > > > > implemented (
> > > > > >
> > > > >
> > >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > > > ).
> > > > > > In this case I first read data from parquet files, convert them
> into
> > > > > Arrow,
> > > > > > and write-out to a MemoryIOChannel, and then read back from it.
> So,
> > > the
> > > > > > performance has nothing to do with Crail or HDFS in the case 3.
> > > Once, I
> > > > > > establish the base performance in this setup (which is around ~40
> > > Gbps
> > > > > with
> > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams
> can
> > > take
> > > > > > ArrowBuf as src/dst buffers. That should be doable.
> > > > >
> > > > > Right, in any case what you are testing is the performance of
> using a
> > > > > particular Java accessor layer to JVM off-heap Arrow memory to sum
> the
> > > > > non-null values of each column. I'm not sure that a single
> bandwidth
> > > > > number produced by this benchmark is very informative for people
> > > > > contemplating what memory format to use in their system due to the
> > > > > current state of the implementation (Java) and workload measured
> > > > > (summing the non-null values with a naive algorithm). I would guess
> > > > > that a C++ version with raw pointers and a loop-unrolled,
> branch-free
> > > > > vectorized sum is going to be a lot faster.
> > > > >
> > > > > >
> > > > > > @Jacques -
> > > > > >
> > > > > > That is a good point that "Arrow's implementation is more
> focused on
> > > > > > interacting with the structure than transporting it". However,
> in any
> > > > > > distributed system one needs to move data/structure around - as
> far
> > > as I
> > > > > > understand that is another goal of the project. My investigation
> > > started
> > > > > > within the context of Spark/SQL data processing. Spark converts
> > > incoming
> > > > > > data into its own in-memory UnsafeRow representation for
> processing.
> > > So
> > > > > > naturally the performance of this data ingestion pipeline cannot
> > > > > outperform
> > > > > > the read performance of the used file format. I benchmarked
> Parquet,
> > > ORC,
> > > > > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > > > > curiously
> > > > > > benchmarked Arrow as well because its design choices are a better
> > > fit for
> > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> > > investigating.
> > > > > From
> > > > > > this point of view, I am trying to find out can Arrow be the file
> > > format
> > > > > > for the next generation of storage/networking devices (see Apache
> > > Crail
> > > > > > project) delivering close to the hardware speed reading/writing
> > > rates. As
> > > > > > Wes pointed out that a C++ library implementation should be
> memory-IO
> > > > > > bound, so what would it take to deliver the same performance in
> Java
> > > ;)
> > > > > > (and then, from across the network).
> > > > > >
> > > > > > I hope this makes sense.
> > > > > >
> > > > > > Cheers,
> > > > > > --
> > > > > > Animesh
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> jacques@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > My big question is what is the use case and how/what are you
> > > trying to
> > > > > > > compare? Arrow's implementation is more focused on interacting
> > > with the
> > > > > > > structure than transporting it. Generally speaking, when we're
> > > working
> > > > > with
> > > > > > > Arrow data we frequently are just interacting with memory
> > > locations and
> > > > > > > doing direct operations. If you have a layer that supports that
> > > type of
> > > > > > > semantic, create a movement technique that depends on that.
> Arrow
> > > > > doesn't
> > > > > > > force a particular API since the data itself is defined by its
> > > > > in-memory
> > > > > > > layout so if you have a custom use or pattern, just work with
> the
> > > > > in-memory
> > > > > > > structures.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> wesmckinn@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > hi Animesh,
> > > > > > > >
> > > > > > > > Per Johan's comments, the C++ library is essentially going
> to be
> > > > > > > > IO/memory bandwidth bound since you're interacting with raw
> > > pointers.
> > > > > > > >
> > > > > > > > I'm looking at your code
> > > > > > > >
> > > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > > > >     int valCount = accessor.getValueCount();
> > > > > > > >     for(int i = 0; i < valCount; i++){
> > > > > > > >         if(!accessor.isNull(i)){
> > > > > > > >             float4Count+=1;
> > > > > > > >             checksum+=accessor.get(i);
> > > > > > > >         }
> > > > > > > >     }
> > > > > > > > }
> > > > > > > >
> > > > > > > > You'll want to get a Java-Arrow expert from Dremio to advise
> you
> > > the
> > > > > > > > fastest way to iterate over this data -- my understanding is
> that
> > > > > much
> > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
> objects
> > > > > > > > rather than going through the higher level APIs. You're also
> > > dropping
> > > > > > > > performance because memory mapping is not yet implemented in
> > > Java,
> > > > > see
> > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > > >
> > > > > > > > Furthermore, the IPC reader class you are using could be made
> > > more
> > > > > > > > efficient. I described the problem in
> > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this
> will be
> > > > > > > > required as soon as we have the ability to do memory mapping
> in
> > > Java
> > > > > > > >
> > > > > > > > Could Crail use the Arrow data structures in its runtime
> rather
> > > than
> > > > > > > > copying? If not, how are Crail's runtime data structures
> > > different?
> > > > > > > >
> > > > > > > > - Wes
> > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > > <J....@tudelft.nl> wrote:
> > > > > > > > >
> > > > > > > > > Hello Animesh,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I browsed a bit in your sources, thanks for sharing. We
> have
> > > > > performed
> > > > > > > > some similar measurements to your third case in the past for
> > > C/C++ on
> > > > > > > > collections of various basic types such as primitives and
> > > strings.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I can say that in terms of consuming data from the Arrow
> format
> > > > > versus
> > > > > > > > language native collections in C++, I've so far always been
> able
> > > to
> > > > > get
> > > > > > > > similar performance numbers (e.g. no drawbacks due to the
> Arrow
> > > > > format
> > > > > > > > itself). Especially when accessing the data through Arrow's
> raw
> > > data
> > > > > > > > pointers (and using for example std::string_view-like
> > > constructs).
> > > > > > > > >
> > > > > > > > > In C/C++ the fast data structures are engineered in such a
> way
> > > > > that as
> > > > > > > > little pointer traversals are required and they take up an as
> > > small
> > > > > as
> > > > > > > > possible memory footprint. Thus each memory access is
> relatively
> > > > > > > efficient
> > > > > > > > (in terms of obtaining the data of interest). The same can
> > > > > absolutely be
> > > > > > > > said for Arrow, if not even more efficient in some cases
> where
> > > object
> > > > > > > > fields are of variable length.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > In the JVM case, the Arrow data is stored off-heap. This
> > > requires
> > > > > the
> > > > > > > > JVM to interface to it through some calls to Unsafe hidden
> under
> > > the
> > > > > > > Netty
> > > > > > > > layer (but please correct me if I'm wrong, I'm not an expert
> on
> > > > > this).
> > > > > > > > Those calls are the only reason I can think of that would
> > > degrade the
> > > > > > > > performance a bit compared to a pure JAva case. I don't know
> if
> > > the
> > > > > > > Unsafe
> > > > > > > > calls are inlined during JIT compilation. If they aren't they
> > > will
> > > > > > > increase
> > > > > > > > access latency to any data a little bit.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I don't have a similar machine so it's not easy to relate
> my
> > > > > numbers to
> > > > > > > > yours, but if you can get that data consumed with 100 Gbps
> in a
> > > pure
> > > > > Java
> > > > > > > > case, I don't see any reason (resulting from Arrow format /
> > > off-heap
> > > > > > > > storage) why you wouldn't be able to get at least really
> close.
> > > Can
> > > > > you
> > > > > > > get
> > > > > > > > to 100 Gbps starting from primitive arrays in Java with your
> > > > > consumption
> > > > > > > > functions in the first place?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm interested to see your progress on this.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Kind regards,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Johan Peltenburg
> > > > > > > > >
> > > > > > > > > ________________________________
> > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > A week ago, Wes and I had a discussion about the
> performance
> > > of the
> > > > > > > > > Arrow/Java implementation on the Apache Crail (Incubating)
> > > mailing
> > > > > > > list (
> > > > > > > > >
> > > > >
> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > > ).
> > > > > > > > In
> > > > > > > > > a nutshell: I am investigating the performance of various
> file
> > > > > formats
> > > > > > > > > (including Arrow) on high-performance NVMe and
> > > RDMA/100Gbps/RoCE
> > > > > > > setups.
> > > > > > > > I
> > > > > > > > > benchmarked how long does it take to materialize values
> (ints,
> > > > > longs,
> > > > > > > > > doubles) of the store_sales table, the largest table in the
> > > TPC-DS
> > > > > > > > dataset
> > > > > > > > > stored on different file formats. Here is a write-up on
> this -
> > > > > > > > >
> https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > > found
> > > > > > > > that
> > > > > > > > > between a pair of machine connected over a 100 Gbps link,
> Arrow
> > > > > (using
> > > > > > > > as a
> > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> bandwidth
> > > with
> > > > > all
> > > > > > > 16
> > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory
> IPC
> > > > > format,
> > > > > > > > and
> > > > > > > > > has not been optimized for storage interfaces/APIs like
> HDFS;
> > > (ii)
> > > > > the
> > > > > > > > > performance I am measuring is for the java implementation.
> > > > > > > > >
> > > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > > >
> > > > > > > > > That brings us to this email where I promised to follow up
> on
> > > the
> > > > > Arrow
> > > > > > > > > mailing list to understand and optimize the performance of
> > > > > Arrow/Java
> > > > > > > > > implementation on high-performance devices. I wrote a small
> > > > > stand-alone
> > > > > > > > > benchmark (
> > > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > > with
> > > > > > > > three
> > > > > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > > > > interfaces:
> > > > > > > > >
> > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30
> Gbps
> > > > > > > > performance
> > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me
> ~35-36
> > > Gbps
> > > > > > > > > performance
> > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me
> ~39
> > > Gbps
> > > > > > > > > performance
> > > > > > > > >
> > > > > > > > > I think the order makes sense. To better understand the
> > > > > performance of
> > > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > > >
> > > > > > > > > The key question I am trying to answer is "what would it
> take
> > > for
> > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > > possible? If
> > > > > > > yes,
> > > > > > > > > then what is missing/or mis-interpreted by me? If not, then
> > > where
> > > > > is
> > > > > > > the
> > > > > > > > > performance lost? Does anyone have any performance
> measurements
> > > > > for C++
> > > > > > > > > implementation? if they have seen/expect better numbers.
> > > > > > > > >
> > > > > > > > > As a next step, I will profile the read path of Arrow/Java
> for
> > > the
> > > > > > > option
> > > > > > > > > 3. I will report my findings.
> > > > > > > > >
> > > > > > > > > Any thoughts and feedback on this investigation are very
> > > welcome.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > --
> > > > > > > > > Animesh
> > > > > > > > >
> > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as
> well.
> > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: [JAVA] Arrow performance measurement

Posted by Wes McKinney <we...@gmail.com>.
See e.g.

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222


On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
<an...@gmail.com> wrote:
>
> Primarily write the same microbenchmark as I have in Java in C++ for table
> reading and value materialization. So just an example of equivalent
> ArrowFileReader example code in C++. Unit tests are a good starting point,
> thanks for the tip :)
>
> On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com> wrote:
>
> > > 3. Are there examples of Arrow in C++ read/write code that I can have a
> > look?
> >
> > What kind of code are you looking for? I would direct you to relevant
> > unit tests that exhibit certain functionality, but it depends on what
> > you are trying to do
> > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > <an...@gmail.com> wrote:
> > >
> > > Hi all - quick update on the performance investigation:
> > >
> > > - I spent some time looking at performance profile for a binary blob
> > column
> > > (1024 bytes of byte[]) and found a few favorable settings for delivering
> > up
> > > to 168 Gbps from in-memory reading benchmark on 16 cores. These settings
> > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > documented
> > > here:
> > >
> > https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > - these setting also help to improved the last number that reported (but
> > > not by much) for the in-memory TPC-DS store_sales table from ~39 Gbps up
> > to
> > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e., w/o any
> > > networking or storage links)
> > >
> > > A few follow up questions that I have:
> > > 1. Arrow reads a batch size worth of data in one go. Are there any
> > > recommended batch sizes? In my investigation, small batch size help with
> > a
> > > better cache profile but increase number of instructions required (more
> > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be the
> > best
> > > performing configuration, which is also a bit counter intuitive as for 16
> > > threads this will lead to 160 MB of memory footprint. May be this is also
> > > tired to the memory management logic which is my next question.
> > > 2. Arrow use's netty's memory manager. (i) what are decent netty memory
> > > management settings for "io.netty.allocator.*" parameters? I don't find
> > any
> > > decent write-up on them; (ii) Is there a provision for ArrowBuf being
> > > re-used once a batch is consumed? As it looks for now, read read
> > allocates
> > > a new buffer to read the whole batch size.
> > > 3. Are there examples of Arrow in C++ read/write code that I can have a
> > > look?
> > >
> > > Cheers,
> > > --
> > > Animesh
> > >
> > >
> > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > > <an...@gmail.com> wrote:
> > > > >
> > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > > >
> > > > > @Johan -
> > > > > 1. I also do not suspect that there is any inherent drawback in Java
> > or
> > > > C++
> > > > > due to the Arrow format. I mentioned C++ because Wes pointed out that
> > > > Java
> > > > > routines are not the most optimized ones (yet!). And naturally one
> > would
> > > > > expect better performance in a native language with all
> > > > pointer/memory/SIMD
> > > > > instruction optimizations that you mentioned. As far as I know, the
> > > > > off-heap buffers are managed in ArrowBuf which implements an abstract
> > > > netty
> > > > > class. But there is nothing unusual, i.e., netty specific, about
> > these
> > > > > unsafe routines, they are used by many projects. Though there is cost
> > > > > associated with materializing on-heap Java values from off-heap
> > memory
> > > > > regions. I need to benchmark that more carefully.
> > > > >
> > > > > 2. When you say "I've so far always been able to get similar
> > performance
> > > > > numbers" - do you mean the same performance of my case 3 where 16
> > cores
> > > > > drive close to 40 Gbps, or the same performance between your C++ and
> > Java
> > > > > benchmarks. Do you have some write-up? I would be interested to read
> > up
> > > > :)
> > > > >
> > > > > 3. "Can you get to 100 Gbps starting from primitive arrays in Java"
> > ->
> > > > that
> > > > > is a good idea. Let me try and report back.
> > > > >
> > > > > @Wes -
> > > > >
> > > > > Is there some benchmark template for C++ routines I can have a look?
> > I
> > > > > would be happy to get some input from Java-Arrow experts on how to
> > write
> > > > > these benchmarks more efficiently. I will have a closer look at the
> > JIRA
> > > > > tickets that you mentioned.
> > > > >
> > > > > So, for now I am focusing on the case 3, which is about establishing
> > > > > performance when reading from a local in-memory I/O stream that I
> > > > > implemented (
> > > > >
> > > >
> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > > ).
> > > > > In this case I first read data from parquet files, convert them into
> > > > Arrow,
> > > > > and write-out to a MemoryIOChannel, and then read back from it. So,
> > the
> > > > > performance has nothing to do with Crail or HDFS in the case 3.
> > Once, I
> > > > > establish the base performance in this setup (which is around ~40
> > Gbps
> > > > with
> > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams can
> > take
> > > > > ArrowBuf as src/dst buffers. That should be doable.
> > > >
> > > > Right, in any case what you are testing is the performance of using a
> > > > particular Java accessor layer to JVM off-heap Arrow memory to sum the
> > > > non-null values of each column. I'm not sure that a single bandwidth
> > > > number produced by this benchmark is very informative for people
> > > > contemplating what memory format to use in their system due to the
> > > > current state of the implementation (Java) and workload measured
> > > > (summing the non-null values with a naive algorithm). I would guess
> > > > that a C++ version with raw pointers and a loop-unrolled, branch-free
> > > > vectorized sum is going to be a lot faster.
> > > >
> > > > >
> > > > > @Jacques -
> > > > >
> > > > > That is a good point that "Arrow's implementation is more focused on
> > > > > interacting with the structure than transporting it". However, in any
> > > > > distributed system one needs to move data/structure around - as far
> > as I
> > > > > understand that is another goal of the project. My investigation
> > started
> > > > > within the context of Spark/SQL data processing. Spark converts
> > incoming
> > > > > data into its own in-memory UnsafeRow representation for processing.
> > So
> > > > > naturally the performance of this data ingestion pipeline cannot
> > > > outperform
> > > > > the read performance of the used file format. I benchmarked Parquet,
> > ORC,
> > > > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > > > curiously
> > > > > benchmarked Arrow as well because its design choices are a better
> > fit for
> > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> > investigating.
> > > > From
> > > > > this point of view, I am trying to find out can Arrow be the file
> > format
> > > > > for the next generation of storage/networking devices (see Apache
> > Crail
> > > > > project) delivering close to the hardware speed reading/writing
> > rates. As
> > > > > Wes pointed out that a C++ library implementation should be memory-IO
> > > > > bound, so what would it take to deliver the same performance in Java
> > ;)
> > > > > (and then, from across the network).
> > > > >
> > > > > I hope this makes sense.
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > Animesh
> > > > >
> > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <ja...@apache.org>
> > > > wrote:
> > > > >
> > > > > > My big question is what is the use case and how/what are you
> > trying to
> > > > > > compare? Arrow's implementation is more focused on interacting
> > with the
> > > > > > structure than transporting it. Generally speaking, when we're
> > working
> > > > with
> > > > > > Arrow data we frequently are just interacting with memory
> > locations and
> > > > > > doing direct operations. If you have a layer that supports that
> > type of
> > > > > > semantic, create a movement technique that depends on that. Arrow
> > > > doesn't
> > > > > > force a particular API since the data itself is defined by its
> > > > in-memory
> > > > > > layout so if you have a custom use or pattern, just work with the
> > > > in-memory
> > > > > > structures.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <we...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > hi Animesh,
> > > > > > >
> > > > > > > Per Johan's comments, the C++ library is essentially going to be
> > > > > > > IO/memory bandwidth bound since you're interacting with raw
> > pointers.
> > > > > > >
> > > > > > > I'm looking at your code
> > > > > > >
> > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > > >     int valCount = accessor.getValueCount();
> > > > > > >     for(int i = 0; i < valCount; i++){
> > > > > > >         if(!accessor.isNull(i)){
> > > > > > >             float4Count+=1;
> > > > > > >             checksum+=accessor.get(i);
> > > > > > >         }
> > > > > > >     }
> > > > > > > }
> > > > > > >
> > > > > > > You'll want to get a Java-Arrow expert from Dremio to advise you
> > the
> > > > > > > fastest way to iterate over this data -- my understanding is that
> > > > much
> > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > > > > > > rather than going through the higher level APIs. You're also
> > dropping
> > > > > > > performance because memory mapping is not yet implemented in
> > Java,
> > > > see
> > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > >
> > > > > > > Furthermore, the IPC reader class you are using could be made
> > more
> > > > > > > efficient. I described the problem in
> > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > > > > > required as soon as we have the ability to do memory mapping in
> > Java
> > > > > > >
> > > > > > > Could Crail use the Arrow data structures in its runtime rather
> > than
> > > > > > > copying? If not, how are Crail's runtime data structures
> > different?
> > > > > > >
> > > > > > > - Wes
> > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > <J....@tudelft.nl> wrote:
> > > > > > > >
> > > > > > > > Hello Animesh,
> > > > > > > >
> > > > > > > >
> > > > > > > > I browsed a bit in your sources, thanks for sharing. We have
> > > > performed
> > > > > > > some similar measurements to your third case in the past for
> > C/C++ on
> > > > > > > collections of various basic types such as primitives and
> > strings.
> > > > > > > >
> > > > > > > >
> > > > > > > > I can say that in terms of consuming data from the Arrow format
> > > > versus
> > > > > > > language native collections in C++, I've so far always been able
> > to
> > > > get
> > > > > > > similar performance numbers (e.g. no drawbacks due to the Arrow
> > > > format
> > > > > > > itself). Especially when accessing the data through Arrow's raw
> > data
> > > > > > > pointers (and using for example std::string_view-like
> > constructs).
> > > > > > > >
> > > > > > > > In C/C++ the fast data structures are engineered in such a way
> > > > that as
> > > > > > > little pointer traversals are required and they take up an as
> > small
> > > > as
> > > > > > > possible memory footprint. Thus each memory access is relatively
> > > > > > efficient
> > > > > > > (in terms of obtaining the data of interest). The same can
> > > > absolutely be
> > > > > > > said for Arrow, if not even more efficient in some cases where
> > object
> > > > > > > fields are of variable length.
> > > > > > > >
> > > > > > > >
> > > > > > > > In the JVM case, the Arrow data is stored off-heap. This
> > requires
> > > > the
> > > > > > > JVM to interface to it through some calls to Unsafe hidden under
> > the
> > > > > > Netty
> > > > > > > layer (but please correct me if I'm wrong, I'm not an expert on
> > > > this).
> > > > > > > Those calls are the only reason I can think of that would
> > degrade the
> > > > > > > performance a bit compared to a pure JAva case. I don't know if
> > the
> > > > > > Unsafe
> > > > > > > calls are inlined during JIT compilation. If they aren't they
> > will
> > > > > > increase
> > > > > > > access latency to any data a little bit.
> > > > > > > >
> > > > > > > >
> > > > > > > > I don't have a similar machine so it's not easy to relate my
> > > > numbers to
> > > > > > > yours, but if you can get that data consumed with 100 Gbps in a
> > pure
> > > > Java
> > > > > > > case, I don't see any reason (resulting from Arrow format /
> > off-heap
> > > > > > > storage) why you wouldn't be able to get at least really close.
> > Can
> > > > you
> > > > > > get
> > > > > > > to 100 Gbps starting from primitive arrays in Java with your
> > > > consumption
> > > > > > > functions in the first place?
> > > > > > > >
> > > > > > > >
> > > > > > > > I'm interested to see your progress on this.
> > > > > > > >
> > > > > > > >
> > > > > > > > Kind regards,
> > > > > > > >
> > > > > > > >
> > > > > > > > Johan Peltenburg
> > > > > > > >
> > > > > > > > ________________________________
> > > > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > A week ago, Wes and I had a discussion about the performance
> > of the
> > > > > > > > Arrow/Java implementation on the Apache Crail (Incubating)
> > mailing
> > > > > > list (
> > > > > > > >
> > > > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > ).
> > > > > > > In
> > > > > > > > a nutshell: I am investigating the performance of various file
> > > > formats
> > > > > > > > (including Arrow) on high-performance NVMe and
> > RDMA/100Gbps/RoCE
> > > > > > setups.
> > > > > > > I
> > > > > > > > benchmarked how long does it take to materialize values (ints,
> > > > longs,
> > > > > > > > doubles) of the store_sales table, the largest table in the
> > TPC-DS
> > > > > > > dataset
> > > > > > > > stored on different file formats. Here is a write-up on this -
> > > > > > > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > found
> > > > > > > that
> > > > > > > > between a pair of machine connected over a 100 Gbps link, Arrow
> > > > (using
> > > > > > > as a
> > > > > > > > file format on HDFS) delivered close to ~30 Gbps of bandwidth
> > with
> > > > all
> > > > > > 16
> > > > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory IPC
> > > > format,
> > > > > > > and
> > > > > > > > has not been optimized for storage interfaces/APIs like HDFS;
> > (ii)
> > > > the
> > > > > > > > performance I am measuring is for the java implementation.
> > > > > > > >
> > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > >
> > > > > > > > That brings us to this email where I promised to follow up on
> > the
> > > > Arrow
> > > > > > > > mailing list to understand and optimize the performance of
> > > > Arrow/Java
> > > > > > > > implementation on high-performance devices. I wrote a small
> > > > stand-alone
> > > > > > > > benchmark (
> > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > with
> > > > > > > three
> > > > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > > > interfaces:
> > > > > > > >
> > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps
> > > > > > > performance
> > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me ~35-36
> > Gbps
> > > > > > > > performance
> > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me ~39
> > Gbps
> > > > > > > > performance
> > > > > > > >
> > > > > > > > I think the order makes sense. To better understand the
> > > > performance of
> > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > >
> > > > > > > > The key question I am trying to answer is "what would it take
> > for
> > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > possible? If
> > > > > > yes,
> > > > > > > > then what is missing/or mis-interpreted by me? If not, then
> > where
> > > > is
> > > > > > the
> > > > > > > > performance lost? Does anyone have any performance measurements
> > > > for C++
> > > > > > > > implementation? if they have seen/expect better numbers.
> > > > > > > >
> > > > > > > > As a next step, I will profile the read path of Arrow/Java for
> > the
> > > > > > option
> > > > > > > > 3. I will report my findings.
> > > > > > > >
> > > > > > > > Any thoughts and feedback on this investigation are very
> > welcome.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > --
> > > > > > > > Animesh
> > > > > > > >
> > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as well.
> > > > > > >
> > > > > >
> > > >
> >

Re: [JAVA] Arrow performance measurement

Posted by Wes McKinney <we...@gmail.com>.
See e.g.

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222


On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
<an...@gmail.com> wrote:
>
> Primarily write the same microbenchmark as I have in Java in C++ for table
> reading and value materialization. So just an example of equivalent
> ArrowFileReader example code in C++. Unit tests are a good starting point,
> thanks for the tip :)
>
> On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com> wrote:
>
> > > 3. Are there examples of Arrow in C++ read/write code that I can have a
> > look?
> >
> > What kind of code are you looking for? I would direct you to relevant
> > unit tests that exhibit certain functionality, but it depends on what
> > you are trying to do
> > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> > <an...@gmail.com> wrote:
> > >
> > > Hi all - quick update on the performance investigation:
> > >
> > > - I spent some time looking at performance profile for a binary blob
> > column
> > > (1024 bytes of byte[]) and found a few favorable settings for delivering
> > up
> > > to 168 Gbps from in-memory reading benchmark on 16 cores. These settings
> > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> > documented
> > > here:
> > >
> > https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > > - these setting also help to improved the last number that reported (but
> > > not by much) for the in-memory TPC-DS store_sales table from ~39 Gbps up
> > to
> > > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e., w/o any
> > > networking or storage links)
> > >
> > > A few follow up questions that I have:
> > > 1. Arrow reads a batch size worth of data in one go. Are there any
> > > recommended batch sizes? In my investigation, small batch size help with
> > a
> > > better cache profile but increase number of instructions required (more
> > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be the
> > best
> > > performing configuration, which is also a bit counter intuitive as for 16
> > > threads this will lead to 160 MB of memory footprint. May be this is also
> > > tired to the memory management logic which is my next question.
> > > 2. Arrow use's netty's memory manager. (i) what are decent netty memory
> > > management settings for "io.netty.allocator.*" parameters? I don't find
> > any
> > > decent write-up on them; (ii) Is there a provision for ArrowBuf being
> > > re-used once a batch is consumed? As it looks for now, read read
> > allocates
> > > a new buffer to read the whole batch size.
> > > 3. Are there examples of Arrow in C++ read/write code that I can have a
> > > look?
> > >
> > > Cheers,
> > > --
> > > Animesh
> > >
> > >
> > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > > <an...@gmail.com> wrote:
> > > > >
> > > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > > >
> > > > > @Johan -
> > > > > 1. I also do not suspect that there is any inherent drawback in Java
> > or
> > > > C++
> > > > > due to the Arrow format. I mentioned C++ because Wes pointed out that
> > > > Java
> > > > > routines are not the most optimized ones (yet!). And naturally one
> > would
> > > > > expect better performance in a native language with all
> > > > pointer/memory/SIMD
> > > > > instruction optimizations that you mentioned. As far as I know, the
> > > > > off-heap buffers are managed in ArrowBuf which implements an abstract
> > > > netty
> > > > > class. But there is nothing unusual, i.e., netty specific, about
> > these
> > > > > unsafe routines, they are used by many projects. Though there is cost
> > > > > associated with materializing on-heap Java values from off-heap
> > memory
> > > > > regions. I need to benchmark that more carefully.
> > > > >
> > > > > 2. When you say "I've so far always been able to get similar
> > performance
> > > > > numbers" - do you mean the same performance of my case 3 where 16
> > cores
> > > > > drive close to 40 Gbps, or the same performance between your C++ and
> > Java
> > > > > benchmarks. Do you have some write-up? I would be interested to read
> > up
> > > > :)
> > > > >
> > > > > 3. "Can you get to 100 Gbps starting from primitive arrays in Java"
> > ->
> > > > that
> > > > > is a good idea. Let me try and report back.
> > > > >
> > > > > @Wes -
> > > > >
> > > > > Is there some benchmark template for C++ routines I can have a look?
> > I
> > > > > would be happy to get some input from Java-Arrow experts on how to
> > write
> > > > > these benchmarks more efficiently. I will have a closer look at the
> > JIRA
> > > > > tickets that you mentioned.
> > > > >
> > > > > So, for now I am focusing on the case 3, which is about establishing
> > > > > performance when reading from a local in-memory I/O stream that I
> > > > > implemented (
> > > > >
> > > >
> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > > ).
> > > > > In this case I first read data from parquet files, convert them into
> > > > Arrow,
> > > > > and write-out to a MemoryIOChannel, and then read back from it. So,
> > the
> > > > > performance has nothing to do with Crail or HDFS in the case 3.
> > Once, I
> > > > > establish the base performance in this setup (which is around ~40
> > Gbps
> > > > with
> > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams can
> > take
> > > > > ArrowBuf as src/dst buffers. That should be doable.
> > > >
> > > > Right, in any case what you are testing is the performance of using a
> > > > particular Java accessor layer to JVM off-heap Arrow memory to sum the
> > > > non-null values of each column. I'm not sure that a single bandwidth
> > > > number produced by this benchmark is very informative for people
> > > > contemplating what memory format to use in their system due to the
> > > > current state of the implementation (Java) and workload measured
> > > > (summing the non-null values with a naive algorithm). I would guess
> > > > that a C++ version with raw pointers and a loop-unrolled, branch-free
> > > > vectorized sum is going to be a lot faster.
> > > >
> > > > >
> > > > > @Jacques -
> > > > >
> > > > > That is a good point that "Arrow's implementation is more focused on
> > > > > interacting with the structure than transporting it". However, in any
> > > > > distributed system one needs to move data/structure around - as far
> > as I
> > > > > understand that is another goal of the project. My investigation
> > started
> > > > > within the context of Spark/SQL data processing. Spark converts
> > incoming
> > > > > data into its own in-memory UnsafeRow representation for processing.
> > So
> > > > > naturally the performance of this data ingestion pipeline cannot
> > > > outperform
> > > > > the read performance of the used file format. I benchmarked Parquet,
> > ORC,
> > > > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > > > curiously
> > > > > benchmarked Arrow as well because its design choices are a better
> > fit for
> > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> > investigating.
> > > > From
> > > > > this point of view, I am trying to find out can Arrow be the file
> > format
> > > > > for the next generation of storage/networking devices (see Apache
> > Crail
> > > > > project) delivering close to the hardware speed reading/writing
> > rates. As
> > > > > Wes pointed out that a C++ library implementation should be memory-IO
> > > > > bound, so what would it take to deliver the same performance in Java
> > ;)
> > > > > (and then, from across the network).
> > > > >
> > > > > I hope this makes sense.
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > Animesh
> > > > >
> > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <ja...@apache.org>
> > > > wrote:
> > > > >
> > > > > > My big question is what is the use case and how/what are you
> > trying to
> > > > > > compare? Arrow's implementation is more focused on interacting
> > with the
> > > > > > structure than transporting it. Generally speaking, when we're
> > working
> > > > with
> > > > > > Arrow data we frequently are just interacting with memory
> > locations and
> > > > > > doing direct operations. If you have a layer that supports that
> > type of
> > > > > > semantic, create a movement technique that depends on that. Arrow
> > > > doesn't
> > > > > > force a particular API since the data itself is defined by its
> > > > in-memory
> > > > > > layout so if you have a custom use or pattern, just work with the
> > > > in-memory
> > > > > > structures.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <we...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > hi Animesh,
> > > > > > >
> > > > > > > Per Johan's comments, the C++ library is essentially going to be
> > > > > > > IO/memory bandwidth bound since you're interacting with raw
> > pointers.
> > > > > > >
> > > > > > > I'm looking at your code
> > > > > > >
> > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > > >     int valCount = accessor.getValueCount();
> > > > > > >     for(int i = 0; i < valCount; i++){
> > > > > > >         if(!accessor.isNull(i)){
> > > > > > >             float4Count+=1;
> > > > > > >             checksum+=accessor.get(i);
> > > > > > >         }
> > > > > > >     }
> > > > > > > }
> > > > > > >
> > > > > > > You'll want to get a Java-Arrow expert from Dremio to advise you
> > the
> > > > > > > fastest way to iterate over this data -- my understanding is that
> > > > much
> > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > > > > > > rather than going through the higher level APIs. You're also
> > dropping
> > > > > > > performance because memory mapping is not yet implemented in
> > Java,
> > > > see
> > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > >
> > > > > > > Furthermore, the IPC reader class you are using could be made
> > more
> > > > > > > efficient. I described the problem in
> > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > > > > > required as soon as we have the ability to do memory mapping in
> > Java
> > > > > > >
> > > > > > > Could Crail use the Arrow data structures in its runtime rather
> > than
> > > > > > > copying? If not, how are Crail's runtime data structures
> > different?
> > > > > > >
> > > > > > > - Wes
> > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > <J....@tudelft.nl> wrote:
> > > > > > > >
> > > > > > > > Hello Animesh,
> > > > > > > >
> > > > > > > >
> > > > > > > > I browsed a bit in your sources, thanks for sharing. We have
> > > > performed
> > > > > > > some similar measurements to your third case in the past for
> > C/C++ on
> > > > > > > collections of various basic types such as primitives and
> > strings.
> > > > > > > >
> > > > > > > >
> > > > > > > > I can say that in terms of consuming data from the Arrow format
> > > > versus
> > > > > > > language native collections in C++, I've so far always been able
> > to
> > > > get
> > > > > > > similar performance numbers (e.g. no drawbacks due to the Arrow
> > > > format
> > > > > > > itself). Especially when accessing the data through Arrow's raw
> > data
> > > > > > > pointers (and using for example std::string_view-like
> > constructs).
> > > > > > > >
> > > > > > > > In C/C++ the fast data structures are engineered in such a way
> > > > that as
> > > > > > > little pointer traversals are required and they take up an as
> > small
> > > > as
> > > > > > > possible memory footprint. Thus each memory access is relatively
> > > > > > efficient
> > > > > > > (in terms of obtaining the data of interest). The same can
> > > > absolutely be
> > > > > > > said for Arrow, if not even more efficient in some cases where
> > object
> > > > > > > fields are of variable length.
> > > > > > > >
> > > > > > > >
> > > > > > > > In the JVM case, the Arrow data is stored off-heap. This
> > requires
> > > > the
> > > > > > > JVM to interface to it through some calls to Unsafe hidden under
> > the
> > > > > > Netty
> > > > > > > layer (but please correct me if I'm wrong, I'm not an expert on
> > > > this).
> > > > > > > Those calls are the only reason I can think of that would
> > degrade the
> > > > > > > performance a bit compared to a pure JAva case. I don't know if
> > the
> > > > > > Unsafe
> > > > > > > calls are inlined during JIT compilation. If they aren't they
> > will
> > > > > > increase
> > > > > > > access latency to any data a little bit.
> > > > > > > >
> > > > > > > >
> > > > > > > > I don't have a similar machine so it's not easy to relate my
> > > > numbers to
> > > > > > > yours, but if you can get that data consumed with 100 Gbps in a
> > pure
> > > > Java
> > > > > > > case, I don't see any reason (resulting from Arrow format /
> > off-heap
> > > > > > > storage) why you wouldn't be able to get at least really close.
> > Can
> > > > you
> > > > > > get
> > > > > > > to 100 Gbps starting from primitive arrays in Java with your
> > > > consumption
> > > > > > > functions in the first place?
> > > > > > > >
> > > > > > > >
> > > > > > > > I'm interested to see your progress on this.
> > > > > > > >
> > > > > > > >
> > > > > > > > Kind regards,
> > > > > > > >
> > > > > > > >
> > > > > > > > Johan Peltenburg
> > > > > > > >
> > > > > > > > ________________________________
> > > > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > A week ago, Wes and I had a discussion about the performance
> > of the
> > > > > > > > Arrow/Java implementation on the Apache Crail (Incubating)
> > mailing
> > > > > > list (
> > > > > > > >
> > > > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > ).
> > > > > > > In
> > > > > > > > a nutshell: I am investigating the performance of various file
> > > > formats
> > > > > > > > (including Arrow) on high-performance NVMe and
> > RDMA/100Gbps/RoCE
> > > > > > setups.
> > > > > > > I
> > > > > > > > benchmarked how long does it take to materialize values (ints,
> > > > longs,
> > > > > > > > doubles) of the store_sales table, the largest table in the
> > TPC-DS
> > > > > > > dataset
> > > > > > > > stored on different file formats. Here is a write-up on this -
> > > > > > > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > found
> > > > > > > that
> > > > > > > > between a pair of machine connected over a 100 Gbps link, Arrow
> > > > (using
> > > > > > > as a
> > > > > > > > file format on HDFS) delivered close to ~30 Gbps of bandwidth
> > with
> > > > all
> > > > > > 16
> > > > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory IPC
> > > > format,
> > > > > > > and
> > > > > > > > has not been optimized for storage interfaces/APIs like HDFS;
> > (ii)
> > > > the
> > > > > > > > performance I am measuring is for the java implementation.
> > > > > > > >
> > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > >
> > > > > > > > That brings us to this email where I promised to follow up on
> > the
> > > > Arrow
> > > > > > > > mailing list to understand and optimize the performance of
> > > > Arrow/Java
> > > > > > > > implementation on high-performance devices. I wrote a small
> > > > stand-alone
> > > > > > > > benchmark (
> > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > with
> > > > > > > three
> > > > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > > > interfaces:
> > > > > > > >
> > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps
> > > > > > > performance
> > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me ~35-36
> > Gbps
> > > > > > > > performance
> > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me ~39
> > Gbps
> > > > > > > > performance
> > > > > > > >
> > > > > > > > I think the order makes sense. To better understand the
> > > > performance of
> > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > >
> > > > > > > > The key question I am trying to answer is "what would it take
> > for
> > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > possible? If
> > > > > > yes,
> > > > > > > > then what is missing/or mis-interpreted by me? If not, then
> > where
> > > > is
> > > > > > the
> > > > > > > > performance lost? Does anyone have any performance measurements
> > > > for C++
> > > > > > > > implementation? if they have seen/expect better numbers.
> > > > > > > >
> > > > > > > > As a next step, I will profile the read path of Arrow/Java for
> > the
> > > > > > option
> > > > > > > > 3. I will report my findings.
> > > > > > > >
> > > > > > > > Any thoughts and feedback on this investigation are very
> > welcome.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > --
> > > > > > > > Animesh
> > > > > > > >
> > > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as well.
> > > > > > >
> > > > > >
> > > >
> >

Re: [JAVA] Arrow performance measurement

Posted by Animesh Trivedi <an...@gmail.com>.
Primarily write the same microbenchmark as I have in Java in C++ for table
reading and value materialization. So just an example of equivalent
ArrowFileReader example code in C++. Unit tests are a good starting point,
thanks for the tip :)

On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com> wrote:

> > 3. Are there examples of Arrow in C++ read/write code that I can have a
> look?
>
> What kind of code are you looking for? I would direct you to relevant
> unit tests that exhibit certain functionality, but it depends on what
> you are trying to do
> On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> <an...@gmail.com> wrote:
> >
> > Hi all - quick update on the performance investigation:
> >
> > - I spent some time looking at performance profile for a binary blob
> column
> > (1024 bytes of byte[]) and found a few favorable settings for delivering
> up
> > to 168 Gbps from in-memory reading benchmark on 16 cores. These settings
> > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> documented
> > here:
> >
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > - these setting also help to improved the last number that reported (but
> > not by much) for the in-memory TPC-DS store_sales table from ~39 Gbps up
> to
> > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e., w/o any
> > networking or storage links)
> >
> > A few follow up questions that I have:
> > 1. Arrow reads a batch size worth of data in one go. Are there any
> > recommended batch sizes? In my investigation, small batch size help with
> a
> > better cache profile but increase number of instructions required (more
> > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be the
> best
> > performing configuration, which is also a bit counter intuitive as for 16
> > threads this will lead to 160 MB of memory footprint. May be this is also
> > tired to the memory management logic which is my next question.
> > 2. Arrow use's netty's memory manager. (i) what are decent netty memory
> > management settings for "io.netty.allocator.*" parameters? I don't find
> any
> > decent write-up on them; (ii) Is there a provision for ArrowBuf being
> > re-used once a batch is consumed? As it looks for now, read read
> allocates
> > a new buffer to read the whole batch size.
> > 3. Are there examples of Arrow in C++ read/write code that I can have a
> > look?
> >
> > Cheers,
> > --
> > Animesh
> >
> >
> > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > <an...@gmail.com> wrote:
> > > >
> > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > >
> > > > @Johan -
> > > > 1. I also do not suspect that there is any inherent drawback in Java
> or
> > > C++
> > > > due to the Arrow format. I mentioned C++ because Wes pointed out that
> > > Java
> > > > routines are not the most optimized ones (yet!). And naturally one
> would
> > > > expect better performance in a native language with all
> > > pointer/memory/SIMD
> > > > instruction optimizations that you mentioned. As far as I know, the
> > > > off-heap buffers are managed in ArrowBuf which implements an abstract
> > > netty
> > > > class. But there is nothing unusual, i.e., netty specific, about
> these
> > > > unsafe routines, they are used by many projects. Though there is cost
> > > > associated with materializing on-heap Java values from off-heap
> memory
> > > > regions. I need to benchmark that more carefully.
> > > >
> > > > 2. When you say "I've so far always been able to get similar
> performance
> > > > numbers" - do you mean the same performance of my case 3 where 16
> cores
> > > > drive close to 40 Gbps, or the same performance between your C++ and
> Java
> > > > benchmarks. Do you have some write-up? I would be interested to read
> up
> > > :)
> > > >
> > > > 3. "Can you get to 100 Gbps starting from primitive arrays in Java"
> ->
> > > that
> > > > is a good idea. Let me try and report back.
> > > >
> > > > @Wes -
> > > >
> > > > Is there some benchmark template for C++ routines I can have a look?
> I
> > > > would be happy to get some input from Java-Arrow experts on how to
> write
> > > > these benchmarks more efficiently. I will have a closer look at the
> JIRA
> > > > tickets that you mentioned.
> > > >
> > > > So, for now I am focusing on the case 3, which is about establishing
> > > > performance when reading from a local in-memory I/O stream that I
> > > > implemented (
> > > >
> > >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > ).
> > > > In this case I first read data from parquet files, convert them into
> > > Arrow,
> > > > and write-out to a MemoryIOChannel, and then read back from it. So,
> the
> > > > performance has nothing to do with Crail or HDFS in the case 3.
> Once, I
> > > > establish the base performance in this setup (which is around ~40
> Gbps
> > > with
> > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams can
> take
> > > > ArrowBuf as src/dst buffers. That should be doable.
> > >
> > > Right, in any case what you are testing is the performance of using a
> > > particular Java accessor layer to JVM off-heap Arrow memory to sum the
> > > non-null values of each column. I'm not sure that a single bandwidth
> > > number produced by this benchmark is very informative for people
> > > contemplating what memory format to use in their system due to the
> > > current state of the implementation (Java) and workload measured
> > > (summing the non-null values with a naive algorithm). I would guess
> > > that a C++ version with raw pointers and a loop-unrolled, branch-free
> > > vectorized sum is going to be a lot faster.
> > >
> > > >
> > > > @Jacques -
> > > >
> > > > That is a good point that "Arrow's implementation is more focused on
> > > > interacting with the structure than transporting it". However, in any
> > > > distributed system one needs to move data/structure around - as far
> as I
> > > > understand that is another goal of the project. My investigation
> started
> > > > within the context of Spark/SQL data processing. Spark converts
> incoming
> > > > data into its own in-memory UnsafeRow representation for processing.
> So
> > > > naturally the performance of this data ingestion pipeline cannot
> > > outperform
> > > > the read performance of the used file format. I benchmarked Parquet,
> ORC,
> > > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > > curiously
> > > > benchmarked Arrow as well because its design choices are a better
> fit for
> > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> investigating.
> > > From
> > > > this point of view, I am trying to find out can Arrow be the file
> format
> > > > for the next generation of storage/networking devices (see Apache
> Crail
> > > > project) delivering close to the hardware speed reading/writing
> rates. As
> > > > Wes pointed out that a C++ library implementation should be memory-IO
> > > > bound, so what would it take to deliver the same performance in Java
> ;)
> > > > (and then, from across the network).
> > > >
> > > > I hope this makes sense.
> > > >
> > > > Cheers,
> > > > --
> > > > Animesh
> > > >
> > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > >
> > > > > My big question is what is the use case and how/what are you
> trying to
> > > > > compare? Arrow's implementation is more focused on interacting
> with the
> > > > > structure than transporting it. Generally speaking, when we're
> working
> > > with
> > > > > Arrow data we frequently are just interacting with memory
> locations and
> > > > > doing direct operations. If you have a layer that supports that
> type of
> > > > > semantic, create a movement technique that depends on that. Arrow
> > > doesn't
> > > > > force a particular API since the data itself is defined by its
> > > in-memory
> > > > > layout so if you have a custom use or pattern, just work with the
> > > in-memory
> > > > > structures.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >
> > > > > > hi Animesh,
> > > > > >
> > > > > > Per Johan's comments, the C++ library is essentially going to be
> > > > > > IO/memory bandwidth bound since you're interacting with raw
> pointers.
> > > > > >
> > > > > > I'm looking at your code
> > > > > >
> > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > >     int valCount = accessor.getValueCount();
> > > > > >     for(int i = 0; i < valCount; i++){
> > > > > >         if(!accessor.isNull(i)){
> > > > > >             float4Count+=1;
> > > > > >             checksum+=accessor.get(i);
> > > > > >         }
> > > > > >     }
> > > > > > }
> > > > > >
> > > > > > You'll want to get a Java-Arrow expert from Dremio to advise you
> the
> > > > > > fastest way to iterate over this data -- my understanding is that
> > > much
> > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > > > > > rather than going through the higher level APIs. You're also
> dropping
> > > > > > performance because memory mapping is not yet implemented in
> Java,
> > > see
> > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > >
> > > > > > Furthermore, the IPC reader class you are using could be made
> more
> > > > > > efficient. I described the problem in
> > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > > > > required as soon as we have the ability to do memory mapping in
> Java
> > > > > >
> > > > > > Could Crail use the Arrow data structures in its runtime rather
> than
> > > > > > copying? If not, how are Crail's runtime data structures
> different?
> > > > > >
> > > > > > - Wes
> > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > <J....@tudelft.nl> wrote:
> > > > > > >
> > > > > > > Hello Animesh,
> > > > > > >
> > > > > > >
> > > > > > > I browsed a bit in your sources, thanks for sharing. We have
> > > performed
> > > > > > some similar measurements to your third case in the past for
> C/C++ on
> > > > > > collections of various basic types such as primitives and
> strings.
> > > > > > >
> > > > > > >
> > > > > > > I can say that in terms of consuming data from the Arrow format
> > > versus
> > > > > > language native collections in C++, I've so far always been able
> to
> > > get
> > > > > > similar performance numbers (e.g. no drawbacks due to the Arrow
> > > format
> > > > > > itself). Especially when accessing the data through Arrow's raw
> data
> > > > > > pointers (and using for example std::string_view-like
> constructs).
> > > > > > >
> > > > > > > In C/C++ the fast data structures are engineered in such a way
> > > that as
> > > > > > little pointer traversals are required and they take up an as
> small
> > > as
> > > > > > possible memory footprint. Thus each memory access is relatively
> > > > > efficient
> > > > > > (in terms of obtaining the data of interest). The same can
> > > absolutely be
> > > > > > said for Arrow, if not even more efficient in some cases where
> object
> > > > > > fields are of variable length.
> > > > > > >
> > > > > > >
> > > > > > > In the JVM case, the Arrow data is stored off-heap. This
> requires
> > > the
> > > > > > JVM to interface to it through some calls to Unsafe hidden under
> the
> > > > > Netty
> > > > > > layer (but please correct me if I'm wrong, I'm not an expert on
> > > this).
> > > > > > Those calls are the only reason I can think of that would
> degrade the
> > > > > > performance a bit compared to a pure JAva case. I don't know if
> the
> > > > > Unsafe
> > > > > > calls are inlined during JIT compilation. If they aren't they
> will
> > > > > increase
> > > > > > access latency to any data a little bit.
> > > > > > >
> > > > > > >
> > > > > > > I don't have a similar machine so it's not easy to relate my
> > > numbers to
> > > > > > yours, but if you can get that data consumed with 100 Gbps in a
> pure
> > > Java
> > > > > > case, I don't see any reason (resulting from Arrow format /
> off-heap
> > > > > > storage) why you wouldn't be able to get at least really close.
> Can
> > > you
> > > > > get
> > > > > > to 100 Gbps starting from primitive arrays in Java with your
> > > consumption
> > > > > > functions in the first place?
> > > > > > >
> > > > > > >
> > > > > > > I'm interested to see your progress on this.
> > > > > > >
> > > > > > >
> > > > > > > Kind regards,
> > > > > > >
> > > > > > >
> > > > > > > Johan Peltenburg
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > A week ago, Wes and I had a discussion about the performance
> of the
> > > > > > > Arrow/Java implementation on the Apache Crail (Incubating)
> mailing
> > > > > list (
> > > > > > >
> > > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > ).
> > > > > > In
> > > > > > > a nutshell: I am investigating the performance of various file
> > > formats
> > > > > > > (including Arrow) on high-performance NVMe and
> RDMA/100Gbps/RoCE
> > > > > setups.
> > > > > > I
> > > > > > > benchmarked how long does it take to materialize values (ints,
> > > longs,
> > > > > > > doubles) of the store_sales table, the largest table in the
> TPC-DS
> > > > > > dataset
> > > > > > > stored on different file formats. Here is a write-up on this -
> > > > > > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > found
> > > > > > that
> > > > > > > between a pair of machine connected over a 100 Gbps link, Arrow
> > > (using
> > > > > > as a
> > > > > > > file format on HDFS) delivered close to ~30 Gbps of bandwidth
> with
> > > all
> > > > > 16
> > > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory IPC
> > > format,
> > > > > > and
> > > > > > > has not been optimized for storage interfaces/APIs like HDFS;
> (ii)
> > > the
> > > > > > > performance I am measuring is for the java implementation.
> > > > > > >
> > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > >
> > > > > > > That brings us to this email where I promised to follow up on
> the
> > > Arrow
> > > > > > > mailing list to understand and optimize the performance of
> > > Arrow/Java
> > > > > > > implementation on high-performance devices. I wrote a small
> > > stand-alone
> > > > > > > benchmark (
> https://github.com/animeshtrivedi/benchmarking-arrow)
> > > with
> > > > > > three
> > > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > > interfaces:
> > > > > > >
> > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps
> > > > > > performance
> > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me ~35-36
> Gbps
> > > > > > > performance
> > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me ~39
> Gbps
> > > > > > > performance
> > > > > > >
> > > > > > > I think the order makes sense. To better understand the
> > > performance of
> > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > >
> > > > > > > The key question I am trying to answer is "what would it take
> for
> > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> possible? If
> > > > > yes,
> > > > > > > then what is missing/or mis-interpreted by me? If not, then
> where
> > > is
> > > > > the
> > > > > > > performance lost? Does anyone have any performance measurements
> > > for C++
> > > > > > > implementation? if they have seen/expect better numbers.
> > > > > > >
> > > > > > > As a next step, I will profile the read path of Arrow/Java for
> the
> > > > > option
> > > > > > > 3. I will report my findings.
> > > > > > >
> > > > > > > Any thoughts and feedback on this investigation are very
> welcome.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > --
> > > > > > > Animesh
> > > > > > >
> > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as well.
> > > > > >
> > > > >
> > >
>

Re: [JAVA] Arrow performance measurement

Posted by Animesh Trivedi <an...@gmail.com>.
Primarily write the same microbenchmark as I have in Java in C++ for table
reading and value materialization. So just an example of equivalent
ArrowFileReader example code in C++. Unit tests are a good starting point,
thanks for the tip :)

On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <we...@gmail.com> wrote:

> > 3. Are there examples of Arrow in C++ read/write code that I can have a
> look?
>
> What kind of code are you looking for? I would direct you to relevant
> unit tests that exhibit certain functionality, but it depends on what
> you are trying to do
> On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
> <an...@gmail.com> wrote:
> >
> > Hi all - quick update on the performance investigation:
> >
> > - I spent some time looking at performance profile for a binary blob
> column
> > (1024 bytes of byte[]) and found a few favorable settings for delivering
> up
> > to 168 Gbps from in-memory reading benchmark on 16 cores. These settings
> > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are
> documented
> > here:
> >
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> > - these setting also help to improved the last number that reported (but
> > not by much) for the in-memory TPC-DS store_sales table from ~39 Gbps up
> to
> > ~45-47 Gbps (note: this number is just in-memory benchmark, i.e., w/o any
> > networking or storage links)
> >
> > A few follow up questions that I have:
> > 1. Arrow reads a batch size worth of data in one go. Are there any
> > recommended batch sizes? In my investigation, small batch size help with
> a
> > better cache profile but increase number of instructions required (more
> > looping). Larger one do otherwise. Somehow ~10MB/thread seem to be the
> best
> > performing configuration, which is also a bit counter intuitive as for 16
> > threads this will lead to 160 MB of memory footprint. May be this is also
> > tired to the memory management logic which is my next question.
> > 2. Arrow use's netty's memory manager. (i) what are decent netty memory
> > management settings for "io.netty.allocator.*" parameters? I don't find
> any
> > decent write-up on them; (ii) Is there a provision for ArrowBuf being
> > re-used once a batch is consumed? As it looks for now, read read
> allocates
> > a new buffer to read the whole batch size.
> > 3. Are there examples of Arrow in C++ read/write code that I can have a
> > look?
> >
> > Cheers,
> > --
> > Animesh
> >
> >
> > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > > <an...@gmail.com> wrote:
> > > >
> > > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > > >
> > > > @Johan -
> > > > 1. I also do not suspect that there is any inherent drawback in Java
> or
> > > C++
> > > > due to the Arrow format. I mentioned C++ because Wes pointed out that
> > > Java
> > > > routines are not the most optimized ones (yet!). And naturally one
> would
> > > > expect better performance in a native language with all
> > > pointer/memory/SIMD
> > > > instruction optimizations that you mentioned. As far as I know, the
> > > > off-heap buffers are managed in ArrowBuf which implements an abstract
> > > netty
> > > > class. But there is nothing unusual, i.e., netty specific, about
> these
> > > > unsafe routines, they are used by many projects. Though there is cost
> > > > associated with materializing on-heap Java values from off-heap
> memory
> > > > regions. I need to benchmark that more carefully.
> > > >
> > > > 2. When you say "I've so far always been able to get similar
> performance
> > > > numbers" - do you mean the same performance of my case 3 where 16
> cores
> > > > drive close to 40 Gbps, or the same performance between your C++ and
> Java
> > > > benchmarks. Do you have some write-up? I would be interested to read
> up
> > > :)
> > > >
> > > > 3. "Can you get to 100 Gbps starting from primitive arrays in Java"
> ->
> > > that
> > > > is a good idea. Let me try and report back.
> > > >
> > > > @Wes -
> > > >
> > > > Is there some benchmark template for C++ routines I can have a look?
> I
> > > > would be happy to get some input from Java-Arrow experts on how to
> write
> > > > these benchmarks more efficiently. I will have a closer look at the
> JIRA
> > > > tickets that you mentioned.
> > > >
> > > > So, for now I am focusing on the case 3, which is about establishing
> > > > performance when reading from a local in-memory I/O stream that I
> > > > implemented (
> > > >
> > >
> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > > ).
> > > > In this case I first read data from parquet files, convert them into
> > > Arrow,
> > > > and write-out to a MemoryIOChannel, and then read back from it. So,
> the
> > > > performance has nothing to do with Crail or HDFS in the case 3.
> Once, I
> > > > establish the base performance in this setup (which is around ~40
> Gbps
> > > with
> > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams can
> take
> > > > ArrowBuf as src/dst buffers. That should be doable.
> > >
> > > Right, in any case what you are testing is the performance of using a
> > > particular Java accessor layer to JVM off-heap Arrow memory to sum the
> > > non-null values of each column. I'm not sure that a single bandwidth
> > > number produced by this benchmark is very informative for people
> > > contemplating what memory format to use in their system due to the
> > > current state of the implementation (Java) and workload measured
> > > (summing the non-null values with a naive algorithm). I would guess
> > > that a C++ version with raw pointers and a loop-unrolled, branch-free
> > > vectorized sum is going to be a lot faster.
> > >
> > > >
> > > > @Jacques -
> > > >
> > > > That is a good point that "Arrow's implementation is more focused on
> > > > interacting with the structure than transporting it". However, in any
> > > > distributed system one needs to move data/structure around - as far
> as I
> > > > understand that is another goal of the project. My investigation
> started
> > > > within the context of Spark/SQL data processing. Spark converts
> incoming
> > > > data into its own in-memory UnsafeRow representation for processing.
> So
> > > > naturally the performance of this data ingestion pipeline cannot
> > > outperform
> > > > the read performance of the used file format. I benchmarked Parquet,
> ORC,
> > > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > > curiously
> > > > benchmarked Arrow as well because its design choices are a better
> fit for
> > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
> investigating.
> > > From
> > > > this point of view, I am trying to find out can Arrow be the file
> format
> > > > for the next generation of storage/networking devices (see Apache
> Crail
> > > > project) delivering close to the hardware speed reading/writing
> rates. As
> > > > Wes pointed out that a C++ library implementation should be memory-IO
> > > > bound, so what would it take to deliver the same performance in Java
> ;)
> > > > (and then, from across the network).
> > > >
> > > > I hope this makes sense.
> > > >
> > > > Cheers,
> > > > --
> > > > Animesh
> > > >
> > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > >
> > > > > My big question is what is the use case and how/what are you
> trying to
> > > > > compare? Arrow's implementation is more focused on interacting
> with the
> > > > > structure than transporting it. Generally speaking, when we're
> working
> > > with
> > > > > Arrow data we frequently are just interacting with memory
> locations and
> > > > > doing direct operations. If you have a layer that supports that
> type of
> > > > > semantic, create a movement technique that depends on that. Arrow
> > > doesn't
> > > > > force a particular API since the data itself is defined by its
> > > in-memory
> > > > > layout so if you have a custom use or pattern, just work with the
> > > in-memory
> > > > > structures.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >
> > > > > > hi Animesh,
> > > > > >
> > > > > > Per Johan's comments, the C++ library is essentially going to be
> > > > > > IO/memory bandwidth bound since you're interacting with raw
> pointers.
> > > > > >
> > > > > > I'm looking at your code
> > > > > >
> > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > > >     int valCount = accessor.getValueCount();
> > > > > >     for(int i = 0; i < valCount; i++){
> > > > > >         if(!accessor.isNull(i)){
> > > > > >             float4Count+=1;
> > > > > >             checksum+=accessor.get(i);
> > > > > >         }
> > > > > >     }
> > > > > > }
> > > > > >
> > > > > > You'll want to get a Java-Arrow expert from Dremio to advise you
> the
> > > > > > fastest way to iterate over this data -- my understanding is that
> > > much
> > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > > > > > rather than going through the higher level APIs. You're also
> dropping
> > > > > > performance because memory mapping is not yet implemented in
> Java,
> > > see
> > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > >
> > > > > > Furthermore, the IPC reader class you are using could be made
> more
> > > > > > efficient. I described the problem in
> > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > > > > required as soon as we have the ability to do memory mapping in
> Java
> > > > > >
> > > > > > Could Crail use the Arrow data structures in its runtime rather
> than
> > > > > > copying? If not, how are Crail's runtime data structures
> different?
> > > > > >
> > > > > > - Wes
> > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > <J....@tudelft.nl> wrote:
> > > > > > >
> > > > > > > Hello Animesh,
> > > > > > >
> > > > > > >
> > > > > > > I browsed a bit in your sources, thanks for sharing. We have
> > > performed
> > > > > > some similar measurements to your third case in the past for
> C/C++ on
> > > > > > collections of various basic types such as primitives and
> strings.
> > > > > > >
> > > > > > >
> > > > > > > I can say that in terms of consuming data from the Arrow format
> > > versus
> > > > > > language native collections in C++, I've so far always been able
> to
> > > get
> > > > > > similar performance numbers (e.g. no drawbacks due to the Arrow
> > > format
> > > > > > itself). Especially when accessing the data through Arrow's raw
> data
> > > > > > pointers (and using for example std::string_view-like
> constructs).
> > > > > > >
> > > > > > > In C/C++ the fast data structures are engineered in such a way
> > > that as
> > > > > > little pointer traversals are required and they take up an as
> small
> > > as
> > > > > > possible memory footprint. Thus each memory access is relatively
> > > > > efficient
> > > > > > (in terms of obtaining the data of interest). The same can
> > > absolutely be
> > > > > > said for Arrow, if not even more efficient in some cases where
> object
> > > > > > fields are of variable length.
> > > > > > >
> > > > > > >
> > > > > > > In the JVM case, the Arrow data is stored off-heap. This
> requires
> > > the
> > > > > > JVM to interface to it through some calls to Unsafe hidden under
> the
> > > > > Netty
> > > > > > layer (but please correct me if I'm wrong, I'm not an expert on
> > > this).
> > > > > > Those calls are the only reason I can think of that would
> degrade the
> > > > > > performance a bit compared to a pure JAva case. I don't know if
> the
> > > > > Unsafe
> > > > > > calls are inlined during JIT compilation. If they aren't they
> will
> > > > > increase
> > > > > > access latency to any data a little bit.
> > > > > > >
> > > > > > >
> > > > > > > I don't have a similar machine so it's not easy to relate my
> > > numbers to
> > > > > > yours, but if you can get that data consumed with 100 Gbps in a
> pure
> > > Java
> > > > > > case, I don't see any reason (resulting from Arrow format /
> off-heap
> > > > > > storage) why you wouldn't be able to get at least really close.
> Can
> > > you
> > > > > get
> > > > > > to 100 Gbps starting from primitive arrays in Java with your
> > > consumption
> > > > > > functions in the first place?
> > > > > > >
> > > > > > >
> > > > > > > I'm interested to see your progress on this.
> > > > > > >
> > > > > > >
> > > > > > > Kind regards,
> > > > > > >
> > > > > > >
> > > > > > > Johan Peltenburg
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > A week ago, Wes and I had a discussion about the performance
> of the
> > > > > > > Arrow/Java implementation on the Apache Crail (Incubating)
> mailing
> > > > > list (
> > > > > > >
> > > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > ).
> > > > > > In
> > > > > > > a nutshell: I am investigating the performance of various file
> > > formats
> > > > > > > (including Arrow) on high-performance NVMe and
> RDMA/100Gbps/RoCE
> > > > > setups.
> > > > > > I
> > > > > > > benchmarked how long does it take to materialize values (ints,
> > > longs,
> > > > > > > doubles) of the store_sales table, the largest table in the
> TPC-DS
> > > > > > dataset
> > > > > > > stored on different file formats. Here is a write-up on this -
> > > > > > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > found
> > > > > > that
> > > > > > > between a pair of machine connected over a 100 Gbps link, Arrow
> > > (using
> > > > > > as a
> > > > > > > file format on HDFS) delivered close to ~30 Gbps of bandwidth
> with
> > > all
> > > > > 16
> > > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory IPC
> > > format,
> > > > > > and
> > > > > > > has not been optimized for storage interfaces/APIs like HDFS;
> (ii)
> > > the
> > > > > > > performance I am measuring is for the java implementation.
> > > > > > >
> > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > >
> > > > > > > That brings us to this email where I promised to follow up on
> the
> > > Arrow
> > > > > > > mailing list to understand and optimize the performance of
> > > Arrow/Java
> > > > > > > implementation on high-performance devices. I wrote a small
> > > stand-alone
> > > > > > > benchmark (
> https://github.com/animeshtrivedi/benchmarking-arrow)
> > > with
> > > > > > three
> > > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > > interfaces:
> > > > > > >
> > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps
> > > > > > performance
> > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me ~35-36
> Gbps
> > > > > > > performance
> > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me ~39
> Gbps
> > > > > > > performance
> > > > > > >
> > > > > > > I think the order makes sense. To better understand the
> > > performance of
> > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > >
> > > > > > > The key question I am trying to answer is "what would it take
> for
> > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> possible? If
> > > > > yes,
> > > > > > > then what is missing/or mis-interpreted by me? If not, then
> where
> > > is
> > > > > the
> > > > > > > performance lost? Does anyone have any performance measurements
> > > for C++
> > > > > > > implementation? if they have seen/expect better numbers.
> > > > > > >
> > > > > > > As a next step, I will profile the read path of Arrow/Java for
> the
> > > > > option
> > > > > > > 3. I will report my findings.
> > > > > > >
> > > > > > > Any thoughts and feedback on this investigation are very
> welcome.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > --
> > > > > > > Animesh
> > > > > > >
> > > > > > > PS~ Cross-posting on the dev@crail.apache.org list as well.
> > > > > >
> > > > >
> > >
>

Re: [JAVA] Arrow performance measurement

Posted by Wes McKinney <we...@gmail.com>.
> 3. Are there examples of Arrow in C++ read/write code that I can have a look?

What kind of code are you looking for? I would direct you to relevant
unit tests that exhibit certain functionality, but it depends on what
you are trying to do
On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
<an...@gmail.com> wrote:
>
> Hi all - quick update on the performance investigation:
>
> - I spent some time looking at performance profile for a binary blob column
> (1024 bytes of byte[]) and found a few favorable settings for delivering up
> to 168 Gbps from in-memory reading benchmark on 16 cores. These settings
> (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are documented
> here:
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> - these setting also help to improved the last number that reported (but
> not by much) for the in-memory TPC-DS store_sales table from ~39 Gbps up to
> ~45-47 Gbps (note: this number is just in-memory benchmark, i.e., w/o any
> networking or storage links)
>
> A few follow up questions that I have:
> 1. Arrow reads a batch size worth of data in one go. Are there any
> recommended batch sizes? In my investigation, small batch size help with a
> better cache profile but increase number of instructions required (more
> looping). Larger one do otherwise. Somehow ~10MB/thread seem to be the best
> performing configuration, which is also a bit counter intuitive as for 16
> threads this will lead to 160 MB of memory footprint. May be this is also
> tired to the memory management logic which is my next question.
> 2. Arrow use's netty's memory manager. (i) what are decent netty memory
> management settings for "io.netty.allocator.*" parameters? I don't find any
> decent write-up on them; (ii) Is there a provision for ArrowBuf being
> re-used once a batch is consumed? As it looks for now, read read allocates
> a new buffer to read the whole batch size.
> 3. Are there examples of Arrow in C++ read/write code that I can have a
> look?
>
> Cheers,
> --
> Animesh
>
>
> On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > <an...@gmail.com> wrote:
> > >
> > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > >
> > > @Johan -
> > > 1. I also do not suspect that there is any inherent drawback in Java or
> > C++
> > > due to the Arrow format. I mentioned C++ because Wes pointed out that
> > Java
> > > routines are not the most optimized ones (yet!). And naturally one would
> > > expect better performance in a native language with all
> > pointer/memory/SIMD
> > > instruction optimizations that you mentioned. As far as I know, the
> > > off-heap buffers are managed in ArrowBuf which implements an abstract
> > netty
> > > class. But there is nothing unusual, i.e., netty specific, about these
> > > unsafe routines, they are used by many projects. Though there is cost
> > > associated with materializing on-heap Java values from off-heap memory
> > > regions. I need to benchmark that more carefully.
> > >
> > > 2. When you say "I've so far always been able to get similar performance
> > > numbers" - do you mean the same performance of my case 3 where 16 cores
> > > drive close to 40 Gbps, or the same performance between your C++ and Java
> > > benchmarks. Do you have some write-up? I would be interested to read up
> > :)
> > >
> > > 3. "Can you get to 100 Gbps starting from primitive arrays in Java" ->
> > that
> > > is a good idea. Let me try and report back.
> > >
> > > @Wes -
> > >
> > > Is there some benchmark template for C++ routines I can have a look? I
> > > would be happy to get some input from Java-Arrow experts on how to write
> > > these benchmarks more efficiently. I will have a closer look at the JIRA
> > > tickets that you mentioned.
> > >
> > > So, for now I am focusing on the case 3, which is about establishing
> > > performance when reading from a local in-memory I/O stream that I
> > > implemented (
> > >
> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > ).
> > > In this case I first read data from parquet files, convert them into
> > Arrow,
> > > and write-out to a MemoryIOChannel, and then read back from it. So, the
> > > performance has nothing to do with Crail or HDFS in the case 3. Once, I
> > > establish the base performance in this setup (which is around ~40 Gbps
> > with
> > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams can take
> > > ArrowBuf as src/dst buffers. That should be doable.
> >
> > Right, in any case what you are testing is the performance of using a
> > particular Java accessor layer to JVM off-heap Arrow memory to sum the
> > non-null values of each column. I'm not sure that a single bandwidth
> > number produced by this benchmark is very informative for people
> > contemplating what memory format to use in their system due to the
> > current state of the implementation (Java) and workload measured
> > (summing the non-null values with a naive algorithm). I would guess
> > that a C++ version with raw pointers and a loop-unrolled, branch-free
> > vectorized sum is going to be a lot faster.
> >
> > >
> > > @Jacques -
> > >
> > > That is a good point that "Arrow's implementation is more focused on
> > > interacting with the structure than transporting it". However, in any
> > > distributed system one needs to move data/structure around - as far as I
> > > understand that is another goal of the project. My investigation started
> > > within the context of Spark/SQL data processing. Spark converts incoming
> > > data into its own in-memory UnsafeRow representation for processing. So
> > > naturally the performance of this data ingestion pipeline cannot
> > outperform
> > > the read performance of the used file format. I benchmarked Parquet, ORC,
> > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > curiously
> > > benchmarked Arrow as well because its design choices are a better fit for
> > > modern high-performance RDMA/NVMe/100+Gbps devices I am investigating.
> > From
> > > this point of view, I am trying to find out can Arrow be the file format
> > > for the next generation of storage/networking devices (see Apache Crail
> > > project) delivering close to the hardware speed reading/writing rates. As
> > > Wes pointed out that a C++ library implementation should be memory-IO
> > > bound, so what would it take to deliver the same performance in Java ;)
> > > (and then, from across the network).
> > >
> > > I hope this makes sense.
> > >
> > > Cheers,
> > > --
> > > Animesh
> > >
> > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <ja...@apache.org>
> > wrote:
> > >
> > > > My big question is what is the use case and how/what are you trying to
> > > > compare? Arrow's implementation is more focused on interacting with the
> > > > structure than transporting it. Generally speaking, when we're working
> > with
> > > > Arrow data we frequently are just interacting with memory locations and
> > > > doing direct operations. If you have a layer that supports that type of
> > > > semantic, create a movement technique that depends on that. Arrow
> > doesn't
> > > > force a particular API since the data itself is defined by its
> > in-memory
> > > > layout so if you have a custom use or pattern, just work with the
> > in-memory
> > > > structures.
> > > >
> > > >
> > > >
> > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > > hi Animesh,
> > > > >
> > > > > Per Johan's comments, the C++ library is essentially going to be
> > > > > IO/memory bandwidth bound since you're interacting with raw pointers.
> > > > >
> > > > > I'm looking at your code
> > > > >
> > > > > private void consumeFloat4(FieldVector fv) {
> > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > >     int valCount = accessor.getValueCount();
> > > > >     for(int i = 0; i < valCount; i++){
> > > > >         if(!accessor.isNull(i)){
> > > > >             float4Count+=1;
> > > > >             checksum+=accessor.get(i);
> > > > >         }
> > > > >     }
> > > > > }
> > > > >
> > > > > You'll want to get a Java-Arrow expert from Dremio to advise you the
> > > > > fastest way to iterate over this data -- my understanding is that
> > much
> > > > > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > > > > rather than going through the higher level APIs. You're also dropping
> > > > > performance because memory mapping is not yet implemented in Java,
> > see
> > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > >
> > > > > Furthermore, the IPC reader class you are using could be made more
> > > > > efficient. I described the problem in
> > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > > > required as soon as we have the ability to do memory mapping in Java
> > > > >
> > > > > Could Crail use the Arrow data structures in its runtime rather than
> > > > > copying? If not, how are Crail's runtime data structures different?
> > > > >
> > > > > - Wes
> > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > <J....@tudelft.nl> wrote:
> > > > > >
> > > > > > Hello Animesh,
> > > > > >
> > > > > >
> > > > > > I browsed a bit in your sources, thanks for sharing. We have
> > performed
> > > > > some similar measurements to your third case in the past for C/C++ on
> > > > > collections of various basic types such as primitives and strings.
> > > > > >
> > > > > >
> > > > > > I can say that in terms of consuming data from the Arrow format
> > versus
> > > > > language native collections in C++, I've so far always been able to
> > get
> > > > > similar performance numbers (e.g. no drawbacks due to the Arrow
> > format
> > > > > itself). Especially when accessing the data through Arrow's raw data
> > > > > pointers (and using for example std::string_view-like constructs).
> > > > > >
> > > > > > In C/C++ the fast data structures are engineered in such a way
> > that as
> > > > > little pointer traversals are required and they take up an as small
> > as
> > > > > possible memory footprint. Thus each memory access is relatively
> > > > efficient
> > > > > (in terms of obtaining the data of interest). The same can
> > absolutely be
> > > > > said for Arrow, if not even more efficient in some cases where object
> > > > > fields are of variable length.
> > > > > >
> > > > > >
> > > > > > In the JVM case, the Arrow data is stored off-heap. This requires
> > the
> > > > > JVM to interface to it through some calls to Unsafe hidden under the
> > > > Netty
> > > > > layer (but please correct me if I'm wrong, I'm not an expert on
> > this).
> > > > > Those calls are the only reason I can think of that would degrade the
> > > > > performance a bit compared to a pure JAva case. I don't know if the
> > > > Unsafe
> > > > > calls are inlined during JIT compilation. If they aren't they will
> > > > increase
> > > > > access latency to any data a little bit.
> > > > > >
> > > > > >
> > > > > > I don't have a similar machine so it's not easy to relate my
> > numbers to
> > > > > yours, but if you can get that data consumed with 100 Gbps in a pure
> > Java
> > > > > case, I don't see any reason (resulting from Arrow format / off-heap
> > > > > storage) why you wouldn't be able to get at least really close. Can
> > you
> > > > get
> > > > > to 100 Gbps starting from primitive arrays in Java with your
> > consumption
> > > > > functions in the first place?
> > > > > >
> > > > > >
> > > > > > I'm interested to see your progress on this.
> > > > > >
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > >
> > > > > > Johan Peltenburg
> > > > > >
> > > > > > ________________________________
> > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > A week ago, Wes and I had a discussion about the performance of the
> > > > > > Arrow/Java implementation on the Apache Crail (Incubating) mailing
> > > > list (
> > > > > >
> > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > ).
> > > > > In
> > > > > > a nutshell: I am investigating the performance of various file
> > formats
> > > > > > (including Arrow) on high-performance NVMe and RDMA/100Gbps/RoCE
> > > > setups.
> > > > > I
> > > > > > benchmarked how long does it take to materialize values (ints,
> > longs,
> > > > > > doubles) of the store_sales table, the largest table in the TPC-DS
> > > > > dataset
> > > > > > stored on different file formats. Here is a write-up on this -
> > > > > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > found
> > > > > that
> > > > > > between a pair of machine connected over a 100 Gbps link, Arrow
> > (using
> > > > > as a
> > > > > > file format on HDFS) delivered close to ~30 Gbps of bandwidth with
> > all
> > > > 16
> > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory IPC
> > format,
> > > > > and
> > > > > > has not been optimized for storage interfaces/APIs like HDFS; (ii)
> > the
> > > > > > performance I am measuring is for the java implementation.
> > > > > >
> > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > >
> > > > > > That brings us to this email where I promised to follow up on the
> > Arrow
> > > > > > mailing list to understand and optimize the performance of
> > Arrow/Java
> > > > > > implementation on high-performance devices. I wrote a small
> > stand-alone
> > > > > > benchmark (https://github.com/animeshtrivedi/benchmarking-arrow)
> > with
> > > > > three
> > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > interfaces:
> > > > > >
> > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps
> > > > > performance
> > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me ~35-36 Gbps
> > > > > > performance
> > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me ~39 Gbps
> > > > > > performance
> > > > > >
> > > > > > I think the order makes sense. To better understand the
> > performance of
> > > > > > Arrow/Java we can focus on the option 3.
> > > > > >
> > > > > > The key question I am trying to answer is "what would it take for
> > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it possible? If
> > > > yes,
> > > > > > then what is missing/or mis-interpreted by me? If not, then where
> > is
> > > > the
> > > > > > performance lost? Does anyone have any performance measurements
> > for C++
> > > > > > implementation? if they have seen/expect better numbers.
> > > > > >
> > > > > > As a next step, I will profile the read path of Arrow/Java for the
> > > > option
> > > > > > 3. I will report my findings.
> > > > > >
> > > > > > Any thoughts and feedback on this investigation are very welcome.
> > > > > >
> > > > > > Cheers,
> > > > > > --
> > > > > > Animesh
> > > > > >
> > > > > > PS~ Cross-posting on the dev@crail.apache.org list as well.
> > > > >
> > > >
> >

Re: [JAVA] Arrow performance measurement

Posted by Wes McKinney <we...@gmail.com>.
> 3. Are there examples of Arrow in C++ read/write code that I can have a look?

What kind of code are you looking for? I would direct you to relevant
unit tests that exhibit certain functionality, but it depends on what
you are trying to do
On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
<an...@gmail.com> wrote:
>
> Hi all - quick update on the performance investigation:
>
> - I spent some time looking at performance profile for a binary blob column
> (1024 bytes of byte[]) and found a few favorable settings for delivering up
> to 168 Gbps from in-memory reading benchmark on 16 cores. These settings
> (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are documented
> here:
> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
> - these setting also help to improved the last number that reported (but
> not by much) for the in-memory TPC-DS store_sales table from ~39 Gbps up to
> ~45-47 Gbps (note: this number is just in-memory benchmark, i.e., w/o any
> networking or storage links)
>
> A few follow up questions that I have:
> 1. Arrow reads a batch size worth of data in one go. Are there any
> recommended batch sizes? In my investigation, small batch size help with a
> better cache profile but increase number of instructions required (more
> looping). Larger one do otherwise. Somehow ~10MB/thread seem to be the best
> performing configuration, which is also a bit counter intuitive as for 16
> threads this will lead to 160 MB of memory footprint. May be this is also
> tired to the memory management logic which is my next question.
> 2. Arrow use's netty's memory manager. (i) what are decent netty memory
> management settings for "io.netty.allocator.*" parameters? I don't find any
> decent write-up on them; (ii) Is there a provision for ArrowBuf being
> re-used once a batch is consumed? As it looks for now, read read allocates
> a new buffer to read the whole batch size.
> 3. Are there examples of Arrow in C++ read/write code that I can have a
> look?
>
> Cheers,
> --
> Animesh
>
>
> On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
> > <an...@gmail.com> wrote:
> > >
> > > Hi Johan, Wes, and Jacques - many thanks for your comments:
> > >
> > > @Johan -
> > > 1. I also do not suspect that there is any inherent drawback in Java or
> > C++
> > > due to the Arrow format. I mentioned C++ because Wes pointed out that
> > Java
> > > routines are not the most optimized ones (yet!). And naturally one would
> > > expect better performance in a native language with all
> > pointer/memory/SIMD
> > > instruction optimizations that you mentioned. As far as I know, the
> > > off-heap buffers are managed in ArrowBuf which implements an abstract
> > netty
> > > class. But there is nothing unusual, i.e., netty specific, about these
> > > unsafe routines, they are used by many projects. Though there is cost
> > > associated with materializing on-heap Java values from off-heap memory
> > > regions. I need to benchmark that more carefully.
> > >
> > > 2. When you say "I've so far always been able to get similar performance
> > > numbers" - do you mean the same performance of my case 3 where 16 cores
> > > drive close to 40 Gbps, or the same performance between your C++ and Java
> > > benchmarks. Do you have some write-up? I would be interested to read up
> > :)
> > >
> > > 3. "Can you get to 100 Gbps starting from primitive arrays in Java" ->
> > that
> > > is a good idea. Let me try and report back.
> > >
> > > @Wes -
> > >
> > > Is there some benchmark template for C++ routines I can have a look? I
> > > would be happy to get some input from Java-Arrow experts on how to write
> > > these benchmarks more efficiently. I will have a closer look at the JIRA
> > > tickets that you mentioned.
> > >
> > > So, for now I am focusing on the case 3, which is about establishing
> > > performance when reading from a local in-memory I/O stream that I
> > > implemented (
> > >
> > https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
> > ).
> > > In this case I first read data from parquet files, convert them into
> > Arrow,
> > > and write-out to a MemoryIOChannel, and then read back from it. So, the
> > > performance has nothing to do with Crail or HDFS in the case 3. Once, I
> > > establish the base performance in this setup (which is around ~40 Gbps
> > with
> > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O streams can take
> > > ArrowBuf as src/dst buffers. That should be doable.
> >
> > Right, in any case what you are testing is the performance of using a
> > particular Java accessor layer to JVM off-heap Arrow memory to sum the
> > non-null values of each column. I'm not sure that a single bandwidth
> > number produced by this benchmark is very informative for people
> > contemplating what memory format to use in their system due to the
> > current state of the implementation (Java) and workload measured
> > (summing the non-null values with a naive algorithm). I would guess
> > that a C++ version with raw pointers and a loop-unrolled, branch-free
> > vectorized sum is going to be a lot faster.
> >
> > >
> > > @Jacques -
> > >
> > > That is a good point that "Arrow's implementation is more focused on
> > > interacting with the structure than transporting it". However, in any
> > > distributed system one needs to move data/structure around - as far as I
> > > understand that is another goal of the project. My investigation started
> > > within the context of Spark/SQL data processing. Spark converts incoming
> > > data into its own in-memory UnsafeRow representation for processing. So
> > > naturally the performance of this data ingestion pipeline cannot
> > outperform
> > > the read performance of the used file format. I benchmarked Parquet, ORC,
> > > Avro, JSON (for the specific TPC-DS store_sales table). And then
> > curiously
> > > benchmarked Arrow as well because its design choices are a better fit for
> > > modern high-performance RDMA/NVMe/100+Gbps devices I am investigating.
> > From
> > > this point of view, I am trying to find out can Arrow be the file format
> > > for the next generation of storage/networking devices (see Apache Crail
> > > project) delivering close to the hardware speed reading/writing rates. As
> > > Wes pointed out that a C++ library implementation should be memory-IO
> > > bound, so what would it take to deliver the same performance in Java ;)
> > > (and then, from across the network).
> > >
> > > I hope this makes sense.
> > >
> > > Cheers,
> > > --
> > > Animesh
> > >
> > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <ja...@apache.org>
> > wrote:
> > >
> > > > My big question is what is the use case and how/what are you trying to
> > > > compare? Arrow's implementation is more focused on interacting with the
> > > > structure than transporting it. Generally speaking, when we're working
> > with
> > > > Arrow data we frequently are just interacting with memory locations and
> > > > doing direct operations. If you have a layer that supports that type of
> > > > semantic, create a movement technique that depends on that. Arrow
> > doesn't
> > > > force a particular API since the data itself is defined by its
> > in-memory
> > > > layout so if you have a custom use or pattern, just work with the
> > in-memory
> > > > structures.
> > > >
> > > >
> > > >
> > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > > hi Animesh,
> > > > >
> > > > > Per Johan's comments, the C++ library is essentially going to be
> > > > > IO/memory bandwidth bound since you're interacting with raw pointers.
> > > > >
> > > > > I'm looking at your code
> > > > >
> > > > > private void consumeFloat4(FieldVector fv) {
> > > > >     Float4Vector accessor = (Float4Vector) fv;
> > > > >     int valCount = accessor.getValueCount();
> > > > >     for(int i = 0; i < valCount; i++){
> > > > >         if(!accessor.isNull(i)){
> > > > >             float4Count+=1;
> > > > >             checksum+=accessor.get(i);
> > > > >         }
> > > > >     }
> > > > > }
> > > > >
> > > > > You'll want to get a Java-Arrow expert from Dremio to advise you the
> > > > > fastest way to iterate over this data -- my understanding is that
> > much
> > > > > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > > > > rather than going through the higher level APIs. You're also dropping
> > > > > performance because memory mapping is not yet implemented in Java,
> > see
> > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > >
> > > > > Furthermore, the IPC reader class you are using could be made more
> > > > > efficient. I described the problem in
> > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > > > required as soon as we have the ability to do memory mapping in Java
> > > > >
> > > > > Could Crail use the Arrow data structures in its runtime rather than
> > > > > copying? If not, how are Crail's runtime data structures different?
> > > > >
> > > > > - Wes
> > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > <J....@tudelft.nl> wrote:
> > > > > >
> > > > > > Hello Animesh,
> > > > > >
> > > > > >
> > > > > > I browsed a bit in your sources, thanks for sharing. We have
> > performed
> > > > > some similar measurements to your third case in the past for C/C++ on
> > > > > collections of various basic types such as primitives and strings.
> > > > > >
> > > > > >
> > > > > > I can say that in terms of consuming data from the Arrow format
> > versus
> > > > > language native collections in C++, I've so far always been able to
> > get
> > > > > similar performance numbers (e.g. no drawbacks due to the Arrow
> > format
> > > > > itself). Especially when accessing the data through Arrow's raw data
> > > > > pointers (and using for example std::string_view-like constructs).
> > > > > >
> > > > > > In C/C++ the fast data structures are engineered in such a way
> > that as
> > > > > little pointer traversals are required and they take up an as small
> > as
> > > > > possible memory footprint. Thus each memory access is relatively
> > > > efficient
> > > > > (in terms of obtaining the data of interest). The same can
> > absolutely be
> > > > > said for Arrow, if not even more efficient in some cases where object
> > > > > fields are of variable length.
> > > > > >
> > > > > >
> > > > > > In the JVM case, the Arrow data is stored off-heap. This requires
> > the
> > > > > JVM to interface to it through some calls to Unsafe hidden under the
> > > > Netty
> > > > > layer (but please correct me if I'm wrong, I'm not an expert on
> > this).
> > > > > Those calls are the only reason I can think of that would degrade the
> > > > > performance a bit compared to a pure JAva case. I don't know if the
> > > > Unsafe
> > > > > calls are inlined during JIT compilation. If they aren't they will
> > > > increase
> > > > > access latency to any data a little bit.
> > > > > >
> > > > > >
> > > > > > I don't have a similar machine so it's not easy to relate my
> > numbers to
> > > > > yours, but if you can get that data consumed with 100 Gbps in a pure
> > Java
> > > > > case, I don't see any reason (resulting from Arrow format / off-heap
> > > > > storage) why you wouldn't be able to get at least really close. Can
> > you
> > > > get
> > > > > to 100 Gbps starting from primitive arrays in Java with your
> > consumption
> > > > > functions in the first place?
> > > > > >
> > > > > >
> > > > > > I'm interested to see your progress on this.
> > > > > >
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > >
> > > > > > Johan Peltenburg
> > > > > >
> > > > > > ________________________________
> > > > > > From: Animesh Trivedi <an...@gmail.com>
> > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
> > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > A week ago, Wes and I had a discussion about the performance of the
> > > > > > Arrow/Java implementation on the Apache Crail (Incubating) mailing
> > > > list (
> > > > > >
> > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > ).
> > > > > In
> > > > > > a nutshell: I am investigating the performance of various file
> > formats
> > > > > > (including Arrow) on high-performance NVMe and RDMA/100Gbps/RoCE
> > > > setups.
> > > > > I
> > > > > > benchmarked how long does it take to materialize values (ints,
> > longs,
> > > > > > doubles) of the store_sales table, the largest table in the TPC-DS
> > > > > dataset
> > > > > > stored on different file formats. Here is a write-up on this -
> > > > > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > found
> > > > > that
> > > > > > between a pair of machine connected over a 100 Gbps link, Arrow
> > (using
> > > > > as a
> > > > > > file format on HDFS) delivered close to ~30 Gbps of bandwidth with
> > all
> > > > 16
> > > > > > cores engaged. Wes pointed out that (i) Arrow is in-memory IPC
> > format,
> > > > > and
> > > > > > has not been optimized for storage interfaces/APIs like HDFS; (ii)
> > the
> > > > > > performance I am measuring is for the java implementation.
> > > > > >
> > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > >
> > > > > > That brings us to this email where I promised to follow up on the
> > Arrow
> > > > > > mailing list to understand and optimize the performance of
> > Arrow/Java
> > > > > > implementation on high-performance devices. I wrote a small
> > stand-alone
> > > > > > benchmark (https://github.com/animeshtrivedi/benchmarking-arrow)
> > with
> > > > > three
> > > > > > implementations of WritableByteChannel, SeekableByteChannel
> > interfaces:
> > > > > >
> > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps
> > > > > performance
> > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me ~35-36 Gbps
> > > > > > performance
> > > > > > 3. Arrow data is stored in on-heap byte[] - this gives me ~39 Gbps
> > > > > > performance
> > > > > >
> > > > > > I think the order makes sense. To better understand the
> > performance of
> > > > > > Arrow/Java we can focus on the option 3.
> > > > > >
> > > > > > The key question I am trying to answer is "what would it take for
> > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it possible? If
> > > > yes,
> > > > > > then what is missing/or mis-interpreted by me? If not, then where
> > is
> > > > the
> > > > > > performance lost? Does anyone have any performance measurements
> > for C++
> > > > > > implementation? if they have seen/expect better numbers.
> > > > > >
> > > > > > As a next step, I will profile the read path of Arrow/Java for the
> > > > option
> > > > > > 3. I will report my findings.
> > > > > >
> > > > > > Any thoughts and feedback on this investigation are very welcome.
> > > > > >
> > > > > > Cheers,
> > > > > > --
> > > > > > Animesh
> > > > > >
> > > > > > PS~ Cross-posting on the dev@crail.apache.org list as well.
> > > > >
> > > >
> >