You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crail.apache.org by Animesh Trivedi <an...@gmail.com> on 2018/11/26 13:23:58 UTC

Re: [JAVA] Arrow performance measurement

Hi all,

Apologies for the silence, I have been occupied. Here is the part 3 of the
investigation. This time, I compared the performance of Java and C++
readers. The benchmark is read and checksum 10 billion integers (~40GB of
data, with some null values) on a single core. You can read the full
investigation here:
https://github.com/animeshtrivedi/blog/blob/master/post/2018-11-22-arrow-cpp.md.
The best numbers that I have are (see the table at the end of the blog for
more details):

C++ (file)    : 21.5 Gbps
C++ (mmap)    : 30.4 Gbps
Java (unsafe) : 12.48 Gbps

The key insights are:

* C++ code (with file I/O) is generally 2x faster than the Java code. I
have narrowed down the profiling to JITing issues, memory management, etc.
Most of these issues are independent of Arrow (as far as I can tell). The
standalone benchmark to debug these issues is here :
https://github.com/animeshtrivedi/java-cpp-fun. The pattern of the
workloads is - (1) you have a large integer and bitmap array; (2) go over
the array, check for NULL, and sum it. In general, C++ code is 2x faster
than the Java code. Generally I understand why Java code is slow, but how
to make it faster of comparable to C++ code is what I want to know. Any
input on it is appreciated.

* C++ code has a no-null values optimization branch on the IsValid() check.
Java code can benefit from it too, but does not have this implemented. I
will open a pull request for this.

* C++ bitmap code can be optimized further by using the unsigned integers
than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
https://godbolt.org/z/deq0_q - compare the size of the assembly code. And
the performance measurements in the blog show up to 50% performance gains.
Alternatively if signed to unsigned upgrade is not possible (perhaps in
every language), then in the C++ code, we should use the bitmap operations
directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
operation), instead of `/` and `%`.

Thanks,
--
Animesh


On Thu, Oct 11, 2018 at 6:55 PM Animesh Trivedi <an...@gmail.com>
wrote:

> Hi Li - thanks for setting up the jiras. I'll prepare the pull requests
> soon.
>
> Hi Wes, I think that is a good idea to write a blog about performance,
> with pointing to the improvements in the code base (fixes and
> microbenchmarks). I can prepare a first draft as we go along. This will
> definitely serve well to the large java community.
>
> Cheers,
> --
> Animesh
>
>
> On Thu, 11 Oct 2018, 17:55 Li Jin, <ic...@gmail.com> wrote:
>
>> I have created these as the first step. Animesh, feel free to submit PR
>> for
>> these. I will look into your micro benchmarks soon.
>>
>>
>>
>>    1. [image: Improvement] ARROW-3497[Java] Add user documentation for
>>    achieving better performance
>>    <https://jira.apache.org/jira/browse/ARROW-3497>
>>    2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to
>> Java
>>    <https://jira.apache.org/jira/browse/ARROW-3496>
>>    3. [image: Improvement] ARROW-3495[Java] Optimize bit operations
>>    performance <https://jira.apache.org/jira/browse/ARROW-3495>
>>    4. [image: Improvement] ARROW-3493[Java] Document
>> BOUNDS_CHECKING_ENABLED
>>    <https://jira.apache.org/jira/browse/ARROW-3493>
>>
>>
>> On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ic...@gmail.com> wrote:
>>
>> > Hi Wes and Animesh,
>> >
>> > Thanks for the analysis and discussion. I am happy to looking into
>> this. I
>> > will create some Jiras soon.
>> >
>> > Li
>> >
>> > On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> >> hey Animesh,
>> >>
>> >> Thank you for doing this analysis. If you'd like to share some of the
>> >> analysis more broadly e.g. on the Apache Arrow blog or social media,
>> >> let us know.
>> >>
>> >> Seems like there might be a few follow ups here for the Arrow Java
>> >> community:
>> >>
>> >> * Documentation about achieving better performance
>> >> * Writing some microperformance benchmarks
>> >> * Making some improvements to the code to facilitate better performance
>> >>
>> >> Feel free to create some JIRA issues. Are any Java developers
>> >> interested in digging a little more into these issues?
>> >>
>> >> Thanks,
>> >> Wes
>> >> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi
>> >> <an...@gmail.com> wrote:
>> >> >
>> >> > Hi Wes and all,
>> >> >
>> >> > Here is another round of updates:
>> >> >
>> >> > Quick recap - previously we established that for 1kB binary blobs,
>> Arrow
>> >> > can deliver > 160 Gbps performance from in-memory buffers.
>> >> >
>> >> > In this round I looked at the performance of materializing
>> "integers".
>> >> In
>> >> > my benchmarks, I found that with careful
>> optimizations/code-rewriting we
>> >> > can push the performance of integer reading from 5.42 Gbps/core to
>> 13.61
>> >> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to
>> 110+
>> >> > Gbps. Key things to do is:
>> >> >
>> >> > 1) Disable memory access checks in Arrow and Netty buffers. This gave
>> >> > significant performance boost. However, for such an important
>> >> performance
>> >> > flag, it is very poorly documented
>> >> > ("drill.enable_unsafe_memory_access=true").
>> >> >
>> >> > 2) Materialize values from Validity and Value direct buffers instead
>> of
>> >> > calling getInt() function on the IntVector. This is implemented as a
>> new
>> >> > Unsafe reader type (
>> >> >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31
>> >> > )
>> >> >
>> >> > 3) Optimize bitmap operation to check if a bit is set or not (
>> >> >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23
>> >> > )
>> >> >
>> >> > A detailed write up of these steps is available here:
>> >> >
>> >>
>> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md
>> >> >
>> >> > I have 2 follow-up questions:
>> >> >
>> >> > 1) Regarding the `isSet` function, why does it has to calculate
>> number
>> >> of
>> >> > bits set? (
>> >> >
>> >>
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797
>> >> ).
>> >> > Wouldn't just checking if the result of the AND operation is zero or
>> >> not be
>> >> > sufficient? Like what I did :
>> >> >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28
>> >> >
>> >> >
>> >> > 2) What is the reason behind this bitmap generation optimization here
>> >> >
>> >>
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179
>> >> > ? At this point when this function is called, the bitmap vector is
>> >> already
>> >> > read from the storage, and contains the right values (either all
>> null,
>> >> all
>> >> > set, or whatever). Generating this mask here for the special cases
>> when
>> >> the
>> >> > values are all NULL or all set (this was the case in my benchmark),
>> can
>> >> be
>> >> > slower than just returning what one has read from the storage.
>> >> >
>> >> > Collectively optimizing these two bitmap operations give more than 1
>> >> Gbps
>> >> > gains in my bench-marking code.
>> >> >
>> >> > Cheers,
>> >> > --
>> >> > Animesh
>> >> >
>> >> >
>> >> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <we...@gmail.com>
>> >> wrote:
>> >> >
>> >> > > See e.g.
>> >> > >
>> >> > >
>> >> > >
>> >>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222
>> >> > >
>> >> > >
>> >> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi
>> >> > > <an...@gmail.com> wrote:
>> >> > > >
>> >> > > > Primarily write the same microbenchmark as I have in Java in C++
>> for
>> >> > > table
>> >> > > > reading and value materialization. So just an example of
>> equivalent
>> >> > > > ArrowFileReader example code in C++. Unit tests are a good
>> starting
>> >> > > point,
>> >> > > > thanks for the tip :)
>> >> > > >
>> >> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <
>> wesmckinn@gmail.com>
>> >> > > wrote:
>> >> > > >
>> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
>> can
>> >> > > have a
>> >> > > > > look?
>> >> > > > >
>> >> > > > > What kind of code are you looking for? I would direct you to
>> >> relevant
>> >> > > > > unit tests that exhibit certain functionality, but it depends
>> on
>> >> what
>> >> > > > > you are trying to do
>> >> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi
>> >> > > > > <an...@gmail.com> wrote:
>> >> > > > > >
>> >> > > > > > Hi all - quick update on the performance investigation:
>> >> > > > > >
>> >> > > > > > - I spent some time looking at performance profile for a
>> binary
>> >> blob
>> >> > > > > column
>> >> > > > > > (1024 bytes of byte[]) and found a few favorable settings for
>> >> > > delivering
>> >> > > > > up
>> >> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores.
>> These
>> >> > > settings
>> >> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.)
>> are
>> >> > > > > documented
>> >> > > > > > here:
>> >> > > > > >
>> >> > > > >
>> >> > >
>> >>
>> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md
>> >> > > > > > - these setting also help to improved the last number that
>> >> reported
>> >> > > (but
>> >> > > > > > not by much) for the in-memory TPC-DS store_sales table from
>> ~39
>> >> > > Gbps up
>> >> > > > > to
>> >> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark,
>> >> i.e.,
>> >> > > w/o any
>> >> > > > > > networking or storage links)
>> >> > > > > >
>> >> > > > > > A few follow up questions that I have:
>> >> > > > > > 1. Arrow reads a batch size worth of data in one go. Are
>> there
>> >> any
>> >> > > > > > recommended batch sizes? In my investigation, small batch
>> size
>> >> help
>> >> > > with
>> >> > > > > a
>> >> > > > > > better cache profile but increase number of instructions
>> >> required
>> >> > > (more
>> >> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem
>> to
>> >> be
>> >> > > the
>> >> > > > > best
>> >> > > > > > performing configuration, which is also a bit counter
>> intuitive
>> >> as
>> >> > > for 16
>> >> > > > > > threads this will lead to 160 MB of memory footprint. May be
>> >> this is
>> >> > > also
>> >> > > > > > tired to the memory management logic which is my next
>> question.
>> >> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent
>> netty
>> >> > > memory
>> >> > > > > > management settings for "io.netty.allocator.*" parameters? I
>> >> don't
>> >> > > find
>> >> > > > > any
>> >> > > > > > decent write-up on them; (ii) Is there a provision for
>> ArrowBuf
>> >> being
>> >> > > > > > re-used once a batch is consumed? As it looks for now, read
>> read
>> >> > > > > allocates
>> >> > > > > > a new buffer to read the whole batch size.
>> >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I
>> can
>> >> > > have a
>> >> > > > > > look?
>> >> > > > > >
>> >> > > > > > Cheers,
>> >> > > > > > --
>> >> > > > > > Animesh
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney <
>> >> wesmckinn@gmail.com>
>> >> > > > > wrote:
>> >> > > > > >
>> >> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi
>> >> > > > > > > <an...@gmail.com> wrote:
>> >> > > > > > > >
>> >> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your
>> comments:
>> >> > > > > > > >
>> >> > > > > > > > @Johan -
>> >> > > > > > > > 1. I also do not suspect that there is any inherent
>> >> drawback in
>> >> > > Java
>> >> > > > > or
>> >> > > > > > > C++
>> >> > > > > > > > due to the Arrow format. I mentioned C++ because Wes
>> >> pointed out
>> >> > > that
>> >> > > > > > > Java
>> >> > > > > > > > routines are not the most optimized ones (yet!). And
>> >> naturally
>> >> > > one
>> >> > > > > would
>> >> > > > > > > > expect better performance in a native language with all
>> >> > > > > > > pointer/memory/SIMD
>> >> > > > > > > > instruction optimizations that you mentioned. As far as I
>> >> know,
>> >> > > the
>> >> > > > > > > > off-heap buffers are managed in ArrowBuf which
>> implements an
>> >> > > abstract
>> >> > > > > > > netty
>> >> > > > > > > > class. But there is nothing unusual, i.e., netty
>> specific,
>> >> about
>> >> > > > > these
>> >> > > > > > > > unsafe routines, they are used by many projects. Though
>> >> there is
>> >> > > cost
>> >> > > > > > > > associated with materializing on-heap Java values from
>> >> off-heap
>> >> > > > > memory
>> >> > > > > > > > regions. I need to benchmark that more carefully.
>> >> > > > > > > >
>> >> > > > > > > > 2. When you say "I've so far always been able to get
>> similar
>> >> > > > > performance
>> >> > > > > > > > numbers" - do you mean the same performance of my case 3
>> >> where 16
>> >> > > > > cores
>> >> > > > > > > > drive close to 40 Gbps, or the same performance between
>> >> your C++
>> >> > > and
>> >> > > > > Java
>> >> > > > > > > > benchmarks. Do you have some write-up? I would be
>> >> interested to
>> >> > > read
>> >> > > > > up
>> >> > > > > > > :)
>> >> > > > > > > >
>> >> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive
>> arrays
>> >> in
>> >> > > Java"
>> >> > > > > ->
>> >> > > > > > > that
>> >> > > > > > > > is a good idea. Let me try and report back.
>> >> > > > > > > >
>> >> > > > > > > > @Wes -
>> >> > > > > > > >
>> >> > > > > > > > Is there some benchmark template for C++ routines I can
>> >> have a
>> >> > > look?
>> >> > > > > I
>> >> > > > > > > > would be happy to get some input from Java-Arrow experts
>> on
>> >> how
>> >> > > to
>> >> > > > > write
>> >> > > > > > > > these benchmarks more efficiently. I will have a closer
>> >> look at
>> >> > > the
>> >> > > > > JIRA
>> >> > > > > > > > tickets that you mentioned.
>> >> > > > > > > >
>> >> > > > > > > > So, for now I am focusing on the case 3, which is about
>> >> > > establishing
>> >> > > > > > > > performance when reading from a local in-memory I/O
>> stream
>> >> that I
>> >> > > > > > > > implemented (
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > >
>> >> > >
>> >>
>> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java
>> >> > > > > > > ).
>> >> > > > > > > > In this case I first read data from parquet files,
>> convert
>> >> them
>> >> > > into
>> >> > > > > > > Arrow,
>> >> > > > > > > > and write-out to a MemoryIOChannel, and then read back
>> from
>> >> it.
>> >> > > So,
>> >> > > > > the
>> >> > > > > > > > performance has nothing to do with Crail or HDFS in the
>> >> case 3.
>> >> > > > > Once, I
>> >> > > > > > > > establish the base performance in this setup (which is
>> >> around ~40
>> >> > > > > Gbps
>> >> > > > > > > with
>> >> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O
>> >> streams
>> >> > > can
>> >> > > > > take
>> >> > > > > > > > ArrowBuf as src/dst buffers. That should be doable.
>> >> > > > > > >
>> >> > > > > > > Right, in any case what you are testing is the performance
>> of
>> >> > > using a
>> >> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory
>> >> to sum
>> >> > > the
>> >> > > > > > > non-null values of each column. I'm not sure that a single
>> >> > > bandwidth
>> >> > > > > > > number produced by this benchmark is very informative for
>> >> people
>> >> > > > > > > contemplating what memory format to use in their system due
>> >> to the
>> >> > > > > > > current state of the implementation (Java) and workload
>> >> measured
>> >> > > > > > > (summing the non-null values with a naive algorithm). I
>> would
>> >> guess
>> >> > > > > > > that a C++ version with raw pointers and a loop-unrolled,
>> >> > > branch-free
>> >> > > > > > > vectorized sum is going to be a lot faster.
>> >> > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > @Jacques -
>> >> > > > > > > >
>> >> > > > > > > > That is a good point that "Arrow's implementation is more
>> >> > > focused on
>> >> > > > > > > > interacting with the structure than transporting it".
>> >> However,
>> >> > > in any
>> >> > > > > > > > distributed system one needs to move data/structure
>> around
>> >> - as
>> >> > > far
>> >> > > > > as I
>> >> > > > > > > > understand that is another goal of the project. My
>> >> investigation
>> >> > > > > started
>> >> > > > > > > > within the context of Spark/SQL data processing. Spark
>> >> converts
>> >> > > > > incoming
>> >> > > > > > > > data into its own in-memory UnsafeRow representation for
>> >> > > processing.
>> >> > > > > So
>> >> > > > > > > > naturally the performance of this data ingestion pipeline
>> >> cannot
>> >> > > > > > > outperform
>> >> > > > > > > > the read performance of the used file format. I
>> benchmarked
>> >> > > Parquet,
>> >> > > > > ORC,
>> >> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table).
>> And
>> >> then
>> >> > > > > > > curiously
>> >> > > > > > > > benchmarked Arrow as well because its design choices are
>> a
>> >> better
>> >> > > > > fit for
>> >> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
>> >> > > > > investigating.
>> >> > > > > > > From
>> >> > > > > > > > this point of view, I am trying to find out can Arrow be
>> >> the file
>> >> > > > > format
>> >> > > > > > > > for the next generation of storage/networking devices
>> (see
>> >> Apache
>> >> > > > > Crail
>> >> > > > > > > > project) delivering close to the hardware speed
>> >> reading/writing
>> >> > > > > rates. As
>> >> > > > > > > > Wes pointed out that a C++ library implementation should
>> be
>> >> > > memory-IO
>> >> > > > > > > > bound, so what would it take to deliver the same
>> >> performance in
>> >> > > Java
>> >> > > > > ;)
>> >> > > > > > > > (and then, from across the network).
>> >> > > > > > > >
>> >> > > > > > > > I hope this makes sense.
>> >> > > > > > > >
>> >> > > > > > > > Cheers,
>> >> > > > > > > > --
>> >> > > > > > > > Animesh
>> >> > > > > > > >
>> >> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
>> >> > > jacques@apache.org>
>> >> > > > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > My big question is what is the use case and how/what
>> are
>> >> you
>> >> > > > > trying to
>> >> > > > > > > > > compare? Arrow's implementation is more focused on
>> >> interacting
>> >> > > > > with the
>> >> > > > > > > > > structure than transporting it. Generally speaking,
>> when
>> >> we're
>> >> > > > > working
>> >> > > > > > > with
>> >> > > > > > > > > Arrow data we frequently are just interacting with
>> memory
>> >> > > > > locations and
>> >> > > > > > > > > doing direct operations. If you have a layer that
>> >> supports that
>> >> > > > > type of
>> >> > > > > > > > > semantic, create a movement technique that depends on
>> >> that.
>> >> > > Arrow
>> >> > > > > > > doesn't
>> >> > > > > > > > > force a particular API since the data itself is defined
>> >> by its
>> >> > > > > > > in-memory
>> >> > > > > > > > > layout so if you have a custom use or pattern, just
>> work
>> >> with
>> >> > > the
>> >> > > > > > > in-memory
>> >> > > > > > > > > structures.
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
>> >> > > wesmckinn@gmail.com>
>> >> > > > > > > wrote:
>> >> > > > > > > > >
>> >> > > > > > > > > > hi Animesh,
>> >> > > > > > > > > >
>> >> > > > > > > > > > Per Johan's comments, the C++ library is essentially
>> >> going
>> >> > > to be
>> >> > > > > > > > > > IO/memory bandwidth bound since you're interacting
>> with
>> >> raw
>> >> > > > > pointers.
>> >> > > > > > > > > >
>> >> > > > > > > > > > I'm looking at your code
>> >> > > > > > > > > >
>> >> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
>> >> > > > > > > > > >     Float4Vector accessor = (Float4Vector) fv;
>> >> > > > > > > > > >     int valCount = accessor.getValueCount();
>> >> > > > > > > > > >     for(int i = 0; i < valCount; i++){
>> >> > > > > > > > > >         if(!accessor.isNull(i)){
>> >> > > > > > > > > >             float4Count+=1;
>> >> > > > > > > > > >             checksum+=accessor.get(i);
>> >> > > > > > > > > >         }
>> >> > > > > > > > > >     }
>> >> > > > > > > > > > }
>> >> > > > > > > > > >
>> >> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
>> >> advise
>> >> > > you
>> >> > > > > the
>> >> > > > > > > > > > fastest way to iterate over this data -- my
>> >> understanding is
>> >> > > that
>> >> > > > > > > much
>> >> > > > > > > > > > code in Dremio interacts with the wrapped Netty
>> ArrowBuf
>> >> > > objects
>> >> > > > > > > > > > rather than going through the higher level APIs.
>> You're
>> >> also
>> >> > > > > dropping
>> >> > > > > > > > > > performance because memory mapping is not yet
>> >> implemented in
>> >> > > > > Java,
>> >> > > > > > > see
>> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
>> >> > > > > > > > > >
>> >> > > > > > > > > > Furthermore, the IPC reader class you are using could
>> >> be made
>> >> > > > > more
>> >> > > > > > > > > > efficient. I described the problem in
>> >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 --
>> >> this
>> >> > > will be
>> >> > > > > > > > > > required as soon as we have the ability to do memory
>> >> mapping
>> >> > > in
>> >> > > > > Java
>> >> > > > > > > > > >
>> >> > > > > > > > > > Could Crail use the Arrow data structures in its
>> runtime
>> >> > > rather
>> >> > > > > than
>> >> > > > > > > > > > copying? If not, how are Crail's runtime data
>> structures
>> >> > > > > different?
>> >> > > > > > > > > >
>> >> > > > > > > > > > - Wes
>> >> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg -
>> EWI
>> >> > > > > > > > > > <J....@tudelft.nl> wrote:
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Hello Animesh,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I browsed a bit in your sources, thanks for
>> sharing.
>> >> We
>> >> > > have
>> >> > > > > > > performed
>> >> > > > > > > > > > some similar measurements to your third case in the
>> >> past for
>> >> > > > > C/C++ on
>> >> > > > > > > > > > collections of various basic types such as primitives
>> >> and
>> >> > > > > strings.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I can say that in terms of consuming data from the
>> >> Arrow
>> >> > > format
>> >> > > > > > > versus
>> >> > > > > > > > > > language native collections in C++, I've so far
>> always
>> >> been
>> >> > > able
>> >> > > > > to
>> >> > > > > > > get
>> >> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to
>> >> the
>> >> > > Arrow
>> >> > > > > > > format
>> >> > > > > > > > > > itself). Especially when accessing the data through
>> >> Arrow's
>> >> > > raw
>> >> > > > > data
>> >> > > > > > > > > > pointers (and using for example std::string_view-like
>> >> > > > > constructs).
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > In C/C++ the fast data structures are engineered in
>> >> such a
>> >> > > way
>> >> > > > > > > that as
>> >> > > > > > > > > > little pointer traversals are required and they take
>> up
>> >> an as
>> >> > > > > small
>> >> > > > > > > as
>> >> > > > > > > > > > possible memory footprint. Thus each memory access is
>> >> > > relatively
>> >> > > > > > > > > efficient
>> >> > > > > > > > > > (in terms of obtaining the data of interest). The
>> same
>> >> can
>> >> > > > > > > absolutely be
>> >> > > > > > > > > > said for Arrow, if not even more efficient in some
>> cases
>> >> > > where
>> >> > > > > object
>> >> > > > > > > > > > fields are of variable length.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
>> >> This
>> >> > > > > requires
>> >> > > > > > > the
>> >> > > > > > > > > > JVM to interface to it through some calls to Unsafe
>> >> hidden
>> >> > > under
>> >> > > > > the
>> >> > > > > > > > > Netty
>> >> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
>> >> expert
>> >> > > on
>> >> > > > > > > this).
>> >> > > > > > > > > > Those calls are the only reason I can think of that
>> >> would
>> >> > > > > degrade the
>> >> > > > > > > > > > performance a bit compared to a pure JAva case. I
>> don't
>> >> know
>> >> > > if
>> >> > > > > the
>> >> > > > > > > > > Unsafe
>> >> > > > > > > > > > calls are inlined during JIT compilation. If they
>> >> aren't they
>> >> > > > > will
>> >> > > > > > > > > increase
>> >> > > > > > > > > > access latency to any data a little bit.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I don't have a similar machine so it's not easy to
>> >> relate
>> >> > > my
>> >> > > > > > > numbers to
>> >> > > > > > > > > > yours, but if you can get that data consumed with 100
>> >> Gbps
>> >> > > in a
>> >> > > > > pure
>> >> > > > > > > Java
>> >> > > > > > > > > > case, I don't see any reason (resulting from Arrow
>> >> format /
>> >> > > > > off-heap
>> >> > > > > > > > > > storage) why you wouldn't be able to get at least
>> really
>> >> > > close.
>> >> > > > > Can
>> >> > > > > > > you
>> >> > > > > > > > > get
>> >> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java
>> with
>> >> your
>> >> > > > > > > consumption
>> >> > > > > > > > > > functions in the first place?
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I'm interested to see your progress on this.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Kind regards,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Johan Peltenburg
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > ________________________________
>> >> > > > > > > > > > > From: Animesh Trivedi <an...@gmail.com>
>> >> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
>> >> > > > > > > > > > > To: dev@arrow.apache.org; dev@crail.apache.org
>> >> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Hi all,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > A week ago, Wes and I had a discussion about the
>> >> > > performance
>> >> > > > > of the
>> >> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
>> >> (Incubating)
>> >> > > > > mailing
>> >> > > > > > > > > list (
>> >> > > > > > > > > > >
>> >> > > > > > >
>> >> > >
>> >> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
>> >> > > > > > > > > ).
>> >> > > > > > > > > > In
>> >> > > > > > > > > > > a nutshell: I am investigating the performance of
>> >> various
>> >> > > file
>> >> > > > > > > formats
>> >> > > > > > > > > > > (including Arrow) on high-performance NVMe and
>> >> > > > > RDMA/100Gbps/RoCE
>> >> > > > > > > > > setups.
>> >> > > > > > > > > > I
>> >> > > > > > > > > > > benchmarked how long does it take to materialize
>> >> values
>> >> > > (ints,
>> >> > > > > > > longs,
>> >> > > > > > > > > > > doubles) of the store_sales table, the largest
>> table
>> >> in the
>> >> > > > > TPC-DS
>> >> > > > > > > > > > dataset
>> >> > > > > > > > > > > stored on different file formats. Here is a
>> write-up
>> >> on
>> >> > > this -
>> >> > > > > > > > > > >
>> >> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
>> >> > > > > > > found
>> >> > > > > > > > > > that
>> >> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
>> >> link,
>> >> > > Arrow
>> >> > > > > > > (using
>> >> > > > > > > > > > as a
>> >> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
>> >> > > bandwidth
>> >> > > > > with
>> >> > > > > > > all
>> >> > > > > > > > > 16
>> >> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
>> >> in-memory
>> >> > > IPC
>> >> > > > > > > format,
>> >> > > > > > > > > > and
>> >> > > > > > > > > > > has not been optimized for storage interfaces/APIs
>> >> like
>> >> > > HDFS;
>> >> > > > > (ii)
>> >> > > > > > > the
>> >> > > > > > > > > > > performance I am measuring is for the java
>> >> implementation.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > That brings us to this email where I promised to
>> >> follow up
>> >> > > on
>> >> > > > > the
>> >> > > > > > > Arrow
>> >> > > > > > > > > > > mailing list to understand and optimize the
>> >> performance of
>> >> > > > > > > Arrow/Java
>> >> > > > > > > > > > > implementation on high-performance devices. I
>> wrote a
>> >> small
>> >> > > > > > > stand-alone
>> >> > > > > > > > > > > benchmark (
>> >> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
>> >> > > > > > > with
>> >> > > > > > > > > > three
>> >> > > > > > > > > > > implementations of WritableByteChannel,
>> >> SeekableByteChannel
>> >> > > > > > > interfaces:
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives
>> me
>> >> ~30
>> >> > > Gbps
>> >> > > > > > > > > > performance
>> >> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives
>> me
>> >> > > ~35-36
>> >> > > > > Gbps
>> >> > > > > > > > > > > performance
>> >> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this
>> >> gives me
>> >> > > ~39
>> >> > > > > Gbps
>> >> > > > > > > > > > > performance
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I think the order makes sense. To better understand
>> >> the
>> >> > > > > > > performance of
>> >> > > > > > > > > > > Arrow/Java we can focus on the option 3.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > The key question I am trying to answer is "what
>> would
>> >> it
>> >> > > take
>> >> > > > > for
>> >> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"?
>> Is it
>> >> > > > > possible? If
>> >> > > > > > > > > yes,
>> >> > > > > > > > > > > then what is missing/or mis-interpreted by me? If
>> >> not, then
>> >> > > > > where
>> >> > > > > > > is
>> >> > > > > > > > > the
>> >> > > > > > > > > > > performance lost? Does anyone have any performance
>> >> > > measurements
>> >> > > > > > > for C++
>> >> > > > > > > > > > > implementation? if they have seen/expect better
>> >> numbers.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > As a next step, I will profile the read path of
>> >> Arrow/Java
>> >> > > for
>> >> > > > > the
>> >> > > > > > > > > option
>> >> > > > > > > > > > > 3. I will report my findings.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Any thoughts and feedback on this investigation are
>> >> very
>> >> > > > > welcome.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Cheers,
>> >> > > > > > > > > > > --
>> >> > > > > > > > > > > Animesh
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > PS~ Cross-posting on the dev@crail.apache.org
>> list as
>> >> > > well.
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > >
>> >> > > > >
>> >> > >
>> >>
>> >
>>
>>

Re: [JAVA] Arrow performance measurement

Posted by Animesh Trivedi <an...@gmail.com>.

Hi Wes, sure. I opened a ticket and will do a pull request.

Cheers,
--
Animesh

On Fri, Nov 30, 2018 at 5:28 PM Wes McKinney <we...@gmail.com> wrote:

> hi Animesh -- can you link to JIRA issues about the C++ improvements
> you're describing? Want to make sure this doesn't fall through the
> cracks
>
> Thanks
> Wes
> On Mon, Nov 26, 2018 at 7:54 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Hi Animesh,
> >
> > Le 26/11/2018 à 14:23, Animesh Trivedi a écrit :
> > >
> > > * C++ bitmap code can be optimized further by using the unsigned
> integers
> > > than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
> > > https://godbolt.org/z/deq0_q - compare the size of the assembly code.
> And
> > > the performance measurements in the blog show up to 50% performance
> gains.
> > > Alternatively if signed to unsigned upgrade is not possible (perhaps in
> > > every language), then in the C++ code, we should use the bitmap
> operations
> > > directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
> > > operation), instead of `/` and `%`.
> >
> > Thank you for noticing this.  Switching to unsigned (and do a
> > static_cast to unsigned) sounds good to me.
> >
> > Regards
> >
> > Antoine.
>

Re: [JAVA] Arrow performance measurement

Posted by Wes McKinney <we...@gmail.com>.

hi Animesh -- can you link to JIRA issues about the C++ improvements
you're describing? Want to make sure this doesn't fall through the
cracks

Thanks
Wes
On Mon, Nov 26, 2018 at 7:54 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hi Animesh,
>
> Le 26/11/2018 à 14:23, Animesh Trivedi a écrit :
> >
> > * C++ bitmap code can be optimized further by using the unsigned integers
> > than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
> > https://godbolt.org/z/deq0_q - compare the size of the assembly code. And
> > the performance measurements in the blog show up to 50% performance gains.
> > Alternatively if signed to unsigned upgrade is not possible (perhaps in
> > every language), then in the C++ code, we should use the bitmap operations
> > directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
> > operation), instead of `/` and `%`.
>
> Thank you for noticing this.  Switching to unsigned (and do a
> static_cast to unsigned) sounds good to me.
>
> Regards
>
> Antoine.

Re: [JAVA] Arrow performance measurement

Posted by Antoine Pitrou <an...@python.org>.

Hi Animesh,

Le 26/11/2018 à 14:23, Animesh Trivedi a écrit :
> 
> * C++ bitmap code can be optimized further by using the unsigned integers
> than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
> https://godbolt.org/z/deq0_q - compare the size of the assembly code. And
> the performance measurements in the blog show up to 50% performance gains.
> Alternatively if signed to unsigned upgrade is not possible (perhaps in
> every language), then in the C++ code, we should use the bitmap operations
> directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
> operation), instead of `/` and `%`.

Thank you for noticing this.  Switching to unsigned (and do a
static_cast to unsigned) sounds good to me.

Regards

Antoine.