You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/12 17:35:39 UTC

[GitHub] [spark] liuzqt commented on pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

liuzqt commented on PR #38064:
URL: https://github.com/apache/spark/pull/38064#issuecomment-1276518707

   Some general comments about the performance implication regarding replacing `Array[Byte]` and `ByteBuffer`(backed by `Array[Byte]`) with `ChunkedByteBuffer`:
   - when reading from stream (i.e., `ByteArrayInputStream` vs `ChunkedByteBufferInputStream`), no much differences, while `ByteArrayInputStream` might a little bit win in terms of cache locality because of continuous memory, but `ChunkedByteBuffer` won't be too bad along as the chunk is reasonable
   - when we're writing to stream(i.e., `ByteArrayOutputStream` vs `ChunkedByteBufferOutputStream`) 
     - `ByteArrayOutputStream` start with a small buffer(32 bytes) and grow 2x exponentially, and have to do **array copy** every grow
     - `ChunkedByteBufferOutputStream` use fixed `chunk size` to grow(which you can specify when you create the stream), while the grow is **append style** instead of **copy style**
     - do some manual benchmark on large data, `ChunkedByteBufferOutputStream` is much faster, (tried different data size from 100MB to 1GB and different chunk size from 1KB to 1MB, can see at least ~2x speedup), I would attribute to array copy overhead mostly.
     - when eventually dump to `ByteBuffer`(or raw byte array) vs. `ChunkedByteBuffer`, the latter might waste some memory space in the last chunk, but not a big deal I believe. And in serialization they're the same.
     - after all, result collection is a small portion in the whole end-to-end query


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org