You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/10 12:57:59 UTC

[GitHub] [arrow-datafusion] e-dard commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

e-dard commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1034891519


   @alamb highlighted this thread internally and I saw a couple of interesting points. I work on IOx's Read Buffer, which is an in-memory columnar engine that currently implements Datafusion's table provider (so currently only supports scans with predicate pushdown etc).
   
   I have experimented with a prototype that can do grouping/aggregation directly on encoded columnar data (e.g., on integer representations of RLE/dictionary encodings) and I found a couple of things mentioned already in this thread:
   
   Using a `Vec<SomeEnum>` had a big overhead (as @alamb mentioned) on hashing performance. However, in the Read Buffer's case it was possible to use all group column value's encoded  representations directly, which were (`u32`) [^1].
   
   Using `Vec<u32>` made a significant improvement to performance. Further, as a special case optimisation I found that if one were grouping on four or fewer columns then there was another big bump in performance by packing the encoded group key values into a single `u128`, and using that as the key in the hashmap. This is where I see the similarities to using a binary representation of the group key. 
   
   Anyway, just some anecdotal thoughts :-). Whilst there are some significant constraints the Read Buffer can take advantage of that Datafusion can't, based on my experience from playing around with similar ideas, I suspect the direction @yjshen has proposed things go here is going will have a significant improvement on grouping performance 👍 
   
   [1]: Because all group columns in the read buffer are dictionary or RLE encoded such that the encoded representation have the same ordinal properties


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org