You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/30 09:19:55 UTC

[GitHub] [arrow-datafusion] yjshen opened a new issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

yjshen opened a new issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   **_Many pipeline-breaking operators are inherently row-based:_** 
   
   For sort that would shuffle records around, re-order would cause random memory access patterns for each column in the current columnar organization. The performance will deteriorate as the number of columns grows. Besides, the compound sort key also requires us to access different columns.
   On the other hand, row-based representation avoids this problem (performance deteriorates with payload column number growth). we can check [here](https://dl.acm.org/doi/10.1145/1409360.1409380) for more explanations.
   
   For hashtable entries that we buffer aggregation state, we are already utilizing a row-based format indirectly -- We use `Vec<ScalarValue>` as a state for each key. Vector of `ScalarValue` is mostly stored continuously in memory but faced with two kinds of inefficiency: 1. memory overhead introduced by `ScalarValue` enum (16bytes per field according to @alamb ); 2. string or other non-primitive values stored on the heap elsewhere and accessed through pointers. 
   
   ```text
   ┌───────────────────────────────────────────────────────┐
   │                                                       │
   │ ┌────────────────┬────────────────┬────────────────┐  │
   │ │  ScalarValue   │  ScalarValue   │  ScalarValue   │  │
   │ │    ::Int(5)    │   ::Int(10)    │    ::Int(3)    │  │
   │ └────────────────┴────────────────┴────────────────┘  │
   │   Hash Table Entry                                    │
   │   Vec<ScalarValue>                                    │
   └───────────────────────────────────────────────────────┘
    When the keys are primitive values, they are stored     
                  contiguously in the Vec                   
   
   
                              ┌ ─ ─ ─ ─ ─ ─ ─ ─ ┐           
                                     "foo"                  
                              │(heap allocation)│           
                               ─ ─ ─ ─ ─ ─ ─ ─ ─            
                                       ▲                    
                               ┌───────┘                    
   ┌───────────────────────────┼───────────────────────────┐
   │                           │                           │
   │ ┌────────────────┬────────────────┬────────────────┐  │
   │ │  ScalarValue   │  ScalarValue   │  ScalarValue   │  │
   │ │    ::Int(5)    │ ::Utf8("foo")  │    ::Int(3)    │  │
   │ └────────────────┴────────────────┴────────────────┘  │
   │   Hash Table Entry                                    │
   │   Vec<ScalarValue>                                    │
   └───────────────────────────────────────────────────────┘
     When the keys have strings/binary data, the variable   
      length data is stored non contiguously in the Vec     
   ```
   _I quote these two great diagrams above from @alamb. Thanks again!_
   
   
   For join, whether hash-based or sort-based, would suffer from similar problems as above.
   
   **Describe the solution you'd like**
   
   1. A `vec<u8>` based representation for tuple, store all columns continuously in memory, for row-logic operations.
   2. Efficient coding/decoding method from/to columnar arrow data.
   3. Access cells in `vec<u8>` tuple efficiently. 
   
   We could refer to PostgreSQL / DuckDB / Spark for the row format design. But note Spark's `UnsafeRow` incurs a lot of memory overhead due to its 8-byte alignment.
   
   **Describe alternatives you've considered**
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

yjshen commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028963254


   Thank you all for your comments and for fixing my mistakes ❤️
   
   I agree for a small hashtable, i.e., the whole data structure can be CPU cache resident or a few times larger. The columnar organization gives us all benefits at no cost. 
   
   However, as the group-by key cardinality grows, the bottleneck of hash aggregation or hash join row concatenation becomes more memory access pattern related. The columnar structured hashtable would cause N times extra cache miss/loads since we are constantly accessing a tuple at a time. 
   
   The problem is identified as [the memory wall](https://dl.acm.org/doi/10.1145/1409360.1409380). It results in quite a lot of systems using row-wise hashtable even they utilize vectorized execution model: in [vertica](https://15721.courses.cs.cmu.edu/spring2020/papers/13-execution/shrinivas-icde2013.pdf), [Typer and Tectorwise](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf), [Operator Fusion](http://www.vldb.org/pvldb/vol11/p1-menon.pdf), and also the latest [DuckDB](https://github.com/duckdb/duckdb/blob/master/src/common/types/row_layout.cpp#L32). 
   
   For sort, the problem is similar; re-ordering will cause a random access pattern in memory for each column. If there are many payload columns, this will be slow.
   
   The row <-> columnar conversion comes with a price; we need extra computation. But considering we always need to buffer intermediate data in memory and compose/decompose while data is cache resident, the conversion could be pretty efficient.
   
   I'm not proposing to change all these data structures at once to row-wised but to point out existing theory and practice on using the row-wise organization for these pipeline breakers' states. We should always implement and benchmark before deciding on these performance-critical codes, as we did in the past.
   
   For the row structure, I'd like a combination of all these three systems: 
   
   - byte-aligned null bit set
   - store all attributes sequentially
     - for data columns, we do no padding
     - for aggregate state columns, we do padding to make them aligned, to make the CPU happier when we repeatedly check and update
     - for the var-len attribute, we store (offset+length) in its position and append the actual value at last for each row
       - we could come up with a short string representation later for space efficiency but also pay extra effort
   
   The reason for the above-proposed structure is: we could determine each attribute's offset with schema independently, and we pay the bill with extra padding space for better computation performance.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1026262077


   Thanks @yjshen  
   
   Some other notes
   1. We had a `Vec<u8>` based implementation prior to https://github.com/apache/arrow-datafusion/pull/808 which I think @Dandandan  originally contributed
   2. the `ScalarValue` overhead is actually worse than 16 bytes: it is `48` or `64` depending on architecture 😱 : https://github.com/apache/arrow-datafusion/blob/e92225d6f4660363eec3b1e767a188fccebb7ed9/datafusion/src/scalar.rs#L2301-L2309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1026262077


   Thanks @yjshen  
   
   Some other notes
   1. We had a `Vec<u8>` based implementation prior to https://github.com/apache/arrow-datafusion/pull/808 which I think @Dandandan  originally contributed (see the `create_key_for_col` function)
   2. the `ScalarValue` overhead is actually worse than 16 bytes: it is `48` or `64` depending on architecture 😱 : https://github.com/apache/arrow-datafusion/blob/e92225d6f4660363eec3b1e767a188fccebb7ed9/datafusion/src/scalar.rs#L2301-L2309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

alamb closed issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

houqp commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028677253


   Thanks @yjshen for the detailed research, it looks like postgres's design might be better assuming we only access row values sequentially the majority of the time. I think this is the case for our current hash aggregate and sort implementation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1029014217


   > However, as the group-by key cardinality grows, the bottleneck of hash aggregation or hash join row concatenation becomes more memory access pattern related. 
   
   One thing we might consider is not storing the group key values directly in the hash table, but separately. Something like:
   
   ```text
    ┌─────────────┐               ┌─────────┬─────────┐   
    │             │               │Group Key│Group Key│   
    │  HashTable  │               │  Col 1  │  Col 2  │   
    │             │               ├─────────┼─────────┤   
    ├──────┬──────┤               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    │      │      │               ├ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ┤   
    │      │      │       ┌───────▶        idx        │   
    │ key  │value │       │       ├ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ┤   
    │      ├──────┤       │       │         │         │   
    │      │ idx  │───────┘       │         │         │   
    │      ├──────┤               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    └──────┴──────┘               │         │         │   
                                  │         │         │   
   HashTable holds indexes        │         │         │   
   to mutable arary               │         │         │   
                                  │   ...   │   ...   │   
                                  └─────────┴─────────┘   
                                                          
                                                          
                                Mutable (Appendable) Array
                                New group keys are        
                                appended at the end       
   ```
   
   The current hash aggregate code takes this approach -- but instead of using Mutable/Appendable arrays it uses a Vec of `GroupState`: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L615
   
   > I'm not proposing to change all these data structures at once to row-wised but to point out existing theory and practice on using the row-wise organization for these pipeline breakers' states. We should always implement and benchmark before deciding on these performance-critical codes, as we did in the past.
   
   I agree with this 💯  and among other reasons is why I enjoy working with you (and the rest of the people on this chain!)
   
   BTW the DuckDB sorting blog https://duckdb.org/2021/08/27/external-sorting.html (and the paper it references by Goetz Grafe) have a good treatment of sorting (specifically the calculation of sort keys and then a secondary memory shuffle to sort move the original data around correctly)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] e-dard commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

e-dard commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1034891519


   @alamb highlighted this thread internally and I saw a couple of interesting points. I work on IOx's Read Buffer, which is an in-memory columnar engine that currently implements Datafusion's table provider (so currently only supports scans with predicate pushdown etc).
   
   I have experimented with a prototype that can do grouping/aggregation directly on encoded columnar data (e.g., on integer representations of RLE/dictionary encodings) and I found a couple of things mentioned already in this thread:
   
   Using a `Vec<SomeEnum>` had a big overhead (as @alamb mentioned) on hashing performance. However, in the Read Buffer's case it was possible to use all group column value's encoded  representations directly, which were (`u32`) [^1].
   
   Using `Vec<u32>` made a significant improvement to performance. Further, as a special case optimisation I found that if one were grouping on four or fewer columns then there was another big bump in performance by packing the encoded group key values into a single `u128`, and using that as the key in the hashmap. This is where I see the similarities to using a binary representation of the group key. 
   
   Anyway, just some anecdotal thoughts :-). Whilst there are some significant constraints the Read Buffer can take advantage of that Datafusion can't, based on my experience from playing around with similar ideas, I suspect the direction @yjshen has proposed things go here is going will have a significant improvement on grouping performance 👍 
   
   [1]: Because all group columns in the read buffer are dictionary or RLE encoded such that the encoded representation have the same ordinal properties


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen edited a comment on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

yjshen edited a comment on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028963254


   Thank you all for your comments and for fixing my mistakes ❤️
   
   I agree for a small hashtable, i.e., the whole data structure can be CPU cache resident or a few times larger. The columnar organization gives us all benefits at no cost.  However, as the group-by key cardinality grows, the bottleneck of hash aggregation or hash join row concatenation becomes more memory access pattern related. The columnar structured hashtable would cause N times extra cache miss/loads since we are constantly accessing a tuple at a time. 
   
   The problem is identified as [the memory wall](https://dl.acm.org/doi/10.1145/1409360.1409380). It results in quite a lot of systems using row-wise hashtable even they utilize vectorized execution model: in [vertica](https://15721.courses.cs.cmu.edu/spring2020/papers/13-execution/shrinivas-icde2013.pdf), [Typer and Tectorwise](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf), [Operator Fusion](http://www.vldb.org/pvldb/vol11/p1-menon.pdf), and also the latest [DuckDB](https://github.com/duckdb/duckdb/blob/master/src/common/types/row_layout.cpp#L32). 
   
   For sort, I think the problem is similar; re-ordering will cause a random access pattern in memory for each column. If there are many payload columns, this will be slow.
   
   The row <-> columnar conversion comes with a price; we need extra computation. But considering we always need to buffer intermediate data in memory and compose/decompose while data is cache resident, the conversion could be pretty efficient.
   
   I'm not proposing to change all these data structures at once to row-wised but to point out existing theory and practice on using the row-wise organization for these pipeline breakers' states. We should always implement and benchmark before deciding on these performance-critical codes, as we did in the past.
   
   For the row structure, I'd like a combination of all these three systems: 
   
   - byte-aligned null bit set
   - store all attributes sequentially
     - for data columns, we do no padding
     - for aggregate state columns, we do padding to make them aligned, to make the CPU happier when we repeatedly check and update
     - for the var-len attribute, we store (offset+length) in its position and append the actual value at last for each row
       - we could come up with a short string representation later for space efficiency but also pay extra effort
   
   The reason for the above-proposed structure is: we could determine each attribute's offset with schema independently, and we pay the bill with extra padding space for better computation performance.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen edited a comment on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

yjshen edited a comment on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1027779249


   After some code/doc checking into the existing systems, the three systems' row layouts are:
   
   **Postgresql:**  var-length tuple
   - null-bits first (byte aligned)
   - store **all** attributes sequentially, 
       - add **extra padding if needed** before each attribute
           -  E.g. table A (bool, char, int32), no padding between bool and char since they are both 1 byte aligned, but 2 bytes padding after char and before int32, since int32 is 4 bytes aligned.
       -  store var-length attribute in place (length first, then content; if the value is not too big/"TOAST" in its term).  1-byte-length for varlena length up to 126 bytes.
   - **Value access:** most difficult, its O(n) of complexity since it needs to access all previous attr of a tuple to calculate padding/length until the start offset of an attr can be deduced.
       
   Check [Data Alignment in PostgreSQL](https://www.enterprisedb.com/postgres-tutorials/data-alignment-postgresql), [Column Storage Internals](https://momjian.us/main/blogs/pgblog/2017.html#March_15_2017), [CodeSample in Page16](https://momjian.us/main/writings/pgsql/inside_shmem.pdf) for more details.
   
   **DuckDB:** fixed-length tuple
   - null-bits first (byte aligned)
   - store fixed-sized attributes sequentially. For var-length attributes, store an 8-byte pointer (on x64)
        - ~~**no padding between** attributes~~  no padding between data columns, but padding for each aggregate value.
        - var-length attribute pointer
              - point to the store called "row heap".
              - In the row heap, var length attributes/strings for one tuple are stored continuously.
   - **Value access:** An extra `vector<idx_t> offsets` is employed to achieve O(1) simple attr access, and O(1 + 1) var-len-attr access.
        
   Check [Source Code](https://github.com/duckdb/duckdb/blob/master/src/common/types/row_layout.cpp#L32-L66) and [a related blog post/external sorting section](https://duckdb.org/2021/08/27/external-sorting.html) for more details.
   
   **SparkSQL:** var-length tuple 
   - null-bits first (8-byte aligned)
   - store each attribute sequentially, **8 bytes aligned for each** attribute; 
       - for var-length attribute, pack (offset+length) into 8 bytes and store in place, store the actual var-length attributes after all fixed fields. (the var-len-attr itself is again 8 bytes aligned)
   - **Value access:** No extra structure needed, O(1) for simple attr access, O(1+1) for var-len-attr access.
       
   Check [Source Code](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L46-L61) for more details.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028712924

   Thanks for the research/overview!

   Taking inspiration from DuckDB / PostgreSQL sounds reasonable to me.

   I am wondering if for certain operations, e.g. hash aggregate, I feel fixed
   size input the data is stored better in a columnar format (mutable array,
   with offsets), which can have faster (vectorized) operations (batch
   updating state values) and faster (free) conversion to a columnar array.
   Another row-based format (like we have now, or a more "advanced" one) would
   spend some extra time in:

   * Converting to the row-wise format values
   * Interpreting the row-wise format (accessing cells based on data types)
   * Generating columnar data

   The story is probably very different for sorting, I still need to read the
   DuckDB post in detail.

   > For join, whether hash-based or sort-based, would suffer from similar
   problems as above

   I think it isn't is the case for hash join, I think there is no need to
   have a row reprentation (as we can keep the left side data in columnar
   format in memory, we don't mutate the data).

   On Thu, Feb 3, 2022, 08:22 QP Hou ***@***.***> wrote:

   > Thanks @yjshen <https://github.com/yjshen> for the detailed research, it
   > looks like postgres's design might be better assuming we only access row
   > values sequentially the majority of the time. I think this is the case for
   > our current hash aggregate and sort implementation?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028677253>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AABH7GJERPANWE7W7QJDGWLUZIULXANCNFSM5NEAJ2CQ>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >

-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

yjshen commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1027779249


   After some code/doc checking into the existing systems, the three systems' row layouts are:
   
   **Postgresql:**  var-length tuple
   - null-bits first (byte aligned)
   - store **all** attributes sequentially, 
       - add **extra padding if needed** before each attribute
           -  E.g. table A (bool, char, int32), no padding between bool and char since they are both 1 byte aligned, but 2 bytes padding after char and before int32, since int32 is 4 bytes aligned.
       -  store var-length attribute in place (length first, then content; if the value is not too big/"TOAST" in its term). 
   - **Value access:** most difficult, its O(n) of complexity since it needs to access all previous attr of a tuple to calculate padding/length until the start offset of an attr can be deduced.
       
   Check [Data Alignment in PostgreSQL](https://www.enterprisedb.com/postgres-tutorials/data-alignment-postgresql), [Column Storage Internals](https://momjian.us/main/blogs/pgblog/2017.html#March_15_2017), [CodeSample in Page16](https://momjian.us/main/writings/pgsql/inside_shmem.pdf) for more details.
   
   **DuckDB:** fixed-length tuple
   - null-bits first (byte aligned)
   - store fixed-sized attributes sequentially. For var-length attributes, store an 8-byte pointer (on x64)
        - **no padding between** attributes
        - var-length attribute pointer
              - point to the store called "row heap".
              - In the string heap, var length attributes/strings for one tuple are stored continuously.
   - **Value access:** An extra `vector<idx_t> offsets` is employed to achieve O(1) simple attr access, and O(1 + 1) var-len-attr access.
        
   Check [Source Code](https://github.com/duckdb/duckdb/blob/master/src/common/types/row_layout.cpp#L32-L66) and [a related blog post/external sorting section](https://duckdb.org/2021/08/27/external-sorting.html) for more details.
   
   **SparkSQL:** var-length tuple 
   - null-bits first (8-byte aligned)
   - store each attribute sequentially, **8 bytes aligned for each** attribute; 
       - for var-length attribute, pack (offset+length) into 8 bytes and store in place, store the actual var-length attributes after all fixed fields. (the var-len-attr itself is again 8 bytes aligned)
   - **Value access:** No extra structure needed, O(1) for simple attr access, O(1+1) for var-len-attr access.
       
   Check [Source Code](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L46-L61) for more details.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028909182


   Depending on when people want to try a "mutable array" approach for improving hash performance, I can probably whip up something (based on arrow-rs), so let me know


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1026262077


   Thanks @yjshen  
   
   Some other notes
   1. We had a `Vec<u8>` based implementation prior to https://github.com/apache/arrow-datafusion/pull/808 which I think @Dandandan  originally contributed (see the `create_key` function -- it even has ascii art)
   2. the `ScalarValue` overhead is actually worse than 16 bytes: it is `48` or `64` depending on architecture 😱 : https://github.com/apache/arrow-datafusion/blob/e92225d6f4660363eec3b1e767a188fccebb7ed9/datafusion/src/scalar.rs#L2301-L2309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen edited a comment on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

yjshen edited a comment on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1027779249


   After some code/doc checking into the existing systems, the three systems' row layouts are:
   
   **Postgresql:**  var-length tuple
   - null-bits first (byte aligned)
   - store **all** attributes sequentially, 
       - add **extra padding if needed** before each attribute
           -  E.g. table A (bool, char, int32), no padding between bool and char since they are both 1 byte aligned, but 2 bytes padding after char and before int32, since int32 is 4 bytes aligned.
       -  store var-length attribute in place (length first, then content; if the value is not too big/"TOAST" in its term). 
   - **Value access:** most difficult, its O(n) of complexity since it needs to access all previous attr of a tuple to calculate padding/length until the start offset of an attr can be deduced.
       
   Check [Data Alignment in PostgreSQL](https://www.enterprisedb.com/postgres-tutorials/data-alignment-postgresql), [Column Storage Internals](https://momjian.us/main/blogs/pgblog/2017.html#March_15_2017), [CodeSample in Page16](https://momjian.us/main/writings/pgsql/inside_shmem.pdf) for more details.
   
   **DuckDB:** fixed-length tuple
   - null-bits first (byte aligned)
   - store fixed-sized attributes sequentially. For var-length attributes, store an 8-byte pointer (on x64)
        - ~~**no padding between** attributes~~  no padding between data columns, but padding for each aggregate value.
        - var-length attribute pointer
              - point to the store called "row heap".
              - In the row heap, var length attributes/strings for one tuple are stored continuously.
   - **Value access:** An extra `vector<idx_t> offsets` is employed to achieve O(1) simple attr access, and O(1 + 1) var-len-attr access.
        
   Check [Source Code](https://github.com/duckdb/duckdb/blob/master/src/common/types/row_layout.cpp#L32-L66) and [a related blog post/external sorting section](https://duckdb.org/2021/08/27/external-sorting.html) for more details.
   
   **SparkSQL:** var-length tuple 
   - null-bits first (8-byte aligned)
   - store each attribute sequentially, **8 bytes aligned for each** attribute; 
       - for var-length attribute, pack (offset+length) into 8 bytes and store in place, store the actual var-length attributes after all fixed fields. (the var-len-attr itself is again 8 bytes aligned)
   - **Value access:** No extra structure needed, O(1) for simple attr access, O(1+1) for var-len-attr access.
       
   Check [Source Code](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L46-L61) for more details.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1029014217


   > However, as the group-by key cardinality grows, the bottleneck of hash aggregation or hash join row concatenation becomes more memory access pattern related. 
   
   One thing we might consider is not storing the group key values directly in the hash table, but separately. Something like:
   
   ```text
    ┌─────────────┐               ┌─────────┬─────────┐   
    │             │               │Group Key│Group Key│   
    │  HashTable  │               │  Col 1  │  Col 2  │   
    │             │               ├─────────┼─────────┤   
    ├──────┬──────┤               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    │      │      │               ├ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ┤   
    │      │      │       ┌───────▶        idx        │   
    │ key  │value │       │       ├ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ┤   
    │      ├──────┤       │       │         │         │   
    │      │ idx  │───────┘       │         │         │   
    │      ├──────┤               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    └──────┴──────┘               │         │         │   
                                  │         │         │   
   HashTable holds indexes        │         │         │   
   to mutable arary               │         │         │   
                                  │   ...   │   ...   │   
                                  └─────────┴─────────┘   
                                                          
                                                          
                                Mutable (Appendable) Array
                                New group keys are        
                                appended at the end       
   ```
   
   The current hash aggregate code takes this approach -- but instead of using Mutable/Appendable arrays it uses a Vec of `GroupState): https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L615
   
   > I'm not proposing to change all these data structures at once to row-wised but to point out existing theory and practice on using the row-wise organization for these pipeline breakers' states. We should always implement and benchmark before deciding on these performance-critical codes, as we did in the past.
   
   I agree with this 💯  and among other reasons is why I enjoy working with you (and the rest of the people on this chain!)
   
   BTW the DuckDB sorting blog https://duckdb.org/2021/08/27/external-sorting.html (and the paper it references by Goetz Grafe) have a good treatment of sorting (specifically the calculation of sort keys and then a secondary memory shuffle to sort move the original data around correctly)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen edited a comment on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

yjshen edited a comment on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1027779249


   After some code/doc checking into the existing systems, the three systems' row layouts are:
   
   **Postgresql:**  var-length tuple
   - null-bits first (byte aligned)
   - store **all** attributes sequentially, 
       - add **extra padding if needed** before each attribute
           -  E.g. table A (bool, char, int32), no padding between bool and char since they are both 1 byte aligned, but 2 bytes padding after char and before int32, since int32 is 4 bytes aligned.
       -  store var-length attribute in place (length first, then content; if the value is not too big/"TOAST" in its term). 
   - **Value access:** most difficult, its O(n) of complexity since it needs to access all previous attr of a tuple to calculate padding/length until the start offset of an attr can be deduced.
       
   Check [Data Alignment in PostgreSQL](https://www.enterprisedb.com/postgres-tutorials/data-alignment-postgresql), [Column Storage Internals](https://momjian.us/main/blogs/pgblog/2017.html#March_15_2017), [CodeSample in Page16](https://momjian.us/main/writings/pgsql/inside_shmem.pdf) for more details.
   
   **DuckDB:** fixed-length tuple
   - null-bits first (byte aligned)
   - store fixed-sized attributes sequentially. For var-length attributes, store an 8-byte pointer (on x64)
        - **no padding between** attributes
        - var-length attribute pointer
              - point to the store called "row heap".
              - In the row heap, var length attributes/strings for one tuple are stored continuously.
   - **Value access:** An extra `vector<idx_t> offsets` is employed to achieve O(1) simple attr access, and O(1 + 1) var-len-attr access.
        
   Check [Source Code](https://github.com/duckdb/duckdb/blob/master/src/common/types/row_layout.cpp#L32-L66) and [a related blog post/external sorting section](https://duckdb.org/2021/08/27/external-sorting.html) for more details.
   
   **SparkSQL:** var-length tuple 
   - null-bits first (8-byte aligned)
   - store each attribute sequentially, **8 bytes aligned for each** attribute; 
       - for var-length attribute, pack (offset+length) into 8 bytes and store in place, store the actual var-length attributes after all fixed fields. (the var-len-attr itself is again 8 bytes aligned)
   - **Value access:** No extra structure needed, O(1) for simple attr access, O(1+1) for var-len-attr access.
       
   Check [Source Code](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L46-L61) for more details.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen edited a comment on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

yjshen edited a comment on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1027821721


   Given `Table A (bool a, char b, int c, string d)  row_value (true, 'W', 59, "XYZ")`  as an example, a tuple in each system are:
   
   ```
                                   
                   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
    Postgres       │ 0F│ 1 │ W │ **│ **│ 00│ 00│ 00│ 3B│ 03│ X │ Y │ Z │
                   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘                        
                                                                                                                        
                                                   Pointer                                                              
                                                                                                                        
                   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐                                                        
     DuckDB        │ 0F│ 1 │ W │ 00│ 00│ 00│ 3B│ 00│ 00│ 00│ 5F│                                                        
                   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘                                                        
                                                             │                                                          
                                     ┌───────────────────────┘                                                          
                                     ▼─────────┐                                                                        
                                     │   XYZ   │                                                                        
                                     └─────────┘                                                                        
                                                                                                                        
                   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐    
     SparkSQL      │ 00│ 00│ 00│ 00│ 00│ 00│ 00│ 0F│ 00│ 00│ 00│ 00│ 00│ 00│ 00│ 1 │ 00│ 00│ 00│ 00│ 00│ 00│ 00│ W ├───┐
                   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘   │
                                                                                                                       │
               ┌───────────────────────────────────────────────────────────────────────────────────────────────────────┘
               │                                                                                                        
               │   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐                                    
               └───▶ 00│ 00│ 00│ 28│ 00│ 00│ 00│ 03│ X │ Y │ Z │ 00│ 00│ 00│ 00│ 00│                                    
                   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘                                                          
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028906123


   💯  with what @Dandandan  and @houqp  said; Thank you for writing this up @yjshen ❤️ 
   
   > I am wondering if for certain operations, e.g. hash aggregate, I feel fixed
   size input the data is stored better in a columnar format (mutable array,
   with offsets),
   
   I agree with @Dandandan  that for HashAggregate this would be super helpful -- as the group keys and aggregates could be computed "in place" (so output was free)
   
   Sorting is indeed different because the sort key is different than what appears in the output. For example `SELECT a, b, c ... ORDER by a+b` needs to compare on `a+b`, but still produce tuples of `(a, b, c)`;
   
   The grouping values are produced. For example `SELECT a+b, sum(c) .. GROUP BY a+b` produces tuples of `(a+b, sum)`
   
   
   p.s. for what it is worth I think DuckDB has a short string optimization so the key may look something more like
   
   
   ```text
   Table A (bool a, char b, int c, string d) row_value (true, 'W', 59, "XYZ")                
                                                                                             
                                                                                             
                                                                                             
          ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐       
          │ 0F │ 1  │ W  │ 00 │ 00 │ 00 │ 3B │ 03 │ 00 │ 00 │ 00 │ 00 │ X  │ Y  │ Z  │       
          └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘       
                                                                                             
                                                  8                                          
                                                                                             
                                                                                             
                                                                                             
   Table A (bool a, char b, int c, string d) row_value (true, 'W', 59, "XYZXYZXYZ")          
                                                                                             
          ┌────┬────┬────┬────┬────┬────┬────┬─────────────────────────────────────────────┐ 
          │ 0F │ 1  │ W  │ 00 │ 00 │ 00 │ 3B │                     PTR                     │ 
          └────┴────┴────┴────┴────┴────┴────┴─────────────────────────────────────────────┘ 
                                                                    │                        
                                                  8                 └───┐                    
                                                                        ▼                    
                                                                                             
                                                                   "XYZXYZXYZ"               
                                                                                             
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1026263442


   Here is the old implementation in case it is of value:
   
   ```rust
   /// Appends a sequence of [u8] bytes for the value in `col[row]` to
   /// `vec` to be used as a key into the hash map for a dictionary type
   ///
   /// Note that ideally, for dictionary encoded columns, we would be
   /// able to simply use the dictionary idicies themselves (no need to
   /// look up values) or possibly simply build the hash table entirely
   /// on the dictionary indexes.
   ///
   /// This aproach would likely work (very) well for the common case,
   /// but it also has to to handle the case where the dictionary itself
   /// is not the same across all record batches (and thus indexes in one
   /// record batch may not correspond to the same index in another)
   fn dictionary_create_key_for_col<K: ArrowDictionaryKeyType>(
       col: &ArrayRef,
       row: usize,
       vec: &mut Vec<u8>,
   ) -> Result<()> {
       let dict_col = col.as_any().downcast_ref::<DictionaryArray<K>>().unwrap();
   
       // look up the index in the values dictionary
       let keys_col = dict_col.keys();
       let values_index = keys_col.value(row).to_usize().ok_or_else(|| {
           DataFusionError::Internal(format!(
               "Can not convert index to usize in dictionary of type creating group by value {:?}",
               keys_col.data_type()
           ))
       })?;
   
       create_key_for_col(dict_col.values(), values_index, vec)
   }
   
   /// Appends a sequence of [u8] bytes for the value in `col[row]` to
   /// `vec` to be used as a key into the hash map.
   ///
   /// NOTE: This function does not check col.is_valid(). Caller must do so
   fn create_key_for_col(col: &ArrayRef, row: usize, vec: &mut Vec<u8>) -> Result<()> {
       match col.data_type() {
           DataType::Boolean => {
               let array = col.as_any().downcast_ref::<BooleanArray>().unwrap();
               vec.extend_from_slice(&[array.value(row) as u8]);
           }
           DataType::Float32 => {
               let array = col.as_any().downcast_ref::<Float32Array>().unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::Float64 => {
               let array = col.as_any().downcast_ref::<Float64Array>().unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::UInt8 => {
               let array = col.as_any().downcast_ref::<UInt8Array>().unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::UInt16 => {
               let array = col.as_any().downcast_ref::<UInt16Array>().unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::UInt32 => {
               let array = col.as_any().downcast_ref::<UInt32Array>().unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::UInt64 => {
               let array = col.as_any().downcast_ref::<UInt64Array>().unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::Int8 => {
               let array = col.as_any().downcast_ref::<Int8Array>().unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::Int16 => {
               let array = col.as_any().downcast_ref::<Int16Array>().unwrap();
               vec.extend(array.value(row).to_le_bytes().iter());
           }
           DataType::Int32 => {
               let array = col.as_any().downcast_ref::<Int32Array>().unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::Int64 => {
               let array = col.as_any().downcast_ref::<Int64Array>().unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::Timestamp(TimeUnit::Millisecond, None) => {
               let array = col
                   .as_any()
                   .downcast_ref::<TimestampMillisecondArray>()
                   .unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::Timestamp(TimeUnit::Microsecond, None) => {
               let array = col
                   .as_any()
                   .downcast_ref::<TimestampMicrosecondArray>()
                   .unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::Timestamp(TimeUnit::Nanosecond, None) => {
               let array = col
                   .as_any()
                   .downcast_ref::<TimestampNanosecondArray>()
                   .unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::Utf8 => {
               let array = col.as_any().downcast_ref::<StringArray>().unwrap();
               let value = array.value(row);
               // store the size
               vec.extend_from_slice(&value.len().to_le_bytes());
               // store the string value
               vec.extend_from_slice(value.as_bytes());
           }
           DataType::LargeUtf8 => {
               let array = col.as_any().downcast_ref::<LargeStringArray>().unwrap();
               let value = array.value(row);
               // store the size
               vec.extend_from_slice(&value.len().to_le_bytes());
               // store the string value
               vec.extend_from_slice(value.as_bytes());
           }
           DataType::Date32 => {
               let array = col.as_any().downcast_ref::<Date32Array>().unwrap();
               vec.extend_from_slice(&array.value(row).to_le_bytes());
           }
           DataType::Dictionary(index_type, _) => match **index_type {
               DataType::Int8 => {
                   dictionary_create_key_for_col::<Int8Type>(col, row, vec)?;
               }
               DataType::Int16 => {
                   dictionary_create_key_for_col::<Int16Type>(col, row, vec)?;
               }
               DataType::Int32 => {
                   dictionary_create_key_for_col::<Int32Type>(col, row, vec)?;
               }
               DataType::Int64 => {
                   dictionary_create_key_for_col::<Int64Type>(col, row, vec)?;
               }
               DataType::UInt8 => {
                   dictionary_create_key_for_col::<UInt8Type>(col, row, vec)?;
               }
               DataType::UInt16 => {
                   dictionary_create_key_for_col::<UInt16Type>(col, row, vec)?;
               }
               DataType::UInt32 => {
                   dictionary_create_key_for_col::<UInt32Type>(col, row, vec)?;
               }
               DataType::UInt64 => {
                   dictionary_create_key_for_col::<UInt64Type>(col, row, vec)?;
               }
               _ => {
                   return Err(DataFusionError::Internal(format!(
                   "Unsupported GROUP BY type (dictionary index type not supported creating key) {}",
                   col.data_type(),
               )))
               }
           },
           _ => {
               // This is internal because we should have caught this before.
               return Err(DataFusionError::Internal(format!(
                   "Unsupported GROUP BY type creating key {}",
                   col.data_type(),
               )));
           }
       }
       Ok(())
   }
   
   /// Create a key `Vec<u8>` that is used as key for the hashmap
   ///
   /// This looks like
   /// [null_byte][col_value_bytes][null_byte][col_value_bytes]
   ///
   /// Note that relatively uncommon patterns (e.g. not 0x00) are chosen
   /// for the null_byte to make debugging easier. The actual values are
   /// arbitrary.
   ///
   /// For a NULL value in a column, the key looks like
   /// [0xFE]
   ///
   /// For a Non-NULL value in a column, this looks like:
   /// [0xFF][byte representation of column value]
   ///
   /// Example of a key with no NULL values:
   /// ```text
   ///                        0xFF byte at the start of each column
   ///                           signifies the value is non-null
   ///                                          │
   ///
   ///                      ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┐
   ///
   ///                      │        string len                 │  0x1234
   /// {                    ▼       (as usize le)      "foo"    ▼(as u16 le)
   ///   k1: "foo"        ╔ ═┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──╦ ═┌──┬──┐
   ///   k2: 0x1234u16     FF║03│00│00│00│00│00│00│00│"f│"o│"o│FF║34│12│
   /// }                  ╚ ═└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──╩ ═└──┴──┘
   ///                     0  1  2  3  4  5  6  7  8  9  10 11 12 13 14
   /// ```
   ///
   ///  Example of a key with NULL values:
   ///
   ///```text
   ///                         0xFE byte at the start of k1 column
   ///                     ┌ ─     signifies the value is NULL
   ///
   ///                     └ ┐
   ///                              0x1234
   /// {                     ▼    (as u16 le)
   ///   k1: NULL          ╔ ═╔ ═┌──┬──┐
   ///   k2: 0x1234u16      FE║FF║12│34│
   /// }                   ╚ ═╚ ═└──┴──┘
   ///                       0  1  2  3
   ///```
   pub(crate) fn create_key(
       group_by_keys: &[ArrayRef],
       row: usize,
       vec: &mut Vec<u8>,
   ) -> Result<()> {
       vec.clear();
       for col in group_by_keys {
           if !col.is_valid(row) {
               vec.push(0xFE);
           } else {
               vec.push(0xFF);
               create_key_for_col(col, row, vec)?
           }
       }
       Ok(())
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Posted by GitBox <gi...@apache.org>.

yjshen commented on issue #1708:
URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1027821721


   Given `Table A (bool a, char b, int c, string d)  row_value (true, 'W', 59, "XYZ")`  as an example, a tuple in each system are:
   
   ```
                   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐                                    
    Postgres       │ 0F│ 1 │ W │ **│ **│ 00│ 00│ 00│ 3B│ 00│ 00│ 00│ 03│ X │ Y │ Z │                                    
                   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘                                    
                                                                                                                        
                                                   Pointer                                                              
                                                                                                                        
                   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐                                                        
     DuckDB        │ 0F│ 1 │ W │ 00│ 00│ 00│ 3B│ 00│ 00│ 00│ 5F│                                                        
                   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘                                                        
                                                             │                                                          
                                     ┌───────────────────────┘                                                          
                                     ▼─────────┐                                                                        
                                     │   XYZ   │                                                                        
                                     └─────────┘                                                                        
                                                                                                                        
                   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐    
     SparkSQL      │ 00│ 00│ 00│ 00│ 00│ 00│ 00│ 0F│ 00│ 00│ 00│ 00│ 00│ 00│ 00│ 1 │ 00│ 00│ 00│ 00│ 00│ 00│ 00│ W ├───┐
                   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘   │
                                                                                                                       │
               ┌───────────────────────────────────────────────────────────────────────────────────────────────────────┘
               │                                                                                                        
               │   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐                                    
               └───▶ 00│ 00│ 00│ 28│ 00│ 00│ 00│ 03│ X │ Y │ Z │ 00│ 00│ 00│ 00│ 00│                                    
                   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘                                                          
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org