You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by al...@apache.org on 2022/11/07 12:18:39 UTC

[arrow-site] branch master updated: [WEBSITE] Blog posts on multi-column sorting implementation (#264)

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 4920b06c28 [WEBSITE] Blog posts on multi-column sorting implementation (#264)
4920b06c28 is described below

commit 4920b06c289f4c48f7f522f42cbbd6479daf08e9
Author: Andrew Lamb <an...@nerdnetworks.org>
AuthorDate: Mon Nov 7 07:18:34 2022 -0500

    [WEBSITE] Blog posts on multi-column sorting implementation (#264)
    
    * [WEBSITE] Blog posts on multi-column sorting implementation
    
    * Apply suggestions from code review from @tustvold
    
    Co-authored-by: Raphael Taylor-Davies <17...@users.noreply.github.com>
    
    * Update _posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md
    
    Co-authored-by: Raphael Taylor-Davies <17...@users.noreply.github.com>
    
    * fix: Use hex in signed integer example
    
    * fix smart quots
    
    * Update example to use ascending sorts
    
    * Update example to use ascending sorts
    
    * Wordsmithing, diagram tweaks, add performance summary to introduction
    
    * Apply suggestions from code review
    
    Co-authored-by: Paddy Horan <57...@users.noreply.github.com>
    Co-authored-by: Raphael Taylor-Davies <17...@users.noreply.github.com>
    
    * Update _posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md
    
    Co-authored-by: Raphael Taylor-Davies <17...@users.noreply.github.com>
    
    * Apply suggestions from code review
    
    Co-authored-by: Raphael Taylor-Davies <17...@users.noreply.github.com>
    
    * Apply suggestions from code review
    
    Co-authored-by: Raphael Taylor-Davies <17...@users.noreply.github.com>
    
    * Add sentence motivating why escaping is unecessary in row format
    
    * whitespace engineering
    
    * Apply suggestions from code review
    
    Co-authored-by: Raphael Taylor-Davies <17...@users.noreply.github.com>
    
    * Update _posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md
    
    Co-authored-by: Raphael Taylor-Davies <17...@users.noreply.github.com>
    
    * Apply suggestions from code review
    
    Co-authored-by: Sutou Kouhei <ko...@cozmixng.org>
    
    * Update date to 2022-11-07
    
    * Apply final edits from @tustvold
    
    Co-authored-by: Raphael Taylor-Davies <17...@users.noreply.github.com>
    Co-authored-by: Paddy Horan <57...@users.noreply.github.com>
    Co-authored-by: Sutou Kouhei <ko...@cozmixng.org>
---
 ...1-07-multi-column-sorts-in-arrow-rust-part-1.md | 232 +++++++++++++++++++
 ...1-07-multi-column-sorts-in-arrow-rust-part-2.md | 245 +++++++++++++++++++++
 2 files changed, 477 insertions(+)

diff --git a/_posts/2022-11-07-multi-column-sorts-in-arrow-rust-part-1.md b/_posts/2022-11-07-multi-column-sorts-in-arrow-rust-part-1.md
new file mode 100644
index 0000000000..1b3a6f89f0
--- /dev/null
+++ b/_posts/2022-11-07-multi-column-sorts-in-arrow-rust-part-1.md
@@ -0,0 +1,232 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust, Part 1"
+date: "2022-11-07 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+Sorting is one of the most fundamental operations in modern databases and other analytic systems, underpinning important operators such as aggregates, joins, window functions, merge, and more. By some estimates, more than half of the execution time in data processing systems is spent sorting. Optimizing sorts is therefore vital to improving query performance and overall system efficiency.
+
+Sorting is also one of the most well studied topics in computer science. The classic survey paper for databases is [Implementing Sorting in Database Systems](https://dl.acm.org/doi/10.1145/1132960.1132964) by Goetz Graefe which provides a thorough academic treatment and is still very applicable today. However, it may not be obvious how to apply the wisdom and advanced techniques described in that paper to modern systems. In addition, the excellent [DuckDB blog on sorting](https://duckdb. [...]
+
+In this series we explain in detail the new [row format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/), and how we used to make sorting more than [3x](https://github.com/apache/arrow-rs/pull/2929) faster than an alternate comparator based approach. The benefits are especially pronounced for strings, dictionary encoded data, and sorts with large numbers of columns.
+
+
+## Multicolumn / Lexicographical Sort Problem
+
+Most languages have native, optimized operations to sort a single column (array) of data, which are specialized based on the type of data being sorted. The reason that sorting is typically more challenging in analytic systems is that it must:
+
+1. They must support multiple columns of data
+2. The column types are not knowable at compile time, and thus the compiler can not typically generate optimized code.
+
+Multicolumn sorting is also referred to as lexicographical sorting in some libraries.
+
+For example, given sales data for various customers and their state of residence, a user might want to find the lowest 10 orders for each state.
+
+```text
+Customer | State | Orders
+—--------+-------+-------
+12345    |  MA   |  10.12
+532432   |  MA   |  8.44
+12345    |  CA   |  3.25
+56232    |  WA   |  6.00
+23442    |  WA   |  132.50
+7844     |  CA   |  9.33
+852353   |  MA   |  1.30
+```
+
+One way to do so is to order the data first by `State` and then by `Orders`:
+```text
+Customer | State | Orders
+—--------+-------+-------
+12345    |  CA   |  3.25
+7844     |  CA   |  9.33
+852353   |  MA   |  1.30
+532432   |  MA   |  8.44
+12345    |  MA   |  10.12
+56232    |  WA   |  6.00
+23442    |  WA   |  132.50
+```
+
+(Note: While there are specialized ways for computing this particular query other than fully sorting the entire input (e.g. "TopK"), they typically need the same multi-column comparison operation described below. Thus while we will use the simplified example in this series, it applies much more broadly)
+
+## Basic Implementation
+
+Let us take the example of a basic sort kernel which takes a set of columns as input, and returns a list of indices identifying a sorted order.
+
+```python
+> lexsort_to_indices([
+    ["MA", "MA", "CA", "WA", "WA", "CA", "MA"]
+  ])
+
+[2, 5, 0, 1, 6, 3, 4]
+
+> lexsort_to_indices([
+    ["MA", "MA", "CA", "WA", "WA",   "CA", "MA"],
+    [10.10, 8.44, 3.25, 6.00, 132.50, 9.33, 1.30]
+  ])
+
+[2, 5, 6, 1, 0, 3, 4]
+```
+
+This function returns a list of indices instead of sorting the columns directly because it:
+1. Avoids expensive copying data during the sorting process
+2. Allows deferring copying of values until the latest possible moment
+3. Can be used to reorder additional columns that weren’t part of the sort key
+
+
+A straightforward implementation of lexsort_to_indices uses a comparator function,
+
+```text
+   row
+  index
+        ┌─────┐   ┌─────┐   ┌─────┐     compare(left_index, right_index)
+      0 │     │   │     │   │     │
+       ┌├─────┤─ ─├─────┤─ ─├─────┤┐                   │             │
+        │     │   │     │   │     │ ◀──────────────────┘             │
+       └├─────┤─ ─├─────┤─ ─├─────┤┘                                 │
+        │     │   │     │   │     │Comparator function compares one  │
+        ├─────┤   ├─────┤   ├─────┤ multi-column row with another.   │
+        │     │   │     │   │     │                                  │
+        ├─────┤   ├─────┤   ├─────┤ The data types of the columns    │
+        │     │   │     │   │     │  and the sort options are not    │
+        └─────┘   └─────┘   └─────┘  known at compile time, only     │
+                    ...                        runtime               │
+                                                                     │
+       ┌┌─────┐─ ─┌─────┐─ ─┌─────┐┐                                 │
+        │     │   │     │   │     │ ◀────────────────────────────────┘
+       └├─────┤─ ─├─────┤─ ─├─────┤┘
+        │     │   │     │   │     │
+        ├─────┤   ├─────┤   ├─────┤
+    N-1 │     │   │     │   │     │
+        └─────┘   └─────┘   └─────┘
+        Customer    State    Orders
+         UInt64      Utf8     F64
+```
+
+
+The comparator function compares each row a column at a time, based on the column types
+
+```text
+                         ┌────────────────────────────────┐
+                         │                                │
+                         ▼                                │
+                     ┌ ─ ─ ─ ┐ ┌ ─ ─ ─ ┐                  │
+                                                          │
+            ┌─────┐  │┌─────┐│ │┌─────┐│                  │
+left_index  │     │   │     │   │     │                   │
+            └─────┘  │└─────┘│ │└─────┘│   Step 1: Compare State
+                                                    (UInt64)
+                     │       │ │       │
+
+                     │       │ │       │
+            ┌─────┐   ┌─────┐   ┌─────┐
+ right_index│     │  ││     ││ ││     ││
+            └─────┘   └─────┘   └─────┘    Step 2: If State values equal
+                     │       │ │       │   compare Orders (F64)
+            Customer   State     Orders                     │
+             UInt64  │  Utf8 │ │  F64  │                    │
+                      ─ ─ ─ ─   ─ ─ ─ ─                     │
+                                    ▲                       │
+                                    │                       │
+                                    └───────────────────────┘
+```
+
+Pseudocode for this operation might look something like
+
+```python
+# Takes a list of columns and returns the lexicographically
+# sorted order as a list of indices
+def lexsort_to_indices(columns):
+  comparator = build_comparator(columns)
+
+  # Construct a list of integers from 0 to the number of rows
+  # and sort it according to the comparator
+  [0..columns.num_rows()].sort_by(comparator)
+
+# Build a function that given indexes (left_idx, right_idx)
+# returns the comparison of the sort keys at the left
+# and right indices respectively
+def build_comparator(columns):
+  def comparator(left_idx, right_idx):
+    for column in columns:
+      # call a compare function which performs
+      # dynamic dispatch on type of left and right columns
+      ordering = compare(column, left_idx,right_idx)
+      if ordering != Equal {
+        return ordering
+      }
+    # All values equal
+    Equal
+  # Return comparator function
+  comparator
+
+  # compares the values in a single column at left_idx and right_idx
+  def compare(column, left_idx, right_idx):
+    # Choose comparison based on type of column ("dynamic dispatch")
+    if column.type == Int:
+     cmp(column[left_idx].as_int(), column[right_idx].as_int())
+    elif column.type == Float:
+     cmp(column[left_idx].as_float(), column[right_idx].as_float())
+    ...
+```
+
+Greater detail is beyond the scope of this post, but in general the more predictable the behavior of a block of code, the better its performance will be. In the case of this pseudocode,  there is clear room for improvement:
+
+1. `comparator` performs a large number of unpredictable conditional branches, where the path execution takes depends on the data values
+2. `comparator` and `compare` use dynamic dispatch, which not only adds further conditional branches, but also function call overhead
+3. `comparator` performs a large number of reads of memory at unpredictable locations
+
+You can find the complete implementation of multi-column comparator construction in arrow-rs in [sort.rs](https://github.com/apache/arrow-rs/blob/f629a2ebe08033e7b78585d82e98c50a4439e7a2/arrow/src/compute/kernels/sort.rs#L905-L1036) and [ord.rs](https://github.com/apache/arrow-rs/blob/f629a2e/arrow/src/array/ord.rs#L178-L313).
+
+
+# Normalized Keys / Byte Array Comparisons
+
+Now imagine we had a way to represent each logical row of data as a sequence of bytes, and that byte-wise comparison of that sequence yielded the same result as comparing the actual column values using the code above. Such a representation would require no switching on column types, and the kernel would become
+
+```python
+def lexsort_to_indices(columns):
+  rows = convert_to_rows(columns)
+  [0..columns.num_rows()].sort_by(lambda l, r: cmp(rows[l], rows[r]))
+```
+
+While this approach does require converting to/from the byte array representation, it has some major advantages:
+
+* Rows can be compared by comparing bytes in memory, which modern computer hardware excels at with the extremely well optimized [memcmp](https://www.man7.org/linux/man-pages/man3/memcmp.3.html)
+* Memory accesses are largely predictable
+* There is no dynamic dispatch overhead
+* Extends straightforwardly to more sophisticated sorting strategies such as
+    * Distribution-based sorting techniques such as radix sort
+    * Parallel merge sort
+    * External sort
+    * ...
+
+You can find more information on how to leverage such representation in the "Binary String Comparison" section of the [DuckDB blog post](https://duckdb.org/2021/08/27/external-sorting.html) on the topic as well as [Graefe’s paper](https://dl.acm.org/doi/10.1145/1132960.1132964). However, we found it wasn’t immediately obvious how to apply this technique to variable length string or dictionary encoded data, which we will explain in the next post in this series.
+
+
+## Next up: Row Format
+
+This post has introduced the concept and challenges of multi column sorting, and shown why a comparable byte array representation, such as the [row format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) introduced to the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/), is such a compelling primitive.
+
+In [the next post]({% post_url 2022-11-07-multi-column-sorts-in-arrow-rust-part-2 %}) we explain how this encoding works, but if you just want to use it, check out the [docs](https://docs.rs/arrow/latest/arrow/row/index.html) for getting started, and report any issues on our [bugtracker](https://github.com/apache/arrow-rs/issues). As always, the [Arrow community](https://github.com/apache/arrow-rs#arrow-rust-community) very much looks forward to seeing what you build with it!
diff --git a/_posts/2022-11-07-multi-column-sorts-in-arrow-rust-part-2.md b/_posts/2022-11-07-multi-column-sorts-in-arrow-rust-part-2.md
new file mode 100644
index 0000000000..161abc8fc0
--- /dev/null
+++ b/_posts/2022-11-07-multi-column-sorts-in-arrow-rust-part-2.md
@@ -0,0 +1,245 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust, Part 2"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+In [Part 1]({% post_url 2022-11-07-multi-column-sorts-in-arrow-rust-part-1 %}) of this post, we described the problem of Multi-Column Sorting and the challenges of implementing it efficiently. This second post explains how the new [row format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/) works and is constructed.
+
+
+## Row Format
+
+The row format is a variable length byte sequence created by concatenating the encoded form of each column. The encoding for each column depends on its datatype (and sort options).
+
+```
+   ┌─────┐   ┌─────┐   ┌─────┐
+   │     │   │     │   │     │
+   ├─────┤ ┌ ┼─────┼ ─ ┼─────┼ ┐              ┏━━━━━━━━━━━━━┓
+   │     │   │     │   │     │  ─────────────▶┃             ┃
+   ├─────┤ └ ┼─────┼ ─ ┼─────┼ ┘              ┗━━━━━━━━━━━━━┛
+   │     │   │     │   │     │
+   └─────┘   └─────┘   └─────┘
+               ...
+   ┌─────┐ ┌ ┬─────┬ ─ ┬─────┬ ┐              ┏━━━━━━━━┓
+   │     │   │     │   │     │  ─────────────▶┃        ┃
+   └─────┘ └ ┴─────┴ ─ ┴─────┴ ┘              ┗━━━━━━━━┛
+   Customer    State    Orders
+    UInt64      Utf8     F64
+
+          Input Arrays                          Row Format
+           (Columns)
+```
+
+The encoding is carefully designed in such a way that escaping is unnecessary: it is never ambiguous as to whether a byte is part of a sentinel (e.g. null) or a value.
+
+### Unsigned Integers
+
+To encode a non-null unsigned integer, the byte `0x01` is written, followed by the integer’s bytes starting with the most significant, i.e. big endian. A null is encoded as a `0x00` byte, followed by the encoded bytes of the integer’s zero value
+
+```
+              ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
+   3          │03│00│00│00│      │01│00│00│00│03│
+              └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
+              ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
+  258         │02│01│00│00│      │01│00│00│01│02│
+              └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
+              ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
+ 23423        │7F│5B│00│00│      │01│00│00│5B│7F│
+              └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
+              ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
+ NULL         │??│??│??│??│      │00│00│00│00│00│
+              └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
+
+             32-bit (4 bytes)        Row Format
+ Value        Little Endian
+```
+
+### Signed Integers
+
+In Rust and most modern computer architectures, signed integers are encoded using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement), where a number is negated by flipping all the bits, and adding 1. Therefore, flipping the top-most bit and treating the result as an unsigned integer preserves the order. This unsigned integer can then be encoded using the same encoding for unsigned integers described in the previous section. For example
+
+```
+       ┌──┬──┬──┬──┐       ┌──┬──┬──┬──┐       ┌──┬──┬──┬──┬──┐
+    5  │05│00│00│00│       │05│00│00│80│       │01│80│00│00│05│
+       └──┴──┴──┴──┘       └──┴──┴──┴──┘       └──┴──┴──┴──┴──┘
+       ┌──┬──┬──┬──┐       ┌──┬──┬──┬──┐       ┌──┬──┬──┬──┬──┐
+   -5  │FB│FF│FF│FF│       │FB│FF│FF│7F│       │01│7F│FF│FF│FB│
+       └──┴──┴──┴──┘       └──┴──┴──┴──┘       └──┴──┴──┴──┴──┘
+
+ Value  32-bit (4 bytes)    High bit flipped      Row Format
+         Little Endian
+```
+
+### Floating Point
+
+Floating point values can be ordered according to the [IEEE 754 totalOrder predicate](https://en.wikipedia.org/wiki/IEEE_754#Total-ordering_predicate) (implemented in Rust by [f32::total_cmp](https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp)). This ordering interprets the bytes of the floating point value as the correspondingly sized, signed, little-endian integer, flipping all the bits except the sign bit in the case of negatives.
+
+Floating point values are therefore encoded to row format by converting them to the appropriate sized signed integer representation, and then using the same encoding for signed integers described in the previous section.
+
+### Byte Arrays (Including Strings)
+
+Unlike primitive types above, byte arrays are variable length. For short strings, such as `state` in our example above, it is possible to pad all values to the length of the longest one with some fixed value such as `0x00` and produce a fixed length row. This is the approach described in the DuckDB blog for encoding `c_birth_country`.
+
+However, often values in string columns differ substantially in length or the maximum length is not known at the start of execution, making it inadvisable and/or impractical to pad the strings to a fixed length. The Rust Arrow row format therefore uses a variable length encoding.
+
+We need an encoding that unambiguously terminates the end of the byte array. This not only permits recovering the original value from the row format, but ensures that bytes of a longer byte array are not compared against bytes from a different column when compared against a row containing a shorter byte array.
+
+A null byte array is encoded as a single `0x00` byte. Similarly, an empty byte array is encoded as a single `0x01` byte.
+
+To encode a non-null, non-empty array, first a single `0x02` byte  is written. Then the array is written in 32-byte blocks, with each complete block followed by a `0xFF` byte as a continuation token. The final block is padded to 32-bytes with `0x00`, and is then followed by the unpadded length of this final block as a single byte in place of a continuation token
+
+Note the following example encodings use a block size of 4 bytes, as opposed to 32 bytes for brevity
+
+```
+                      ┌───┬───┬───┬───┬───┬───┐
+ "MEEP"               │02 │'M'│'E'│'E'│'P'│04 │
+                      └───┴───┴───┴───┴───┴───┘
+
+                      ┌───┐
+ ""                   │01 |
+                      └───┘
+
+ NULL                 ┌───┐
+                      │00 │
+                      └───┘
+
+"Defenestration"      ┌───┬───┬───┬───┬───┬───┐
+                      │02 │'D'│'e'│'f'│'e'│FF │
+                      └───┼───┼───┼───┼───┼───┤
+                          │'n'│'e'│'s'│'t'│FF │
+                          ├───┼───┼───┼───┼───┤
+                          │'r'│'a'│'t'│'r'│FF │
+                          ├───┼───┼───┼───┼───┤
+                          │'a'│'t'│'i'│'o'│FF │
+                          ├───┼───┼───┼───┼───┤
+                          │'n'│00 │00 │00 │17 │
+                          └───┴───┴───┴───┴───┘
+```
+
+This approach is loosely inspired by [COBS encoding](https://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffing), and chosen over more traditional [byte stuffing](https://en.wikipedia.org/wiki/High-Level_Data_Link_Control#Asynchronous_framing) as it is more amenable to vectorization, in particular hardware with AVX-256 can copy a 32-byte block in a single instruction.
+
+### Dictionary Arrays
+Dictionary Encoded Data (called [categorical](https://pandas.pydata.org/docs/user_guide/categorical.html) in pandas) is increasingly important because they can store and process low cardinality data very efficiently.
+
+A simple approach to encoding dictionary arrays would be to encode the logical values directly using the encodings for primitive values described previously. However, this would lose the benefits of dictionary encoding to reduce memory and CPU consumption.
+
+To further complicate matters, the [Arrow implementation of Dictionary encoding](https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout) is quite general, and we can make no assumptions about the contents of the dictionaries. In particular, we cannot assume that the dictionary values are sorted, nor that the same dictionary is used for all arrays within a column
+
+The following example shows how a string column might be encoded in two arrays using two different dictionaries. The dictionary keys `0`, `1`, and `2` in the first batch correspond to different values than the same keys in the second dictionary.
+
+```
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌───────────┐ ┌─────┐    │
+│ │"Fabulous" │ │  0  │
+  ├───────────┤ ├─────┤    │
+│ │   "Bar"   │ │  2  │
+  ├───────────┤ ├─────┤    │       ┌───────────┐
+│ │  "Soup"   │ │  2  │            │"Fabulous" │
+  └───────────┘ ├─────┤    │       ├───────────┤
+│               │  0  │            │  "Soup"   │
+                ├─────┤    │       ├───────────┤
+│               │  1  │            │  "Soup"   │
+                └─────┘    │       ├───────────┤
+│                                  │"Fabulous" │
+                 Values    │       ├───────────┤
+│ Dictionary   (indexes in         │   "Bar"   │
+               dictionary) │       ├───────────┤
+│                                  │   "ZZ"    │
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘       ├───────────┤
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        │   "Bar"   │
+                           │       ├───────────┤
+│ ┌───────────┐ ┌─────┐            │   "ZZ"    │
+  │"Fabulous" │ │  1  │    │       ├───────────┤
+│ ├───────────┤ ├─────┤            │"Fabulous" │
+  │   "ZZ"    │ │  2  │    │       └───────────┘
+│ ├───────────┤ ├─────┤
+  │   "Bar"   │ │  1  │    │
+│ └───────────┘ ├─────┤
+                │  0  │    │      Logical column
+│               └─────┘               values
+                Values     │
+│  Dictionary (indexes in
+              dictionary)  │
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+The key observation which allows us to efficiently create a row format for this kind of data is that given a byte array, a new byte array can always be created which comes before or after it in the sort order by adding an additional byte.
+
+Therefore we can incrementally build an order-preserving mapping from dictionary values to variable length byte arrays, without needing to know all possible dictionary values beforehand, instead introducing mappings for new dictionary values as we encounter them.
+
+```
+┌──────────┐                 ┌─────┐
+│  "Bar"   │ ───────────────▶│ 01  │
+└──────────┘                 └─────┘
+┌──────────┐                 ┌─────┬─────┐
+│"Fabulous"│ ───────────────▶│ 01  │ 02  │
+└──────────┘                 └─────┴─────┘
+┌──────────┐                 ┌─────┐
+│  "Soup"  │ ───────────────▶│ 05  │
+└──────────┘                 └─────┘
+┌──────────┐                 ┌─────┐
+│   "ZZ"   │ ───────────────▶│ 07  │
+└──────────┘                 └─────┘
+
+    Example Order Preserving Mapping
+```
+
+The details of the data structure used to generate this mapping are beyond the scope of this blog post, but may be the topic of a future post. You can find [the code here](https://github.com/apache/arrow-rs/blob/07024f6a16b870fda81cba5779b8817b20386ebf/arrow/src/row/interner.rs).
+
+The data structure also ensures that no values contain `0x00` and therefore we can encode the arrays directly using `0x00` as an end-delimiter.
+
+A null value is encoded as a single `0x00` byte, and a non-null value encoded as a single `0x01` byte, followed by the `0x00` terminated byte array determined by the order preserving mapping
+
+```
+                          ┌─────┬─────┬─────┬─────┐
+   "Fabulous"             │ 01  │ 03  │ 05  │ 00  │
+                          └─────┴─────┴─────┴─────┘
+
+                          ┌─────┬─────┬─────┐
+   "ZZ"                   │ 01  │ 07  │ 00  │
+                          └─────┴─────┴─────┘
+
+                          ┌─────┐
+    NULL                  │ 00  │
+                          └─────┘
+
+     Input                  Row Format
+```
+
+### Sort Options
+
+One detail we have so far ignored over is how to support ascending and descending sorts (e.g. `ASC` or `DESC` in SQL). The Arrow Rust row format supports these options by simply inverting the bytes of the encoded representation, except the initial byte used for nullability encoding, on a per column basis.
+
+Similarly, supporting SQL compatible sorting also requires a format that can specify the order of `NULL`s (before or after all non `NULL` values). The row format supports this option by optionally encoding nulls as `0xFF` instead of `0x00` on a per column basis.
+
+## Conclusion
+
+Hopefully these two articles have given you a flavor of what is possible with a comparable row format and how it works. Feel free to check out the [docs](https://docs.rs/arrow/latest/arrow/row/index.html) for instructions on getting started, and report any issues on our [bugtracker](https://github.com/apache/arrow-rs/issues).
+
+Using this format for lexicographic sorting is more than [3x](https://github.com/apache/arrow-rs/pull/2929) faster than the comparator based approach, with the benefits especially pronounced for strings, dictionaries and sorts with large numbers of columns.
+
+We have also already used it to more than [double](https://github.com/apache/arrow-datafusion/pull/3386) the performance of sort preserving merge in the [DataFusion project](https://arrow.apache.org/datafusion/), and expect similar or greater performance uplift as we apply it to sort, grouping, join, and window function operators as well.
+
+As always, the [Arrow community](https://github.com/apache/arrow-rs#arrow-rust-community) very much looks forward to seeing what you build with it!