You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/01 20:31:46 UTC

[GitHub] [arrow-site] alamb commented on a diff in pull request #264: [WEBSITE] Blog posts on multi-column sorting implementation

alamb commented on code in PR #264:
URL: https://github.com/apache/arrow-site/pull/264#discussion_r1010871430


##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md:
##########
@@ -0,0 +1,210 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust, Part 1"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+Sorting is one of the most fundamental operations in modern databases and other analytic systems, underpinning common operators such as aggregates, joins, window functions, merge, and more. By some estimates, more than half of the execution time in data processing systems is spent sorting. Optimizing sorts is therefore vital to improving query performance and overall system efficiency.
+
+Sorting is also one of the most well studied topics in computer science. The classic survey paper for databases is [Implementing Sorting in Database Systems](https://dl.acm.org/doi/10.1145/1132960.1132964) by Goetz Graefe which provides a thorough academic treatment that is still very applicable today. However, it may not be obvious how to apply the wisdom and advanced techniques described in that paper to modern systems.
+
+In this blog post we explain in detail the new [row format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/), and how this can be used to perform blazingly fast multi-column sorts. The excellent [DuckDB blog on sorting](https://duckdb.org/2021/08/27/external-sorting.html) highlights several sorting techniques, and mentions such a comparable row format, but it does not explain how to efficiently sort variable length strings or dictionary encoded data, which we do in this post.
+
+## Multicolumn / Lexicographical Sort Problem
+
+Most languages have native optimized operations to sort a single column (array) of data and are specialized based on the type of data. The reason that sorting is typically more challenging in analytic systems is that they must:
+1. Support sorting by multiple columns of data
+2. The column types are not knowable at compile time, and thus the compiler can not typically generate optimized code.
+
+Multicolumn sorting is also referred to as lexicographical sorting in some libraries.
+
+For example, given sales data for various customers and their state of residence, a user might want to find the top 10 orders for each state. One way to do so is to order the data first by State (ascending) and then by Orders (descending)
+
+```sql
+Customer | State | Orders
+—--------+-------+-------
+12345    |  MA   |  10.12
+532432   |  MA   |  8.44
+12345    |  CA   |  3.25
+56232    |  WA   |  6.00
+23442    |  WA   |  132.50
+7844     |  CA   |  9.33
+852353   |  MA   |  1.30
+```
+
+(Note: While there are specialized ways for computing this particular query other that sorting the entire input (“TopK”), they typically need the same multi-column comparison operation described below, so we will use the simplified example in our post but it does apply more broadly)
+
+## Basic Implementation
+
+Let us take the example of a basic sort kernel which takes a set of columns as input, and returns a list of indices identifying the sorted order.
+
+```python
+> lexsort_to_indices([
+  [“MA”, “MA”, “CA”, “WA”, “WA”, “CA”, “MA”]
+])
+[2, 5, 0, 1, 6, 3, 4]
+
+> lexsort_to_indices([

Review Comment:
   in bf7760b51d



##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md:
##########
@@ -0,0 +1,240 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust, Part 2"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+In Part 1 <!-- TODO add link -->  of this post, we described the problem of Multi-Column Sorting and challenges to implementing it efficiently. This second post explains how the new [row format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/) sorts more quickly.
+
+
+## Row Format
+    Now that you have a taste for why a comparable byte array representation is such a compelling primitive, you will be pleased to learn that the row format added to arrow-rs is such a representation. The rest of this article will explain  how it works, but if you just want to use it, check out the [docs](https://docs.rs/arrow/latest/arrow/row/index.html), the [code](https://github.com/apache/arrow-rs/blob/07024f6/arrow/src/row/mod.rs#L105-L187), and [bugtracker](https://github.com/apache/arrow-rs/issues).
+
+The Row Format is a variable length byte sequence created by concatenating the encoded form of each column. The encoding for each column depends on both the datatype as well as the requested sort options (ascending, descending, nulls first and null last).
+
+```
+   ┌─────┐   ┌─────┐   ┌─────┐
+   │     │   │     │   │     │
+   ├─────┤ ┌ ┼─────┼ ─ ┼─────┼ ┐             ┏━━━━━┳━━━━━━━━┓
+   │     │   │     │   │     │  ────────────▶┃     ┃        ┃
+   ├─────┤ └ ┼─────┼ ─ ┼─────┼ ┘             ┗━━━━━┻━━━━━━━━┛
+   │     │   │     │   │     │
+   └─────┘   └─────┘   └─────┘
+               ...
+   ┌─────┐ ┌ ┬─────┬ ─ ┬─────┬ ┐             ┏━━━━━━━━┓
+   │     │   │     │   │     │  ────────────▶┃        ┃
+   └─────┘ └ ┴─────┴ ─ ┴─────┴ ┘             ┗━━━━━━━━┛
+   Customer    State    Orders
+    UInt64      Utf8     F64
+
+          Input Arrays                          Row Format
+           (Columns)
+```
+
+### Unsigned Integers
+
+To encode a non-null unsigned integer, the byte `0x01` is written, followed by the integer’s bytes starting with the most significant, i.e. big endian. A null is encoded as a `0x00` byte, followed by the encoded bytes of the integer’s zero value
+
+```
+              ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
+   3          │03│00│00│00│      │01│00│00│00│03│
+              └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
+              ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
+  258         │02│01│00│00│      │01│00│00│01│02│
+              └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
+              ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
+ 23423        │7F│5B│00│00│      │01│00│00│5B│7F│
+              └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
+              ┌──┬──┬──┬──┐      ┌──┬──┬──┬──┬──┐
+ NULL         │??│??│??│??│      │00│00│00│00│00│
+              └──┴──┴──┴──┘      └──┴──┴──┴──┴──┘
+
+             32-bit (4 bytes)        Row Format
+ Value        Little Endian
+```
+
+### Signed Integers
+
+In Rust and most modern computer architectures, signed integers are encoded using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement), where a number is negated by flipping all the bits, and adding 1. Therefore, flipping the top-most bit and treating the result as an unsigned integer preserves the order. This unsigned integer can then be encoded using the same encoding for unsigned integers described in the previous section. For example
+
+```

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org