You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/08/29 19:49:12 UTC

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2593: POC: Comparable Row Format

tustvold commented on code in PR #2593:
URL: https://github.com/apache/arrow-rs/pull/2593#discussion_r957735173


##########
arrow/src/row/mod.rs:
##########
@@ -0,0 +1,577 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! A comparable row-oriented representation of a [`RecordBatch`]
+
+use half::f16;
+
+use crate::array::{
+    as_boolean_array, as_generic_binary_array, as_largestring_array, as_primitive_array,
+    as_string_array, Array, ArrayRef, Decimal128Array, Decimal256Array,
+};
+use crate::datatypes::*;
+use crate::record_batch::RecordBatch;
+use crate::util::decimal::{Decimal128, Decimal256};
+
+/// A row-oriented representation of a [`RecordBatch`]
+///
+/// # Format
+///
+/// The encoding of the row format should not be considered stable, but is documented here
+/// for reference.
+///
+/// The key property it provides is that a byte-wise comparison, e.g. [`memcmp`], is sufficient
+/// to establish the ordering of two rows, allowing for extremely fast comparisons
+///
+/// ## Unsigned Integer Encoding
+///
+/// A null integer is encoded as a `0_u8`, followed by a zero-ed number of bytes corresponding
+/// to the integer's length
+///
+/// A valid integer is encoded as `1_u8`, followed by the big-endian representation of the
+/// integer
+///
+/// ## Signed Integer Encoding
+///
+/// Signed integers have their most significant sign bit flipped, and are then encoded in the
+/// same manner as an unsigned integer
+///
+/// ## Float Encoding
+///
+/// Floats are converted from IEEE 754 representation to a signed integer representation
+/// by flipping all bar the sign bit if they are negative.
+///
+/// They are then encoded in the same manner as a signed integer
+///
+/// ## Variable Length Bytes Encoding
+///
+/// A null is encoded as a big endian encoded `0_u32`
+///
+/// A valid value is encoded as a big endian length, with the most significant bit set, followed
+/// by the byte values
+///
+/// ## Dictionary Encoding
+///
+/// **Not currently implemented**
+///
+/// Dictionaries are materialized to their values in the row format, this may require dramatically
+/// more memory than the source [`Array`]. It is recommended that the batch size is kept
+/// sufficiently small that this doesn't cause issues
+///
+/// # Ordering
+///
+/// ## Float Ordering
+///
+/// Floats are totally ordered in accordance to the `totalOrder` predicate as defined
+/// in the IEEE 754 (2008 revision) floating point standard.
+///
+/// The ordering established by this does not always agree with the
+/// [`PartialOrd`] and [`PartialEq`] implementations of `f32`. For example,
+/// they consider negative and positive zero equal, while this does not
+///
+/// ## Null Ordering
+///
+/// The row format currently orders nulls as less than non-null values. A future extension
+/// could allow configuring this, by inverting the representation of the null bit
+///
+/// ## Reverse Column Ordering
+///
+/// The row format does not currently support reversing the ordering of a specific column. A
+/// future extension could allow configuring this by inverting the bit representation of values
+/// for the column in question
+///
+/// ## Reconstruction
+///
+/// Given a schema it would theoretically be possible to reconstruct the columnar data from
+/// the row format, however, this is currently not supported. It is recommended that the row
+/// format is instead used to obtain a sorted list of row indices, which can then be used
+/// with [`take`]:[crate::compute::take] to obtain a sorted [`RecordBatch`]
+///
+/// [`memcmp`]:[https://www.man7.org/linux/man-pages/man3/memcmp.3.html]
+///
+#[derive(Debug)]
+pub struct RowBatch {
+    buffer: Box<[u8]>,
+    offsets: Box<[usize]>,
+}
+
+impl RowBatch {
+    /// Create a [`RowBatch`] from the provided [`RecordBatch`]
+    pub fn new(batch: &RecordBatch) -> Self {

Review Comment:
   Yeah, I was being lazy, supporting SortOptions is on my radar 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org