You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/18 10:23:13 UTC

[GitHub] [arrow-datafusion] yjshen opened a new pull request, #2261: Introduce RowLayout to represent rows for different purposes

yjshen opened a new pull request, #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   The second part of #2189.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   To support an 8-byte aligned row layout for grouping states of hash aggregation.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Enable the reading and writing raw-bytes rows with two possible layouts.
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   An API change might have no effects since it's on the optional feature `row`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r852027987


##########
datafusion/core/src/row/layout.rs:
##########
@@ -17,26 +17,94 @@
 
 //! Various row layout for different use case
 
-use crate::row::{schema_null_free, var_length};
+use crate::row::{row_supported, schema_null_free, var_length};
 use arrow::datatypes::{DataType, Schema};
 use arrow::util::bit_util::{ceil, round_upto_power_of_2};
+use std::fmt::{Debug, Formatter};
 use std::sync::Arc;
 
 const UTF8_DEFAULT_SIZE: usize = 20;
 const BINARY_DEFAULT_SIZE: usize = 100;
 
+#[derive(Copy, Clone, Debug)]

Review Comment:
   The main changes are below. Other files changes are almost mechanical.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r853481992


##########
datafusion/core/src/row/layout.rs:
##########
@@ -50,13 +118,23 @@ fn type_width(dt: &DataType) -> usize {
     }
 }
 
+fn word_aligned_offsets(null_width: usize, schema: &Arc<Schema>) -> (Vec<usize>, usize) {
+    let mut offsets = vec![];
+    let mut offset = null_width;
+    for _ in schema.fields() {
+        offsets.push(offset);
+        offset += 8; // a 8-bytes word for each field

Review Comment:
   Sorry -- I should have been clearer -- what I think is important is that the code will assert / fail in some obvious way if/when someone decides to add `DataType::Decimal` support in the future. I don't think we actually need to support any additional sizes now, I am just thinking of our future selves (or some other new contributor) trying to debug a partial field overwrite because the fields were too small 🤯 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r853256368


##########
datafusion/core/src/row/layout.rs:
##########
@@ -50,13 +118,23 @@ fn type_width(dt: &DataType) -> usize {
     }
 }
 
+fn word_aligned_offsets(null_width: usize, schema: &Arc<Schema>) -> (Vec<usize>, usize) {
+    let mut offsets = vec![];
+    let mut offset = null_width;
+    for _ in schema.fields() {
+        offsets.push(offset);
+        offset += 8; // a 8-bytes word for each field

Review Comment:
   It just seemed like a landmine for the future. It is fine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r853283598


##########
datafusion/core/src/row/layout.rs:
##########
@@ -50,13 +118,23 @@ fn type_width(dt: &DataType) -> usize {
     }
 }
 
+fn word_aligned_offsets(null_width: usize, schema: &Arc<Schema>) -> (Vec<usize>, usize) {
+    let mut offsets = vec![];
+    let mut offset = null_width;
+    for _ in schema.fields() {
+        offsets.push(offset);
+        offset += 8; // a 8-bytes word for each field

Review Comment:
   Let me make it a function here to choose between 8 and 16 based on DataTypes. 
   
   > Also it would be nice to know variable length data works here
   
   I haven't settled down my mind if I'd store grouping keys as well as group state as one row or two separate rows and have a JoinedRow { key: vec<u8>, state: vec<u8> } for the hashtable in aggregate.
   
   But for the group states only, I prefer to not handle variable length states, such as an array, median, or sketch in approxs', since it's not efficient to do in-place updates for these states, we'd better keep with Vec<ScalarValue> states for them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r854534284


##########
datafusion/core/tests/sql/explain_analyze.rs:
##########
@@ -136,8 +137,16 @@ async fn explain_analyze_baseline_metrics() {
             }
             let metrics = plan.metrics().unwrap().aggregate_by_partition();
 
-            assert!(metrics.output_rows().unwrap() > 0);
-            assert!(metrics.elapsed_compute().unwrap() > 0);
+            assert!(
+                metrics.output_rows().unwrap() > 0,
+                "plan: {}",
+                DisplayableExecutionPlan::with_metrics(plan).one_line()
+            );
+            assert!(

Review Comment:
   
   This change appears to be intermittently failing
   
   https://github.com/apache/arrow-datafusion/runs/6100177086?check_suite_focus=true
   
   https://github.com/apache/arrow-datafusion/runs/6100277301?check_suite_focus=true
   
   https://github.com/apache/arrow-datafusion/runs/6100969854?check_suite_focus=true
   
   Edit: This would appear to occur if the operator is too fast to record time passing, I'll get a PR up to fix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r853253315


##########
datafusion/core/src/row/layout.rs:
##########
@@ -50,13 +118,23 @@ fn type_width(dt: &DataType) -> usize {
     }
 }
 
+fn word_aligned_offsets(null_width: usize, schema: &Arc<Schema>) -> (Vec<usize>, usize) {
+    let mut offsets = vec![];
+    let mut offset = null_width;
+    for _ in schema.fields() {
+        offsets.push(offset);
+        offset += 8; // a 8-bytes word for each field

Review Comment:
   Row format doesn't support Decimal for now, so just 8 bytes currently.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#issuecomment-1101766241

   I plan to review this PR first thing tomorrow morning US eastern time (~ 6AM or so)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r854092749


##########
datafusion/core/src/row/layout.rs:
##########
@@ -109,20 +121,24 @@ fn compact_type_width(dt: &DataType) -> usize {
 fn word_aligned_offsets(null_width: usize, schema: &Schema) -> (Vec<usize>, usize) {
     let mut offsets = vec![];
     let mut offset = null_width;
-    for _ in schema.fields() {
+    for f in schema.fields() {
         offsets.push(offset);
-        offset += 8; // a 8-bytes word for each field
+        assert!(!matches!(f.data_type(), DataType::Decimal(_, _)));

Review Comment:
   👍 



##########
datafusion/core/src/row/jit/mod.rs:
##########
@@ -92,85 +93,190 @@ mod tests {
     fn_test_single_type!(
         BooleanArray,
         Boolean,
-        vec![Some(true), Some(false), None, Some(true), None]
+        vec![Some(true), Some(false), None, Some(true), None],
+        Compact
+    );
+
+    fn_test_single_type!(
+        BooleanArray,
+        Boolean,
+        vec![Some(true), Some(false), None, Some(true), None],
+        WordAligned

Review Comment:
   👌  very nice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r853006965


##########
datafusion/core/src/row/layout.rs:
##########
@@ -17,26 +17,94 @@
 
 //! Various row layout for different use case
 
-use crate::row::{schema_null_free, var_length};
+use crate::row::{row_supported, schema_null_free, var_length};
 use arrow::datatypes::{DataType, Schema};
 use arrow::util::bit_util::{ceil, round_upto_power_of_2};
+use std::fmt::{Debug, Formatter};
 use std::sync::Arc;
 
 const UTF8_DEFAULT_SIZE: usize = 20;
 const BINARY_DEFAULT_SIZE: usize = 100;
 
+#[derive(Copy, Clone, Debug)]
+/// Type of a RowLayout
+pub enum RowType {

Review Comment:
   👍 



##########
datafusion/core/src/row/layout.rs:
##########
@@ -17,26 +17,94 @@
 
 //! Various row layout for different use case
 
-use crate::row::{schema_null_free, var_length};
+use crate::row::{row_supported, schema_null_free, var_length};
 use arrow::datatypes::{DataType, Schema};
 use arrow::util::bit_util::{ceil, round_upto_power_of_2};
+use std::fmt::{Debug, Formatter};
 use std::sync::Arc;
 
 const UTF8_DEFAULT_SIZE: usize = 20;
 const BINARY_DEFAULT_SIZE: usize = 100;
 
+#[derive(Copy, Clone, Debug)]
+/// Type of a RowLayout
+pub enum RowType {
+    /// This type of layout will store each field with minimum bytes for space efficiency.
+    /// Its typical use case represents a sorting payload that accesses all row fields as a unit.
+    Compact,
+    /// This type of layout will store one 8-byte word per field for CPU-friendly,
+    /// It is mainly used to represent the rows with frequently updated content,
+    /// for example, grouping state for hash aggregation.
+    WordAligned,
+    // RawComparable,
+}
+
+/// Reveals how the fields of a record are stored in the raw-bytes format
+pub(crate) struct RowLayout {
+    /// Type of the layout
+    type_: RowType,
+    /// If a row is null free according to its schema
+    pub(crate) null_free: bool,
+    /// The number of bytes used to store null bits for each field.
+    pub(crate) null_width: usize,
+    /// Length in bytes for `values` part of the current tuple.
+    pub(crate) values_width: usize,
+    /// Total number of fields for each tuple.
+    pub(crate) field_count: usize,
+    /// Starting offset for each fields in the raw bytes.
+    pub(crate) field_offsets: Vec<usize>,
+}
+
+impl RowLayout {
+    pub(crate) fn new(schema: &Arc<Schema>, type_: RowType) -> Self {

Review Comment:
   ```suggestion
       pub(crate) fn new(schema: &Schema, type_: RowType) -> Self {
   ```



##########
datafusion/core/src/row/layout.rs:
##########
@@ -17,26 +17,94 @@
 
 //! Various row layout for different use case
 
-use crate::row::{schema_null_free, var_length};
+use crate::row::{row_supported, schema_null_free, var_length};
 use arrow::datatypes::{DataType, Schema};
 use arrow::util::bit_util::{ceil, round_upto_power_of_2};
+use std::fmt::{Debug, Formatter};
 use std::sync::Arc;
 
 const UTF8_DEFAULT_SIZE: usize = 20;
 const BINARY_DEFAULT_SIZE: usize = 100;
 
+#[derive(Copy, Clone, Debug)]
+/// Type of a RowLayout
+pub enum RowType {
+    /// This type of layout will store each field with minimum bytes for space efficiency.
+    /// Its typical use case represents a sorting payload that accesses all row fields as a unit.
+    Compact,
+    /// This type of layout will store one 8-byte word per field for CPU-friendly,
+    /// It is mainly used to represent the rows with frequently updated content,
+    /// for example, grouping state for hash aggregation.
+    WordAligned,
+    // RawComparable,
+}
+
+/// Reveals how the fields of a record are stored in the raw-bytes format
+pub(crate) struct RowLayout {
+    /// Type of the layout
+    type_: RowType,
+    /// If a row is null free according to its schema
+    pub(crate) null_free: bool,
+    /// The number of bytes used to store null bits for each field.
+    pub(crate) null_width: usize,
+    /// Length in bytes for `values` part of the current tuple.
+    pub(crate) values_width: usize,
+    /// Total number of fields for each tuple.
+    pub(crate) field_count: usize,
+    /// Starting offset for each fields in the raw bytes.
+    pub(crate) field_offsets: Vec<usize>,
+}
+
+impl RowLayout {
+    pub(crate) fn new(schema: &Arc<Schema>, type_: RowType) -> Self {
+        assert!(row_supported(schema));
+        let null_free = schema_null_free(schema);
+        let field_count = schema.fields().len();
+        let null_width = if null_free { 0 } else { ceil(field_count, 8) };
+        let (field_offsets, values_width) = match type_ {
+            RowType::Compact => compact_offsets(null_width, schema),
+            RowType::WordAligned => word_aligned_offsets(null_width, schema),
+        };
+        Self {
+            type_,
+            null_free,
+            null_width,
+            values_width,
+            field_count,
+            field_offsets,
+        }
+    }
+
+    #[inline(always)]
+    pub(crate) fn init_varlena_offset(&self) -> usize {
+        self.null_width + self.values_width
+    }
+}
+
+impl Debug for RowLayout {
+    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+        f.debug_struct("RowLayout")
+            .field("type", &self.type_)
+            .field("null_width", &self.null_width)
+            .field("values_width", &self.values_width)
+            .field("field_count", &self.field_count)
+            .field("offsets", &self.field_offsets)
+            .finish()
+    }
+}
+
 /// Get relative offsets for each field and total width for values
-pub fn get_offsets(null_width: usize, schema: &Arc<Schema>) -> (Vec<usize>, usize) {
+fn compact_offsets(null_width: usize, schema: &Arc<Schema>) -> (Vec<usize>, usize) {

Review Comment:
   ```suggestion
   fn compact_offsets(null_width: usize, schema: &Schema) -> (Vec<usize>, usize) {
   ```



##########
datafusion/core/src/row/layout.rs:
##########
@@ -17,26 +17,94 @@
 
 //! Various row layout for different use case
 
-use crate::row::{schema_null_free, var_length};
+use crate::row::{row_supported, schema_null_free, var_length};
 use arrow::datatypes::{DataType, Schema};
 use arrow::util::bit_util::{ceil, round_upto_power_of_2};
+use std::fmt::{Debug, Formatter};
 use std::sync::Arc;
 
 const UTF8_DEFAULT_SIZE: usize = 20;
 const BINARY_DEFAULT_SIZE: usize = 100;
 
+#[derive(Copy, Clone, Debug)]
+/// Type of a RowLayout
+pub enum RowType {
+    /// This type of layout will store each field with minimum bytes for space efficiency.
+    /// Its typical use case represents a sorting payload that accesses all row fields as a unit.
+    Compact,
+    /// This type of layout will store one 8-byte word per field for CPU-friendly,
+    /// It is mainly used to represent the rows with frequently updated content,
+    /// for example, grouping state for hash aggregation.
+    WordAligned,
+    // RawComparable,
+}
+
+/// Reveals how the fields of a record are stored in the raw-bytes format
+pub(crate) struct RowLayout {
+    /// Type of the layout
+    type_: RowType,
+    /// If a row is null free according to its schema
+    pub(crate) null_free: bool,
+    /// The number of bytes used to store null bits for each field.
+    pub(crate) null_width: usize,
+    /// Length in bytes for `values` part of the current tuple.
+    pub(crate) values_width: usize,
+    /// Total number of fields for each tuple.
+    pub(crate) field_count: usize,
+    /// Starting offset for each fields in the raw bytes.
+    pub(crate) field_offsets: Vec<usize>,
+}
+
+impl RowLayout {
+    pub(crate) fn new(schema: &Arc<Schema>, type_: RowType) -> Self {
+        assert!(row_supported(schema));
+        let null_free = schema_null_free(schema);
+        let field_count = schema.fields().len();
+        let null_width = if null_free { 0 } else { ceil(field_count, 8) };
+        let (field_offsets, values_width) = match type_ {
+            RowType::Compact => compact_offsets(null_width, schema),
+            RowType::WordAligned => word_aligned_offsets(null_width, schema),
+        };
+        Self {
+            type_,
+            null_free,
+            null_width,
+            values_width,
+            field_count,
+            field_offsets,
+        }
+    }
+
+    #[inline(always)]
+    pub(crate) fn init_varlena_offset(&self) -> usize {
+        self.null_width + self.values_width
+    }
+}
+
+impl Debug for RowLayout {

Review Comment:
   I wonder if there is a reason you chose not to use `#[derive(Debug)]` here rather than a manual implementation
   



##########
datafusion/core/src/row/reader.rs:
##########
@@ -85,52 +85,36 @@ macro_rules! fn_get_idx_opt {
 
 /// Read the tuple `data[base_offset..]` we are currently pointing to
 pub struct RowReader<'a> {
+    /// Layout on how to read each field

Review Comment:
   👍  this is nice



##########
datafusion/core/src/row/layout.rs:
##########
@@ -50,13 +118,23 @@ fn type_width(dt: &DataType) -> usize {
     }
 }
 
+fn word_aligned_offsets(null_width: usize, schema: &Arc<Schema>) -> (Vec<usize>, usize) {
+    let mut offsets = vec![];
+    let mut offset = null_width;
+    for _ in schema.fields() {
+        offsets.push(offset);
+        offset += 8; // a 8-bytes word for each field

Review Comment:
   Are we sure that all field types can fit within 8 bytes?  I am thinking about types like `DataType::Decimal` stored in a `u128` / `16 bytes`)?
   
   Also it would be nice to know variable length data works here



##########
datafusion/core/src/row/layout.rs:
##########
@@ -17,26 +17,94 @@
 
 //! Various row layout for different use case
 
-use crate::row::{schema_null_free, var_length};
+use crate::row::{row_supported, schema_null_free, var_length};
 use arrow::datatypes::{DataType, Schema};
 use arrow::util::bit_util::{ceil, round_upto_power_of_2};
+use std::fmt::{Debug, Formatter};
 use std::sync::Arc;
 
 const UTF8_DEFAULT_SIZE: usize = 20;
 const BINARY_DEFAULT_SIZE: usize = 100;
 
+#[derive(Copy, Clone, Debug)]
+/// Type of a RowLayout
+pub enum RowType {
+    /// This type of layout will store each field with minimum bytes for space efficiency.
+    /// Its typical use case represents a sorting payload that accesses all row fields as a unit.
+    Compact,
+    /// This type of layout will store one 8-byte word per field for CPU-friendly,
+    /// It is mainly used to represent the rows with frequently updated content,
+    /// for example, grouping state for hash aggregation.
+    WordAligned,
+    // RawComparable,
+}
+
+/// Reveals how the fields of a record are stored in the raw-bytes format
+pub(crate) struct RowLayout {
+    /// Type of the layout
+    type_: RowType,
+    /// If a row is null free according to its schema
+    pub(crate) null_free: bool,
+    /// The number of bytes used to store null bits for each field.
+    pub(crate) null_width: usize,
+    /// Length in bytes for `values` part of the current tuple.
+    pub(crate) values_width: usize,
+    /// Total number of fields for each tuple.
+    pub(crate) field_count: usize,
+    /// Starting offset for each fields in the raw bytes.
+    pub(crate) field_offsets: Vec<usize>,
+}
+
+impl RowLayout {
+    pub(crate) fn new(schema: &Arc<Schema>, type_: RowType) -> Self {

Review Comment:
   Minor stylistic comment



##########
datafusion/core/src/row/writer.rs:
##########
@@ -96,51 +96,34 @@ macro_rules! fn_set_idx {
 
 /// Reusable row writer backed by Vec<u8>
 pub struct RowWriter {
+    /// Layout on how to write each field

Review Comment:
   ❤️  for reduced repetition



##########
datafusion/core/src/row/mod.rs:
##########
@@ -176,8 +176,8 @@ mod tests {
                     let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(a)])?;
                     let mut vector = vec![0; 1024];
                     let row_offsets =

Review Comment:
   Would it be possible to extend these tests to include the new `WordAligned` format as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
alamb merged PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r854534284


##########
datafusion/core/tests/sql/explain_analyze.rs:
##########
@@ -136,8 +137,16 @@ async fn explain_analyze_baseline_metrics() {
             }
             let metrics = plan.metrics().unwrap().aggregate_by_partition();
 
-            assert!(metrics.output_rows().unwrap() > 0);
-            assert!(metrics.elapsed_compute().unwrap() > 0);
+            assert!(
+                metrics.output_rows().unwrap() > 0,
+                "plan: {}",
+                DisplayableExecutionPlan::with_metrics(plan).one_line()
+            );
+            assert!(

Review Comment:
   
   
   This change appears to be intermittently failing
   
   https://github.com/apache/arrow-datafusion/runs/6100177086?check_suite_focus=true
   
   https://github.com/apache/arrow-datafusion/runs/6100277301?check_suite_focus=true
   
   https://github.com/apache/arrow-datafusion/runs/6100969854?check_suite_focus=true



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r853283598


##########
datafusion/core/src/row/layout.rs:
##########
@@ -50,13 +118,23 @@ fn type_width(dt: &DataType) -> usize {
     }
 }
 
+fn word_aligned_offsets(null_width: usize, schema: &Arc<Schema>) -> (Vec<usize>, usize) {
+    let mut offsets = vec![];
+    let mut offset = null_width;
+    for _ in schema.fields() {
+        offsets.push(offset);
+        offset += 8; // a 8-bytes word for each field

Review Comment:
   Let me make it a function here to choose between 8 and 16 based on DataTypes. 
   
   > Also it would be nice to know variable length data works here
   
   I haven't settled down my mind if I'd store grouping keys as well as group state as one row or two separate rows and have a JoinedRow { key: `vec<u8>`, state: `vec<u8>` } for the hashtable in aggregate.
   
   But for the group states only, I prefer to not handle variable length states, such as an array, median, or sketch in approxs', since it's not efficient to do in-place updates for these varlena states, we'd better keep with Vec<ScalarValue> states for them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2261: Introduce RowLayout to represent rows for different purposes

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2261:
URL: https://github.com/apache/arrow-datafusion/pull/2261#discussion_r853283598


##########
datafusion/core/src/row/layout.rs:
##########
@@ -50,13 +118,23 @@ fn type_width(dt: &DataType) -> usize {
     }
 }
 
+fn word_aligned_offsets(null_width: usize, schema: &Arc<Schema>) -> (Vec<usize>, usize) {
+    let mut offsets = vec![];
+    let mut offset = null_width;
+    for _ in schema.fields() {
+        offsets.push(offset);
+        offset += 8; // a 8-bytes word for each field

Review Comment:
   Let me make it a function here to choose between 8 and 16 based on DataTypes. 
   
   > Also it would be nice to know variable length data works here
   
   I haven't settled down my mind if I'd store grouping keys as well as group state as one row or two separate rows and have a JoinedRow { key: `vec<u8>`, state: `vec<u8>` } for the hashtable in aggregate.
   
   But for the group states only, I prefer to not handle variable length states, such as an array, median, or sketch in approxs', since it's not efficient to do in-place updates for these states, we'd better keep with Vec<ScalarValue> states for them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org