You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/12 15:55:10 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

tustvold opened a new pull request, #1552:
URL: https://github.com/apache/arrow-rs/pull/1552

   # Which issue does this PR close?
   
   Closes #1536
   
   # Rationale for this change
    
   See ticket
   
   # What changes are included in this PR?
   
   Makes it possible to construct a RecordBatch with no columns, but a non-zero row count. I had a brief search in the codebase and couldn't find anything obvious that this would break, but I guess there is only one way to find out :smile: 
   
   # Are there any user-facing changes?
   
   This technically makes a breaking change to `RecordBatchOptions` as it wasn't marked as non-exhaustive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848650182


##########
arrow/src/record_batch.rs:
##########
@@ -402,15 +402,20 @@ impl RecordBatch {
 
 /// Options that control the behaviour used when creating a [`RecordBatch`].
 #[derive(Debug)]
+#[non_exhaustive]

Review Comment:
   non_exhaustive should make additive changes non-breaking, which should be good enough hopefully 😀



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848650182


##########
arrow/src/record_batch.rs:
##########
@@ -402,15 +402,20 @@ impl RecordBatch {
 
 /// Options that control the behaviour used when creating a [`RecordBatch`].
 #[derive(Debug)]
+#[non_exhaustive]

Review Comment:
   non_exhaustive should make additive changes non-breaking, which should be good enough hopefully 😀
   
   Happy to change if you feel strongly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848648673


##########
arrow/src/record_batch.rs:
##########
@@ -243,7 +248,7 @@ impl RecordBatch {
     /// # }
     /// ```
     pub fn num_rows(&self) -> usize {
-        self.columns[0].data().len()
+        self.row_count

Review Comment:
   Yup, it's basically a way to support projecting no columns and just getting the row count. This is needed for #1537



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848640104


##########
arrow/src/record_batch.rs:
##########
@@ -402,15 +402,20 @@ impl RecordBatch {
 
 /// Options that control the behaviour used when creating a [`RecordBatch`].
 #[derive(Debug)]
+#[non_exhaustive]

Review Comment:
   I don't feel particularly strongly either way, although have to write and maintain setters and accessors is a bit of a PIA



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848632777


##########
arrow/src/record_batch.rs:
##########
@@ -41,6 +41,7 @@ use crate::error::{ArrowError, Result};
 pub struct RecordBatch {
     schema: SchemaRef,
     columns: Vec<Arc<dyn Array>>,
+    row_count: usize,

Review Comment:
   maybe leave a comment here about why `row_count` is added?



##########
arrow/src/record_batch.rs:
##########
@@ -128,7 +125,15 @@ impl RecordBatch {
         }
 
         // check that all columns have the same row count
-        let row_count = columns[0].data().len();
+        let row_count = options
+            .row_count
+            .or_else(|| columns.first().map(|col| col.len()))
+            .ok_or_else(|| {
+                ArrowError::InvalidArgumentError(
+                    "must either specify a row count or at least one column".to_string(),
+                )
+            })?;
+
         if columns.iter().any(|c| c.len() != row_count) {
             return Err(ArrowError::InvalidArgumentError(
                 "all columns in a record batch must have the same length".to_string(),

Review Comment:
   This error will now also happen when the specified row count doesn't match the array counts (so the wording might be good to update?)



##########
arrow/src/record_batch.rs:
##########
@@ -243,7 +248,7 @@ impl RecordBatch {
     /// # }
     /// ```
     pub fn num_rows(&self) -> usize {
-        self.columns[0].data().len()
+        self.row_count

Review Comment:
   Maybe we could avoid adding `row_count` with something like
   
   ```rust
   self
     .columns
     .get(0)
     .map(|col| col.data.len())
     .unwrap_or(0)
   ```
   ?



##########
arrow/src/record_batch.rs:
##########
@@ -402,15 +402,20 @@ impl RecordBatch {
 
 /// Options that control the behaviour used when creating a [`RecordBatch`].
 #[derive(Debug)]
+#[non_exhaustive]

Review Comment:
   Or maybe it is time to make these fields non `pub` and add `with_match_field_name(&mut self, val: bool)` type accessors?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848637520


##########
arrow/src/record_batch.rs:
##########
@@ -243,7 +248,7 @@ impl RecordBatch {
     /// # }
     /// ```
     pub fn num_rows(&self) -> usize {
-        self.columns[0].data().len()
+        self.row_count

Review Comment:
   This would always return 0 if there are no columns?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848602337


##########
arrow/src/record_batch.rs:
##########
@@ -101,23 +100,21 @@ impl RecordBatch {
             .iter()
             .map(|field| new_empty_array(field.data_type()))
             .collect();
-        RecordBatch { schema, columns }
+
+        RecordBatch {
+            schema,
+            columns,
+            row_count: 0,
+        }
     }
 
     /// Validate the schema and columns using [`RecordBatchOptions`]. Returns an error
-    /// if any validation check fails.
-    fn validate_new_batch(
-        schema: &SchemaRef,
-        columns: &[ArrayRef],
+    /// if any validation check fails, otherwise returns the created [`RecordBatch`]
+    fn try_new_impl(

Review Comment:
   I'm not really sure why we need a separate method for this, but I avoided changing it to keep the diff down



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848646320


##########
arrow/src/record_batch.rs:
##########
@@ -243,7 +248,7 @@ impl RecordBatch {
     /// # }
     /// ```
     pub fn num_rows(&self) -> usize {
-        self.columns[0].data().len()
+        self.row_count

Review Comment:
   I see -- I think I was a little confused -- is the idea that the schema also has 0 fields? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848953681


##########
arrow/src/record_batch.rs:
##########
@@ -402,15 +402,20 @@ impl RecordBatch {
 
 /// Options that control the behaviour used when creating a [`RecordBatch`].
 #[derive(Debug)]
+#[non_exhaustive]

Review Comment:
   I don't feel strongly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb merged pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
alamb merged PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848602711


##########
arrow/src/record_batch.rs:
##########
@@ -402,15 +402,20 @@ impl RecordBatch {
 
 /// Options that control the behaviour used when creating a [`RecordBatch`].
 #[derive(Debug)]
+#[non_exhaustive]

Review Comment:
   Adding a new option imo shouldn't be a breaking change, this should give us that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1552: Create RecordBatch With Non-Zero Row Count But No Columns (#1536)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1552:
URL: https://github.com/apache/arrow-rs/pull/1552#discussion_r848647062


##########
arrow/src/record_batch.rs:
##########
@@ -402,15 +402,20 @@ impl RecordBatch {
 
 /// Options that control the behaviour used when creating a [`RecordBatch`].
 #[derive(Debug)]
+#[non_exhaustive]

Review Comment:
   > although have to write and maintain setters and accessors is a bit of a PIA
   
   The price for backwards compatibility?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org