You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/27 01:02:56 UTC

[GitHub] [arrow-datafusion] comphead opened a new pull request, #2968: fix writer index out of bounds

comphead opened a new pull request, #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #2910 .
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   The weird bug was reported, it can be reproduced locally in `row_writer_resize_test`, the reason when the column is not nullable it gives 1 extra byte to initial offset, and this breaks the buffer resize logic, causing out of range issues on `        self.data[varlena_offset..varlena_offset + size].copy_from_slice(bytes);
   `
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931731720


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();

Review Comment:
   I figure out the corner case to panic for your fix.
   You can add the test with string data type, but the each value of array is empty, just like that
   ```
   arrow::array::StringArray::from(vec!["","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""]);
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] kmitchener commented on pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
kmitchener commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1196760007

   Per my note on #2963, this PR fixes that issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
comphead commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931283945


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();

Review Comment:
   > And in the `set_utf8` or `set_binary`, we can use
   > 
   > ```
   >         self.varlena_offset += (size + 8);
   >         self.varlena_width += (size +8);
   > ```
   > 
   > to update the offset and width.
   
   That was also my vision. But my concern here if we add 8 extra bytes per field, it may bloat the row size significantly?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931731720


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();

Review Comment:
   I figure out the corner case to panic for your fix.
   You can add the test with string data type, but the each value of string empty, just like that
   ```
   arrow::array::StringArray::from(vec!["","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""]);
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1199570589

   Benchmark runs are scheduled for baseline = 176f4329dad5800c2f0c29edd21086f899bef676 and contender = 811bad406f5ab4d58352cedc64e28f2bcacba9a9. 811bad406f5ab4d58352cedc64e28f2bcacba9a9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/65ec6218d1a04553804aa2624f7f293a...9023b1ddd41a4efe9ebf0a94f3913239/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b967cdf2a8ea4f169bd115db86ca377e...0a2c6cf87bb54a95bfb887d50518ae57/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbb07730d13c44edb8706c293febf4b5...4d5642f9ffdd4aa7b403b5cf2677e297/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/8a697ecc5d9c40409faf0b9f1045cf78...fa866d9e87f142e1a770a175e4c3bd8f/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931737325


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();

Review Comment:
   @liukun4515 can elaborate on why there are two '+ 8' suggested?
   
   ```
       self.varlena_offset += (size + 8);
       self.varlena_width += (size +8);
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
comphead commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r932376056


##########
datafusion/row/src/writer.rs:
##########
@@ -269,29 +269,29 @@ impl RowWriter {
 
 /// Stitch attributes of tuple in `batch` at `row_idx` and returns the tuple width
 pub fn write_row(
-    row: &mut RowWriter,
+    row_writer: &mut RowWriter,

Review Comment:
   I thought its better naming, because when reading the code I supposed the `row`  is either incoming row or row buffer, both  are not correct. row_writer better reflects the object imho.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
yjshen commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1196993266

   I'm aware of this PR but ran out of time today. Will come back tomorrow morning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r932786946


##########
datafusion/row/src/writer.rs:
##########
@@ -349,9 +349,9 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();
-    if new_width > to.data.capacity() {
+    if new_width > to.data.len() {
         // double the capacity to avoid repeated resize

Review Comment:
   outdated code comments



##########
datafusion/row/src/writer.rs:
##########
@@ -365,9 +365,9 @@ pub(crate) fn write_field_binary(
     let from = from.as_any().downcast_ref::<BinaryArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.len();
-    if new_width > to.data.capacity() {
+    if new_width > to.data.len() {
         // double the capacity to avoid repeated resize

Review Comment:
   same here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931725250


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();

Review Comment:
   @comphead you can go through the code of `set_utf8`.
   ```
       fn set_utf8(&mut self, idx: usize, value: &str) {
           self.assert_index_valid(idx);
           let bytes = value.as_bytes();
           let size = bytes.len();
           // this line: write the offset and size to the row_buffer
           self.set_offset_size(idx, size as u32);
           let varlena_offset = self.varlena_offset;
           self.data[varlena_offset..varlena_offset + size].copy_from_slice(bytes);
           self.varlena_offset += size;
           self.varlena_width += size;
       }
   ```
   Not per field, and variable length data type like utf8 and binary.
   From the @yjshen  codebase, I think the layout of utf8 and binary should be
   ```
   ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
   │ offset and size│     utf8 data or binary data │
   ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931771668


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();
-    if new_width > to.data.capacity() {
+    if new_width > to.data.len() {

Review Comment:
   thanks for your explain.
   Please ignore my comments. @comphead 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931081569


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();
-    if new_width > to.data.capacity() {
+    if new_width > to.data.len() {

Review Comment:
   Do we need to change the  `write_field_binary` like this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931748678


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();
-    if new_width > to.data.capacity() {
+    if new_width > to.data.len() {

Review Comment:
   >  if the new width or new length is less than the capacity of data, why need to expand the space.
   
   I think it's because `Vec` in rust doesn't own the space between (len, capacity].



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
yjshen commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1198948522

   @alamb wants to take another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
comphead commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1196155678

   @yjshen please check this PR draft, as I noticed the row_writer logic is mostly done by you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931081569


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();
-    if new_width > to.data.capacity() {
+    if new_width > to.data.len() {

Review Comment:
   Do we need to change the  `write_field_utf8` like this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931772254


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();

Review Comment:
   Please ignore my comment, I have a error thought about the row layout when writing the comments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
alamb merged PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
comphead commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931283945


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();

Review Comment:
   > And in the `set_utf8` or `set_binary`, we can use
   > 
   > ```
   >         self.varlena_offset += (size + 8);
   >         self.varlena_width += (size +8);
   > ```
   > 
   > to update the offset and width.
   
   @liukun4515 That was my vision too. But my concern here if we add 8 extra bytes per field, it may bloat the row size significantly?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931335900


##########
datafusion/core/src/dataframe.rs:
##########
@@ -1266,4 +1266,45 @@ mod tests {
 
         Ok(())
     }
+
+    #[tokio::test]
+    async fn row_writer_resize_test() -> Result<()> {
+        let schema = Arc::new(Schema::new(vec![arrow::datatypes::Field::new(
+            "column_1",
+            DataType::Utf8,
+            false,
+        )]));
+
+        let data = RecordBatch::try_new(
+            schema.clone(),
+            vec![
+                Arc::new(arrow::array::StringArray::from(vec![
+                    Some("2a0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"),
+                    Some("3a0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000800"),
+                ])),
+            ],
+        )?;
+
+        let table = crate::datasource::MemTable::try_new(schema, vec![vec![data]])?;
+
+        let ctx = SessionContext::new();
+        ctx.register_table("test", Arc::new(table))?;
+
+        let sql = r#"
+        SELECT 
+            COUNT(1)
+        FROM 
+            test
+        GROUP BY
+            column_1"#;
+
+        // let df = ctx.sql("SELECT * FROM test").await.unwrap();
+        // df.show_limit(10).await.unwrap();
+        // dbg!(df.schema());

Review Comment:
   Should we leave it in?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931755169


##########
datafusion/core/src/dataframe.rs:
##########
@@ -1266,4 +1266,45 @@ mod tests {
 
         Ok(())
     }
+
+    #[tokio::test]
+    async fn row_writer_resize_test() -> Result<()> {
+        let schema = Arc::new(Schema::new(vec![arrow::datatypes::Field::new(
+            "column_1",
+            DataType::Utf8,
+            false,
+        )]));
+
+        let data = RecordBatch::try_new(
+            schema.clone(),
+            vec![
+                Arc::new(arrow::array::StringArray::from(vec![
+                    Some("2a0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"),
+                    Some("3a0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000800"),
+                ])),
+            ],
+        )?;
+
+        let table = crate::datasource::MemTable::try_new(schema, vec![vec![data]])?;
+
+        let ctx = SessionContext::new();
+        ctx.register_table("test", Arc::new(table))?;
+
+        let sql = r#"
+        SELECT 
+            COUNT(1)
+        FROM 
+            test
+        GROUP BY
+            column_1"#;
+
+        // let df = ctx.sql("SELECT * FROM test").await.unwrap();
+        // df.show_limit(10).await.unwrap();
+        // dbg!(df.schema());

Review Comment:
   We'd better remove this I suppose.



##########
datafusion/row/src/writer.rs:
##########
@@ -269,29 +269,29 @@ impl RowWriter {
 
 /// Stitch attributes of tuple in `batch` at `row_idx` and returns the tuple width
 pub fn write_row(
-    row: &mut RowWriter,
+    row_writer: &mut RowWriter,

Review Comment:
   This renaming is unrelated, I think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1196171172

   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2968?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2968](https://codecov.io/gh/apache/arrow-datafusion/pull/2968?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (908fb70) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/4005076d8e3e4fa07541da62f7a6c9c755029da1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (4005076) will **increase** coverage by `0.01%`.
   > The diff coverage is `100.00%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2968      +/-   ##
   ==========================================
   + Coverage   85.71%   85.72%   +0.01%     
   ==========================================
     Files         280      280              
     Lines       51313    51330      +17     
   ==========================================
   + Hits        43983    44003      +20     
   + Misses       7330     7327       -3     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/2968?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/core/src/dataframe.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2968/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9kYXRhZnJhbWUucnM=) | `89.05% <100.00%> (+0.76%)` | :arrow_up: |
   | [datafusion/row/src/writer.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2968/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9yb3cvc3JjL3dyaXRlci5ycw==) | `91.21% <100.00%> (+1.35%)` | :arrow_up: |
   | [datafusion/expr/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2968/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `77.43% <0.00%> (-0.35%)` | :arrow_down: |
   | [datafusion/core/src/physical\_plan/metrics/value.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2968/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL21ldHJpY3MvdmFsdWUucnM=) | `87.43% <0.00%> (+0.50%)` | :arrow_up: |
   | [datafusion/expr/src/window\_frame.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2968/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy93aW5kb3dfZnJhbWUucnM=) | `93.27% <0.00%> (+0.84%)` | :arrow_up: |
   
   Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931727919


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();
-    if new_width > to.data.capacity() {
+    if new_width > to.data.len() {

Review Comment:
   In my opinion, this fix is not right.
   The `data` is used to buffer fixed size of layout and row data, if the new width or new length is less than the capacity of data, why need to expand the space.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thomas-k-cameron commented on pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
thomas-k-cameron commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1196620916

   Hi,
   
   I'm the one reported the bug.
   
   If it is necessary for the resizing logic to decide on the basis of length, not the capacity, of the `data` field, then shouldn't we do the same for the `pub(crate) fn write_field_binary` as well?
   
   I wonder if `pub(crate) fn write_field_binary`  could create a similar bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r932786946


##########
datafusion/row/src/writer.rs:
##########
@@ -349,9 +349,9 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();
-    if new_width > to.data.capacity() {
+    if new_width > to.data.len() {
         // double the capacity to avoid repeated resize

Review Comment:
   outdated code comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
comphead commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1197290876

   > Thanks @comphead
   > 
   > I also think @yjshen should review this if he has time before we merge this in
   
   Hi @alamb I'm waiting for @yjshen and after his review I can do the same for binary datatype


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931142134


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();

Review Comment:
   Here, we can use `let new_width = to.current_width() + s.as_bytes().len() + 8`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1196798024

   @yjshen PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931143930


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();

Review Comment:
   And in the `set_utf8` or `set_binary`, we can use 
   
   ```
           self.varlena_offset += (size + 8);
           self.varlena_width += (size +8);
   ```
   to update the offset and width.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on pull request #2968: fix writer index out of bounds

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1196840689

   @comphead  Thanks for your contribution.
   But I'm confuse about the method to fix this issue.
   
   For binary and utf8, we need to write the `offset and size` before the data, why not update the `new_width` using `to.current_width() + s.as_bytes().len() + 8`.
   The 8 is the size of `offset and size` which is encoded as a u64.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#discussion_r931725250


##########
datafusion/row/src/writer.rs:
##########
@@ -349,7 +349,7 @@ pub(crate) fn write_field_utf8(
     let from = from.as_any().downcast_ref::<StringArray>().unwrap();
     let s = from.value(row_idx);
     let new_width = to.current_width() + s.as_bytes().len();

Review Comment:
   @comphead you can go through the code of `set_utf8`.
   ```
       fn set_utf8(&mut self, idx: usize, value: &str) {
           self.assert_index_valid(idx);
           let bytes = value.as_bytes();
           let size = bytes.len();
           // this line: write the offset and size to the row_buffer
           self.set_offset_size(idx, size as u32);
           let varlena_offset = self.varlena_offset;
           self.data[varlena_offset..varlena_offset + size].copy_from_slice(bytes);
           self.varlena_offset += size;
           self.varlena_width += size;
       }
   ```
   Not per field, just variable length data type like utf8 and binary.
   From the @yjshen  codebase, I think the layout of utf8 and binary should be
   ```
   ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
   │ offset and size│     utf8 data or binary data │
   ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on pull request #2968: fix `RowWriter` index out of bounds error

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2968:
URL: https://github.com/apache/arrow-datafusion/pull/2968#issuecomment-1197648845

   > Thanks @comphead!
   > 
   > I propose we fix the followings:
   > 
   > * The `new_width > to.data.len()` change makes sense to me, but on the resize part:
   > 
   > ```rust
   >     if new_width > to.data.len() {
   >         to.data.resize(max(to.data.capacity(), new_width), 0);
   >     }
   > ```
   > 
   > I think we could just resize to capacity to avoid reallocating, instead of the previous double capacity way.
   > 
   > * `write_field_binary` should be adapted accordingly as well.
   > * The capacity() bug also exists in the `end_padding` logic, it should be:
   > 
   > ```rust
   >     /// End each row at 8-byte word boundary.
   >     pub(crate) fn end_padding(&mut self) {
   >         let payload_width = self.current_width();
   >         self.row_width = round_upto_power_of_2(payload_width, 8);
   >         if self.data.len() < self.row_width {
   >             self.data.resize(self.row_width, 0);
   >         }
   >     }
   > ```
   > 
   > @liukun4515 I cannot quite get your `+8` logic while calculating size in the comments above, I didn't get panic with your 110 empty string case either. Please correct me if I have missed something important.
   
   Sorry for the error comments with my wrong understanding for row layout.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org