You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/28 16:38:20 UTC

[GitHub] [arrow] nevi-me opened a new pull request #8548: ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader

nevi-me opened a new pull request #8548:
URL: https://github.com/apache/arrow/pull/8548


   This merges the in-progress Arrow <> Parquet IO branch into master, so we can continue with development here.
   
   Please do not squash, as we'd like to preserve the individual commits as they're already squashed into their respective JIRAs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8548: ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8548:
URL: https://github.com/apache/arrow/pull/8548#issuecomment-718383631


   That's okay, I understand.
   I suppose it's becoming less of an issue as we seem to be growing commiters and reviewers. We'll also be doing the rest of the parquet work on the main branch, so this won't happen again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me merged pull request #8548: ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader

Posted by GitBox <gi...@apache.org>.
nevi-me merged pull request #8548:
URL: https://github.com/apache/arrow/pull/8548


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8548: ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8548:
URL: https://github.com/apache/arrow/pull/8548#issuecomment-718373762


   > > There was probably some mis-understanding during the call yesterday, but I was expecting this to be a pull request and approved by another committer _before being merged_. This merge IMO did not follow our review practice and did not gave time to other committers to go through it.
   > 
   > Apologies Jorge,
   > 
   > The intention was to bring in the changes into master, on the rebuttable presumption that they've been reviewed already.
   > We don't have a lot of review bandwidth, so I wasn't expecting anyone to review the commits again, as we also wouldn't easily change them (maybe except the last commit with a `git commit --amend`.
   > 
   > After creating the PR, I noticed that the other merge options have been disabled in the UI (we can only squash now), so the aim of being able to use the UI to merge this was defeated; and so I rebased and pushed from my end.
   > 
   > I've gone through your questions around arraydata equality, so I'll have a look at what you've done on #8541, and I can prioritise any follow-up work needed to cover what might still be outstanding.
   
   No worries, was a mis-communication, and I am sorry for not being very explicit about it: I was aware of the writer branch, but since it was moving fast, I deemed it as "WIP" and waited for a PR to master before going through it. Once I saw the PR from @carols10cents to master, I understood it to be the "ready to review" version of the branch, and reviewed it ASAP, given the clock ticking on parquet stuff and the call.
   
   Only afterwards I realize that that PR was actually a different thing, and that there has been a merge of the main changes already.
   
   I would consider beneficial to have a review on any push to master from anyone. If anything, it gives time to others to comment on the PR and get accustomed to the changes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8548: ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8548:
URL: https://github.com/apache/arrow/pull/8548#discussion_r513946133



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -209,6 +209,61 @@ impl ArrayData {
     }
 }
 
+impl PartialEq for ArrayData {
+    fn eq(&self, other: &Self) -> bool {
+        assert_eq!(
+            self.data_type(),
+            other.data_type(),
+            "Data types not the same"
+        );
+        assert_eq!(self.len(), other.len(), "Lengths not the same");
+        // TODO: when adding tests for this, test that we can compare with arrays that have offsets
+        assert_eq!(self.offset(), other.offset(), "Offsets not the same");
+        assert_eq!(self.null_count(), other.null_count());
+        // compare buffers excluding padding
+        let self_buffers = self.buffers();
+        let other_buffers = other.buffers();
+        assert_eq!(self_buffers.len(), other_buffers.len());
+        self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
+            compare_buffer_regions(
+                s,
+                self.offset(), // TODO mul by data length
+                o,
+                other.offset(), // TODO mul by data len
+            );
+        });
+        // assert_eq!(self.buffers(), other.buffers());
+
+        assert_eq!(self.child_data(), other.child_data());
+        // null arrays can skip the null bitmap, thus only compare if there are no nulls
+        if self.null_count() != 0 || other.null_count() != 0 {
+            compare_buffer_regions(
+                self.null_buffer().unwrap(),
+                self.offset(),
+                other.null_buffer().unwrap(),
+                other.offset(),
+            )
+        }
+        true
+    }
+}
+
+/// A helper to compare buffer regions of 2 buffers.
+/// Compares the length of the shorter buffer.
+fn compare_buffer_regions(
+    left: &Buffer,
+    left_offset: usize,
+    right: &Buffer,
+    right_offset: usize,
+) {
+    // for convenience, we assume that the buffer lengths are only unequal if one has padding,
+    // so we take the shorter length so we can discard the padding from the longer length
+    let shorter_len = left.len().min(right.len());
+    let s_sliced = left.bit_slice(left_offset, shorter_len);
+    let o_sliced = right.bit_slice(right_offset, shorter_len);
+    assert_eq!(s_sliced, o_sliced);
+}
+

Review comment:
       This is not correct, see [this comment](https://github.com/apache/arrow/pull/8546#discussion_r513933731)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8548: ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8548:
URL: https://github.com/apache/arrow/pull/8548#issuecomment-718069239


   https://issues.apache.org/jira/browse/ARROW-8421


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8548: ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8548:
URL: https://github.com/apache/arrow/pull/8548#issuecomment-718347031


   There was probably some mis-understanding during the call yesterday, but I was expecting this to be a pull request and approved by another committer _before being merged_. This merge IMO did not follow our review practice and did not gave time to other committers to go through it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8548: ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8548:
URL: https://github.com/apache/arrow/pull/8548#issuecomment-718370707


   > There was probably some mis-understanding during the call yesterday, but I was expecting this to be a pull request and approved by another committer _before being merged_. This merge IMO did not follow our review practice and did not gave time to other committers to go through it.
   
   Apologies Jorge,
   
   The intention was to bring in the changes into master, on the rebuttable presumption that they've been reviewed already.
   We don't have a lot of review bandwidth, so I wasn't expecting anyone to review the commits again, as we also wouldn't easily change them (maybe except the last commit with a `git commit --amend`.
   
   After creating the PR, I noticed that the other merge options have been disabled in the UI (we can only squash now), so the aim of being able to use the UI to merge this was defeated; and so I rebased and pushed from my end.
   
   I've gone through your questions around arraydata equality, so I'll have a look at what you've done on #8541, and I can prioritise any follow-up work needed to cover what might still be outstanding.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8548: ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8548:
URL: https://github.com/apache/arrow/pull/8548#discussion_r513946419



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -209,6 +209,61 @@ impl ArrayData {
     }
 }
 
+impl PartialEq for ArrayData {
+    fn eq(&self, other: &Self) -> bool {
+        assert_eq!(
+            self.data_type(),
+            other.data_type(),
+            "Data types not the same"
+        );
+        assert_eq!(self.len(), other.len(), "Lengths not the same");
+        // TODO: when adding tests for this, test that we can compare with arrays that have offsets
+        assert_eq!(self.offset(), other.offset(), "Offsets not the same");
+        assert_eq!(self.null_count(), other.null_count());
+        // compare buffers excluding padding
+        let self_buffers = self.buffers();
+        let other_buffers = other.buffers();
+        assert_eq!(self_buffers.len(), other_buffers.len());
+        self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
+            compare_buffer_regions(
+                s,
+                self.offset(), // TODO mul by data length
+                o,
+                other.offset(), // TODO mul by data len
+            );
+        });
+        // assert_eq!(self.buffers(), other.buffers());
+
+        assert_eq!(self.child_data(), other.child_data());
+        // null arrays can skip the null bitmap, thus only compare if there are no nulls
+        if self.null_count() != 0 || other.null_count() != 0 {
+            compare_buffer_regions(
+                self.null_buffer().unwrap(),
+                self.offset(),
+                other.null_buffer().unwrap(),
+                other.offset(),
+            )
+        }
+        true
+    }
+}
+
+/// A helper to compare buffer regions of 2 buffers.
+/// Compares the length of the shorter buffer.
+fn compare_buffer_regions(
+    left: &Buffer,
+    left_offset: usize,
+    right: &Buffer,
+    right_offset: usize,
+) {
+    // for convenience, we assume that the buffer lengths are only unequal if one has padding,
+    // so we take the shorter length so we can discard the padding from the longer length
+    let shorter_len = left.len().min(right.len());
+    let s_sliced = left.bit_slice(left_offset, shorter_len);
+    let o_sliced = right.bit_slice(right_offset, shorter_len);
+    assert_eq!(s_sliced, o_sliced);
+}
+

Review comment:
       IMO this is not correct, see [this comment](https://github.com/apache/arrow/pull/8546#discussion_r513933731)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8548: ARROW-8421: [Rust] [Parquet] Merge in-progress writer & reader

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8548:
URL: https://github.com/apache/arrow/pull/8548#issuecomment-718385685


   Anyways, I do not want this to downplay any of the work that mostly you and @carols10cents have been taking: the changes look really great. Thanks a lot for taking this on. It is a monumental effort and the outcome is great. 💯  Overall, this brings the rust implementation to its full potential of read and write to parquet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org