You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/21 12:05:30 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request #1219: Do not concatenate identical dictionaries

tustvold opened a new pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219


   # Which issue does this PR close?
   
   Closes #504
   
   # Rationale for this change
    
   See ticket
   
   # What changes are included in this PR?
   
   Adds `ArrayData::ptr_eq` and then uses this to elide dictionary concatenation in MutableArrayData
   
   # Are there any user-facing changes?
   
   This changes the behaviour of concat, to no longer concatenate the dictionaries if they are the same. Not sure if this counts
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1219: Do not concatenate identical dictionaries

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#discussion_r790282646



##########
File path: arrow/src/array/data.rs
##########
@@ -1155,6 +1155,41 @@ impl ArrayData {
             Ok(())
         })
     }
+
+    /// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons

Review comment:
       👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1219: Do not concatenate identical dictionaries

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#issuecomment-1019555733


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1219](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c9417ed) into [master](https://codecov.io/gh/apache/arrow-rs/commit/fcd37ee1f8a8c3f5af099419aa9667bf0a515595?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fcd37ee) will **increase** coverage by `0.00%`.
   > The diff coverage is `87.23%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1219/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #1219   +/-   ##
   =======================================
     Coverage   82.70%   82.71%           
   =======================================
     Files         175      175           
     Lines       51711    51787   +76     
   =======================================
   + Hits        42769    42837   +68     
   - Misses       8942     8950    +8     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.41% <76.00%> (-0.15%)` | :arrow_down: |
   | [arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=) | `82.48% <87.23%> (+0.20%)` | :arrow_up: |
   | [arrow/src/compute/kernels/concat.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb25jYXQucnM=) | `100.00% <100.00%> (ø)` | |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.52% <0.00%> (-0.20%)` | :arrow_down: |
   | [arrow/src/csv/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi9yZWFkZXIucnM=) | `88.12% <0.00%> (+0.01%)` | :arrow_up: |
   | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `54.10% <0.00%> (+0.30%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fcd37ee...c9417ed](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1219: Do not concatenate identical dictionaries

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#discussion_r790283770



##########
File path: arrow/src/compute/kernels/concat.rs
##########
@@ -525,4 +525,44 @@ mod tests {
 
         Ok(())
     }
+
+    #[test]
+    fn test_dictionary_concat_reuse() {
+        let array: DictionaryArray<Int8Type> =
+            vec!["a", "a", "b", "c"].into_iter().collect();
+        let array_copy: DictionaryArray<Int8Type> = array.data().clone().into();
+
+        // dictionary is "a", "b", "c"
+        assert_eq!(
+            array.values(),
+            &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef)
+        );
+        assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2]));
+
+        // concatenate it with itself
+        let combined = concat(&[&array_copy as _, &array as _]).unwrap();
+
+        let combined = combined
+            .as_any()
+            .downcast_ref::<DictionaryArray<Int8Type>>()
+            .unwrap();
+
+        assert_eq!(
+            combined.values(),
+            &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef),
+            "Actual: {:#?}",
+            combined
+        );
+
+        assert_eq!(
+            combined.keys(),
+            &Int8Array::from(vec![0, 0, 1, 2, 0, 0, 1, 2])
+        );
+
+        // Should have reused the dictionary
+        assert!(array.data().child_data()[0].ptr_eq(&combined.data().child_data()[0]));
+        assert!(
+            array_copy.data().child_data()[0].ptr_eq(&combined.data().child_data()[0])
+        );
+    }

Review comment:
       Can we also add a test of concatenating three dictionaries -- where 2 use the same dictionary and one is a different dictionary?

##########
File path: arrow/src/array/data.rs
##########
@@ -1155,6 +1155,41 @@ impl ArrayData {
             Ok(())
         })
     }
+
+    /// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
+    /// to determine buffer equality. This is cheaper than `PartialEq::eq` but may
+    /// return false negatives

Review comment:
       in what case would this return a false negative (to the "are these two pointers the same" question)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #1219: Do not concatenate identical dictionaries

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1219: Do not concatenate identical dictionaries

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#discussion_r790662987



##########
File path: arrow/src/array/data.rs
##########
@@ -1155,6 +1155,41 @@ impl ArrayData {
             Ok(())
         })
     }
+
+    /// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
+    /// to determine buffer equality. This is cheaper than `PartialEq::eq` but may
+    /// return false when the arrays are logically equal

Review comment:
       👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1219: Do not concatenate identical dictionaries

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#discussion_r790321744



##########
File path: arrow/src/array/data.rs
##########
@@ -1155,6 +1155,41 @@ impl ArrayData {
             Ok(())
         })
     }
+
+    /// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
+    /// to determine buffer equality. This is cheaper than `PartialEq::eq` but may
+    /// return false negatives

Review comment:
       I updated the wording

##########
File path: arrow/src/compute/kernels/concat.rs
##########
@@ -525,4 +525,44 @@ mod tests {
 
         Ok(())
     }
+
+    #[test]
+    fn test_dictionary_concat_reuse() {
+        let array: DictionaryArray<Int8Type> =
+            vec!["a", "a", "b", "c"].into_iter().collect();
+        let array_copy: DictionaryArray<Int8Type> = array.data().clone().into();
+
+        // dictionary is "a", "b", "c"
+        assert_eq!(
+            array.values(),
+            &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef)
+        );
+        assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2]));
+
+        // concatenate it with itself
+        let combined = concat(&[&array_copy as _, &array as _]).unwrap();
+
+        let combined = combined
+            .as_any()
+            .downcast_ref::<DictionaryArray<Int8Type>>()
+            .unwrap();
+
+        assert_eq!(
+            combined.values(),
+            &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef),
+            "Actual: {:#?}",
+            combined
+        );
+
+        assert_eq!(
+            combined.keys(),
+            &Int8Array::from(vec![0, 0, 1, 2, 0, 0, 1, 2])
+        );
+
+        // Should have reused the dictionary
+        assert!(array.data().child_data()[0].ptr_eq(&combined.data().child_data()[0]));
+        assert!(
+            array_copy.data().child_data()[0].ptr_eq(&combined.data().child_data()[0])
+        );
+    }

Review comment:
       Added




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #1219: Do not concatenate identical dictionaries

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on pull request #1219: Do not concatenate identical dictionaries

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#issuecomment-1019447411


   Looks good to me. I initially thought `ptr_eq` for dictionary values could get away with only comparing the data buffer, but in theory the same buffer could be combined with different offsets or validity. So really need to compare all buffers and child data as int this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1219: Do not concatenate identical dictionaries

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#discussion_r790284597



##########
File path: arrow/src/array/data.rs
##########
@@ -1155,6 +1155,41 @@ impl ArrayData {
             Ok(())
         })
     }
+
+    /// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
+    /// to determine buffer equality. This is cheaper than `PartialEq::eq` but may
+    /// return false negatives

Review comment:
       It may say two things are not equal when logically they are?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org