You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/21 12:05:30 UTC
[GitHub] [arrow-rs] tustvold opened a new pull request #1219: Do not concatenate identical dictionaries
tustvold opened a new pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219
# Which issue does this PR close?
Closes #504
# Rationale for this change
See ticket
# What changes are included in this PR?
Adds `ArrayData::ptr_eq` and then uses this to elide dictionary concatenation in MutableArrayData
# Are there any user-facing changes?
This changes the behaviour of concat, to no longer concatenate the dictionaries if they are the same. Not sure if this counts
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a change in pull request #1219: Do not concatenate identical dictionaries
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#discussion_r790282646
##########
File path: arrow/src/array/data.rs
##########
@@ -1155,6 +1155,41 @@ impl ArrayData {
Ok(())
})
}
+
+ /// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
Review comment:
👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #1219: Do not concatenate identical dictionaries
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#issuecomment-1019555733
# [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#1219](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c9417ed) into [master](https://codecov.io/gh/apache/arrow-rs/commit/fcd37ee1f8a8c3f5af099419aa9667bf0a515595?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fcd37ee) will **increase** coverage by `0.00%`.
> The diff coverage is `87.23%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1219/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #1219 +/- ##
=======================================
Coverage 82.70% 82.71%
=======================================
Files 175 175
Lines 51711 51787 +76
=======================================
+ Hits 42769 42837 +68
- Misses 8942 8950 +8
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.41% <76.00%> (-0.15%)` | :arrow_down: |
| [arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=) | `82.48% <87.23%> (+0.20%)` | :arrow_up: |
| [arrow/src/compute/kernels/concat.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb25jYXQucnM=) | `100.00% <100.00%> (ø)` | |
| [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.52% <0.00%> (-0.20%)` | :arrow_down: |
| [arrow/src/csv/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi9yZWFkZXIucnM=) | `88.12% <0.00%> (+0.01%)` | :arrow_up: |
| [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1219/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `54.10% <0.00%> (+0.30%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [fcd37ee...c9417ed](https://codecov.io/gh/apache/arrow-rs/pull/1219?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a change in pull request #1219: Do not concatenate identical dictionaries
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#discussion_r790283770
##########
File path: arrow/src/compute/kernels/concat.rs
##########
@@ -525,4 +525,44 @@ mod tests {
Ok(())
}
+
+ #[test]
+ fn test_dictionary_concat_reuse() {
+ let array: DictionaryArray<Int8Type> =
+ vec!["a", "a", "b", "c"].into_iter().collect();
+ let array_copy: DictionaryArray<Int8Type> = array.data().clone().into();
+
+ // dictionary is "a", "b", "c"
+ assert_eq!(
+ array.values(),
+ &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef)
+ );
+ assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2]));
+
+ // concatenate it with itself
+ let combined = concat(&[&array_copy as _, &array as _]).unwrap();
+
+ let combined = combined
+ .as_any()
+ .downcast_ref::<DictionaryArray<Int8Type>>()
+ .unwrap();
+
+ assert_eq!(
+ combined.values(),
+ &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef),
+ "Actual: {:#?}",
+ combined
+ );
+
+ assert_eq!(
+ combined.keys(),
+ &Int8Array::from(vec![0, 0, 1, 2, 0, 0, 1, 2])
+ );
+
+ // Should have reused the dictionary
+ assert!(array.data().child_data()[0].ptr_eq(&combined.data().child_data()[0]));
+ assert!(
+ array_copy.data().child_data()[0].ptr_eq(&combined.data().child_data()[0])
+ );
+ }
Review comment:
Can we also add a test of concatenating three dictionaries -- where 2 use the same dictionary and one is a different dictionary?
##########
File path: arrow/src/array/data.rs
##########
@@ -1155,6 +1155,41 @@ impl ArrayData {
Ok(())
})
}
+
+ /// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
+ /// to determine buffer equality. This is cheaper than `PartialEq::eq` but may
+ /// return false negatives
Review comment:
in what case would this return a false negative (to the "are these two pointers the same" question)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb merged pull request #1219: Do not concatenate identical dictionaries
Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a change in pull request #1219: Do not concatenate identical dictionaries
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#discussion_r790662987
##########
File path: arrow/src/array/data.rs
##########
@@ -1155,6 +1155,41 @@ impl ArrayData {
Ok(())
})
}
+
+ /// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
+ /// to determine buffer equality. This is cheaper than `PartialEq::eq` but may
+ /// return false when the arrays are logically equal
Review comment:
👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a change in pull request #1219: Do not concatenate identical dictionaries
Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#discussion_r790321744
##########
File path: arrow/src/array/data.rs
##########
@@ -1155,6 +1155,41 @@ impl ArrayData {
Ok(())
})
}
+
+ /// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
+ /// to determine buffer equality. This is cheaper than `PartialEq::eq` but may
+ /// return false negatives
Review comment:
I updated the wording
##########
File path: arrow/src/compute/kernels/concat.rs
##########
@@ -525,4 +525,44 @@ mod tests {
Ok(())
}
+
+ #[test]
+ fn test_dictionary_concat_reuse() {
+ let array: DictionaryArray<Int8Type> =
+ vec!["a", "a", "b", "c"].into_iter().collect();
+ let array_copy: DictionaryArray<Int8Type> = array.data().clone().into();
+
+ // dictionary is "a", "b", "c"
+ assert_eq!(
+ array.values(),
+ &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef)
+ );
+ assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2]));
+
+ // concatenate it with itself
+ let combined = concat(&[&array_copy as _, &array as _]).unwrap();
+
+ let combined = combined
+ .as_any()
+ .downcast_ref::<DictionaryArray<Int8Type>>()
+ .unwrap();
+
+ assert_eq!(
+ combined.values(),
+ &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef),
+ "Actual: {:#?}",
+ combined
+ );
+
+ assert_eq!(
+ combined.keys(),
+ &Int8Array::from(vec![0, 0, 1, 2, 0, 0, 1, 2])
+ );
+
+ // Should have reused the dictionary
+ assert!(array.data().child_data()[0].ptr_eq(&combined.data().child_data()[0]));
+ assert!(
+ array_copy.data().child_data()[0].ptr_eq(&combined.data().child_data()[0])
+ );
+ }
Review comment:
Added
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb merged pull request #1219: Do not concatenate identical dictionaries
Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] jhorstmann commented on pull request #1219: Do not concatenate identical dictionaries
Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#issuecomment-1019447411
Looks good to me. I initially thought `ptr_eq` for dictionary values could get away with only comparing the data buffer, but in theory the same buffer could be combined with different offsets or validity. So really need to compare all buffers and child data as int this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a change in pull request #1219: Do not concatenate identical dictionaries
Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1219:
URL: https://github.com/apache/arrow-rs/pull/1219#discussion_r790284597
##########
File path: arrow/src/array/data.rs
##########
@@ -1155,6 +1155,41 @@ impl ArrayData {
Ok(())
})
}
+
+ /// Returns true if this `ArrayData` is equal to `other`, using pointer comparisons
+ /// to determine buffer equality. This is cheaper than `PartialEq::eq` but may
+ /// return false negatives
Review comment:
It may say two things are not equal when logically they are?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org