You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/28 16:48:46 UTC

[GitHub] [arrow-rs] alamb opened a new issue #504: Do not copy dictionary values when they are the same in `concat`

alamb opened a new issue #504:
URL: https://github.com/apache/arrow-rs/issues/504


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   For dictionary arrays, the code for the `concat` simply copies the data from multiple places into a single contigous block: https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/concat.rs#L55
   
   For dictionary arrays, this results in potentially duplicating the dictionary values. For example:
   
   Array1:  "A, C, A, B"
   Dictionary: `{A, B, C}`
   Values: `{0, 2, 0, 1}`
   
   Array2:  "C, C, A"
   Dictionary: `{A, B, C}`
   Values: `{2, 2, 0}`
   
   Note the dictionary is the same  between Array 1 and Array 2. Concatenating them together results in
   Array3:  "A, C, A, B, C, C, A"
   Dictionary: `{A, B, C, A, B, C}`
   Values: `{0, 2, 0, 1, 6, 6, 4}`
   
   In IOx it turns out that we concatenate rather large numbers of such arrays leading to *very* large duplicated dictionaries (read all about the gory details here): https://github.com/influxdata/influxdb_iox/issues/1822
   
   **Describe the solution you'd like**
   When the dictionary has the exact same contents (e.g. same pointer) there is no need to copy the dictionary
   
   
   Here is a reproducer showing the actual issue:
   ```rust
       fn test_string_dictionary_repeated_dictionary() {
           let array : DictionaryArray<Int8Type> = vec!["a", "a", "b", "c"].into_iter().collect();
           let array_copy: DictionaryArray<Int8Type> = array.data().clone().into();
   
           // dictionary is "a", "b", "c"
           assert_eq!(array.values(), &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef));
           assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2]));
   
           // concatenate it with itself
           let combined = concat(&[&array_copy as _, &array as _])
               .unwrap();
           let combined = combined
               .as_any()
               .downcast_ref::<DictionaryArray<Int8Type>>()
               .unwrap();
   
           // Since dictionares are the same, no need a copy -- it could be the same "a", "b", "c"
           assert_eq!(combined.values(), &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef),
                      "Actual: {:#?}", combined);
           assert_eq!(combined.keys(), &Int8Array::from(vec![0, 0, 1, 2, 0, 0, 1, 2]));
       }
   ```
   
   The actual output shows that the resulting dictionary is `"a", "b", "c", "a", "b", "c"` when it was literally the same pointer
   
   ```
   ---- compute::kernels::concat::tests::test_string_dictionary_repeated_dictionary stdout ----
   thread 'compute::kernels::concat::tests::test_string_dictionary_repeated_dictionary' panicked at 'assertion failed: `(left == right)`
     left: `StringArray
   [
     "a",
     "b",
     "c",
     "a",
     "b",
     "c",
   ]`,
    right: `StringArray
   [
     "a",
     "b",
     "c",
   ]`: Actual: DictionaryArray {keys: PrimitiveArray<Int8>
   [
     0,
     0,
     1,
     2,
     3,
     3,
     4,
     5,
   ] values: StringArray
   [
     "a",
     "b",
     "c",
     "a",
     "b",
     "c",
   ]}
   ', arrow/src/compute/kernels/concat.rs:528:9
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   
   ```
   
   **Describe alternatives you've considered**
   There are times when the dictionary may not be the exact same pointer, but contain mostly/all the same values which would also be useful to compact, but I will file that under a separate ticket as it will likely involve a tradeoff between CPU and memory
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb closed issue #504: Do not copy dictionary values when they are the same in `concat`

Posted by GitBox <gi...@apache.org>.
alamb closed issue #504:
URL: https://github.com/apache/arrow-rs/issues/504


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on issue #504: Do not copy dictionary values when they are the same in `concat`

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #504:
URL: https://github.com/apache/arrow-rs/issues/504#issuecomment-869882802


   FYI, in our product we use our own concat kernel for dictionary arrays because our TableProvider ensures that dictionaries across chunks and batches are "consistent". Dictionaries across chunks are not necessarily equal, but our loader makes sure that all values from the previous dictionary occur with the same key in later batches. We will likely make that guarantee stronger soon so that it will actually be the same dictionary, the proposed change would then allow us to use the standard kernel. I think when we switched to our own kernel, dictionaries were not supported at all by concat.
   
   I never got around to proposing this, but my initial idea for a solution was to have such a guarantee somehow on metadata level, as  it would also allow more optimizations for example in the group by operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on issue #504: Do not copy dictionary values when they are the same in `concat`

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #504:
URL: https://github.com/apache/arrow-rs/issues/504#issuecomment-869882802


   FYI, in our product we use our own concat kernel for dictionary arrays because our TableProvider ensures that dictionaries across chunks and batches are "consistent". Dictionaries across chunks are not necessarily equal, but our loader makes sure that all values from the previous dictionary occur with the same key in later batches. We will likely make that guarantee stronger soon so that it will actually be the same dictionary, the proposed change would then allow us to use the standard kernel. I think when we switched to our own kernel, dictionaries were not supported at all by concat.
   
   I never got around to proposing this, but my initial idea for a solution was to have such a guarantee somehow on metadata level, as  it would also allow more optimizations for example in the group by operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb closed issue #504: Do not copy dictionary values when they are the same in `concat`

Posted by GitBox <gi...@apache.org>.
alamb closed issue #504:
URL: https://github.com/apache/arrow-rs/issues/504


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #504: Do not copy dictionary values when they are the same in `concat`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #504:
URL: https://github.com/apache/arrow-rs/issues/504#issuecomment-1019496497


   FYI @tustvold  has a proposed implementation in https://github.com/apache/arrow-rs/pull/1219


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org