You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/15 01:31:40 UTC

[GitHub] [arrow] westonpace opened a new pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

westonpace opened a new pull request #12160:
URL: https://github.com/apache/arrow/pull/12160


   [] Add test cases


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#discussion_r789073033



##########
File path: python/pyarrow/tests/test_ipc.py
##########
@@ -136,6 +140,22 @@ def stream_fixture():
     return StreamFormatFixture()
 
 
+@pytest.fixture(params=[
+    FileFormatFixture(),
+    StreamFormatFixture()
+    # pytest.param(
+    #     pytest.lazy_fixture('file_fixture'),
+    #     id='File Format'
+    # ),
+    # pytest.param(
+    #     pytest.lazy_fixture('stream_fixture'),
+    #     id='Stream Format'
+    # )

Review comment:
       This is probably redundant?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1022366001


   @westonpace can this be merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#discussion_r793768579



##########
File path: python/pyarrow/ipc.pxi
##########
@@ -124,6 +124,15 @@ cdef class IpcWriteOptions(_Weakrefable):
     emit_dictionary_deltas : bool
         Whether to emit dictionary deltas.  Default is false for maximum
         stream compatibility.
+    unify_dictionaries : bool
+        If true then calls to write_table will attempt to unify dictionaries
+        across all batches in the table.  This can help avoid the need for
+        replacement dictionaries (which the file format does not support)
+        but requires computing the unified dictionary and then remapping
+        the indices arrays.
+
+        This property is ignored when writing to the streaming format as
+        the streaming format can support replacement dictionaries.

Review comment:
       ```suggestion
           This parameter is ignored when writing to the IPC stream format as
           the IPC stream format can support replacement dictionaries.
   ```

##########
File path: cpp/src/arrow/ipc/options.h
##########
@@ -87,9 +87,9 @@ struct ARROW_EXPORT IpcWriteOptions {
 
   /// \brief Whether to unify dictionaries for the IPC file format
   ///
-  /// The IPC file format doesn't support dictionary replacements or deltas.
+  /// The IPC file format doesn't support dictionary replacements.
   /// Therefore, chunks of a column with a dictionary type must have the same
-  /// dictionary in each record batch.
+  /// dictionary in each record batch (or an extended dictionary + delta).
   ///
   /// If this option is true, RecordBatchWriter::WriteTable will attempt
   /// to unify dictionaries across each table column.  If this option is

Review comment:
       Does the sentence below need to be updated? ("unequal dictionaries")




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#discussion_r794182177



##########
File path: cpp/src/arrow/ipc/options.h
##########
@@ -87,9 +87,9 @@ struct ARROW_EXPORT IpcWriteOptions {
 
   /// \brief Whether to unify dictionaries for the IPC file format
   ///
-  /// The IPC file format doesn't support dictionary replacements or deltas.
+  /// The IPC file format doesn't support dictionary replacements.
   /// Therefore, chunks of a column with a dictionary type must have the same
-  /// dictionary in each record batch.
+  /// dictionary in each record batch (or an extended dictionary + delta).
   ///
   /// If this option is true, RecordBatchWriter::WriteTable will attempt
   /// to unify dictionaries across each table column.  If this option is

Review comment:
       I changed from unequal to incompatible.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1024007598


   Benchmark runs are scheduled for baseline = 8905de9b3db1667eff7678a3cad2de0b64ff46bf and contender = 3663971f17cc5cc32bb389ad959eb5b30dacb1e1. 3663971f17cc5cc32bb389ad959eb5b30dacb1e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7cc234dc3b814d458aea17d6a3f5d5dd...69f8be77e75049108670eb249c832996/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16eafaae15dd4f8097cf3b98e9ac15d5...9752d6b458324331a1ad492b012253a3/)
   [Finished :arrow_down:0.22% :arrow_up:0.17%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/de59571d9af8427dbbaec1e2c4f97cc7...b38b93db8b7449d78f31a382fe964945/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1020116938


   (Also, sorry, do you mind updating the PR description for the eventual commit message?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1023858993


   Thanks for the reminder @rok and the suggestions @pitrou .  I'll merge on green and tackle ARROW-15425 tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1013575092






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1020116300


   It doesn't seem so, filed ARROW-15425.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace closed pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
westonpace closed pull request #12160:
URL: https://github.com/apache/arrow/pull/12160


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1024007598


   Benchmark runs are scheduled for baseline = 8905de9b3db1667eff7678a3cad2de0b64ff46bf and contender = 3663971f17cc5cc32bb389ad959eb5b30dacb1e1. 3663971f17cc5cc32bb389ad959eb5b30dacb1e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7cc234dc3b814d458aea17d6a3f5d5dd...69f8be77e75049108670eb249c832996/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16eafaae15dd4f8097cf3b98e9ac15d5...9752d6b458324331a1ad492b012253a3/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/de59571d9af8427dbbaec1e2c4f97cc7...b38b93db8b7449d78f31a382fe964945/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1024007598


   Benchmark runs are scheduled for baseline = 8905de9b3db1667eff7678a3cad2de0b64ff46bf and contender = 3663971f17cc5cc32bb389ad959eb5b30dacb1e1. 3663971f17cc5cc32bb389ad959eb5b30dacb1e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7cc234dc3b814d458aea17d6a3f5d5dd...69f8be77e75049108670eb249c832996/)
   [Finished :arrow_down:5.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16eafaae15dd4f8097cf3b98e9ac15d5...9752d6b458324331a1ad492b012253a3/)
   [Finished :arrow_down:0.22% :arrow_up:0.17%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/de59571d9af8427dbbaec1e2c4f97cc7...b38b93db8b7449d78f31a382fe964945/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#discussion_r789254214



##########
File path: python/pyarrow/tests/test_ipc.py
##########
@@ -136,6 +140,22 @@ def stream_fixture():
     return StreamFormatFixture()
 
 
+@pytest.fixture(params=[
+    FileFormatFixture(),
+    StreamFormatFixture()
+    # pytest.param(
+    #     pytest.lazy_fixture('file_fixture'),
+    #     id='File Format'
+    # ),
+    # pytest.param(
+    #     pytest.lazy_fixture('stream_fixture'),
+    #     id='Stream Format'
+    # )

Review comment:
       Good catch thanks :).  I had thought this was causing problems at one point but it wasn't so I needed to uncomment and use this instead as it has nicer names.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1024007598


   Benchmark runs are scheduled for baseline = 8905de9b3db1667eff7678a3cad2de0b64ff46bf and contender = 3663971f17cc5cc32bb389ad959eb5b30dacb1e1. 3663971f17cc5cc32bb389ad959eb5b30dacb1e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7cc234dc3b814d458aea17d6a3f5d5dd...69f8be77e75049108670eb249c832996/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16eafaae15dd4f8097cf3b98e9ac15d5...9752d6b458324331a1ad492b012253a3/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/de59571d9af8427dbbaec1e2c4f97cc7...b38b93db8b7449d78f31a382fe964945/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org