You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/15 01:31:40 UTC
[GitHub] [arrow] westonpace opened a new pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
westonpace opened a new pull request #12160:
URL: https://github.com/apache/arrow/pull/12160
[] Add test cases
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] rok commented on a change in pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#discussion_r789073033
##########
File path: python/pyarrow/tests/test_ipc.py
##########
@@ -136,6 +140,22 @@ def stream_fixture():
return StreamFormatFixture()
+@pytest.fixture(params=[
+ FileFormatFixture(),
+ StreamFormatFixture()
+ # pytest.param(
+ # pytest.lazy_fixture('file_fixture'),
+ # id='File Format'
+ # ),
+ # pytest.param(
+ # pytest.lazy_fixture('stream_fixture'),
+ # id='Stream Format'
+ # )
Review comment:
This is probably redundant?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] rok commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
rok commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1022366001
@westonpace can this be merged?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#discussion_r793768579
##########
File path: python/pyarrow/ipc.pxi
##########
@@ -124,6 +124,15 @@ cdef class IpcWriteOptions(_Weakrefable):
emit_dictionary_deltas : bool
Whether to emit dictionary deltas. Default is false for maximum
stream compatibility.
+ unify_dictionaries : bool
+ If true then calls to write_table will attempt to unify dictionaries
+ across all batches in the table. This can help avoid the need for
+ replacement dictionaries (which the file format does not support)
+ but requires computing the unified dictionary and then remapping
+ the indices arrays.
+
+ This property is ignored when writing to the streaming format as
+ the streaming format can support replacement dictionaries.
Review comment:
```suggestion
This parameter is ignored when writing to the IPC stream format as
the IPC stream format can support replacement dictionaries.
```
##########
File path: cpp/src/arrow/ipc/options.h
##########
@@ -87,9 +87,9 @@ struct ARROW_EXPORT IpcWriteOptions {
/// \brief Whether to unify dictionaries for the IPC file format
///
- /// The IPC file format doesn't support dictionary replacements or deltas.
+ /// The IPC file format doesn't support dictionary replacements.
/// Therefore, chunks of a column with a dictionary type must have the same
- /// dictionary in each record batch.
+ /// dictionary in each record batch (or an extended dictionary + delta).
///
/// If this option is true, RecordBatchWriter::WriteTable will attempt
/// to unify dictionaries across each table column. If this option is
Review comment:
Does the sentence below need to be updated? ("unequal dictionaries")
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#discussion_r794182177
##########
File path: cpp/src/arrow/ipc/options.h
##########
@@ -87,9 +87,9 @@ struct ARROW_EXPORT IpcWriteOptions {
/// \brief Whether to unify dictionaries for the IPC file format
///
- /// The IPC file format doesn't support dictionary replacements or deltas.
+ /// The IPC file format doesn't support dictionary replacements.
/// Therefore, chunks of a column with a dictionary type must have the same
- /// dictionary in each record batch.
+ /// dictionary in each record batch (or an extended dictionary + delta).
///
/// If this option is true, RecordBatchWriter::WriteTable will attempt
/// to unify dictionaries across each table column. If this option is
Review comment:
I changed from unequal to incompatible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot edited a comment on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1024007598
Benchmark runs are scheduled for baseline = 8905de9b3db1667eff7678a3cad2de0b64ff46bf and contender = 3663971f17cc5cc32bb389ad959eb5b30dacb1e1. 3663971f17cc5cc32bb389ad959eb5b30dacb1e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7cc234dc3b814d458aea17d6a3f5d5dd...69f8be77e75049108670eb249c832996/)
[Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16eafaae15dd4f8097cf3b98e9ac15d5...9752d6b458324331a1ad492b012253a3/)
[Finished :arrow_down:0.22% :arrow_up:0.17%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/de59571d9af8427dbbaec1e2c4f97cc7...b38b93db8b7449d78f31a382fe964945/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1020116938
(Also, sorry, do you mind updating the PR description for the eventual commit message?)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1023858993
Thanks for the reminder @rok and the suggestions @pitrou . I'll merge on green and tackle ARROW-15425 tomorrow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1013575092
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1020116300
It doesn't seem so, filed ARROW-15425.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace closed pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
westonpace closed pull request #12160:
URL: https://github.com/apache/arrow/pull/12160
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1024007598
Benchmark runs are scheduled for baseline = 8905de9b3db1667eff7678a3cad2de0b64ff46bf and contender = 3663971f17cc5cc32bb389ad959eb5b30dacb1e1. 3663971f17cc5cc32bb389ad959eb5b30dacb1e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7cc234dc3b814d458aea17d6a3f5d5dd...69f8be77e75049108670eb249c832996/)
[Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16eafaae15dd4f8097cf3b98e9ac15d5...9752d6b458324331a1ad492b012253a3/)
[Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/de59571d9af8427dbbaec1e2c4f97cc7...b38b93db8b7449d78f31a382fe964945/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot edited a comment on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1024007598
Benchmark runs are scheduled for baseline = 8905de9b3db1667eff7678a3cad2de0b64ff46bf and contender = 3663971f17cc5cc32bb389ad959eb5b30dacb1e1. 3663971f17cc5cc32bb389ad959eb5b30dacb1e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7cc234dc3b814d458aea17d6a3f5d5dd...69f8be77e75049108670eb249c832996/)
[Finished :arrow_down:5.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16eafaae15dd4f8097cf3b98e9ac15d5...9752d6b458324331a1ad492b012253a3/)
[Finished :arrow_down:0.22% :arrow_up:0.17%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/de59571d9af8427dbbaec1e2c4f97cc7...b38b93db8b7449d78f31a382fe964945/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#discussion_r789254214
##########
File path: python/pyarrow/tests/test_ipc.py
##########
@@ -136,6 +140,22 @@ def stream_fixture():
return StreamFormatFixture()
+@pytest.fixture(params=[
+ FileFormatFixture(),
+ StreamFormatFixture()
+ # pytest.param(
+ # pytest.lazy_fixture('file_fixture'),
+ # id='File Format'
+ # ),
+ # pytest.param(
+ # pytest.lazy_fixture('stream_fixture'),
+ # id='Stream Format'
+ # )
Review comment:
Good catch thanks :). I had thought this was causing problems at one point but it wasn't so I needed to uncomment and use this instead as it has nicer names.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot edited a comment on pull request #12160: ARROW-13467: [C++] Support delta dictionaries in the IPC file format
Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12160:
URL: https://github.com/apache/arrow/pull/12160#issuecomment-1024007598
Benchmark runs are scheduled for baseline = 8905de9b3db1667eff7678a3cad2de0b64ff46bf and contender = 3663971f17cc5cc32bb389ad959eb5b30dacb1e1. 3663971f17cc5cc32bb389ad959eb5b30dacb1e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7cc234dc3b814d458aea17d6a3f5d5dd...69f8be77e75049108670eb249c832996/)
[Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16eafaae15dd4f8097cf3b98e9ac15d5...9752d6b458324331a1ad492b012253a3/)
[Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/de59571d9af8427dbbaec1e2c4f97cc7...b38b93db8b7449d78f31a382fe964945/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org