You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/29 17:07:02 UTC

[GitHub] [arrow] pitrou opened a new pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

pitrou opened a new pull request #8302:
URL: https://github.com/apache/arrow/pull/8302


   When a dictionary changes from the previous batch, we should re-emit it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #8302:
URL: https://github.com/apache/arrow/pull/8302#issuecomment-703641811


   I can take a final look this morning. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8302:
URL: https://github.com/apache/arrow/pull/8302#issuecomment-700861455


   https://issues.apache.org/jira/browse/ARROW-10121


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8302:
URL: https://github.com/apache/arrow/pull/8302#issuecomment-703512955


   @wesm Do you want to review this again? Otherwise perhaps @bkietz ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #8302:
URL: https://github.com/apache/arrow/pull/8302#discussion_r496941604



##########
File path: cpp/src/arrow/ipc/writer.cc
##########
@@ -982,13 +982,7 @@ class ARROW_EXPORT IpcFormatWriter : public RecordBatchWriter {
 
     RETURN_NOT_OK(CheckStarted());
 
-    if (!wrote_dictionaries_) {
-      RETURN_NOT_OK(WriteDictionaries(batch));
-      wrote_dictionaries_ = true;
-    }
-
-    // TODO: Check for delta dictionaries. Can we scan for deltas while computing
-    // the RecordBatch payload to save time?
+    RETURN_NOT_OK(WriteDictionaries(batch));

Review comment:
       I assume this will populate `last_dictionaries_` entries even when the length of the dictionary is zero? I recall there was some discussion about empty dictionaries (and whether they need to be written at all) at the start of a stream

##########
File path: cpp/src/arrow/ipc/writer.cc
##########
@@ -1020,13 +1014,30 @@ class ARROW_EXPORT IpcFormatWriter : public RecordBatchWriter {
   Status WriteDictionaries(const RecordBatch& batch) {
     ARROW_ASSIGN_OR_RAISE(const auto dictionaries, CollectDictionaries(batch, mapper_));
 
+    // TODO: Check for delta dictionaries. Can we scan for deltas while computing
+    // the RecordBatch payload to save time?
+
     for (const auto& pair : dictionaries) {
       IpcPayload payload;
       int64_t dictionary_id = pair.first;
       const auto& dictionary = pair.second;
 
+      // If a dictionary with this id was already emitted, check if it was the same.
+      // Note we don't compare dictionaries by value because it may be costly.
+      auto* last_dictionary = &last_dictionaries_[dictionary_id];
+      {
+        const auto maybe_last_dict = last_dictionary->lock();
+        if (maybe_last_dict && maybe_last_dict->data() == dictionary->data()) {
+          // Same dictionary data by pointer => no need to emit it again
+          continue;
+        }
+      }
+
       RETURN_NOT_OK(GetDictionaryPayload(dictionary_id, dictionary, options_, &payload));
       RETURN_NOT_OK(payload_writer_->WritePayload(payload));

Review comment:
       This should only be allowable if we are writing a stream? Dictionary replacements aren't allowed for IPC files




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #8302:
URL: https://github.com/apache/arrow/pull/8302#discussion_r499856714



##########
File path: cpp/src/arrow/ipc/writer.cc
##########
@@ -982,13 +982,7 @@ class ARROW_EXPORT IpcFormatWriter : public RecordBatchWriter {
 
     RETURN_NOT_OK(CheckStarted());
 
-    if (!wrote_dictionaries_) {
-      RETURN_NOT_OK(WriteDictionaries(batch));
-      wrote_dictionaries_ = true;
-    }
-
-    // TODO: Check for delta dictionaries. Can we scan for deltas while computing
-    // the RecordBatch payload to save time?
+    RETURN_NOT_OK(WriteDictionaries(batch));

Review comment:
       I'm not sure if skipped dictionaries will work, but they are permitted by the format, so it should probably be investigated in a follow up issue




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8302:
URL: https://github.com/apache/arrow/pull/8302#issuecomment-700852780


   @wesm This is a first draft. It is functional (i.e. fixes the underlying issue) but doesn't try to detect deltas.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8302:
URL: https://github.com/apache/arrow/pull/8302#discussion_r497538522



##########
File path: cpp/src/arrow/ipc/writer.cc
##########
@@ -1020,13 +1014,30 @@ class ARROW_EXPORT IpcFormatWriter : public RecordBatchWriter {
   Status WriteDictionaries(const RecordBatch& batch) {
     ARROW_ASSIGN_OR_RAISE(const auto dictionaries, CollectDictionaries(batch, mapper_));
 
+    // TODO: Check for delta dictionaries. Can we scan for deltas while computing
+    // the RecordBatch payload to save time?
+
     for (const auto& pair : dictionaries) {
       IpcPayload payload;
       int64_t dictionary_id = pair.first;
       const auto& dictionary = pair.second;
 
+      // If a dictionary with this id was already emitted, check if it was the same.
+      // Note we don't compare dictionaries by value because it may be costly.
+      auto* last_dictionary = &last_dictionaries_[dictionary_id];
+      {
+        const auto maybe_last_dict = last_dictionary->lock();
+        if (maybe_last_dict && maybe_last_dict->data() == dictionary->data()) {
+          // Same dictionary data by pointer => no need to emit it again
+          continue;
+        }
+      }
+
       RETURN_NOT_OK(GetDictionaryPayload(dictionary_id, dictionary, options_, &payload));
       RETURN_NOT_OK(payload_writer_->WritePayload(payload));

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
wesm closed pull request #8302:
URL: https://github.com/apache/arrow/pull/8302


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8302:
URL: https://github.com/apache/arrow/pull/8302#discussion_r496942764



##########
File path: cpp/src/arrow/ipc/writer.cc
##########
@@ -982,13 +982,7 @@ class ARROW_EXPORT IpcFormatWriter : public RecordBatchWriter {
 
     RETURN_NOT_OK(CheckStarted());
 
-    if (!wrote_dictionaries_) {
-      RETURN_NOT_OK(WriteDictionaries(batch));
-      wrote_dictionaries_ = true;
-    }
-
-    // TODO: Check for delta dictionaries. Can we scan for deltas while computing
-    // the RecordBatch payload to save time?
+    RETURN_NOT_OK(WriteDictionaries(batch));

Review comment:
       Ah, we can skip zero-length dictionaries explicitly, but won't it confuse the IPC reader?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8302:
URL: https://github.com/apache/arrow/pull/8302#discussion_r496943047



##########
File path: cpp/src/arrow/ipc/writer.cc
##########
@@ -1020,13 +1014,30 @@ class ARROW_EXPORT IpcFormatWriter : public RecordBatchWriter {
   Status WriteDictionaries(const RecordBatch& batch) {
     ARROW_ASSIGN_OR_RAISE(const auto dictionaries, CollectDictionaries(batch, mapper_));
 
+    // TODO: Check for delta dictionaries. Can we scan for deltas while computing
+    // the RecordBatch payload to save time?
+
     for (const auto& pair : dictionaries) {
       IpcPayload payload;
       int64_t dictionary_id = pair.first;
       const auto& dictionary = pair.second;
 
+      // If a dictionary with this id was already emitted, check if it was the same.
+      // Note we don't compare dictionaries by value because it may be costly.
+      auto* last_dictionary = &last_dictionaries_[dictionary_id];
+      {
+        const auto maybe_last_dict = last_dictionary->lock();
+        if (maybe_last_dict && maybe_last_dict->data() == dictionary->data()) {
+          // Same dictionary data by pointer => no need to emit it again
+          continue;
+        }
+      }
+
       RETURN_NOT_OK(GetDictionaryPayload(dictionary_id, dictionary, options_, &payload));
       RETURN_NOT_OK(payload_writer_->WritePayload(payload));

Review comment:
       Hmm, right. I should check how the file writer deals with this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8302: ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8302:
URL: https://github.com/apache/arrow/pull/8302#issuecomment-701527565


   I still need to add tests for this (I'll do that once #8309 is merged). Detecting and writing delta dictionaries is not implemented, but that's a feature, not a bug, so it may also wait for another PR.
   (also, it will have to be disabled by default, for compatibility with other IPC readers)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org