You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "lidavidm (via GitHub)" <gi...@apache.org> on 2023/06/09 15:14:52 UTC

[GitHub] [arrow] lidavidm commented on a diff in pull request #35920: GH-18547: [Java] Support re-emitting dictionaries in ArrowStreamWriter

lidavidm commented on code in PR #35920:
URL: https://github.com/apache/arrow/pull/35920#discussion_r1224427836


##########
java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java:
##########
@@ -135,6 +135,7 @@ public void run() throws IOException {
         Preconditions.checkState(reader.bytesRead() == writer.bytesWritten());
         LOGGER.debug(String.format("Echoed %d records", echoed));
         reader.close(false);
+        writer.close();

Review Comment:
   Thanks. I think it's reasonable to expect that you must close the writer anyways (and that you should generally try-with-resources these writers).



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamWriter.java:
##########
@@ -121,4 +129,45 @@ public static void writeEndOfStream(WriteChannel out, IpcOption option) throws I
   protected void endInternal(WriteChannel out) throws IOException {
     writeEndOfStream(out, option);
   }
+
+  @Override
+  protected void ensureDictionariesWritten(DictionaryProvider provider, Set<Long> dictionaryIdsUsed)

Review Comment:
   Oof, so for each batch we write, we need to re-compare all dictionaries? That may also be worth noting in the PR description.



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamWriter.java:
##########
@@ -121,4 +129,45 @@ public static void writeEndOfStream(WriteChannel out, IpcOption option) throws I
   protected void endInternal(WriteChannel out) throws IOException {
     writeEndOfStream(out, option);
   }
+
+  @Override
+  protected void ensureDictionariesWritten(DictionaryProvider provider, Set<Long> dictionaryIdsUsed)

Review Comment:
   That said, this is broadly in line with the C++ implementation: https://github.com/apache/arrow/blob/766e25440250dde9117e19245389badb5ed7678c/cpp/src/arrow/ipc/writer.cc#L1121
   
   The C++ implementation takes a number of shortcuts to avoid having to compare the full data buffers on each batch. However, because vectors are mutable in the Java implementation, we can't easily make the same optimization here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org