You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "manolama (via GitHub)" <gi...@apache.org> on 2023/10/09 21:56:35 UTC

[PR] GH-38168: [Java] Fix multi-batch Dictionary in Arrow{Reader|Writer} [arrow]

manolama opened a new pull request, #38169:
URL: https://github.com/apache/arrow/pull/38169

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Previously, writing a file using the `ArrowFileWriter` class would only flush the initial state of dictionary vectors to the file so that subsequent updates were ignored. Likewise the `ArrowFileReader` only read the first dictionary block and re-used it for all subsequent data blocks.
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   The `ArrowFileWriter` now flushes dictionary vectors on each call to `writeBatch()`. The `ArrowFileReader` will now load the dictionaries for each block on `loadNextBatch()` or `loadRecordBatch(1)`.
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   Yes
   
   ### Are there any user-facing changes?
   
   If users relied on previous behavior to encode a single dictionary for use across multiple batches without the `isDelta` flag set, this change may break that behavior. However per the documentation at https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages, the previous behavior was in error. 
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "manolama (via GitHub)" <gi...@apache.org>.
manolama commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1758173077

   > The main thing I'd make sure of is that this writes delta dictionaries and not replacement dictionaries
   @lidavidm I mentioned in the commit comment that this is _not_ a fix for delta writing (though it _may_ help if we add an API to set the delta flag. It's always false.) Instead I need replacement dictionaries for parallel processing of the blocks. 
   
   I can see if the repo has a delta encoded dictionary somewhere to test decoding and maybe try and test delta encoding somehow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1758243630

   Er, sorry, I should say deltas are allowed in streams, but not files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in Arrow{Reader|Writer} [arrow]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1753937921

   :warning: GitHub issue #38168 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.
vibhatha commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1757770202

   I will check @lidavidm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "manolama (via GitHub)" <gi...@apache.org>.
manolama commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1765339201

   Heard back from Micah and we'd need a format change first. Will take a look at that later if we get a bit more involved with Arrow. Closing this for now, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1758177402

   > [Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported).](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.
vibhatha commented on code in PR #38169:
URL: https://github.com/apache/arrow/pull/38169#discussion_r1357645095


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java:
##########
@@ -129,12 +128,6 @@ protected void endInternal(WriteChannel out) throws IOException {
   @Override
   protected void ensureDictionariesWritten(DictionaryProvider provider, Set<Long> dictionaryIdsUsed)
       throws IOException {
-    if (dictionariesWritten) {
-      return;
-    }
-    dictionariesWritten = true;
-    // Write out all dictionaries required.
-    // Replacement dictionaries are not supported in the IPC file format.

Review Comment:
   Just for my clarification:
   
   Referring to [Apache Arrow Columnar specification](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format). 
   > Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported). 
   
   Here the relevant part of the code to provide that functionality has been removed. Does this change that functionality? Does it need a documentation update?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1758315988

   It's linked above. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
   
   Sorry, I don't know what's with me today. My original statement was right. "Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported)." -> deltas are allowed in files, just not replacements


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "manolama (via GitHub)" <gi...@apache.org>.
manolama commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1758211676

   Oof, ok thanks for that info. I'll tweak this for the dictionary then and see if I can get the folks to allow replacements in the file format as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1758208886

   Replacement is allowed in streams, but not files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "manolama (via GitHub)" <gi...@apache.org>.
manolama commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1758309883

   Could you point me to the spec that talks about files vs streams please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1757735031

   @vibhatha @davisusanibar can one of you review the PR? The main thing I'd make sure of is that this writes delta dictionaries and not replacement dictionaries


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "manolama (via GitHub)" <gi...@apache.org>.
manolama commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1758201546

   That's insanely confusing:
   
   > Alternatively, if isDelta is set to false, then the dictionary replaces the existing dictionary for the same ID.
   
   Did the spec change to disallow replacements or is the doc correct and the spec now allows replacement? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "manolama (via GitHub)" <gi...@apache.org>.
manolama commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1758441825

   No worries. So nothing more formal defining the spec than that doc? Guess I'll go through Jiras to find where that came from. Files should support replacement if you ask me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on PR #38169:
URL: https://github.com/apache/arrow/pull/38169#issuecomment-1758176694

   The IPC file format explicitly disallows replacement dictionaries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} [arrow]

Posted by "manolama (via GitHub)" <gi...@apache.org>.
manolama closed pull request #38169: GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer}
URL: https://github.com/apache/arrow/pull/38169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org