You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2021/03/04 11:20:59 UTC

[GitHub] [arrow-testing] jmgpeeters opened a new pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

jmgpeeters opened a new pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59


   ARROW-11838 aims to add C++ read capability for IPC data with shared dictionaries. 
   
   Write support isn't available yet either, and out of scope here, so these files - allowing testing of the read functionality - were generated from Java.
   
   Ideally this PR is reviewed alongside the upcoming one in apache/arrow that contains the src and test changes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-testing] pitrou commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796719583


   @jmgpeeters It seems these should go into the "golden files" used for integration testing, see https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration
   Integration testing is documented here: https://arrow.apache.org/docs/format/Integration.html
   The integration testing machinery is maintained here: https://github.com/apache/arrow/tree/master/dev/archery/archery/integration
   
   Don't hesitate to ask questions if you have trouble navigating this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-testing] pitrou merged pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

Posted by GitBox <gi...@apache.org>.
pitrou merged pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-testing] pitrou commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-801043407


   @jmgpeeters That sounds right. AFAIU, the version number is the Arrow release (at least C++ or Java) that these files are supposed to be compatible with.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-testing] jmgpeeters commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

Posted by GitBox <gi...@apache.org>.
jmgpeeters commented on pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796777534


   Agreed. I'll make the changes and get back to you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-testing] jmgpeeters commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

Posted by GitBox <gi...@apache.org>.
jmgpeeters commented on pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796737078


   Ah, thanks, I wasn't aware of the Archery integration suite. Had a quick glance, and seems to make sense. Was a bit worried it would require support in all languages for shared dicts, but it seems easy to disable languages per folder etc.
   
   One thing I noticed from the JSON format is that it doesn't (appear to) support dictionary restatement, i.e. schema -> dict_batch[id=1] -> batch -> dict_batch[id=1] -> batch -> ... as we have in the streaming format, and that I'm currently explicitly testing in the bespoke tests. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-testing] jmgpeeters commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

Posted by GitBox <gi...@apache.org>.
jmgpeeters commented on pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-801014168


   @pitrou I've moved the golden files into 4.0.0-shareddicts and created a json.gz to validate against.
   Is 4.0.0 right? Not quite sure what the convention is, but - assuming this all gets merged eventually - the integration test should work with library version 4.0.0 and above.
   
   I've tested locally with Archery against C++, which works fine.  To make the changes in archery (in the sister PR over in apache/arrow), it would be easiest to get this one merged first, so I can rely on the github checks to figure out which implementations might need to be skipped, rather than having to build e.g. Java, Julia, Rust, ... locally.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-testing] pitrou commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796768922


   Indeed, the JSON format doesn't support it, so that will be a problem if we want to do roundtripping tests with the integration machinery.
   However, I think we can still use the "golden files" part of integration testing, because there the logic for each implementation is (see [here](https://github.com/apache/arrow/blob/master/cpp/src/arrow/testing/json_integration_test.cc#L225-L234) for the C++ implementation):
   * read the JSON file and convert it into a series of record batches
   * read the Arrow file and decode it into a series of record batches
   * compare respective record batches for equality
   
   Comparing for equality doesn't care if the dictionaries are shared, so this should be ok for testing the ability to read IPC files with shared dictionaries.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org