You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/07 19:00:46 UTC

[GitHub] [arrow] pitrou opened a new pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

pitrou opened a new pull request #7664:
URL: https://github.com/apache/arrow/pull/7664


   V4 Union arrays with top-level null slots are disallowed, though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#discussion_r452544857



##########
File path: python/pyarrow/ipc.pxi
##########
@@ -18,6 +18,32 @@
 import warnings
 
 
+cpdef enum MetadataVersion:
+    V1 = <char> CMetadataVersion_V1
+    V2 = <char> CMetadataVersion_V2
+    V3 = <char> CMetadataVersion_V3
+    V4 = <char> CMetadataVersion_V4
+    V5 = <char> CMetadataVersion_V5
+
+
+cdef object _wrap_metadata_version(CMetadataVersion version):
+    return MetadataVersion(<char> version)

Review comment:
       Sounds like it (Java does not check)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#issuecomment-655530738


   @lidavidm @pitrou I haven't looked at the patch yet, but V5 must be the default produced by writers but V4 can be opted in to. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#discussion_r452565269



##########
File path: python/pyarrow/ipc.pxi
##########
@@ -18,6 +18,32 @@
 import warnings
 
 
+cpdef enum MetadataVersion:
+    V1 = <char> CMetadataVersion_V1
+    V2 = <char> CMetadataVersion_V2
+    V3 = <char> CMetadataVersion_V3
+    V4 = <char> CMetadataVersion_V4
+    V5 = <char> CMetadataVersion_V5
+
+
+cdef object _wrap_metadata_version(CMetadataVersion version):
+    return MetadataVersion(<char> version)

Review comment:
       See https://github.com/apache/arrow/blob/maint-0.17.x/cpp/src/arrow/ipc/message.cc#L57
   
   So it only checks for old versions but new versions pass silently, which is quite scary to me. But on the other hand the risk of V5 data breaking a V4 application (e.g. Spark) at the moment is low. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#discussion_r452565638



##########
File path: python/pyarrow/ipc.pxi
##########
@@ -18,6 +18,32 @@
 import warnings
 
 
+cpdef enum MetadataVersion:
+    V1 = <char> CMetadataVersion_V1
+    V2 = <char> CMetadataVersion_V2
+    V3 = <char> CMetadataVersion_V3
+    V4 = <char> CMetadataVersion_V4
+    V5 = <char> CMetadataVersion_V5
+
+
+cdef object _wrap_metadata_version(CMetadataVersion version):
+    return MetadataVersion(<char> version)

Review comment:
       https://issues.apache.org/jira/browse/ARROW-9399




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#issuecomment-655061168


   Note that https://github.com/apache/arrow-testing/pull/35 needs to be merged first.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#issuecomment-655379174


   The ASAN/UBSAN Ci failure should be fixed when merging #7644.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#discussion_r452558073



##########
File path: python/pyarrow/ipc.pxi
##########
@@ -18,6 +18,32 @@
 import warnings
 
 
+cpdef enum MetadataVersion:
+    V1 = <char> CMetadataVersion_V1
+    V2 = <char> CMetadataVersion_V2
+    V3 = <char> CMetadataVersion_V3
+    V4 = <char> CMetadataVersion_V4
+    V5 = <char> CMetadataVersion_V5
+
+
+cdef object _wrap_metadata_version(CMetadataVersion version):
+    return MetadataVersion(<char> version)

Review comment:
       Well, neither checks, because #7685 passes (Java sending V5 while C++ sends V4) and this passed (Java sends V4 while C++ sends V5)
   
   Though that may be reassuring to anyone planning to use 0.17.1 with 1.0.0.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#discussion_r452379377



##########
File path: cpp/src/arrow/ipc/file_to_stream.cc
##########
@@ -39,7 +39,11 @@ Status ConvertToStream(const char* path) {
 
   ARROW_ASSIGN_OR_RAISE(auto in_file, io::ReadableFile::Open(path));
   ARROW_ASSIGN_OR_RAISE(auto reader, ipc::RecordBatchFileReader::Open(in_file.get()));
-  ARROW_ASSIGN_OR_RAISE(auto writer, ipc::NewStreamWriter(&sink, reader->schema()));
+  auto options = IpcWriteOptions::Defaults();
+  // Use V5 to get up-to-date Union buffer layout
+  options.metadata_version = MetadataVersion::V5;

Review comment:
       This should be the default, so not needed

##########
File path: cpp/src/arrow/ipc/stream_to_file.cc
##########
@@ -37,7 +37,10 @@ Status ConvertToFile() {
   io::StdoutStream sink;
 
   ARROW_ASSIGN_OR_RAISE(auto reader, RecordBatchStreamReader::Open(&input));
-  ARROW_ASSIGN_OR_RAISE(auto writer, NewFileWriter(&sink, reader->schema()));
+  auto options = IpcWriteOptions::Defaults();
+  // Use V5 to get up-to-date Union buffer layout
+  options.metadata_version = MetadataVersion::V5;

Review comment:
       Can remove this

##########
File path: cpp/src/arrow/testing/json_integration_test.cc
##########
@@ -83,8 +83,11 @@ static Status ConvertJsonToArrow(const std::string& json_path,
               << reader->schema()->ToString(/* show_metadata = */ true) << std::endl;
   }
 
+  auto options = IpcWriteOptions::Defaults();
+  // Use V5 to get up-to-date Union buffer layout
+  options.metadata_version = MetadataVersion::V5;

Review comment:
       Can remove

##########
File path: python/pyarrow/ipc.pxi
##########
@@ -18,6 +18,32 @@
 import warnings
 
 
+cpdef enum MetadataVersion:
+    V1 = <char> CMetadataVersion_V1
+    V2 = <char> CMetadataVersion_V2
+    V3 = <char> CMetadataVersion_V3
+    V4 = <char> CMetadataVersion_V4
+    V5 = <char> CMetadataVersion_V5
+
+
+cdef object _wrap_metadata_version(CMetadataVersion version):
+    return MetadataVersion(<char> version)

Review comment:
       Out of curiosity, where will we break (presumably earlier than this) if we were to encounter an unrecognized version?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#discussion_r452564868



##########
File path: python/pyarrow/ipc.pxi
##########
@@ -18,6 +18,32 @@
 import warnings
 
 
+cpdef enum MetadataVersion:
+    V1 = <char> CMetadataVersion_V1
+    V2 = <char> CMetadataVersion_V2
+    V3 = <char> CMetadataVersion_V3
+    V4 = <char> CMetadataVersion_V4
+    V5 = <char> CMetadataVersion_V5
+
+
+cdef object _wrap_metadata_version(CMetadataVersion version):
+    return MetadataVersion(<char> version)

Review comment:
       Yikes, well I will open a JIRA about adding appropriate checks at least for 1.0.0




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#issuecomment-655358696


   That is right indeed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#discussion_r452544857



##########
File path: python/pyarrow/ipc.pxi
##########
@@ -18,6 +18,32 @@
 import warnings
 
 
+cpdef enum MetadataVersion:
+    V1 = <char> CMetadataVersion_V1
+    V2 = <char> CMetadataVersion_V2
+    V3 = <char> CMetadataVersion_V3
+    V4 = <char> CMetadataVersion_V4
+    V5 = <char> CMetadataVersion_V5
+
+
+cdef object _wrap_metadata_version(CMetadataVersion version):
+    return MetadataVersion(<char> version)

Review comment:
       Sounds like it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#issuecomment-655193980


   Just a high level comment: if I'm reading this right, V4 is still the default metadata version and applications opt in to V5 when they want to read/write unions. Am I understanding this right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#discussion_r452537636



##########
File path: python/pyarrow/ipc.pxi
##########
@@ -18,6 +18,32 @@
 import warnings
 
 
+cpdef enum MetadataVersion:
+    V1 = <char> CMetadataVersion_V1
+    V2 = <char> CMetadataVersion_V2
+    V3 = <char> CMetadataVersion_V3
+    V4 = <char> CMetadataVersion_V4
+    V5 = <char> CMetadataVersion_V5
+
+
+cdef object _wrap_metadata_version(CMetadataVersion version):
+    return MetadataVersion(<char> version)

Review comment:
       The fact that integration still passes in the corresponding Java PR #7685 (before rebasing on top of this) would imply that the version doesn't get checked, no?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#issuecomment-656341760


   I'm quickly taking care of these small things so this can be merged


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#discussion_r452537636



##########
File path: python/pyarrow/ipc.pxi
##########
@@ -18,6 +18,32 @@
 import warnings
 
 
+cpdef enum MetadataVersion:
+    V1 = <char> CMetadataVersion_V1
+    V2 = <char> CMetadataVersion_V2
+    V3 = <char> CMetadataVersion_V3
+    V4 = <char> CMetadataVersion_V4
+    V5 = <char> CMetadataVersion_V5
+
+
+cdef object _wrap_metadata_version(CMetadataVersion version):
+    return MetadataVersion(<char> version)

Review comment:
       The fact that integration still passes in the corresponding Java PR #7685 would imply that the version doesn't get checked, no?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
wesm closed pull request #7664:
URL: https://github.com/apache/arrow/pull/7664


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#issuecomment-655536712


   I just sent an e-mail to the ML. We don't want to continue producing V4 metadata unless we need to for forward compatibility reasons. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#issuecomment-656033704


   Rebased.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#issuecomment-655533965


   I see, I will update the PR then.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7664: ARROW-9265: [C++] Allow writing and reading V4-compliant IPC data

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7664:
URL: https://github.com/apache/arrow/pull/7664#issuecomment-655061788


   https://issues.apache.org/jira/browse/ARROW-9265


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org