You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/20 10:09:26 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

jorisvandenbossche opened a new pull request #7233:
URL: https://github.com/apache/arrow/pull/7233


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#discussion_r428283327



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -347,45 +347,55 @@ Status DecompressBuffers(Compression::type compression, const IpcReadOptions& op
   std::unique_ptr<util::Codec> codec;
   ARROW_ASSIGN_OR_RAISE(codec, util::Codec::Create(compression));
 
-  auto DecompressOne = [&](int i) {
-    ArrayData* arr = (*fields)[i].get();
-    for (size_t i = 0; i < arr->buffers.size(); ++i) {
-      if (arr->buffers[i] == nullptr) {
-        continue;
-      }
-      if (arr->buffers[i]->size() == 0) {
-        continue;
-      }
-      if (arr->buffers[i]->size() < 8) {
-        return Status::Invalid(
-            "Likely corrupted message, compressed buffers "
-            "are larger than 8 bytes by construction");
-      }
-      const uint8_t* data = arr->buffers[i]->data();
-      int64_t compressed_size = arr->buffers[i]->size() - sizeof(int64_t);
-      int64_t uncompressed_size =
-          BitUtil::FromLittleEndian(util::SafeLoadAs<int64_t>(data));
+  auto DecompressOneBuffer = [&](std::shared_ptr<Buffer>* buf) {
+    if ((*buf) == nullptr) {
+      return Status::OK();
+    }
+    if ((*buf)->size() == 0) {
+      return Status::OK();
+    }
+    if ((*buf)->size() < 8) {
+      return Status::Invalid(
+          "Likely corrupted message, compressed buffers "
+          "are larger than 8 bytes by construction");
+    }
+    const uint8_t* data = (*buf)->data();
+    int64_t compressed_size = (*buf)->size() - sizeof(int64_t);
+    int64_t uncompressed_size =
+        BitUtil::FromLittleEndian(util::SafeLoadAs<int64_t>(data));
 
-      ARROW_ASSIGN_OR_RAISE(auto uncompressed,
-                            AllocateBuffer(uncompressed_size, options.memory_pool));
+    ARROW_ASSIGN_OR_RAISE(auto uncompressed,
+                          AllocateBuffer(uncompressed_size, options.memory_pool));
 
-      int64_t actual_decompressed;
-      ARROW_ASSIGN_OR_RAISE(
-          actual_decompressed,
-          codec->Decompress(compressed_size, data + sizeof(int64_t), uncompressed_size,
-                            uncompressed->mutable_data()));
-      if (actual_decompressed != uncompressed_size) {
-        return Status::Invalid("Failed to fully decompress buffer, expected ",
-                               uncompressed_size, " bytes but decompressed ",
-                               actual_decompressed);
+    int64_t actual_decompressed;
+    ARROW_ASSIGN_OR_RAISE(
+        actual_decompressed,
+        codec->Decompress(compressed_size, data + sizeof(int64_t), uncompressed_size,
+                          uncompressed->mutable_data()));
+    if (actual_decompressed != uncompressed_size) {
+      return Status::Invalid("Failed to fully decompress buffer, expected ",
+                             uncompressed_size, " bytes but decompressed ",
+                             actual_decompressed);
+    }
+    *buf = std::move(uncompressed);
+    return Status::OK();
+  };
+
+  auto DecompressOneField = [&](int i) {
+    ArrayData* arr = (*fields)[i].get();
+    for (size_t i = 0; i < arr->buffers.size(); ++i) {
+      ARROW_RETURN_NOT_OK(DecompressOneBuffer(&arr->buffers[i]));
+    }
+    for (size_t i = 0; i < arr->child_data.size(); ++i) {
+      for (size_t j = 0; j < arr->child_data[i]->buffers.size(); ++j) {
+        ARROW_RETURN_NOT_OK(DecompressOneBuffer(&arr->child_data[i]->buffers[j]));

Review comment:
       TODO: on second thought, for multiple levels of nesting, this will need to be a recursive loop for the child_data's child_data




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-643463846


   > I think we can apply this patch on conda-forge so at least those builds are OK
   
   @wesm I hit this issue saving a DataFrame to Feather. Was it possible to patch the conda-forge release or do I need to wait for 1.0.0?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-632094064


   We should discuss strategies for handling the Python wheels


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] JakeWilliams22 commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
JakeWilliams22 commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-634090344


   Just FYI I experience this same issue using the Python API (not sure if that is based on the code mentioned here). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-634100588


   Yes same code


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-642442732


   From my / Ben's part, I think so yes, but it could use another review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-643485591


   I thought that, no problem to wait for 1.0.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#discussion_r428583079



##########
File path: python/pyarrow/tests/test_feather.py
##########
@@ -753,3 +760,18 @@ def test_read_column_duplicated_in_file(tempdir):
     # selection with column names errors
     with pytest.raises(ValueError):
         read_table(path, columns=['a', 'b'])
+
+
+def test_nested_types(compression):
+    # https://issues.apache.org/jira/browse/ARROW-8860
+    table = pa.table({'col': pa.StructArray.from_arrays(
+        [[0, 1, 2], [1, 2, 3]], names=["f1", "f2"])})
+    _check_arrow_roundtrip(table, compression=compression)
+
+    table = pa.table({'col': pa.array([[1, 2], [3, 4]])})
+    _check_arrow_roundtrip(table, compression=compression)
+
+
+@h.given(past.all_tables)
+def test_roundtrip(table):
+    _check_arrow_roundtrip(table)

Review comment:
       Should this test different compressions as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#discussion_r427895147



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -381,6 +381,42 @@ Status DecompressBuffers(Compression::type compression, const IpcReadOptions& op
       }
       arr->buffers[i] = std::move(uncompressed);
     }
+    for (size_t j = 0; j < arr->child_data.size(); ++j) {
+      for (size_t i = 0; i < arr->child_data[j]->buffers.size(); ++i) {
+        if (arr->child_data[j]->buffers[i] == nullptr) {
+          continue;
+        }
+        if (arr->child_data[j]->buffers[i]->size() == 0) {
+          continue;
+        }
+        if (arr->child_data[j]->buffers[i]->size() < 8) {
+          return Status::Invalid(
+              "Likely corrupted message, compressed buffers "
+              "are larger than 8 bytes by construction");
+        }
+        const uint8_t* data = arr->child_data[j]->buffers[i]->data();
+        int64_t compressed_size =
+            arr->child_data[j]->buffers[i]->size() - sizeof(int64_t);
+        int64_t uncompressed_size =
+            BitUtil::FromLittleEndian(util::SafeLoadAs<int64_t>(data));
+
+        ARROW_ASSIGN_OR_RAISE(auto uncompressed,
+                              AllocateBuffer(uncompressed_size, options.memory_pool));
+
+        int64_t actual_decompressed;
+        ARROW_ASSIGN_OR_RAISE(
+            actual_decompressed,
+            codec->Decompress(compressed_size, data + sizeof(int64_t), uncompressed_size,
+                              uncompressed->mutable_data()));
+        if (actual_decompressed != uncompressed_size) {
+          return Status::Invalid("Failed to fully decompress buffer, expected ",
+                                 uncompressed_size, " bytes but decompressed ",
+                                 actual_decompressed);
+        }
+        arr->child_data[j]->buffers[i] = std::move(uncompressed);

Review comment:
       TODO: I need to rewrite this to factor out the common logic in a separate function (now this loop just duplicates the loop above to decompress `arr->buffers`)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-632085945


   Shoot. This issue is a bummer 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-632093727


   I think we can apply this patch on conda-forge so at least those builds are OK


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-632100671


   We should certainly not apply different patches to same-named versions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-631389422


   https://issues.apache.org/jira/browse/ARROW-8860


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler edited a comment on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
BryanCutler edited a comment on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-643485591


   I thought that, no problem to wait for 1.0.0. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-642137750


   Is this done?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#discussion_r434704664



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -344,50 +344,73 @@ class ArrayLoader {
   ArrayData* out_;
 };
 
+Result<std::shared_ptr<Buffer>> DecompressBuffer(const std::shared_ptr<Buffer>& buf,
+                                                 const IpcReadOptions& options,
+                                                 util::Codec* codec) {
+  if (buf == nullptr || buf->size() == 0) {
+    return buf;
+  }
+
+  if (buf->size() < 8) {
+    return Status::Invalid(
+        "Likely corrupted message, compressed buffers "
+        "are larger than 8 bytes by construction");
+  }
+
+  const uint8_t* data = buf->data();
+  int64_t compressed_size = buf->size() - sizeof(int64_t);
+  int64_t uncompressed_size = BitUtil::FromLittleEndian(util::SafeLoadAs<int64_t>(data));
+
+  ARROW_ASSIGN_OR_RAISE(auto uncompressed,
+                        AllocateBuffer(uncompressed_size, options.memory_pool));
+
+  ARROW_ASSIGN_OR_RAISE(
+      int64_t actual_decompressed,
+      codec->Decompress(compressed_size, data + sizeof(int64_t), uncompressed_size,
+                        uncompressed->mutable_data()));
+  if (actual_decompressed != uncompressed_size) {
+    return Status::Invalid("Failed to fully decompress buffer, expected ",
+                           uncompressed_size, " bytes but decompressed ",
+                           actual_decompressed);
+  }
+
+  return std::move(uncompressed);
+}
+
 Status DecompressBuffers(Compression::type compression, const IpcReadOptions& options,
                          std::vector<std::shared_ptr<ArrayData>>* fields) {
-  std::unique_ptr<util::Codec> codec;
-  ARROW_ASSIGN_OR_RAISE(codec, util::Codec::Create(compression));
+  struct BufferAccumulator {
+    using BufferPtrVector = std::vector<std::shared_ptr<Buffer>*>;
 
-  auto DecompressOne = [&](int i) {
-    ArrayData* arr = (*fields)[i].get();
-    for (size_t i = 0; i < arr->buffers.size(); ++i) {
-      if (arr->buffers[i] == nullptr) {
-        continue;
-      }
-      if (arr->buffers[i]->size() == 0) {
-        continue;
-      }
-      if (arr->buffers[i]->size() < 8) {
-        return Status::Invalid(
-            "Likely corrupted message, compressed buffers "
-            "are larger than 8 bytes by construction");
+    void AppendFrom(const std::vector<std::shared_ptr<ArrayData>>& fields) {
+      for (const auto& field : fields) {
+        for (auto& buffer : field->buffers) {
+          buffers_.push_back(&buffer);
+        }
+        AppendFrom(field->child_data);
       }
-      const uint8_t* data = arr->buffers[i]->data();
-      int64_t compressed_size = arr->buffers[i]->size() - sizeof(int64_t);
-      int64_t uncompressed_size =
-          BitUtil::FromLittleEndian(util::SafeLoadAs<int64_t>(data));
-
-      ARROW_ASSIGN_OR_RAISE(auto uncompressed,
-                            AllocateBuffer(uncompressed_size, options.memory_pool));
+    }
 
-      int64_t actual_decompressed;
-      ARROW_ASSIGN_OR_RAISE(
-          actual_decompressed,
-          codec->Decompress(compressed_size, data + sizeof(int64_t), uncompressed_size,
-                            uncompressed->mutable_data()));
-      if (actual_decompressed != uncompressed_size) {
-        return Status::Invalid("Failed to fully decompress buffer, expected ",
-                               uncompressed_size, " bytes but decompressed ",
-                               actual_decompressed);
-      }
-      arr->buffers[i] = std::move(uncompressed);
+    BufferPtrVector Get(const std::vector<std::shared_ptr<ArrayData>>& fields) && {
+      AppendFrom(fields);
+      return std::move(buffers_);
     }
-    return Status::OK();
+
+    BufferPtrVector buffers_;
   };
 
+  // flatten all buffers
+  auto buffers = BufferAccumulator{}.Get(*fields);

Review comment:
       Accumulating all buffers to decompress should provide better utilization of a thread pool. Previously, in the case of a single nested field all but one thread would be idle regardless of how many buffers needed to be decompressed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7233:
URL: https://github.com/apache/arrow/pull/7233


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-643471463


   It sounds like people aren't super keen on patching conda-forge, I'd suggest waiting for 1.0.0 in the meantime or using `compression='uncompressed'` as a workaround for nested data


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7233: ARROW-8860: [C++] Fix IPC/Feather decompression for nested types (with child_data)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7233:
URL: https://github.com/apache/arrow/pull/7233#issuecomment-632089275


   Yeah, if releasing was not so much work, I would directly say to do a 0.17.2 ..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org