You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/04 09:53:39 UTC

[GitHub] [arrow] chrisavl opened a new pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

chrisavl opened a new pull request #8589:
URL: https://github.com/apache/arrow/pull/8589


   This PR fixes the offset being lost when the validity bitmap is updated.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield closed pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield closed pull request #8589:
URL: https://github.com/apache/arrow/pull/8589


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#issuecomment-722906035


   just realized there are some constness issues I need to address.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#issuecomment-722909212


   NM, I'm going to take a different approach, because the one here doesn't account for dictionaries.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#issuecomment-721637384


   https://issues.apache.org/jira/browse/ARROW-10493


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#issuecomment-724444564


   I will move the slice method into parquet, expect some pushes to the branch over the next couple of hours.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#discussion_r519899025



##########
File path: cpp/src/parquet/column_writer.cc
##########
@@ -1222,12 +1223,17 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
   }
 
   std::shared_ptr<Array> MaybeReplaceValidity(std::shared_ptr<Array> array,
-                                              int64_t new_null_count) {
+                                              int64_t new_null_count,
+                                              arrow::MemoryPool* memory_pool) {
     if (bits_buffer_ == nullptr) {
       return array;
     }
     std::vector<std::shared_ptr<Buffer>> buffers = array->data()->buffers;
     buffers[0] = bits_buffer_;
+    DCHECK_GT(buffers.size(), 1);
+    PARQUET_ASSIGN_OR_THROW(
+        buffers[1],
+        checked_pointer_cast<arrow::FlatArray>(array)->SlicedValues(memory_pool));
     // Should be a leaf array.
     DCHECK_EQ(array->num_fields(), 0);

Review comment:
       Stylistically, the above hunk should follow this assertion of leafiness since that hunk assumes leafiness by casting to FlatArray

##########
File path: cpp/src/arrow/array/array_base.h
##########
@@ -207,6 +207,14 @@ static inline std::ostream& operator<<(std::ostream& os, const Array& x) {
 
 /// Base class for non-nested arrays
 class ARROW_EXPORT FlatArray : public Array {
+ public:
+  /// Returns the slice buffer for the second buffer in the
+  /// array (values buffer for primitives, offsets buffers
+  /// for variable length binary types.
+  ///
+  /// \note API EXPERIMENTAL
+  virtual Result<std::shared_ptr<Buffer>> SlicedValues(MemoryPool* pool) const = 0;

Review comment:
       Since this is only used in one place and uses "Slice" with very different semantics from other slices, I think it'd be better to
   1. move it into `column_writer.cc` or
   2. provide a more encapsulated and easily understood function, for example `Result<std::shared_ptr<ArrayData>> arrow::internal::ReplaceNullBitmap(const ArrayData& data, std::shared_ptr<Buffer> null_bitmap, MemoryPool* pool)`

##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3199,6 +3200,36 @@ TEST(TestArrowReaderAdHoc, HandleDictPageOffsetZero) {
   TryReadDataFile(test::get_data_file("dict-page-offset-zero.parquet"));
 }
 
+TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
+  // ARROW-10493
+  auto type =
+      ::arrow::struct_({::arrow::field("inner", ::arrow::utf8(), /*nullable=*/true)});
+  auto outer_array = ::arrow::ArrayFromJSON(type,
+                                            R"([{"inner": "a"},
+                                                {"inner": null},
+                                                {"inner": "b"},
+                                                {"inner": "c"},
+                                                {"inner": null},
+                                                {"inner": null},
+                                                {"inner": "d"},
+                                                null])");
+  auto inner_array =
+      std::static_pointer_cast<::arrow::StructArray>(outer_array)->field(0);
+  ASSERT_OK_AND_ASSIGN(inner_array->data()->buffers[0],
+                       ::arrow::internal::BytesToBits({1, 0, 1, 1, 0, 0, 1, 0}));
+  ASSERT_OK_AND_ASSIGN(outer_array->data()->buffers[0],
+                       ::arrow::internal::BytesToBits({1, 1, 1, 1, 1, 1, 1, 0}));

Review comment:
       I think you mean to assert equality here, but you're overwriting the null bitmaps instead.
   ```suggestion
     ASSERT_OK_AND_ASSIGN(auto expected_inner_validity,
                          ::arrow::internal::BytesToBits({1, 0, 1, 1, 0, 0, 1, 0}));
     ASSERT_OK_AND_ASSIGN(auto expected_outer_validity,
                          ::arrow::internal::BytesToBits({1, 1, 1, 1, 1, 1, 1, 0}));
     AssertBuffersEqual(*inner_array->data()->buffers[0], *expected_inner_validity);
     AssertBuffersEqual(*outer_array->data()->buffers[0], *expected_outer_validity);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] chrisavl commented on a change in pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
chrisavl commented on a change in pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#discussion_r517942949



##########
File path: cpp/src/parquet/column_writer.cc
##########
@@ -1230,8 +1230,9 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
     buffers[0] = bits_buffer_;
     // Should be a leaf array.
     DCHECK_EQ(array->num_fields(), 0);
-    return arrow::MakeArray(std::make_shared<ArrayData>(
-        array->type(), array->length(), std::move(buffers), new_null_count));
+    return arrow::MakeArray(std::make_shared<ArrayData>(array->type(), array->length(),
+                                                        std::move(buffers),

Review comment:
       Ah, you are right the offset also applies to the null buffer. But I am not sure how to best slice the other buffers, since ArrayData::Slice won't help here and using SliceBuffer directly would require figuring out the physical offset from the logical offset in ArrayData. Any suggestions or pointers here would be appreciated!
   
   Fow now, I pushed a not so elegant workaround that allocates a bitmap for the whole column instead of for only a batch_size. That way the offset for the bitmap and the other buffers should be the same, I think.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#issuecomment-724611663


   @chrisavl I'm done pushing.  I expanded the test case to try to cover all branches in the new slice method and reverted my previous changes.  Let me know if the new code looks reasonable.  @bkietz could you take a look also?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] chrisavl commented on pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
chrisavl commented on pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#issuecomment-725191152


   @emkornfield thanks, LGTM!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#discussion_r517512787



##########
File path: cpp/src/parquet/column_writer.cc
##########
@@ -1230,8 +1230,9 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
     buffers[0] = bits_buffer_;
     // Should be a leaf array.
     DCHECK_EQ(array->num_fields(), 0);
-    return arrow::MakeArray(std::make_shared<ArrayData>(
-        array->type(), array->length(), std::move(buffers), new_null_count));
+    return arrow::MakeArray(std::make_shared<ArrayData>(array->type(), array->length(),
+                                                        std::move(buffers),

Review comment:
       Nice catch.
   
   I think there is an edge case here.  I think offset applies to the null buffer also, so this would cause an illegal access if any values are actually null.  unfortunately, I think we need to slice all the buffers according to offset.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#issuecomment-722870334


   @chrisavl I added a new interface that you should be able to use for replacing the second buffer for arrow writing.  I'm going to push one more commit which will add tests.  Could you update your code to use it.
   
   @bkietz could you review/check the interface code?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] chrisavl commented on a change in pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
chrisavl commented on a change in pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#discussion_r520225236



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3199,6 +3200,36 @@ TEST(TestArrowReaderAdHoc, HandleDictPageOffsetZero) {
   TryReadDataFile(test::get_data_file("dict-page-offset-zero.parquet"));
 }
 
+TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
+  // ARROW-10493
+  auto type =
+      ::arrow::struct_({::arrow::field("inner", ::arrow::utf8(), /*nullable=*/true)});
+  auto outer_array = ::arrow::ArrayFromJSON(type,
+                                            R"([{"inner": "a"},
+                                                {"inner": null},
+                                                {"inner": "b"},
+                                                {"inner": "c"},
+                                                {"inner": null},
+                                                {"inner": null},
+                                                {"inner": "d"},
+                                                null])");
+  auto inner_array =
+      std::static_pointer_cast<::arrow::StructArray>(outer_array)->field(0);
+  ASSERT_OK_AND_ASSIGN(inner_array->data()->buffers[0],
+                       ::arrow::internal::BytesToBits({1, 0, 1, 1, 0, 0, 1, 0}));
+  ASSERT_OK_AND_ASSIGN(outer_array->data()->buffers[0],
+                       ::arrow::internal::BytesToBits({1, 1, 1, 1, 1, 1, 1, 0}));

Review comment:
       Thanks, I got confused by the validity of the inner array being 1 when the outer array validity is 0 for the last element. But this is not actually relevant to this issue, so I will just remove this bit of code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#discussion_r517514089



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3199,6 +3199,27 @@ TEST(TestArrowReaderAdHoc, HandleDictPageOffsetZero) {
   TryReadDataFile(test::get_data_file("dict-page-offset-zero.parquet"));
 }
 
+TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
+  // ARROW-10493
+  auto string_array =
+      ::arrow::ArrayFromJSON(::arrow::utf8(), "[\"a\", \"b\", \"c\", \"d\"]");

Review comment:
       array from json supports structs as {"column_name": value} entries, we could clean this up a little but creating the type first and using ArrayFromJson by itself without use StructArray::Make




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] chrisavl commented on a change in pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
chrisavl commented on a change in pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#discussion_r517943148



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3199,6 +3199,27 @@ TEST(TestArrowReaderAdHoc, HandleDictPageOffsetZero) {
   TryReadDataFile(test::get_data_file("dict-page-offset-zero.parquet"));
 }
 
+TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
+  // ARROW-10493
+  auto string_array =
+      ::arrow::ArrayFromJSON(::arrow::utf8(), "[\"a\", \"b\", \"c\", \"d\"]");

Review comment:
       Nice, thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#issuecomment-722939001


   Okay, I'm done pushing.  @chrisavl I think the new API should make the change for parquet_writer smaller (and probably more efficient).  See unit tests added for intended usage.  Note you probably need to check that array length > for assignment (we might need to do that for zero size also to test NA arrays).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield removed a comment on pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield removed a comment on pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#issuecomment-722909212


   NM, I'm going to take a different approach, because the one here doesn't account for dictionaries.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8589: ARROW-10493: [C++][Parquet] Fix offset lost in MaybeReplaceValidity

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8589:
URL: https://github.com/apache/arrow/pull/8589#discussion_r518528887



##########
File path: cpp/src/parquet/column_writer.cc
##########
@@ -1230,8 +1230,9 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
     buffers[0] = bits_buffer_;
     // Should be a leaf array.
     DCHECK_EQ(array->num_fields(), 0);
-    return arrow::MakeArray(std::make_shared<ArrayData>(
-        array->type(), array->length(), std::move(buffers), new_null_count));
+    return arrow::MakeArray(std::make_shared<ArrayData>(array->type(), array->length(),
+                                                        std::move(buffers),

Review comment:
       yeah, I'm not sure I like that change that way, I'll try to make a new commit on your branch with something that I think will be cleaner (adding an API to Array to get the right sliced buffer.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org