You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/20 20:09:04 UTC

[GitHub] [arrow] wjones127 opened a new pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

wjones127 opened a new pull request #12216:
URL: https://github.com/apache/arrow/pull/12216


   This is just some testing I've done so far. It requires the parquet file attached in the Jira.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r789152407



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 
+  // Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number 
+  // of nulls in array (861)
+  // ASSERT_OK(second_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());

Review comment:
       So reads 0,1,2 and then and all even reads are valid? do they all agree with each other?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1022895612


   Deleted a previous comment.  structs that have a repeated parent but no repeated children should be handled correctly in DefLevelsToBitmap


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1029112711


   > In the PR description you say that validating the table catches the issue. Is it sufficient or is the custom MemoryPool actually needed?
   
   @pitrou I felt that was a very indirect way of testing the issue. It's only one possible consequence of the buffer overflow; most of the time another validity bitmap doesn't seem to be immediately after. I tried for several days to come up with another (shorter) record batch that produces this issue, but all I could find is the one in that parquet file and that one `.Slice(159, 160)`. This is a lot of code, but I seemed like the most direct way to test this.
   
   > I'm working on https://issues.apache.org/jira/browse/ARROW-15550 which will allow exercising this bug without additional test.
   
   Yeah I think if we implemented that then this test would be obsolete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023749294


   > Yeah I think it's something like that. Both the capacity_ and size_ of the null_bitmap buffer are 320, which to me suggests that ZeroPadding() might be failing to provide the null termination needed to separate these buffers. Does that sound plausible?
   
   ZeroPadding should be ensuring everything is zeroed at the end.  Based on the evidence I think this is quirk of how memory is getting re-used between by the memory allocator on each iteration.  I think there is probably a buffer overflow everytime.  it doesn't affect the first time through because the overflow is into blank memory or something like that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeroshade commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
zeroshade commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1030755720


   > CC @zeroshade
   
   @emkornfield I'll double check that the Go implementation doesn't exhibit this bug, I don't think it can because Go would panic if it tried to write past the end of a slice. 
   
   When you reach consensus on the test, I'll try adding the same test to the Go parquet implementation to ensure it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1032932498


   So, for the record, #12330 adds a debugging option for memory pools and also includes the fix for this issue (since otherwise CI would fail).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r789148107



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 
+  // Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number 
+  // of nulls in array (861)
+  // ASSERT_OK(second_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());

Review comment:
       So if I read this correctly there is an list<struct<x: double>> arrow schema we are trying to read back?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1017878570


   https://issues.apache.org/jira/browse/ARROW-14047


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023419574


   > Deleted a previous comment. structs that have a repeated parent but no repeated children should be handled correctly in DefLevelsToBitmap
   
   Well my debugging so far is pointing to `DefLevelsToBitmap` as the point where the data is corrupted. 
   
   > Just curious what CPU are you running this on? Or put another way if it is intel with AVX2 can you try disabling SIMD (or vice versa)?
   
   I'm currently testing with an M1 Mac. I had `ARROW_SIMD_LEVEL=NEON`  but just recompiled with that set to `NONE` and test failure is the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1022896774


   Just curious what CPU are you running this on?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield edited a comment on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield edited a comment on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023871928


   > If we instead passed the (correct?) length of 2560 bits, then the problematic write on this line will not happen:
   
   Realized I was saying the same thing.  I think this is correct, nice catch.
   
   If this is more then an off-by overwrite error I think there might be deeper problems.  Just need to think about if it is an error in the Levels to Bitmap function that is writing when it shouldn't be or somehow we are getting the math wrong, passing down the wrong number for passing the upper bound on values.
   
   But i do think https://github.com/apache/arrow/blob/b1ae6029687aa3bb756c61a189c88125128cf026/cpp/src/parquet/level_conversion_inc.h#L336
   
   should be taking values_read_upper_bound instead of num_def_levels. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r794940851



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3675,6 +3675,116 @@ TEST(TestArrowReaderAdHoc, LARGE_MEMORY_TEST(LargeStringColumn)) {
   AssertTablesEqual(*table, *batched_table, /*same_chunk_layout=*/false);
 }
 
+class NoOverflowMemoryPool : public MemoryPool {
+ public:
+  explicit NoOverflowMemoryPool(MemoryPool* pool, int64_t padding);
+  ~NoOverflowMemoryPool() override = default;
+
+  Status Allocate(int64_t size, uint8_t** out) override;
+  Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override;
+
+  void Free(uint8_t* buffer, int64_t size) override;
+
+  int64_t bytes_allocated() const override;
+
+  int64_t max_memory() const override;
+
+  std::string backend_name() const override;
+
+ private:
+  MemoryPool* pool_;
+  int64_t padding_;
+  std::unordered_map<uint8_t**, int64_t> buffers_;
+  const uint8_t check_value_ = '\xff';
+};
+
+NoOverflowMemoryPool::NoOverflowMemoryPool(MemoryPool* pool, int64_t padding)
+    : pool_(pool), padding_(padding) {}
+
+Status NoOverflowMemoryPool::Allocate(int64_t size, uint8_t** out) {
+  Status s = pool_->Allocate(size + padding_, out);
+  if (s.ok()) {
+    buffers_[out] = size;
+
+    for (int i = size; i < size + padding_; ++i) {
+      (*out)[i] = check_value_;
+    }
+  }
+  return s;
+}
+
+Status NoOverflowMemoryPool::Reallocate(int64_t old_size, int64_t new_size,
+                                        uint8_t** ptr) {
+  Status s = pool_->Reallocate(old_size + padding_, new_size + padding_, ptr);
+  if (s.ok()) {
+    buffers_[ptr] = new_size;
+
+    for (int i = new_size; i < new_size + padding_; ++i) {
+      (*ptr)[i] = check_value_;
+    }
+  }
+  return s;
+}
+
+void NoOverflowMemoryPool::Free(uint8_t* buffer, int64_t size) {
+  // DCHECK_EQ(size, buffers_[&buffer]);

Review comment:
       I found it's common for `Free` to be called with size different that was allocated, so disabled this line. Is that fine?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1029086111


   I'm working on https://issues.apache.org/jira/browse/ARROW-15550 which will allow exercising this bug without additional test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1022869849


    > It seems that LoadBatch() on the leaf node is reading the correct data, but by the time MakeArray() is called the data has been somehow corrupted. I will debug further to try and narrow down where that is from.
   
   Note that LoadBatch and MakeArray are called recursively from the List then the Struct and then finally the leaf.  LoadBatch is pretty simply (but i guess there could be some issues there).  Structs and list reconstruction is are more complicated especially lists).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023871928


   If this is more then an off-by overwrite error I think there might be deeper problems.  Just need to think about if it is an error in the Levels to Bitmap function that is writing when it shouldn't be or somehow we are getting the math wrong, passing down the wrong number for passing the upper bound on values.
   
   But i do think https://github.com/apache/arrow/blob/b1ae6029687aa3bb756c61a189c88125128cf026/cpp/src/parquet/level_conversion_inc.h#L336
   
   should be taking values_read_upper_bound instead of num_def_levels. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1024520797


   > Is it possible to add additional test coverage that would have caught this issue?
   
   I had thought about including a test with that parquet file, but it seemed to be a very indirect way to test the issue.
   
   Yet, at the same time, I haven't been able to design a more direct failing test for this issue. Does this sound like something we should be able to test as part of the ASAN tests?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1022894186


   OK I found one bug.  We don't handle the case when structs have [repeated parents](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L757).  I thought I had handled this case, but maybe a commit got dropped.  I would have expected [this test](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reconstruct_internal_test.cc#L1141) to have caught it.  I wonder i re-arranging the data might have caught it or this is just a red-herring.  Let me know if this makes sense and if you want to try to reproduce/fix the issue otherwise I can try to do it.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r789133300



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 
+  // Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number 
+  // of nulls in array (861)
+  // ASSERT_OK(second_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());

Review comment:
       So this line fails?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r789136429



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 

Review comment:
       is this first_read or second read?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r789154890



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 
+  // Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number 
+  // of nulls in array (861)
+  // ASSERT_OK(second_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());

Review comment:
       Yes, all evens are equal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r794950106



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3675,6 +3675,116 @@ TEST(TestArrowReaderAdHoc, LARGE_MEMORY_TEST(LargeStringColumn)) {
   AssertTablesEqual(*table, *batched_table, /*same_chunk_layout=*/false);
 }
 
+class NoOverflowMemoryPool : public MemoryPool {
+ public:
+  explicit NoOverflowMemoryPool(MemoryPool* pool, int64_t padding);
+  ~NoOverflowMemoryPool() override = default;
+
+  Status Allocate(int64_t size, uint8_t** out) override;
+  Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override;
+
+  void Free(uint8_t* buffer, int64_t size) override;
+
+  int64_t bytes_allocated() const override;
+
+  int64_t max_memory() const override;
+
+  std::string backend_name() const override;
+
+ private:
+  MemoryPool* pool_;
+  int64_t padding_;
+  std::unordered_map<uint8_t**, int64_t> buffers_;
+  const uint8_t check_value_ = '\xff';
+};
+
+NoOverflowMemoryPool::NoOverflowMemoryPool(MemoryPool* pool, int64_t padding)
+    : pool_(pool), padding_(padding) {}
+
+Status NoOverflowMemoryPool::Allocate(int64_t size, uint8_t** out) {
+  Status s = pool_->Allocate(size + padding_, out);
+  if (s.ok()) {
+    buffers_[out] = size;
+
+    for (int i = size; i < size + padding_; ++i) {
+      (*out)[i] = check_value_;
+    }
+  }
+  return s;
+}
+
+Status NoOverflowMemoryPool::Reallocate(int64_t old_size, int64_t new_size,
+                                        uint8_t** ptr) {
+  Status s = pool_->Reallocate(old_size + padding_, new_size + padding_, ptr);
+  if (s.ok()) {
+    buffers_[ptr] = new_size;
+
+    for (int i = new_size; i < new_size + padding_; ++i) {
+      (*ptr)[i] = check_value_;
+    }
+  }
+  return s;
+}
+
+void NoOverflowMemoryPool::Free(uint8_t* buffer, int64_t size) {
+  // DCHECK_EQ(size, buffers_[&buffer]);

Review comment:
       Haven't reviewed the rest yet.  But it seems like it is potentially a bug someplace.  @pitrou are there any cases you can think of this happening? 
   
   Also does this actually fail without the change?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield removed a comment on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield removed a comment on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1022894186


   OK I found one bug.  We don't handle the case when structs have [repeated parents](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L757).  I thought I had handled this case, but maybe a commit got dropped.  I would have expected [this test](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reconstruct_internal_test.cc#L1141) to have caught it.  I wonder i re-arranging the data might have caught it or this is just a red-herring.  Let me know if this makes sense and if you want to try to reproduce/fix the issue otherwise I can try to do it.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1022834715


   @emkornfield I have debugged further and I believe I have narrowed down to the approximate place where the data is being corrupted, though it's very strange. I have added two `ValidateFull()` calls that seem to be before and after this corruption occurs. The one on `parquet/arrow/reader_internal.cc:780` passes, but the one on `parquet/arrow/reader.cc:482` fails. The error I get when I run:
   
   ```
   56: /Users/willjones/Documents/arrows/arrow/cpp/src/parquet/arrow/reader.cc:482:  Check failed: _s.ok() Operation failed: out_->ValidateFull()
   56: Bad status: Invalid: In chunk 0: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
   56: /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/array/validate.cc:118  ValidateNulls(*data.type)
   ```
   
   Problem is between those two points I see nothing that could alter the array. I am running this with `OMP_NUM_THREADS=1` and `OMP_THREAD_LIMIT=1`, and I can confirm that in my debugger I only see 2 threads (1 worker and one worker loop). So I'm very confused as to what could be going on between there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1024724745


   Just uploaded a test that uses a custom MemoryPool that validates that bytes past the end of the allocated buffers are not modified. Seems overkill but it works :shrug:.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r794956943



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3675,6 +3675,116 @@ TEST(TestArrowReaderAdHoc, LARGE_MEMORY_TEST(LargeStringColumn)) {
   AssertTablesEqual(*table, *batched_table, /*same_chunk_layout=*/false);
 }
 
+class NoOverflowMemoryPool : public MemoryPool {
+ public:
+  explicit NoOverflowMemoryPool(MemoryPool* pool, int64_t padding);
+  ~NoOverflowMemoryPool() override = default;
+
+  Status Allocate(int64_t size, uint8_t** out) override;
+  Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override;
+
+  void Free(uint8_t* buffer, int64_t size) override;
+
+  int64_t bytes_allocated() const override;
+
+  int64_t max_memory() const override;
+
+  std::string backend_name() const override;
+
+ private:
+  MemoryPool* pool_;
+  int64_t padding_;
+  std::unordered_map<uint8_t**, int64_t> buffers_;
+  const uint8_t check_value_ = '\xff';
+};
+
+NoOverflowMemoryPool::NoOverflowMemoryPool(MemoryPool* pool, int64_t padding)
+    : pool_(pool), padding_(padding) {}
+
+Status NoOverflowMemoryPool::Allocate(int64_t size, uint8_t** out) {
+  Status s = pool_->Allocate(size + padding_, out);
+  if (s.ok()) {
+    buffers_[out] = size;
+
+    for (int i = size; i < size + padding_; ++i) {
+      (*out)[i] = check_value_;
+    }
+  }
+  return s;
+}
+
+Status NoOverflowMemoryPool::Reallocate(int64_t old_size, int64_t new_size,
+                                        uint8_t** ptr) {
+  Status s = pool_->Reallocate(old_size + padding_, new_size + padding_, ptr);
+  if (s.ok()) {
+    buffers_[ptr] = new_size;
+
+    for (int i = new_size; i < new_size + padding_; ++i) {
+      (*ptr)[i] = check_value_;
+    }
+  }
+  return s;
+}
+
+void NoOverflowMemoryPool::Free(uint8_t* buffer, int64_t size) {
+  // DCHECK_EQ(size, buffers_[&buffer]);

Review comment:
       I think you want `buffers_` to be `std::unordered_map<uint8_t*, int64_t> buffers_` instead of `uint8_t**`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r795772660



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3675,6 +3675,116 @@ TEST(TestArrowReaderAdHoc, LARGE_MEMORY_TEST(LargeStringColumn)) {
   AssertTablesEqual(*table, *batched_table, /*same_chunk_layout=*/false);
 }
 
+class NoOverflowMemoryPool : public MemoryPool {
+ public:
+  explicit NoOverflowMemoryPool(MemoryPool* pool, int64_t padding);
+  ~NoOverflowMemoryPool() override = default;
+
+  Status Allocate(int64_t size, uint8_t** out) override;
+  Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override;
+
+  void Free(uint8_t* buffer, int64_t size) override;
+
+  int64_t bytes_allocated() const override;
+
+  int64_t max_memory() const override;
+
+  std::string backend_name() const override;
+
+ private:
+  MemoryPool* pool_;
+  int64_t padding_;
+  std::unordered_map<uint8_t**, int64_t> buffers_;
+  const uint8_t check_value_ = '\xff';
+};
+
+NoOverflowMemoryPool::NoOverflowMemoryPool(MemoryPool* pool, int64_t padding)
+    : pool_(pool), padding_(padding) {}
+
+Status NoOverflowMemoryPool::Allocate(int64_t size, uint8_t** out) {
+  Status s = pool_->Allocate(size + padding_, out);
+  if (s.ok()) {
+    buffers_[out] = size;
+
+    for (int i = size; i < size + padding_; ++i) {
+      (*out)[i] = check_value_;
+    }
+  }
+  return s;
+}
+
+Status NoOverflowMemoryPool::Reallocate(int64_t old_size, int64_t new_size,
+                                        uint8_t** ptr) {
+  Status s = pool_->Reallocate(old_size + padding_, new_size + padding_, ptr);
+  if (s.ok()) {
+    buffers_[ptr] = new_size;
+
+    for (int i = new_size; i < new_size + padding_; ++i) {
+      (*ptr)[i] = check_value_;
+    }
+  }
+  return s;
+}
+
+void NoOverflowMemoryPool::Free(uint8_t* buffer, int64_t size) {
+  // DCHECK_EQ(size, buffers_[&buffer]);

Review comment:
       Thanks, that was it. That DCHECK passes now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023679834


   > https://github.com/apache/arrow/blob/03f3cf986314654e932587d01df59ad145faf5b9/cpp/src/arrow/util/bitmap_writer.h#L139
   
   Yeah I think it's something like that. Both the `capacity_` and `size_` of the `null_bitmap` buffer are 320, which to me suggests that `ZeroPadding()` might be failing to provide the null termination needed to separate these buffers. Does that sound plausible? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023637292


   Here's the particular line where the array become invalid, though I don't think it's the root cause:
   
   https://github.com/apache/arrow/blob/03f3cf986314654e932587d01df59ad145faf5b9/cpp/src/arrow/util/bitmap_writer.h#L168
   
   On reads that are corrupted, the `bitmap_` value seems to be duplicated. When the final byte is set to `\0` on the above line, the first 8 values of the array become null. So the validity bitmap must be pointing to the second portion of the duplicated bitmap. Perhaps it's not being cleared somehow? Here is the content of `bitmap_` in the sequence of reads:
   
   ```
   print bitmap_ 1
   (uint8_t *) $4 = 0x00000001016a83c0 "\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\
 xb5u"
   
   print bitmap_ 2
   (uint8_t *) $10 = 0x00000001016a8280 "\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee
 \xb5u\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u"
   
   print bitmap_ 3
   (uint8_t *) $16 = 0x00000001016a83c0 "\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee
 \xb5u"
   
   print bitmap_ 4
   (uint8_t *) $24 = 0x00000001016a8280 "\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee
 \xb5u\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u"
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023674858


   I wonder if we are undersizing the buffer for when the [memcpy][https://github.com/apache/arrow/blob/03f3cf986314654e932587d01df59ad145faf5b9/cpp/src/arrow/util/bitmap_writer.h#L139] happens.  Does the problem go away if we allocate more space for the validity bitmap?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r790222470



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 
+  // Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number 
+  // of nulls in array (861)
+  // ASSERT_OK(second_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());

Review comment:
       Wow, that is strange.  Also, just to check they also equal after round-tripping or is there data loss?  Most of the state should be constructed by scratch on read, so I can't think of anything off the top of my head that would cause this.  Trying to run the test under ASAN or UBSAN maybe could highlight something.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r790222470



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 
+  // Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number 
+  // of nulls in array (861)
+  // ASSERT_OK(second_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());

Review comment:
       Wow, that is strange.  Also, just to check they also equal after round-tripping or is there data loss?  Most of the state should be constructed by scratch on read, so I can't think of anything off the top of my head that would cause this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield edited a comment on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield edited a comment on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1022896774


   Just curious what CPU are you running this on?  Or put another way if it is intel with AVX2 can you try disabling SIMD (or vice versa)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023867239


   > Another possible interpretation is we are passing in the incorrect length to the FirstTimeBitmapWriter. I see on the following lines we are passing in num_def_levels=2800, but we only allocated a buffer of 320 bytes / 2560 bits.
   
   That might or might not be a problem.  Because if there is a repeated parent is true, so there are potentially less values for that call.  The question is whether [value_read and null count](https://github.com/apache/arrow/blob/b1ae6029687aa3bb756c61a189c88125128cf026/cpp/src/parquet/level_conversion_inc.h#L350-L351) can fit in the allocated bitmap (or at least that was the intent of the code).  empty lists would not be included so the calculation is `num_def_levels - num empty lists above me`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023657080


   This issue seems to be related to the memory pool. I've been using jemalloc for these tests thus far. We seem to be getting different output allocated for `null_bitmap` on these lines:
   
   https://github.com/apache/arrow/blob/20e9e935899c8e11439ee16fb41d24190f2fabd6/cpp/src/parquet/arrow/reader.cc#L766-L768
   
   On a valid read, it allocates:
   
   ```
   print null_bitmap
   (length 320 bytes)
   (uint8_t *) $18 = 0x00000001016a83c0 "\xbc\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\
 xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xbc"
   ```
   
   On an invalid read, there is data from matching the previous bitmap:
   
   ```
   print null_bitmap
   (length 640 bytes)
   (uint8_t *) $46 = 0x00000001016a8280 "\xbc\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\
 xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xa5\xbc\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u
 \xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u\xfb\xaa\xedU\xednuw\xfb\xaa\xf9U\xdb\xee\xd5v\xfb\xaa\xf5U\xb6\xee\xb5u"
   ```
   
   
   An issue that I'm realizing is probably related: I couldn't get this test to run using mimalloc, because it would error with:
   
   ```
   mimalloc: error: buffer overflow in heap block 0x200000a0600 of size 320: write after 320 bytes
   ```
   
   If I ran the debugger with mimalloc, I could actually run through the test with no errors or invalid data, as long as I waiting 1 - 2 seconds between each read iteration.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r789149962



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 
+  // Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number 
+  // of nulls in array (861)
+  // ASSERT_OK(second_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());

Review comment:
       Yes.
   
   And I just found the resulting array is invalid in every other read starting with the third. That is the following reads (1-indexed) will be invalid: 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29,




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r793041384



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 
+  // Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number 
+  // of nulls in array (861)
+  // ASSERT_OK(second_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());

Review comment:
       Thanks for looking at it, Micah. I'll try ASAN and UBSAN, as well as see if I can repro with fuzz testing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1028928255


   Note that a debug MemoryPool may be useful anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r789143572



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 

Review comment:
       Should be second read for that line below; Just reran with 3769 uncommented. 
   
   ```
   56: /Users/willjones/Documents/arrows/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc:3797: Failure
   56: Failed
   56: 'second_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
   56: /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/array/validate.cc:118  ValidateNulls(*data.type)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r794953683



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3675,6 +3675,116 @@ TEST(TestArrowReaderAdHoc, LARGE_MEMORY_TEST(LargeStringColumn)) {
   AssertTablesEqual(*table, *batched_table, /*same_chunk_layout=*/false);
 }
 
+class NoOverflowMemoryPool : public MemoryPool {
+ public:
+  explicit NoOverflowMemoryPool(MemoryPool* pool, int64_t padding);
+  ~NoOverflowMemoryPool() override = default;
+
+  Status Allocate(int64_t size, uint8_t** out) override;
+  Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override;
+
+  void Free(uint8_t* buffer, int64_t size) override;
+
+  int64_t bytes_allocated() const override;
+
+  int64_t max_memory() const override;
+
+  std::string backend_name() const override;
+
+ private:
+  MemoryPool* pool_;
+  int64_t padding_;
+  std::unordered_map<uint8_t**, int64_t> buffers_;
+  const uint8_t check_value_ = '\xff';
+};
+
+NoOverflowMemoryPool::NoOverflowMemoryPool(MemoryPool* pool, int64_t padding)
+    : pool_(pool), padding_(padding) {}
+
+Status NoOverflowMemoryPool::Allocate(int64_t size, uint8_t** out) {
+  Status s = pool_->Allocate(size + padding_, out);
+  if (s.ok()) {
+    buffers_[out] = size;
+
+    for (int i = size; i < size + padding_; ++i) {
+      (*out)[i] = check_value_;
+    }
+  }
+  return s;
+}
+
+Status NoOverflowMemoryPool::Reallocate(int64_t old_size, int64_t new_size,
+                                        uint8_t** ptr) {
+  Status s = pool_->Reallocate(old_size + padding_, new_size + padding_, ptr);
+  if (s.ok()) {
+    buffers_[ptr] = new_size;
+
+    for (int i = new_size; i < new_size + padding_; ++i) {
+      (*ptr)[i] = check_value_;
+    }
+  }
+  return s;
+}
+
+void NoOverflowMemoryPool::Free(uint8_t* buffer, int64_t size) {
+  // DCHECK_EQ(size, buffers_[&buffer]);

Review comment:
       > Also does this actually fail without the change?
   
   I commented out the fix in the current push, so you can see the failure in CI. Once CI finishes I will push with the fix which will show success (and also fix the formatting issue :facepalm:).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 edited a comment on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
wjones127 edited a comment on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1024724745


   Just uploaded a test that uses a custom MemoryPool that validates that bytes past the end of the allocated buffers are not modified. Seems overkill but it works :shrug:.
   
   Here's the failing test: https://github.com/apache/arrow/runs/4987334242?check_suite_focus=true#step:6:4532


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023682053


   Yes this change to https://github.com/apache/arrow/blob/20e9e935899c8e11439ee16fb41d24190f2fabd6/cpp/src/parquet/arrow/reader.cc#L766-L768 fixes the issue:
   
   ```
       ARROW_ASSIGN_OR_RAISE(
           null_bitmap,
           AllocateResizableBuffer(bit_util::BytesForBits(length_upper_bound + 100), ctx_->pool));
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1033624753


   Closed in favour of #12330.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #12216:
URL: https://github.com/apache/arrow/pull/12216


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1024838490


   CC @zeroshade 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r789135683



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 
+  // Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number 
+  // of nulls in array (861)
+  // ASSERT_OK(second_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());

Review comment:
       Yes, if you comment out line 3739. As is, it will error on line 3763 (ValidateFull). (Sorry that's a little confusing)
   
   Output from error on 3771:
   
   ```
   56: /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/testing/gtest_util.cc:205: Failure
   56: Failed
   56: # chunk 0
   56: 
   56: @@ -0, +0 @@
   56: -[{}, {}, {}, {}, {}, {}, {}]
   56: -[{}, {}, {x: 13}, {}, {x: 13.2}, {}, {x: 13.4}, {}]
   56: +[{x: 12.7}, {x: 12.8}, {}, {x: 13}, {x: 13.1}, {x: 13.2}, {x: 13.3}]
   56: +[{x: 12.8}, {}, {x: 13}, {}, {x: 13.2}, {}, {x: 13.4}, {}]
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#discussion_r789153383



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -3719,6 +3719,77 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, RepeatReadNullableStructColumnFile) {
+  // ARROW-14047
+  auto pool = default_memory_pool();
+
+  // Special parquet file; see attachment in Jira
+  auto file_path = test::get_data_file("writeReadRowGroup.parquet");
+  PARQUET_ASSIGN_OR_THROW(auto infile, ::arrow::io::ReadableFile::Open(file_path, pool));
+
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
+  ASSERT_OK(OpenFile(infile, pool, &arrow_reader));
+
+  std::shared_ptr<::arrow::Table> first_read;
+  ASSERT_OK(arrow_reader->ReadTable(&first_read));
+
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  std::shared_ptr<::arrow::Table> second_read;
+  ASSERT_OK(arrow_reader->ReadTable(&second_read));
+  ASSERT_OK(arrow_reader->ReadTable(&second_read)); // (Actually third read)
+
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
+  // ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());
+
+  // To eliminate possibility of parquet writer oddities, rewrite table to new parquet:
+  using ::arrow::io::BufferOutputStream;
+  ASSERT_OK_AND_ASSIGN(auto outs, BufferOutputStream::Create(1 << 10, pool));
+  auto props = default_writer_properties();
+  std::unique_ptr<arrow::FileWriter> writer;
+  ASSERT_OK(
+      arrow::FileWriter::Open(*first_read->schema().get(), pool, outs, props, &writer));
+  ASSERT_OK(writer->WriteTable(*first_read.get(), std::numeric_limits<int64_t>::max()));
+  ASSERT_OK(writer->Close());
+  ASSERT_OK_AND_ASSIGN(auto buffer, outs->Finish());
+
+  // Read from table
+  std::unique_ptr<parquet::arrow::FileReader> arrow_reader2;
+  ASSERT_OK(parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer), pool, &arrow_reader2));
+
+  ASSERT_OK(arrow_reader2->ReadTable(&first_read));
+  ASSERT_OK(first_read->column(1)->chunk(0)->ValidateFull()); // So it's in theory valid
+
+  ASSERT_OK(arrow_reader2->ReadTable(&second_read));
+  // 'first_read->column(1)->chunk(0)->ValidateFull()' failed with Invalid: List child array invalid: 
+  // Invalid: Struct child array #0 invalid: Invalid: null_count value (854) doesn't match actual number 
+  // of nulls in array (861)
+  // ASSERT_OK(second_read->column(1)->chunk(0)->ValidateFull()); 
+
+  ::arrow::AssertTablesEqual(*first_read.get(), *second_read.get());

Review comment:
       That's what I'm seeing so far. And I believe the original test showed the valid ones do, but I will do another check.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1028928025


   @wjones127 In the PR description you say that validating the table catches the issue. Is it sufficient or is the custom MemoryPool actually needed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1023857312


   Another possible interpretation is we are passing in the incorrect length to the `FirstTimeBitmapWriter`. I see on the following lines we are passing in `num_def_levels=2800`, but we only allocated a buffer of 320 bytes / 2560 bits. 
   
   https://github.com/apache/arrow/blob/b1ae6029687aa3bb756c61a189c88125128cf026/cpp/src/parquet/level_conversion_inc.h#L333-L336
   
   If we instead passed the (correct?) length of 2560 bits, then the problematic write on this line will not happen:
   
   https://github.com/apache/arrow/blob/03f3cf986314654e932587d01df59ad145faf5b9/cpp/src/arrow/util/bitmap_writer.h#L166-L168
   
   I'm feeling pretty unsure if I'm interpreting this correctly, so your guidance is most appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 edited a comment on pull request #12216: ARROW-14047: [C++] Parquet Arrow read table can produce invalid array [WIP]

Posted by GitBox <gi...@apache.org>.
wjones127 edited a comment on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1022834715


   @emkornfield I have debugged further and I believe I have narrowed down to the approximate place where the data is being corrupted. I have added two `ValidateFull()` calls that seem to be before and after this corruption occurs. The one on `parquet/arrow/reader_internal.cc:780` passes, but the one on `parquet/arrow/reader.cc:482` fails. The error I get when I run:
   
   ```
   56: /Users/willjones/Documents/arrows/arrow/cpp/src/parquet/arrow/reader.cc:482:  Check failed: _s.ok() Operation failed: out_->ValidateFull()
   56: Bad status: Invalid: In chunk 0: Invalid: null_count value (854) doesn't match actual number of nulls in array (861)
   56: /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/array/validate.cc:118  ValidateNulls(*data.type)
   ```
   
   It seems that `LoadBatch()` on the leaf node is reading the correct data, but by the time `MakeArray()` is called the data has been somehow corrupted. I will debug further to try and narrow down where that is from.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12216: ARROW-14047: [C++] Parquet Arrow reader sets null values in buffer overflow

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12216:
URL: https://github.com/apache/arrow/pull/12216#issuecomment-1024472998


   Is it possible to add additional test coverage that would have caught this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org