You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/22 21:01:20 UTC

[GitHub] [arrow] lidavidm commented on a diff in pull request #13665: ARROW-17100: [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written prior to 3.0.0 per ARROW-10353

lidavidm commented on code in PR #13665:
URL: https://github.com/apache/arrow/pull/13665#discussion_r927983874


##########
cpp/src/parquet/column_reader.h:
##########
@@ -105,10 +105,12 @@ class PARQUET_EXPORT PageReader {
 
   static std::unique_ptr<PageReader> Open(
       std::shared_ptr<ArrowInputStream> stream, int64_t total_num_rows,
-      Compression::type codec, ::arrow::MemoryPool* pool = ::arrow::default_memory_pool(),
+      Compression::type codec, bool compression_always_true,

Review Comment:
   nit: default to false?



##########
cpp/src/parquet/column_io_benchmark.cc:
##########
@@ -130,7 +130,8 @@ std::shared_ptr<Int64Reader> BuildReader(std::shared_ptr<Buffer>& buffer,
                                          int64_t num_values, Compression::type codec,
                                          ColumnDescriptor* schema) {
   auto source = std::make_shared<::arrow::io::BufferReader>(buffer);
-  std::unique_ptr<PageReader> page_reader = PageReader::Open(source, num_values, codec);
+  std::unique_ptr<PageReader> page_reader =
+      PageReader::Open(source, num_values, codec, false);

Review Comment:
   nit: add `/*param_name=*/ false` so readers can more easily tell what's going on



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -3943,6 +3943,19 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, OldDataPageV2) {
+  // ARROW-17100
+  const char* c_root = std::getenv("ARROW_TEST_DATA");
+  if (!c_root) {
+    GTEST_SKIP() << "ARROW_TEST_DATA not set.";
+  }

Review Comment:
   This also needs a SKIP like the one for Snappy above



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -3943,6 +3943,19 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, OldDataPageV2) {
+  // ARROW-17100
+  const char* c_root = std::getenv("ARROW_TEST_DATA");
+  if (!c_root) {
+    GTEST_SKIP() << "ARROW_TEST_DATA not set.";
+  }

Review Comment:
   FWIW, everything else in `parquet` uses PARQUET_TEST_DATA…should it have gone there instead?



##########
cpp/src/parquet/column_writer_test.cc:
##########
@@ -85,7 +85,7 @@ class TestPrimitiveWriter : public PrimitiveTypedTest<TestType> {
     ASSERT_OK_AND_ASSIGN(auto buffer, sink_->Finish());
     auto source = std::make_shared<::arrow::io::BufferReader>(buffer);
     std::unique_ptr<PageReader> page_reader =
-        PageReader::Open(std::move(source), num_rows, compression);
+        PageReader::Open(std::move(source), num_rows, compression, false);

Review Comment:
   ditto the comment above here (though: adding the default would also fix it)



##########
cpp/src/parquet/column_reader.cc:
##########
@@ -449,7 +452,10 @@ std::shared_ptr<Page> SerializedPageReader::NextPage() {
           header.repetition_levels_byte_length < 0) {
         throw ParquetException("Invalid page header (negative levels byte length)");
       }
-      bool is_compressed = header.__isset.is_compressed ? header.is_compressed : false;
+      // Some implementations set is_compressed to false but still compressed.

Review Comment:
   "Some implementations" -> Specifically, Arrow prior to 3.0.0?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org