You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/03/21 22:04:42 UTC

[GitHub] [arrow] wjones127 opened a new pull request, #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

wjones127 opened a new pull request, #34668:
URL: https://github.com/apache/arrow/pull/34668

   ### Rationale for this change
   
   We should make sure that we are defensive when it comes to invalid inputs. Otherwise malicious actors might send invalid Parquet files as a way to crash a users service.
   
   ### What changes are included in this PR?
   
   Adds a guard against trying to read empty pages. Also adds a DCHECK that validates `num_valid_values_` was initialized (it wasn't when `len = 0`). Also adds a few more general tests cases for the encoding, covering empty arrays and empty strings.
   
   ### Are these changes tested?
   
   Yes, several unit tests have been added.
   
   ### Are there any user-facing changes?
   
   **This PR contains a "Critical Fix".**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34668:
URL: https://github.com/apache/arrow/pull/34668#discussion_r1144176635


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2708,7 +2708,14 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
 
   void SetData(int num_values, const uint8_t* data, int len) override {
     num_values_ = num_values;
-    if (len == 0) return;
+    if (len == 0) {
+      if (num_values > 0) {
+        throw ParquetException(
+            "Delta length byte array page was empty, but expected {num_values} values.");

Review Comment:
   I think num_values could greater than 0, let me write a test for that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34668:
URL: https://github.com/apache/arrow/pull/34668#discussion_r1145595301


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2708,7 +2708,6 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
 
   void SetData(int num_values, const uint8_t* data, int len) override {
     num_values_ = num_values;
-    if (len == 0) return;

Review Comment:
   nice catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on a diff in pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on code in PR #34668:
URL: https://github.com/apache/arrow/pull/34668#discussion_r1144110440


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2708,7 +2708,14 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
 
   void SetData(int num_values, const uint8_t* data, int len) override {
     num_values_ = num_values;
-    if (len == 0) return;
+    if (len == 0) {

Review Comment:
   ```suggestion
       if (ARROW_PREDICT_FALSE(len <= 0)) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34668:
URL: https://github.com/apache/arrow/pull/34668#issuecomment-1478845206

   The `num_values` is a tricky word, num_values is from DataPageHeader, and means that `num of non-null value + num of null values`. So it includes null value.
   
   In DeltaLengthByteArray, seems real num_values can comes from DELTA encoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34668:
URL: https://github.com/apache/arrow/pull/34668#issuecomment-1478993844

   Thanks @mapleFU for investigation. This is what I mean exactly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34668:
URL: https://github.com/apache/arrow/pull/34668#discussion_r1144053408


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2723,6 +2730,7 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
     // Decode up to `max_values` strings into an internal buffer
     // and reference them into `buffer`.
     max_values = std::min(max_values, num_valid_values_);
+    DCHECK_GE(max_values, 0);

Review Comment:
   That might be safer, but as-is we should never trip that. I added this in case someone refactors the class and adds another code path where `num_valid_values_` might not be initialized. Do you still think it's worthwhile?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34668:
URL: https://github.com/apache/arrow/pull/34668#discussion_r1144166948


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2708,7 +2708,14 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
 
   void SetData(int num_values, const uint8_t* data, int len) override {
     num_values_ = num_values;
-    if (len == 0) return;
+    if (len == 0) {
+      if (num_values > 0) {

Review Comment:
   I did a quick check. `num_values` is the number of values with nulls considered in a data page.
   
   https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L922
   ```cpp
       current_decoder_->SetData(static_cast<int>(num_buffered_values_), buffer,
                                 static_cast<int>(data_size));
   ```
   
   https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L949
   ```cpp
     // The total number of values stored in the data page. This is the maximum of
     // the number of encoded definition levels or encoded values. For
     // non-repeated, required columns, this is equal to the number of encoded
     // values. For repeated or optional values, there may be fewer data values
     // than levels, and this tells you how many encoded levels there are in that
     // case.
     int64_t num_buffered_values_;
   ```
   
   Therefore, the input `num_values > 0 && len == 0` looks valid to me. Should we not throw here but decode nothing from the decoder instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34668:
URL: https://github.com/apache/arrow/pull/34668#discussion_r1144035990


##########
cpp/src/parquet/encoding_test.cc:
##########
@@ -1639,6 +1639,79 @@ TYPED_TEST(TestDeltaLengthByteArrayEncoding, BasicRoundTrip) {
       /*null_probability*/ 0.1));
 }
 
+std::shared_ptr<Buffer> DeltaEncode(std::vector<int32_t> lengths) {
+  auto encoder = MakeTypedEncoder<Int32Type>(Encoding::DELTA_BINARY_PACKED);
+  encoder->Put(lengths.data(), static_cast<int>(lengths.size()));
+  return encoder->FlushValues();
+}
+
+TEST(TestDeltaLengthByteArrayEncoding, AdHocRoundTrip) {
+  const std::shared_ptr<::arrow::Array> cases[] = {
+      ::arrow::ArrayFromJSON(::arrow::utf8(), R"([])"),
+      ::arrow::ArrayFromJSON(::arrow::utf8(), R"(["abc", "de", ""])"),
+      ::arrow::ArrayFromJSON(::arrow::utf8(), R"(["", "", ""])"),
+      ::arrow::ArrayFromJSON(::arrow::utf8(), R"(["abc", "", "xyz"])")->Slice(1),
+  };
+
+  std::string expected_encoded_vals[] = {
+      DeltaEncode({})->ToString(),
+      DeltaEncode({3, 2, 0})->ToString() + "abcde",
+      DeltaEncode({0, 0, 0})->ToString(),
+      DeltaEncode({0, 3})->ToString() + "xyz",
+  };
+
+  auto encoder = MakeTypedEncoder<ByteArrayType>(Encoding::DELTA_LENGTH_BYTE_ARRAY,
+                                                 /*use_dictionary=*/false);
+  auto decoder = MakeTypedDecoder<ByteArrayType>(Encoding::DELTA_LENGTH_BYTE_ARRAY);
+  for (int i = 0; i < 4; ++i) {
+    auto array = cases[i];
+    auto expected_encoded = expected_encoded_vals[i];
+
+    encoder->Put(*array);
+    std::shared_ptr<Buffer> buffer = encoder->FlushValues();
+
+    ASSERT_EQ(expected_encoded, buffer->ToString());
+
+    int num_values = static_cast<int>(array->length() - array->null_count());
+    decoder->SetData(num_values, buffer->data(), static_cast<int>(buffer->size()));
+
+    typename EncodingTraits<ByteArrayType>::Accumulator acc;
+    acc.builder.reset(new ::arrow::StringBuilder);
+
+    int values_decoded = decoder->DecodeArrow(num_values, 0, nullptr, 0, &acc);
+    ASSERT_EQ(values_decoded, num_values);
+
+    std::shared_ptr<::arrow::Array> roundtripped_array;
+    ASSERT_OK(acc.builder->Finish(&roundtripped_array));
+    ASSERT_ARRAYS_EQUAL(*array, *roundtripped_array);
+  }
+}
+
+TEST(DeltaLengthByteArrayEncoding, RejectBadBuffer) {

Review Comment:
   I'd welcome any other ideas for bad buffers we should try.



##########
cpp/src/parquet/encoding_test.cc:
##########
@@ -1759,12 +1832,10 @@ TEST(DeltaLengthByteArrayEncodingAdHoc, ArrowDirectPut) {
   CheckEncode(::arrow::ArrayFromJSON(::arrow::binary(), values), lengths);
   CheckEncode(::arrow::ArrayFromJSON(::arrow::large_binary(), values), lengths);
 
-  const uint8_t data[] = {
-      0x80, 0x01, 0x04, 0x04, 0x0a, 0x00, 0x01, 0x00, 0x00, 0x00, 0x02, 0x00,
-      0x00, 0x00, 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x57, 0x6f, 0x72, 0x6c, 0x64,
-      0x46, 0x6f, 0x6f, 0x62, 0x61, 0x72, 0x41, 0x44, 0x42, 0x43, 0x45, 0x46,
-  };
-  auto encoded = Buffer::Wrap(data, sizeof(data));
+  auto encoded =
+      ::arrow::ConcatenateBuffers(
+          {DeltaEncode({5, 5, 6, 6}), std::make_shared<Buffer>("HelloWorldFoobarADBCEF")})
+          .ValueOrDie();

Review Comment:
   Added for readability.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on a diff in pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on code in PR #34668:
URL: https://github.com/apache/arrow/pull/34668#discussion_r1144125532


##########
cpp/src/parquet/encoding_test.cc:
##########
@@ -1639,6 +1639,79 @@ TYPED_TEST(TestDeltaLengthByteArrayEncoding, BasicRoundTrip) {
       /*null_probability*/ 0.1));
 }
 
+std::shared_ptr<Buffer> DeltaEncode(std::vector<int32_t> lengths) {
+  auto encoder = MakeTypedEncoder<Int32Type>(Encoding::DELTA_BINARY_PACKED);
+  encoder->Put(lengths.data(), static_cast<int>(lengths.size()));
+  return encoder->FlushValues();
+}
+
+TEST(TestDeltaLengthByteArrayEncoding, AdHocRoundTrip) {
+  const std::shared_ptr<::arrow::Array> cases[] = {
+      ::arrow::ArrayFromJSON(::arrow::utf8(), R"([])"),
+      ::arrow::ArrayFromJSON(::arrow::utf8(), R"(["abc", "de", ""])"),
+      ::arrow::ArrayFromJSON(::arrow::utf8(), R"(["", "", ""])"),
+      ::arrow::ArrayFromJSON(::arrow::utf8(), R"(["abc", "", "xyz"])")->Slice(1),
+  };
+
+  std::string expected_encoded_vals[] = {
+      DeltaEncode({})->ToString(),
+      DeltaEncode({3, 2, 0})->ToString() + "abcde",
+      DeltaEncode({0, 0, 0})->ToString(),
+      DeltaEncode({0, 3})->ToString() + "xyz",
+  };
+
+  auto encoder = MakeTypedEncoder<ByteArrayType>(Encoding::DELTA_LENGTH_BYTE_ARRAY,
+                                                 /*use_dictionary=*/false);
+  auto decoder = MakeTypedDecoder<ByteArrayType>(Encoding::DELTA_LENGTH_BYTE_ARRAY);
+  for (int i = 0; i < 4; ++i) {
+    auto array = cases[i];
+    auto expected_encoded = expected_encoded_vals[i];
+
+    encoder->Put(*array);
+    std::shared_ptr<Buffer> buffer = encoder->FlushValues();
+
+    ASSERT_EQ(expected_encoded, buffer->ToString());
+
+    int num_values = static_cast<int>(array->length() - array->null_count());
+    decoder->SetData(num_values, buffer->data(), static_cast<int>(buffer->size()));
+
+    typename EncodingTraits<ByteArrayType>::Accumulator acc;
+    acc.builder.reset(new ::arrow::StringBuilder);
+
+    int values_decoded = decoder->DecodeArrow(num_values, 0, nullptr, 0, &acc);
+    ASSERT_EQ(values_decoded, num_values);
+
+    std::shared_ptr<::arrow::Array> roundtripped_array;
+    ASSERT_OK(acc.builder->Finish(&roundtripped_array));
+    ASSERT_ARRAYS_EQUAL(*array, *roundtripped_array);
+  }
+}
+
+TEST(DeltaLengthByteArrayEncoding, RejectBadBuffer) {

Review Comment:
   Perhaps we should also check for `num_values` or `len` being negative as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34668:
URL: https://github.com/apache/arrow/pull/34668#issuecomment-1480212638

   So I found that the DeltaBitPackDecoder will error if the header is missing, so I think it makes sense for us to also error if it's missing at the beginning of DeltaLengthByteArray. To accomplish this, I actually deleted the branch, so it can get to the `DecodeLengths()` which already handles this error:
    https://github.com/wjones127/arrow/blob/f59fe0a2267b12069586e7c6b0351c48bf296371/cpp/src/parquet/encoding.cc#L2424


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34668:
URL: https://github.com/apache/arrow/pull/34668#issuecomment-1480178540

   @mapleFU 
   > Would you mind providing me the input file or test case? Or where does this segfault? I think the program would be prevent from fatal in somewhere
   
   You are right it didn't actually crash, it just caused a buffer overflow, which tripped the sanitizers. This was discovered in our fuzzing CI.
   
   Thanks for providing the test. It doesn't actually trigger the codepath for `len <= 0` since there is the header for the encoded lengths present:
   
   ```
   "\x80\U00000001\U00000004\0\0"
   ```
   
   That does make me wonder whether we should reject an empty buffer as corrupt (missing required header?). But also just loading no values makes sense too.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on a diff in pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on code in PR #34668:
URL: https://github.com/apache/arrow/pull/34668#discussion_r1144118212


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2708,7 +2708,14 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
 
   void SetData(int num_values, const uint8_t* data, int len) override {
     num_values_ = num_values;
-    if (len == 0) return;
+    if (len == 0) {
+      if (num_values > 0) {
+        throw ParquetException(
+            "Delta length byte array page was empty, but expected {num_values} values.");

Review Comment:
   Does this print `num_values`? Perhaps we can check the message with `EXPECT_THROW_THAT`



##########
cpp/src/parquet/encoding_test.cc:
##########
@@ -1759,12 +1832,10 @@ TEST(DeltaLengthByteArrayEncodingAdHoc, ArrowDirectPut) {
   CheckEncode(::arrow::ArrayFromJSON(::arrow::binary(), values), lengths);
   CheckEncode(::arrow::ArrayFromJSON(::arrow::large_binary(), values), lengths);
 
-  const uint8_t data[] = {
-      0x80, 0x01, 0x04, 0x04, 0x0a, 0x00, 0x01, 0x00, 0x00, 0x00, 0x02, 0x00,
-      0x00, 0x00, 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x57, 0x6f, 0x72, 0x6c, 0x64,
-      0x46, 0x6f, 0x6f, 0x62, 0x61, 0x72, 0x41, 0x44, 0x42, 0x43, 0x45, 0x46,
-  };
-  auto encoded = Buffer::Wrap(data, sizeof(data));
+  auto encoded =
+      ::arrow::ConcatenateBuffers(
+          {DeltaEncode({5, 5, 6, 6}), std::make_shared<Buffer>("HelloWorldFoobarADBCEF")})
+          .ValueOrDie();

Review Comment:
   Much more readable now :D



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34668:
URL: https://github.com/apache/arrow/pull/34668#issuecomment-1478845203

   The `num_values` is a tricky word, num_values is from DataPageHeader, and means that `num of non-null value + num of null values`. So it includes null value.
   
   In DeltaLengthByteArray, seems real num_values can comes from DELTA encoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34668:
URL: https://github.com/apache/arrow/pull/34668#issuecomment-1478981972

   And I think you can add a test case:
   
   ```
   diff --git a/cpp/src/parquet/encoding_test.cc b/cpp/src/parquet/encoding_test.cc
   index 7dcee1152..31ebd4188 100644
   --- a/cpp/src/parquet/encoding_test.cc
   +++ b/cpp/src/parquet/encoding_test.cc
   @@ -1713,7 +1713,7 @@ class TestDeltaLengthByteArrayEncoding : public TestEncodingBase<Type> {
    
        encoder->PutSpaced(draws_, num_values_, valid_bits, valid_bits_offset);
        encode_buffer_ = encoder->FlushValues();
   -    decoder->SetData(num_values_ - null_count, encode_buffer_->data(),
   +    decoder->SetData(num_values_, encode_buffer_->data(),
   :
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34668:
URL: https://github.com/apache/arrow/pull/34668#issuecomment-1478853200

   @wjones127 Would you mind providing me the input file or test case? Or where does this segfault? I think the program would be prevent from fatal in somewhere


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 merged pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 merged PR #34668:
URL: https://github.com/apache/arrow/pull/34668


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on a diff in pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on code in PR #34668:
URL: https://github.com/apache/arrow/pull/34668#discussion_r1144110440


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2708,7 +2708,14 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
 
   void SetData(int num_values, const uint8_t* data, int len) override {
     num_values_ = num_values;
-    if (len == 0) return;
+    if (len == 0) {

Review Comment:
   ```suggestion
       if (ARROW_PREDICT_FALSE(len == 0)) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34668:
URL: https://github.com/apache/arrow/pull/34668#issuecomment-1482227769

   Benchmark runs are scheduled for baseline = e5e33395c9ec7eb8abeb2decf040de72fb40b39e and contender = 5bbaf6bade2232a0d60e912defb97579a11db83f. 5bbaf6bade2232a0d60e912defb97579a11db83f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/03883233e0e1408ab2e9a9ac287fa465...4ec227782738474eb10e3d316022686a/)
   [Failed :arrow_down:0.39% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/7d7d3d79e1944c52bbcbc5c3227314fc...7e7a3326e2a84729a774d99757350bf6/)
   [Finished :arrow_down:2.04% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/c95c0c55dafc4e78b6e519b6d67b5713...62560cfb406b4aaaa517c2c5e4141103/)
   [Finished :arrow_down:0.06% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/d7b52359f50644178b222dd59f0edd79...481f7ef438cd4b0881d4e7e4213d6aba/)
   Buildkite builds:
   [Finished] [`5bbaf6ba` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2566)
   [Failed] [`5bbaf6ba` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2596)
   [Finished] [`5bbaf6ba` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2564)
   [Finished] [`5bbaf6ba` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2587)
   [Finished] [`e5e33395` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2565)
   [Failed] [`e5e33395` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2595)
   [Finished] [`e5e33395` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2563)
   [Finished] [`e5e33395` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2586)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on code in PR #34668:
URL: https://github.com/apache/arrow/pull/34668#discussion_r1144042752


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2723,6 +2730,7 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
     // Decode up to `max_values` strings into an internal buffer
     // and reference them into `buffer`.
     max_values = std::min(max_values, num_valid_values_);
+    DCHECK_GE(max_values, 0);

Review Comment:
   should this be an exception?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34668: GH-34667: [C++][Parquet] Test DeltaLengthByteArrayDecoder with invalid inputs

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34668:
URL: https://github.com/apache/arrow/pull/34668#issuecomment-1478650561

   * Closes: #34667


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org