You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/17 00:36:36 UTC

[GitHub] [arrow] tachyonwill opened a new pull request #11984: PARQUET-2109: Check if Parquet page has too few values

tachyonwill opened a new pull request #11984:
URL: https://github.com/apache/arrow/pull/11984


   Column reader uses the values reported in the page header to gauge
   if there are more values left to read. However, corrupted page headers
   might overstate the number of values. This can cause an infinite loop
   when reading a column when doing something like:
   while(reader.HasNext()) {
     reader.ReadBatch(...);
   }
   
   Ideally HasNext() would return false in these cases, but that seems
   non-trivial. Instead, we change ReadBatch to throw an exception in these
   cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r779308089



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -993,6 +996,9 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatch(int64_t batch_size, int16_t* def
 
   *values_read = this->ReadValues(values_to_read, values);
   int64_t total_values = std::max(num_def_levels, *values_read);
+  if (total_values == 0) {
+    ParquetException::EofException("Read 0 values");

Review comment:
       is it possible to add in the number of additional values expected or is it not-knowable at this point?  Are there issues if somehow a data page/row group has different values?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#issuecomment-1012117682


   Benchmark runs are scheduled for baseline = d67a210b8c50b1d109e3d7780591d010e94cc9cc and contender = 77fc23fcae0331da3adf94619a381a371a6e414f. 77fc23fcae0331da3adf94619a381a371a6e414f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/760418bc15784f6a9336c73c15968402...198ee6a2eca74809902e856f30e71d88/)
   [Failed :arrow_down:0.0% :arrow_up:0.45%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/80cbbc68b21c4fef84126c03d3073090...fca90948d95b4c958825d9d60099adab/)
   [Finished :arrow_down:2.42% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c833ede75dcf4a85ab852b02cc4b4abb...12c3816bdd5b49fa921a30ae4cf39c65/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#issuecomment-1012117682


   Benchmark runs are scheduled for baseline = d67a210b8c50b1d109e3d7780591d010e94cc9cc and contender = 77fc23fcae0331da3adf94619a381a371a6e414f. 77fc23fcae0331da3adf94619a381a371a6e414f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/760418bc15784f6a9336c73c15968402...198ee6a2eca74809902e856f30e71d88/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/80cbbc68b21c4fef84126c03d3073090...fca90948d95b4c958825d9d60099adab/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c833ede75dcf4a85ab852b02cc4b4abb...12c3816bdd5b49fa921a30ae4cf39c65/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#issuecomment-1012117682


   Benchmark runs are scheduled for baseline = d67a210b8c50b1d109e3d7780591d010e94cc9cc and contender = 77fc23fcae0331da3adf94619a381a371a6e414f. 77fc23fcae0331da3adf94619a381a371a6e414f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/760418bc15784f6a9336c73c15968402...198ee6a2eca74809902e856f30e71d88/)
   [Failed :arrow_down:0.0% :arrow_up:0.45%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/80cbbc68b21c4fef84126c03d3073090...fca90948d95b4c958825d9d60099adab/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c833ede75dcf4a85ab852b02cc4b4abb...12c3816bdd5b49fa921a30ae4cf39c65/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #11984:
URL: https://github.com/apache/arrow/pull/11984


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #11984: PARQUET-2109: Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#issuecomment-996308796






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on a change in pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r779722792



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -993,6 +996,9 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatch(int64_t batch_size, int16_t* def
 
   *values_read = this->ReadValues(values_to_read, values);
   int64_t total_values = std::max(num_def_levels, *values_read);
+  if (total_values == 0) {
+    ParquetException::EofException("Read 0 values");

Review comment:
       I would expect that we should see min(batch_size, num_buffered_values_ - num_decoded_values_) values. I will add this to the error message.
   
   I don't entirely understand your second question. If a page has an inconsistent number of values from what is in the header, bad things can happen This PR is addressing one of those scenarios, where the reported number is too many causing an infinite loop. There are other scenarios however that I am not addressing. For example, if the reported number of values is lower than what actually exists, we can drop them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on pull request #11984: PARQUET-2109: Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#issuecomment-996343072


   Another option is to have ReadBatch set num_decoded_values_ to num_buffered_values_ in these cases. It would break the loop and allow us to accept these malformed pages by just taking what values are there. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on a change in pull request #11984: PARQUET-2109: Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r775573699



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -940,7 +940,7 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchWithDictionary(
     int64_t* indices_read, const T** dict, int32_t* dict_len) {
   bool has_dict_output = dict != nullptr && dict_len != nullptr;
   // Similar logic as ReadValues to get pages.
-  if (!HasNext()) {
+  if (batch_size == 0 || !HasNext()) {

Review comment:
       We should probably support this. I think the more important question is should we throw an exception for bad pages or just skip over the phantom values. Parquet-mr seems to do the latter but does other validation on the reported value counts. It would also make the batch size 0 case easier.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r781479274



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -993,6 +996,9 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatch(int64_t batch_size, int16_t* def
 
   *values_read = this->ReadValues(values_to_read, values);
   int64_t total_values = std::max(num_def_levels, *values_read);
+  if (total_values == 0) {
+    ParquetException::EofException("Read 0 values");

Review comment:
       sorry for the second question I meant to ask if it has 0 values.  I think you answered this below.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on a change in pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r776498284



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -940,7 +940,7 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchWithDictionary(
     int64_t* indices_read, const T** dict, int32_t* dict_len) {
   bool has_dict_output = dict != nullptr && dict_len != nullptr;
   // Similar logic as ReadValues to get pages.
-  if (!HasNext()) {
+  if (batch_size == 0 || !HasNext()) {

Review comment:
       Changed check to allow batch size 0 when reading dict.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on a change in pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r779717505



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -970,6 +970,9 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchWithDictionary(
   // Read dictionary indices.
   *indices_read = ReadDictionaryIndices(indices_to_read, indices);
   int64_t total_indices = std::max(num_def_levels, *indices_read);
+  if (total_indices == 0 && batch_size != 0) {
+    ParquetException::EofException("Read 0 values");

Review comment:
       The PR doesn't change the behavior on length 0 pages(assuming the page is correctly formed). At the start of the ReadBatch* methods, HasNext() is called and we gracefully bail out if it returns false. Size 0 pages will cause HasNext() to return false, hence we stop. Is this the right thing to do? I don't know. It can cause weird behavior and looking at some parquet-mr JIRAs, size 0 pages might not be entirely legal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#issuecomment-1012111524


   Merged, thanks for the review @emkornfield !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r779308341



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -970,6 +970,9 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchWithDictionary(
   // Read dictionary indices.
   *indices_read = ReadDictionaryIndices(indices_to_read, indices);
   int64_t total_indices = std::max(num_def_levels, *indices_read);
+  if (total_indices == 0 && batch_size != 0) {
+    ParquetException::EofException("Read 0 values");

Review comment:
       same comment as below about edge cases with zero length data pages.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maqister commented on a change in pull request #11984: PARQUET-2109: Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
maqister commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r775112597



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -940,7 +940,7 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchWithDictionary(
     int64_t* indices_read, const T** dict, int32_t* dict_len) {
   bool has_dict_output = dict != nullptr && dict_len != nullptr;
   // Similar logic as ReadValues to get pages.
-  if (!HasNext()) {
+  if (batch_size == 0 || !HasNext()) {

Review comment:
       this change breaks use-case in my company where we use this API with batch_size = 0 explicitly, just to obtain dictionary alone.
   it is ok from our perspective to change it as we can just add dedicated API to obtain dictionary in our fork of the code.
   
   i am just bringing this up in case there are other devs impacted by the public API change.
   
   https://www.hyrumslaw.com/

##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -940,7 +940,7 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchWithDictionary(
     int64_t* indices_read, const T** dict, int32_t* dict_len) {
   bool has_dict_output = dict != nullptr && dict_len != nullptr;
   // Similar logic as ReadValues to get pages.
-  if (!HasNext()) {
+  if (batch_size == 0 || !HasNext()) {

Review comment:
       this change breaks use-case in my company where we use this API with batch_size = 0 explicitly, just to obtain dictionary alone.
   
   it is ok from our perspective to change it as we can just add dedicated API to obtain dictionary in our fork of the code.
   
   i am just bringing this up in case there are other devs impacted by the public API change.
   
   https://www.hyrumslaw.com/




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on a change in pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r781554824



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -993,6 +996,9 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatch(int64_t batch_size, int16_t* def
 
   *values_read = this->ReadValues(values_to_read, values);
   int64_t total_values = std::max(num_def_levels, *values_read);
+  if (total_values == 0) {
+    ParquetException::EofException("Read 0 values");

Review comment:
       Ok, yes. I don't think the page size 0 was a problem(or legal) but I updated the code to check for the case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maqister commented on a change in pull request #11984: PARQUET-2109: Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
maqister commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r775112597



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -940,7 +940,7 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchWithDictionary(
     int64_t* indices_read, const T** dict, int32_t* dict_len) {
   bool has_dict_output = dict != nullptr && dict_len != nullptr;
   // Similar logic as ReadValues to get pages.
-  if (!HasNext()) {
+  if (batch_size == 0 || !HasNext()) {

Review comment:
       this change breaks use-case in my company where we use this API with batch_size = 0 explicitly, just to obtain dictionary alone. we use it for our loading .parquet files into properitary in-memory column store.
   
   it is ok from our perspective to change it as we can just add dedicated API to obtain dictionary in our fork of the code.
   
   i am just bringing this up in case there are other devs impacted by the public API change.
   
   https://www.hyrumslaw.com/




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r779308172



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -980,7 +983,7 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatch(int64_t batch_size, int16_t* def
                                                 int16_t* rep_levels, T* values,
                                                 int64_t* values_read) {
   // HasNext invokes ReadNewPage
-  if (!HasNext()) {
+  if (batch_size == 0 || !HasNext()) {

Review comment:
       maybe add a comment document the use-case for batch_size == 0 here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on a change in pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r779722792



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -993,6 +996,9 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatch(int64_t batch_size, int16_t* def
 
   *values_read = this->ReadValues(values_to_read, values);
   int64_t total_values = std::max(num_def_levels, *values_read);
+  if (total_values == 0) {
+    ParquetException::EofException("Read 0 values");

Review comment:
       I would expect that we should see min(batch_size, num_buffered_values_ - num_decoded_values_) values. I will add this to the error message.
   
   I don't entirely understand your second question. If a page has an inconsistent number of values from what is in the header, bad things can happen This PR is addressing one of those scenarios, where the reported number is too many causing an infinite loop. There are other scenarios however that I am not addressing. For example, if the reported number of values is lower than what actually exists, we can skip them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11984: PARQUET-2109: [C++] Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#issuecomment-1012117682


   Benchmark runs are scheduled for baseline = d67a210b8c50b1d109e3d7780591d010e94cc9cc and contender = 77fc23fcae0331da3adf94619a381a371a6e414f. 77fc23fcae0331da3adf94619a381a371a6e414f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/760418bc15784f6a9336c73c15968402...198ee6a2eca74809902e856f30e71d88/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/80cbbc68b21c4fef84126c03d3073090...fca90948d95b4c958825d9d60099adab/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c833ede75dcf4a85ab852b02cc4b4abb...12c3816bdd5b49fa921a30ae4cf39c65/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maqister commented on a change in pull request #11984: PARQUET-2109: Check if Parquet page has too few values

Posted by GitBox <gi...@apache.org>.
maqister commented on a change in pull request #11984:
URL: https://github.com/apache/arrow/pull/11984#discussion_r775112597



##########
File path: cpp/src/parquet/column_reader.cc
##########
@@ -940,7 +940,7 @@ int64_t TypedColumnReaderImpl<DType>::ReadBatchWithDictionary(
     int64_t* indices_read, const T** dict, int32_t* dict_len) {
   bool has_dict_output = dict != nullptr && dict_len != nullptr;
   // Similar logic as ReadValues to get pages.
-  if (!HasNext()) {
+  if (batch_size == 0 || !HasNext()) {

Review comment:
       this change breaks use-case in my company where we use this API with batch_size = 0 explicitly, just to obtain dictionary alone. we use it for our loading .parquet files into properitary in-memory column store flow.
   
   it is ok from our perspective to change it as we can just add dedicated API to obtain dictionary in our fork of the code.
   
   i am just bringing this up in case there are other devs impacted by the public API change.
   
   https://www.hyrumslaw.com/




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org