You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "YoungRX (via GitHub)" <gi...@apache.org> on 2023/02/24 02:52:33 UTC

[GitHub] [arrow] YoungRX commented on issue #34313: [C++] CountRows() in ParquetFileFormat class is unreasonable

YoungRX commented on issue #34313:
URL: https://github.com/apache/arrow/issues/34313#issuecomment-1442724440

   > I think id < 12345 actually scans two row groups, the correct result for CountRows should be 20000
   > Shouldn't the correct result be 12344?
   
   Yes, the end result of the scan should be 12344 rows. But as you said, the filter level for Parquet predicate pushdown is row group. For `id < 12345`, we actually scanned two row groups. `ParquetFileFormat::CountRows` calls `ParquetFileFragment::FilterRowGroups` to get the two row groups. These two row groups have 20000 rows, which is what `ParquetFileFormat::CountRows` should return.
   
   Specifically, `ParquetFileFormat::CountRows` calculates the number of rows of data after predicate pushdown filtering.
   
   Possible solutions for `ParquetFileFormat::CountRows` are as follows:
   
   > Delete `if (expressions[i] != compute::literal(true)) return util::nullopt;` from `ParquetFileFragment::TryCountRows`.
   
   And thanks for your answers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org