You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/11 23:53:43 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue #863: Simple predicates against Parquet relations are not working

andygrove opened a new issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863


   **Describe the bug**
   
   ```
   > create external table customer stored as parquet location '/mnt/bigdata/tpch-sf1000-parquet/customer';
   0 rows in set. Query took 0.000 seconds.
   
   > SELECT c_mktsegment, COUNT(*) FROM customer GROUP BY c_mktsegment;
   +--------------+-----------------+
   | c_mktsegment | COUNT(UInt8(1)) |
   +--------------+-----------------+
   | HOUSEHOLD    | 30003565        |
   | BUILDING     | 29998146        |
   | FURNITURE    | 29999758        |
   | MACHINERY    | 30003128        |
   | AUTOMOBILE   | 29995355        |
   +--------------+-----------------+
   5 rows in set. Query took 0.758 seconds.
   
   > SELECT concat('[', c_mktsegment, ']') from customer limit 5;
   +------------------------------------------+
   | concat(Utf8("["),c_mktsegment,Utf8("]")) |
   +------------------------------------------+
   | [MACHINERY]                              |
   | [HOUSEHOLD]                              |
   | [BUILDING]                               |
   | [AUTOMOBILE]                             |
   | [HOUSEHOLD]                              |
   +------------------------------------------+
   5 rows in set. Query took 0.028 seconds.
   
   > SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'BUILDING';
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 0               |
   +-----------------+
   1 row in set. Query took 0.398 seconds.
   
   > SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'HOUSEHOLD';
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 0               |
   +-----------------+
   1 row in set. Query took 0.386 seconds.
   
   > SELECT COUNT(*) FROM customer WHERE c_mktsegment LIKE 'HOUSEHOLD';
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 30003565        |
   +-----------------+
   1 row in set. Query took 0.663 seconds.
   ```
   
   **To Reproduce**
   Run the above queries in datafusion-cli
   
   **Expected behavior**
   Simple predicates should work.
   
   **Additional context**
   The parquet files were generated by the conversion utility in the tpch benchmarks.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp edited a comment on issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
houqp edited a comment on issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863#issuecomment-898208798


   That's a very good point @alamb :D This certainly makes me less stressed about the patch release.
   
   UPDATE: actually, i take it back, looks like datafusion currently has arrow pinned to 5.1 :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
alamb closed issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andygrove commented on issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863#issuecomment-897690131


   Thanks @houqp yes that is it. I tested with Arrow master and it works. Perhaps we can put out a patch release on DataFusion/Ballista at some point soon with this fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863#issuecomment-897374135


   Might be related to https://github.com/apache/arrow-rs/pull/643. Have you tried running it with arrow master?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863#issuecomment-897804288


   I plan to make an arrow-rs 5.2.0 RC this afternoon. The cool thing about using sem-ver versioning is that `cargo update` should be sufficient to get the fix for arrow without any changes to datafusion
   
   If we want to ensure anyone running datafuson is also using at least arrow 5.2.0 (and thus has the fix for https://github.com/apache/arrow-rs/pull/643) we would need a new datafusion release


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
alamb closed issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863#issuecomment-898208798


   That's a very good point @alamb :D This certainly makes me less stressed about the patch release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andygrove edited a comment on issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
andygrove edited a comment on issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863#issuecomment-897273083


   If I comment out the code in our parquet reader that filters out row groups based on predicates then I see the expected results.
   
   ```
   > SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'BUILDING';
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 29998146        |
   +-----------------+
   1 row in set. Query took 0.874 seconds.
   ```
   
   My conclusion is that we have a bug in our DataFusion/Parquet writer where we are writing incorrect statistics somehow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andygrove edited a comment on issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
andygrove edited a comment on issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863#issuecomment-897273083


   If I comment out the code in our parquet reader that filters out row groups based on predicates then I see the expected results.
   
   ```
   > SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'BUILDING';
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 29998146        |
   +-----------------+
   1 row in set. Query took 0.874 seconds.
   ```
   
   My conclusion is that we have a bug in our Parquet writer where we are writing incorrect statistics somehow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863#issuecomment-897718347


   Yep, once arrow 5.2 release is out, i will start a patch release for datafusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andygrove commented on issue #863: tpch conversion generates parquet files that cannot be queried on simple predicates

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #863:
URL: https://github.com/apache/arrow-datafusion/issues/863#issuecomment-897273083


   If I comment out the code in our parquet reader that filters out row groups then I see the expected results.
   
   ```
   > SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'BUILDING';
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 29998146        |
   +-----------------+
   1 row in set. Query took 0.874 seconds.
   ```
   
   My conclusion is that we have a bug in our Parquet writer where we are writing incorrect statistics somehow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org