You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/14 15:03:12 UTC

[GitHub] [arrow-datafusion] Jimexist opened a new issue #725: Global limit isn't really limiting parquet file reads and stops earlier

Jimexist opened a new issue #725:
URL: https://github.com/apache/arrow-datafusion/issues/725


   **Describe the bug**
   A clear and concise description of what the bug is.
   
   When given a global limit:
   
   ```sql
   select * from some_large_data limit 50;
   ```
   
   even with a `-c` being 100 i.e. small batch size, the global limit isn't really cutting running time small.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   this is easily reproducible given a large data set.
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on issue #725: Global limit isn't really limiting parquet file reads and stops earlier

Posted by GitBox <gi...@apache.org>.

Jimexist commented on issue #725:
URL: https://github.com/apache/arrow-datafusion/issues/725#issuecomment-880780387


   > I added some form of limit push down to parquet some time ago.
   > Might be that it isn't applied to your dataset somehow? Or maybe getting the metadata / statistics itself might be slow?
   > 
   > [apache/arrow#9672](https://github.com/apache/arrow/pull/9672)
   
   I tried generating another file using https://gist.github.com/Jimexist/82717bc3ef32a366e11ef60e6e876fcc and it turns out that limit indeed works. that's 6405008 rows and `select * from table limit 10` returns within 0.5s, and selecting only one column returns in less than 0.03s, so i guess that's indeed taking effect.
   
   i guess my original slow case was due to that the parquet file was directly pulled from HDFS, in which case the statistics are not working?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #725: Global limit isn't really limiting parquet file reads and stops earlier

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #725:
URL: https://github.com/apache/arrow-datafusion/issues/725#issuecomment-880011030


   I added some form of limit push down to parquet some time ago.
   Might be that it isn't applied to your dataset somehow? Or maybe getting the metadata / statistics itself might be slow?
   
   https://github.com/apache/arrow/pull/9672


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist closed issue #725: Global limit isn't really limiting parquet file reads and stops earlier

Posted by GitBox <gi...@apache.org>.

Jimexist closed issue #725:
URL: https://github.com/apache/arrow-datafusion/issues/725


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist closed issue #725: Global limit isn't really limiting parquet file reads and stops earlier

Posted by GitBox <gi...@apache.org>.

Jimexist closed issue #725:
URL: https://github.com/apache/arrow-datafusion/issues/725


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #725: Global limit isn't really limiting parquet file reads and stops earlier

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #725:
URL: https://github.com/apache/arrow-datafusion/issues/725#issuecomment-881024855


   > > I added some form of limit push down to parquet some time ago.
   > > Might be that it isn't applied to your dataset somehow? Or maybe getting the metadata / statistics itself might be slow?
   > > [apache/arrow#9672](https://github.com/apache/arrow/pull/9672)
   > 
   > I tried generating another file using https://gist.github.com/Jimexist/82717bc3ef32a366e11ef60e6e876fcc and it turns out that limit indeed works. that's 6405008 rows and `select * from table limit 10` returns within 0.5s, and selecting only one column returns in less than 0.03s, so i guess that's indeed taking effect.
   > 
   > i guess my original slow case was due to that the parquet file was directly pulled from HDFS, in which case the statistics are not working?
   
   Hm that's weird. We don't use the statistics for the limit, but reduce the amount scanned per file.
   Would be great to have some reproduction on this. Maybe it has to do with large row groups? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist closed issue #725: Global limit isn't really limiting parquet file reads and stops earlier

Posted by GitBox <gi...@apache.org>.

Jimexist closed issue #725:
URL: https://github.com/apache/arrow-datafusion/issues/725


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org