You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "sundy-li (via GitHub)" <gi...@apache.org> on 2023/02/26 10:22:38 UTC

[GitHub] [arrow-datafusion] sundy-li commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB

sundy-li commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1445319733

   TopK is a partial factor.
   
   1.  Lazy projection(aka Later projection) can improve this case, we just fetch `URL` column at the first query and apply the order limit then projection other columns by rowids.
   2.  `URL` is a large binary column in the hits dataset, duckdb optimized reading parquet to it's memory model. You can prove that by `select max(URL) from table`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org