You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/11 13:17:32 UTC

[GitHub] [arrow-datafusion] ParadoxShmaradox commented on issue #2845: [Question] Optimize multiple reads on same DataFrame

ParadoxShmaradox commented on issue #2845:
URL: https://github.com/apache/arrow-datafusion/issues/2845#issuecomment-1180398779

   What I ended up doing was to collect the record batches from the dataframe and because I have knowledge that the record batches are pre sorted by the id column from the read parquet file I could skip batches and apply the kernel filters by hand.
   
   This cut the filtering time dramatically from 5ms average to 1ms. There are about 100 partitions.
   I wonder if a record batch could hold some statistics on the data, either pre computed or on demand and then Datafusion use that statistics in the physical plan optimization.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org