You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/31 19:35:28 UTC
[GitHub] [arrow-datafusion] alamb commented on pull request #5790: Revert pr #5020
alamb commented on PR #5790:
URL: https://github.com/apache/arrow-datafusion/pull/5790#issuecomment-1492496143
For anyone else following along the original PR that added this change was https://github.com/apache/arrow-datafusion/pull/5020
@yahoNanJing -- looking at the screen shot you provided,
![228734458-ddd80ca3-f59d-47a6-bd55-141ed31924e3](https://user-images.githubusercontent.com/490673/229212240-66994f1b-b811-4c6c-af9b-b0f8678c8fb2.png)
It looks to me like the parquet file in question is being read using 2 streams (aka the file was opened by two different tasks which are reading it in parallel)
Thus while the wall clock time takes only 4 seconds it may be possible that the total cpu time is actually 7 seconds
You could potentially disable the `repartition_file_scans` option
https://docs.rs/datafusion/latest/datafusion/config/struct.OptimizerOptions.html#structfield.repartition_file_scans
And see if the metrics were more like what you expected
Perhaps @thinkharderdev or @tustvold have additional thoughts they could share
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org