You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/01 17:46:07 UTC

[GitHub] [spark] tgravescs commented on issue #23695: [SPARK-26780][CORE]Improve shuffle read using ReadAheadInputStream

tgravescs commented on issue #23695: [SPARK-26780][CORE]Improve shuffle read using ReadAheadInputStream
URL: https://github.com/apache/spark/pull/23695#issuecomment-488355875

Glad to see someone working on this, I had started looking at this a long time back, but I was looking at using the linux readahead call. I assume you didn't try the linux readahead vs this to see performance difference? I know the ReadAheadInputStream was already here so easier to use, but I would also think the os one might be more efficient if it doesn't have to copy into user space and then you copy it again to send it out, the os one just pulls into the page cache.

In your performance tests, how did you test? if the data was in the page cache then you wouldn't see much benefit, its when the data actually has to be pulled from disk and the readahead does enough to get it into memory before you need it.
Have you run this through any real spark applications or on a real cluster to see results?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org