You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/07/14 22:23:41 UTC

[GitHub] [spark] JoshRosen commented on issue #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OOM when invoking `getIterator` function.

JoshRosen commented on issue #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OOM when invoking `getIterator` function.
URL: https://github.com/apache/spark/pull/20184#issuecomment-511240279

I stumbled across this PR while looking through the open Spark shuffle PRs.

It sounds like the problem here is that we don't need to allocate the input stream and read buffer until it's actually time to read the spill, but we're currently doing that too early:

- In `getSortedIterator()`, we have to construct all readers before we can return the first record because we must find the first record according to the sorted ordering and that requires looking at all spill files.
- However, we do not have this constraint in `getIterator()`, which returns an unsorted iterator and is used in `ExternalAppendOnlyUnsafeRowArray` (which uses the sorter only for its spilling capabilities, not for sorting).

Given this context, lazy initialization makes sense to me. However, this PR is a bit outdated and has some merge conflicts. I would be supportive of this change if the conflicts are resolved and the PR description is updated.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org