You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2015/11/11 16:06:51 UTC
[GitHub] spark pull request: [SPARK-7041] Avoid writing empty files in Bypa...
GitHub user JoshRosen reopened a pull request:
https://github.com/apache/spark/pull/5622
[SPARK-7041] Avoid writing empty files in BypassMergeSortShuffleWriter
In BypassMergeSortShuffleWriter, we may end up opening disk writers files for empty partitions; this occurs because we manually call `open()` after creating the writer, causing serialization and compression input streams to be created; these streams may write headers to the output stream, resulting in non-zero-length files being created for partitions that contain no records. This is unnecessary, though, since the disk object writer will automatically open itself when the first write is performed. Removing this eager open() call and rewriting the consumers to cope with the non-existence of empty files results in a large performance benefit for certain sparse workloads when using sort-based shuffle. This has an impact for small-scale Spark SQL jobs in unit tests and `spark-shell`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark file-handle-optimizations
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5622.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5622
----
commit 00bcf8a893c021fa4a949c5ac077a34881870ace
Author: Josh Rosen <jo...@databricks.com>
Date: 2015-04-21T18:55:11Z
Avoid IO operations on empty files in BlockObjectWriter.
commit 8fd89b47efbef6325e0bc45bad0b74bf8ead4a6d
Author: Josh Rosen <jo...@databricks.com>
Date: 2015-04-21T19:10:46Z
Do not create empty files at all.
commit 0db87c341686e7b24e760583bcc9fe9054d3095a
Author: Josh Rosen <jo...@databricks.com>
Date: 2015-04-21T20:30:00Z
Reduce scope of FileOutputStream in ExternalSorter
commit 7e2340d05721d6374e78069baa5870e87cd0cfb1
Author: Josh Rosen <jo...@databricks.com>
Date: 2015-04-22T00:45:45Z
Revert "Reduce scope of FileOutputStream in ExternalSorter"
This reverts commit 3c9c9447d4d4e8ddeb036167390073e3b67fb621.
commit 54cd5ceb025f635552a14e6241d43e9858fb095d
Author: Josh Rosen <jo...@databricks.com>
Date: 2015-06-05T21:38:38Z
Merge remote-tracking branch 'origin/master' into file-handle-optimizations
commit 5c777cf40ee1f70092639a8abfe8b9598d6d3636
Author: Josh Rosen <jo...@databricks.com>
Date: 2015-06-05T21:35:53Z
Rework SPARK-7041 for BypassMergeSort split
commit c7caa5c6b54c86895a2f57ba448b1dd626ce5cf4
Author: Josh Rosen <jo...@databricks.com>
Date: 2015-06-06T05:21:20Z
Merge remote-tracking branch 'origin/master' into file-handle-optimizations
commit aaa51bf58f286f0c1dbb0a38afb514e7a38b1183
Author: Josh Rosen <jo...@databricks.com>
Date: 2015-06-09T17:50:35Z
Actually avoid calling open() :)
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org