You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2015/11/11 16:06:51 UTC

[GitHub] spark pull request: [SPARK-7041] Avoid writing empty files in Bypa...

GitHub user JoshRosen reopened a pull request:

    https://github.com/apache/spark/pull/5622

    [SPARK-7041] Avoid writing empty files in BypassMergeSortShuffleWriter

    In BypassMergeSortShuffleWriter, we may end up opening disk writers files for empty partitions; this occurs because we manually call `open()` after creating the writer, causing serialization and compression input streams to be created; these streams may write headers to the output stream, resulting in non-zero-length files being created for partitions that contain no records. This is unnecessary, though, since the disk object writer will automatically open itself when the first write is performed. Removing this eager open() call and rewriting the consumers to cope with the non-existence of empty files results in a large performance benefit for certain sparse workloads when using sort-based shuffle.  This has an impact for small-scale Spark SQL jobs in unit tests and `spark-shell`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark file-handle-optimizations

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5622.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5622
    
----
commit 00bcf8a893c021fa4a949c5ac077a34881870ace
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-04-21T18:55:11Z

    Avoid IO operations on empty files in BlockObjectWriter.

commit 8fd89b47efbef6325e0bc45bad0b74bf8ead4a6d
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-04-21T19:10:46Z

    Do not create empty files at all.

commit 0db87c341686e7b24e760583bcc9fe9054d3095a
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-04-21T20:30:00Z

    Reduce scope of FileOutputStream in ExternalSorter

commit 7e2340d05721d6374e78069baa5870e87cd0cfb1
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-04-22T00:45:45Z

    Revert "Reduce scope of FileOutputStream in ExternalSorter"
    
    This reverts commit 3c9c9447d4d4e8ddeb036167390073e3b67fb621.

commit 54cd5ceb025f635552a14e6241d43e9858fb095d
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-06-05T21:38:38Z

    Merge remote-tracking branch 'origin/master' into file-handle-optimizations

commit 5c777cf40ee1f70092639a8abfe8b9598d6d3636
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-06-05T21:35:53Z

    Rework SPARK-7041 for BypassMergeSort split

commit c7caa5c6b54c86895a2f57ba448b1dd626ce5cf4
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-06-06T05:21:20Z

    Merge remote-tracking branch 'origin/master' into file-handle-optimizations

commit aaa51bf58f286f0c1dbb0a38afb514e7a38b1183
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-06-09T17:50:35Z

    Actually avoid calling open() :)

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org