You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shirish (Jira)" <ji...@apache.org> on 2019/12/18 02:39:00 UTC
[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998752#comment-16998752 ]
Shirish edited comment on SPARK-1476 at 12/18/19 2:38 AM:
----------------------------------------------------------
This is an old chain that I happen to land on to. I am interested in the following points mentioned by [~mridulm80]. Did anyone ever get to implementing MultiOutputs map without needing to use cache? If not, can anyone give me a pointer on how to get started.
_"_[~matei] _Interesting that you should mention about splitting output of a map into multiple blocks__._
_We are actually thinking about that in a different context - akin to MultiOutputs in hadoop or SPLIT in pig : without needing to cache the intermediate output; but directly emit values to different blocks/rdd's based on the output of a map or some such."_
was (Author: shirishkumar):
This is an old chain that I happen to land on to. I am interested in the following points mentioned by [~mridulm80]. Did anyone ever get to implementing MultiOutputs map without needing to use cache? If not, is there a pointer I can get on how to get started.
_"_[~matei] _Interesting that you should mention about splitting output of a map into multiple blocks__._
_We are actually thinking about that in a different context - akin to MultiOutputs in hadoop or SPLIT in pig : without needing to cache the intermediate output; but directly emit values to different blocks/rdd's based on the output of a map or some such."_
> 2GB limit in spark for blocks
> -----------------------------
>
> Key: SPARK-1476
> URL: https://issues.apache.org/jira/browse/SPARK-1476
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Environment: all
> Reporter: Mridul Muralidharan
> Priority: Critical
> Attachments: 2g_fix_proposal.pdf
>
>
> The underlying abstraction for blocks in spark is a ByteBuffer : which limits the size of the block to 2GB.
> This has implication not just for managed blocks in use, but also for shuffle blocks (memory mapped blocks are limited to 2gig, even though the api allows for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
> This is a severe limitation for use of spark when used on non trivial datasets.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org