You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Brian Husted (JIRA)" <ji...@apache.org> on 2014/10/03 14:37:33 UTC

[jira] [Commented] (SPARK-2421) Spark should treat writable as serializable for keys

    [ https://issues.apache.org/jira/browse/SPARK-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157936#comment-14157936 ] 

Brian Husted commented on SPARK-2421:
-------------------------------------

To work around the problem, one must map the Writable to a String (org.apache.hadoop.io.Text in the case below).  This an issue when sorting large amounts of data since Spark will attempt to write out the entire dataset (spill) to perform the data conversion.  On a 500GB file this fills up more than 100GB of space on each node in our 12 node cluster which is very inefficient.  We are currently using Spark 1.0.2.  Any thoughts here are appreciated.

Our code that attempts to mimic map/reduce sort in Spark:

//read in the hadoop sequence file to sort
 val file = sc.sequenceFile(input, classOf[Text], classOf[Text])

//this is the code we would like to avoid that maps the Hadoop Text Input to Strings so the sortyByKey will run
     file.map{ case (k,v) => (k.toString(), v.toString())}

//perform the sort on the converted data
    val sortedOutput = file.sortByKey(true, 1)

//write out the results as a sequence file
    sortedOutput.saveAsSequenceFile(output, Some(classOf[DefaultCodec])) 

> Spark should treat writable as serializable for keys
> ----------------------------------------------------
>
>                 Key: SPARK-2421
>                 URL: https://issues.apache.org/jira/browse/SPARK-2421
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output, Java API
>    Affects Versions: 1.0.0
>            Reporter: Xuefu Zhang
>
> It seems that Spark requires the key be serializable (class implement Serializable interface). In Hadoop world, Writable interface is used for the same purpose. A lot of existing classes, while writable, are not considered by Spark as Serializable. It would be nice if Spark can treate Writable as serializable and automatically serialize and de-serialize these classes using writable interface.
> This is identified in HIVE-7279, but its benefits are seen global.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org