You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:03:08 UTC
[jira] [Updated] (SPARK-22338) namedtuple serialization is inefficient

     [ https://issues.apache.org/jira/browse/SPARK-22338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-22338:
---------------------------------
    Labels: bulk-closed  (was: )

> namedtuple serialization is inefficient
> ---------------------------------------
>
>                 Key: SPARK-22338
>                 URL: https://issues.apache.org/jira/browse/SPARK-22338
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: Joel Croteau
>            Priority: Minor
>              Labels: bulk-closed
>
> I greatly appreciate the level of hack that PySpark contains in order to make namedtuples serializable, but I feel like it could be done a little better. In particular, say I create a namedtuple class with a few long argument names like this:
> {code:JiraShouldReallySupportPython}
> MyTuple = namedtuple('MyTuple', ('longarga', 'longargb', 'longargc'))
> {code}
> If a create an instance of this, here is how PySpark serializes it:
> {code:JiraShouldReallySupportPython}
> mytuple = MyTuple(1, 2, 3)
> pickle.dumps(mytuple, pickle.HIGHEST_PROTOCOL)
> b'\x80\x04\x95]\x00\x00\x00\x00\x00\x00\x00\x8c\x13pyspark.serializers\x94\x8c\x08_restore\x94\x93\x94\x8c\x07MyTuple\x94\x8c\x08longarga\x94\x8c\x08longargb\x94\x8c\x08longargc\x94\x87\x94K\x01K\x02K\x03\x87\x94\x87\x94R\x94.'
> {code}
> This serialization includes the name of the namedtuple class, the names of each of its members, as well as references to internal functions in pyspark.serializers. By comparison, this is what I get if I serialize the bare tuple:
> {code:JiraShouldReallySupportPython}
> shorttuple = (1,2,3)
> pickle.dumps(shorttuple, pickle.HIGHEST_PROTOCOL)
> b'\x80\x04\x95\t\x00\x00\x00\x00\x00\x00\x00K\x01K\x02K\x03\x87\x94.'
> {code}
> Much shorter. For another comparison, here is what it looks like if I build a dict with the same data and element names:
> {code:JiraShouldReallySupportPython}
> mydict = {'longarga':1, 'longargb':2, 'longargc':3}
> pickle.dumps(mydict, pickle.HIGHEST_PROTOCOL)
> b'\x80\x04\x95,\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x08longarga\x94K\x01\x8c\x08longargb\x94K\x02\x8c\x08longargc\x94K\x03u.'
> {code}
> In other words, even using a dict is substantially shorter than using a namedtuple in its current form. There shouldn't be any need for namedtuples to have this much overhead in their serialization. For one thing, if the class object is being broadcast to the nodes, there should be no need for each namedtuple instance to include all of the field names; the class name should be enough. If you use namedtuples heavily, this can create a lot of overhead in memory and disk use. I am going to try and improve the serialization and submit a patch if I can find the time, but I don't know the pyspark code too well, so if anyone has suggestions for where to start, I would love to hear them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org