You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2014/08/02 03:30:43 UTC

[GitHub] spark pull request: [SPARK-1687] [PySpark] pickable namedtuple

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1623#discussion_r15726127
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -277,6 +308,19 @@ class PickleSerializer(FramedSerializer):
         not be as fast as more specialized serializers.
         """
     
    +    def _hack_namedtuple(self):
    +        # namedtuple created in other module can be pickled normal
    +        # hack namedtuple in __main__ module
    +        for n, o in sys.modules["__main__"].__dict__.iteritems():
    +            if (type(o) is type and o.__base__ is tuple
    +                    and hasattr(o, "_fields")
    +                    and "__reduce__" not in o.__dict__):
    +                hack_namedtuple(o)
    +
    +    def dump_stream(self, iterator, stream):
    +        self._hack_namedtuple()
    --- End diff --
    
    I was going to suggest that maybe we should have a boolean flag that tests whether we've already hacked namedtuple, but maybe we don't need it: _hack_namedtuple() is idempotent and that might be premature optimization, since here we only pay the hack cost once per stream.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---