You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kbourgoin <gi...@git.apache.org> on 2017/05/31 01:00:37 UTC

[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Github user kbourgoin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119250658
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,11 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            seconds = calendar.timegm(dt.utctimetuple())
    +            # Avoiding the invalid range of years (100-1899) for mktime in Python < 3
    +            if dt.year not in range(100, 1900):
    --- End diff --
    
    Thanks for putting this together. I filed the original ticket, but didn't have time to look into how to actually fix it.  This looks like a good fix, but there's a small performance optimization that would help. Changing this to `if 100 >= dt.year > 1900:` gives a significant improvement.
    
    ```
    # Running 1,000,000 iteration of `toInternal`
    
    Timing of datetime(16,1,1)
    ------------------------------
    Before: 33.7063019276s
    After: 1.56738209724s
    
    
    Timing of datetime(1600,1,1)
    --------------------------------
    Before: 28.1155149937s
    After: 1.55464696884s
    
    
    Timing of datetime(1900,1,1)
    --------------------------------
    Before: 33.3729829788s
    After: 1.54201412201s
    ```
    
    With the call to `range`, you have to instantiate the range object for every single serialization (in Python 2 this creates the whole list at once in memory) and then iterate over it in order to do the search. The simple test above avoids all those steps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org