You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rberenguel <gi...@git.apache.org> on 2017/05/29 19:44:20 UTC

[GitHub] spark pull request #18139: Spark 20787 invalid years

GitHub user rberenguel opened a pull request:

    https://github.com/apache/spark/pull/18139

    Spark 20787 invalid years

    `time.mktime` can't handle dates from 1899-100, according to the documentation by design. `calendar.timegm` is equivalent in shared cases, but can handle those years.
    
    ## What changes were proposed in this pull request?
    
    Change `time.mktime` for the more able `calendar.timegm` to adress cases like:
    ```python
    import datetime as dt
    sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count()
    ```
    failing due to internal conversion failure when there is no timezone information in the time object. In the case there is information, `calendar` was used instead.
    
    ## How was this patch tested?
    
    The existing test cases cover this change, since it does not change any existing functionality. Added a test to confirm it working in the problematic range.
    
    This PR is original work from me and I license this work to the Spark project

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rberenguel/spark SPARK-20787-invalid-years

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18139.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18139
    
----
commit 6c0312f94e3fce2bf4d6a30055bd747be535bb0f
Author: Ruben Berenguel Montoro <ru...@mostlymaths.net>
Date:   2017-05-29T15:46:21Z

    SPARK-20787 time.mktime can’t handle dates from 1899-100, by construction. calendar.timegm is equivalent in shared cases, but can handle those

commit d3c41b5f18971168870524ad3a5fac876859bf4b
Author: Ruben Berenguel Montoro <ru...@mostlymaths.net>
Date:   2017-05-29T19:42:54Z

    SPARK-20787 Technically a hack. Using gmtime everywhere does not work well with DST shifts. So, for timeranges that don’t work well with mktime, use gmtime

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Yes, most solutions are concerned only with strftime (virtualtime and matplotlib above).  Since calendar.gmtime can handle this better (or at least for pre-1900) it is just a matter to... not messing up horribly with how dates work (It’s also problematic in Java after all)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    **[Test build #77641 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77641/testReport)** for PR 18139 at commit [`4a6f77d`](https://github.com/apache/spark/commit/4a6f77d54158caefd008b8db9f2aa6a651373ae0).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119350488
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,11 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            seconds = calendar.timegm(dt.utctimetuple())
    +            # Avoiding the invalid range of years (100-1899) for mktime in Python < 3
    +            if dt.year > 1899 or dt.year <= 100:
    +                seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    +                           else time.mktime(dt.timetuple()))
    --- End diff --
    
    Makes sense. I always forget if blocks don't create scoping blocks in Python


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Gentle follow up, if you have some time to update this @rberenguel :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119583288
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,12 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            # Avoiding the invalid range of years (100-1899) for mktime in Python < 3
    +            if dt.year > 1899 or dt.year <= 100:
    --- End diff --
    
    Uh, right, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119583612
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,12 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            # Avoiding the invalid range of years (100-1899) for mktime in Python < 3
    +            if dt.year > 1899 or dt.year <= 100:
    +                seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    +                           else time.mktime(dt.timetuple()))
    +            else:
    +                seconds = calendar.timegm(dt.utctimetuple())
    --- End diff --
    
    Well, this was the previous functionality, I'm not 100% sure of the rationale that was used to do it this way either. If we don't have `tzinfo` and use `dt.timetuple` instead of `utctimetule` I guess it's using locale time via the underlying C implementation for `mktime`, or that's what the Python docs seem to suggest, but it's not that clear. So I'm not convinced we need to offset, since it should be a platform/locale dependent offset


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77642/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    For instance, [matplotlib handles strftime as a special case for pre-1900](https://github.com/matplotlib/matplotlib/blob/c166fa1afc40cf5fc211b584f8abc107744e9417/lib/matplotlib/dates.py#L573)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Thanks @ueshin , forgot to run `dev/lint-python` on the tests (still getting used to which speed tricks are useful to avoid running a full test suite run locally, which takes more than an hour on my machine)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    @j-bennet I'm not sure about that change. My suggested approach is just fixing what the Python API offers (as in, mktime has known limitations, from Python's point of view, independent of what mktime might do). I guess a better way would be to have some platform depending check for the 1900-1901-1902 problem (which I suspect should be as a separate issue from this one, though, even if they might touch the same area of the code)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    **[Test build #77641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77641/testReport)** for PR 18139 at commit [`4a6f77d`](https://github.com/apache/spark/commit/4a6f77d54158caefd008b8db9f2aa6a651373ae0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    More ideas welcome indeed. We could just replicate what others are doing to fix the problem, if we can't introduce external dependencies.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    cc @ueshin 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    @HyukjinKwon I would definitely not mind, I'm actually curious to see how they do it. I also saw it's not that big (and we don't need all the functionality). I'll extract the pieces we need for our case


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    @rberenguel is this still on your radar?
    Also jenkins ok to test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119344927
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,11 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            seconds = calendar.timegm(dt.utctimetuple())
    +            # Avoiding the invalid range of years (100-1899) for mktime in Python < 3
    +            if dt.year not in range(100, 1900):
    --- End diff --
    
    Thanks! Indeed, I thought about it but creating PRs too late in the afternoon dizzies my optimisation logic. Great catch and check, although I think `100 >= dt.year > 1900` is an and condition, so it's never only valid for years larger than 1900. I changed it for an or (the < 100 and the > 1899 parts)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    I'm not sure if we can use external libraries here. Do you have any ideas? @holdenk 
    I'd cc @HyukjinKwon @viirya for more opinions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119346663
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,11 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            seconds = calendar.timegm(dt.utctimetuple())
    +            # Avoiding the invalid range of years (100-1899) for mktime in Python < 3
    +            if dt.year > 1899 or dt.year <= 100:
    +                seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    +                           else time.mktime(dt.timetuple()))
    --- End diff --
    
    How about adding a `else` branch to eliminate repetitive computation of `seconds`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Hi @holdenk yup, I still have a to-do to address it in a clean way, I got sidetracked with other things. Helping Python since the 18th century!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77641/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119263608
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,11 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            seconds = calendar.timegm(dt.utctimetuple())
    --- End diff --
    
    Perhaps it need to be confirmed by @davies ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Thanks for cc'ing me too. This seems a very specific problem. If we can introduce similar and small fix to solve it in PySpark, it should be better for me.
    
    I'm not sure if virtualtime solves this, maybe I miss it, in a rough scan I didn't find it put an overlayed form for `time.mktime` as other functions, but it patches `time.strftime`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    One option, which wouldn't be super efficient at border cases, but would allow for general case is just falling back on `OverflowError` (on top of avoiding it in the known to cause error range according to Python docs).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    **[Test build #77642 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77642/testReport)** for PR 18139 at commit [`aeef87f`](https://github.com/apache/spark/commit/aeef87f1bf6a724116a50d2324ca22906b8e0bf4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by j-bennet <gi...@git.apache.org>.
Github user j-bennet commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    The range that mktime will accept is platform-dependent. I am on a Mac, and it looks like mktime can't handle 1900 and 1901:
    
    ```
    Python 2.7.13 (default, Dec 18 2016, 07:03:39)
    Type "copyright", "credits" or "license" for more information.
    
    IPython 5.3.0 -- An enhanced Interactive Python.
    ?         -> Introduction and overview of IPython's features.
    %quickref -> Quick reference.
    help      -> Python's own help system.
    object?   -> Details about 'object', use 'object??' for extra details.
    
    In [1]: import time
    
    In [2]: import datetime as dt
    
    In [3]: time.mktime(dt.datetime(1900, 1, 1).timetuple())
    ---------------------------------------------------------------------------
    OverflowError                             Traceback (most recent call last)
    <ipython-input-3-0332545829df> in <module>()
    ----> 1 time.mktime(dt.datetime(1900, 1, 1).timetuple())
    
    OverflowError: mktime argument out of range
    
    In [4]: time.mktime(dt.datetime(1901, 1, 1).timetuple())
    ---------------------------------------------------------------------------
    OverflowError                             Traceback (most recent call last)
    <ipython-input-4-fb80b633a52b> in <module>()
    ----> 1 time.mktime(dt.datetime(1901, 1, 1).timetuple())
    
    OverflowError: mktime argument out of range
    
    In [5]: time.mktime(dt.datetime(1902, 1, 1).timetuple())
    Out[5]: -2145888000.0
    ```
    
    I would suggest, to be on the safe side, to only call `mktime` for years 1902 and up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    cc @ueshin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    **[Test build #77642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77642/testReport)** for PR 18139 at commit [`aeef87f`](https://github.com/apache/spark/commit/aeef87f1bf6a724116a50d2324ca22906b8e0bf4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119570547
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,12 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            # Avoiding the invalid range of years (100-1899) for mktime in Python < 3
    +            if dt.year > 1899 or dt.year <= 100:
    --- End diff --
    
    Here `dt.year <= 100` should be `dt.year < 100`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    @holdenk @ueshin Digging a bit into other solutions to this problem (since it's not specifically Spark related, when you dig deeper, so other libraries have to find its way to fix it) and maybe adding [this dependency](https://pypi.python.org/pypi/virtualtime) might be the best way to fix it forever (while also not reinventing wheels that are already in the python ecosystem)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119580044
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,12 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            # Avoiding the invalid range of years (100-1899) for mktime in Python < 3
    +            if dt.year > 1899 or dt.year <= 100:
    +                seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    +                           else time.mktime(dt.timetuple()))
    +            else:
    +                seconds = calendar.timegm(dt.utctimetuple())
    --- End diff --
    
    I guess if `dt.tzinfo` doesn't exists, `dt` loses local timezone offset, so we need to adjust timestamp using `time.timezone` or something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Posted by kbourgoin <gi...@git.apache.org>.
Github user kbourgoin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119250658
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,11 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            seconds = calendar.timegm(dt.utctimetuple())
    +            # Avoiding the invalid range of years (100-1899) for mktime in Python < 3
    +            if dt.year not in range(100, 1900):
    --- End diff --
    
    Thanks for putting this together. I filed the original ticket, but didn't have time to look into how to actually fix it.  This looks like a good fix, but there's a small performance optimization that would help. Changing this to `if 100 >= dt.year > 1900:` gives a significant improvement.
    
    ```
    # Running 1,000,000 iteration of `toInternal`
    
    Timing of datetime(16,1,1)
    ------------------------------
    Before: 33.7063019276s
    After: 1.56738209724s
    
    
    Timing of datetime(1600,1,1)
    --------------------------------
    Before: 28.1155149937s
    After: 1.55464696884s
    
    
    Timing of datetime(1900,1,1)
    --------------------------------
    Before: 33.3729829788s
    After: 1.54201412201s
    ```
    
    With the call to `range`, you have to instantiate the range object for every single serialization (in Python 2 this creates the whole list at once in memory) and then iterate over it in order to do the search. The simple test above avoids all those steps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Thank you for cc'ing me. I am actually .. quite against adding an external Python package to fix a specific problem, in particular, if it is a Python package as a hard dependency - up to my knowledge, PySpark case is a bit complicated (e.g., cloudpickle and py4j).
    
    Looking at [virtualtime repo](https://github.com/j5int/virtualtime), it looks not quite big. Would it require many changes to resemble what they do?
    I haven't looked into the discussion, change and JIRA closely yet but just want to quickly check if you wouldn't mind.
     


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

Posted by rberenguel <gi...@git.apache.org>.
Github user rberenguel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18139#discussion_r119344936
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -187,8 +187,11 @@ def needConversion(self):
     
         def toInternal(self, dt):
             if dt is not None:
    -            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
    -                       else time.mktime(dt.timetuple()))
    +            seconds = calendar.timegm(dt.utctimetuple())
    +            # Avoiding the invalid range of years (100-1899) for mktime in Python < 3
    +            if dt.year not in range(100, 1900):
    --- End diff --
    
    Thanks! Indeed, I thought about it but creating PRs too late in the afternoon dizzies my optimisation logic. Great catch and check, although I think `100 >= dt.year > 1900` is an and condition, so it's never only valid for years larger than 1900. I changed it for an or (the < 100 and the > 1899 parts)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18139: Spark 20787 invalid years

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18139
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org