You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by squito <gi...@git.apache.org> on 2017/03/06 19:22:46 UTC
[GitHub] spark pull request #16781: [SPARK-12297][SQL][POC] Hive compatibility for Pa...

GitHub user squito reopened a pull request:

    https://github.com/apache/spark/pull/16781

    [SPARK-12297][SQL][POC] Hive compatibility for Parquet Timestamps

    ## What changes were proposed in this pull request?
    
    Hive has very strange behavior when writing timestamps to parquet data.  It will always apply the conversion from the local timezone to UTC, which it then reverses when reading the data back.  For compatibility with Hive, Spark should provide an option for doing the same conversion, when necessary, based on table metadata.  This goes along with HIVE-12767.
    
    Note that the default for Spark remains unchanged; created tables are marked as UTC, which means the read and write path remains unchanged (and avoids slow timezone logic).  The major use case is that *legacy* tables written by hive can now be read in correctly by Spark, as long as the appropriate table properties are set.
    
    ## How was this patch tested?
    
    Added a unit test which creates tables, reads and writes data, under a variety of permutations (conf on whether or not to add a default tz to new tables, different explicit timezones, vectorized reading on and off).
    
    TODO
    * [ ] predicate pushdown

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark SPARK-12297

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16781.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16781
    
----
commit 53d0744f8cdb9404bfe84f1e0154606d3442639c
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-01-27T02:37:27Z

    very basic test for adjusting read parquet data

commit 69a3c8cb6c4efb35d817c214a43b217daddade4b
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-01-27T20:18:17Z

    wip

commit 51e24f28359b807f46e93975941330d5d93e3875
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-01-31T02:55:09Z

    working version for non-vectorized read -- lots of garbage too

commit 7e618411c83002ca098526b0691f8b184295a216
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-01-31T19:03:34Z

    working for vectorized reads -- not sure about all code paths

commit 9fbde13cbc431ff955564a0695c7a1c3e64e158f
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-01T15:39:39Z

    more tests for write path

commit bac9eb0ed3a65fcdaab458ef3bd52aef5af01b68
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-01T20:23:07Z

    expand tests; fix some metastore interaction; cleanup a lot of garbage

commit 1b05978dc9ee1de5f6d1d7031510ab6b91a6e5b9
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-01T20:49:31Z

    more cleanup

commit b622d278d7a451846dcde28ed01c9618b7a00662
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-01T22:03:38Z

    handle bad timezones; include unit test

commit 0604403e0d67d59c9c586b72b340db5c157d817b
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-02T05:43:08Z

    write support; lots more unit tests

commit f45516da3ee5adf6300085a807b7acd4193cbb36
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-02T16:17:39Z

    add tests for alter table

commit d4511a68a881c0f2b1238d644e4e6fa1f5578154
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-02T16:25:52Z

    utc or gmt; cleanup

commit 223ce2c25b122707c64e4eda77a11bff71fd0cbe
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-02T16:27:13Z

    more cleanup

commit 5b49ae026044b46f0899a9e792e2b71733c4cb8a
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-02-02T18:02:31Z

    fix compatibility

commit 9ef60a4fdb0164f6eed75d40897fddce33f96c23
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-01T19:31:54Z

    Merge branch 'master' into SPARK-12297

commit 0b6883c944ed0400a62414ebd605d7114a2f135d
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-02T20:41:46Z

    Merge branch 'master' into SPARK-12297

commit 69b81425c9b68240ddc5411a83ba39ca7d1a74e3
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-03T03:05:06Z

    wip

commit 7ca2c864de3b1a34e2e77e72f9ae51cfada88d65
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-03T22:33:42Z

    fix

commit 6f982d30c7cd4f8ee6e28024c45dcaeaa72bd874
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-06T19:12:00Z

    fixes; passes tests now

commit 1ad2f8302ee0c9f1caa801b5a75ddf610930e299
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-06T19:14:55Z

    Merge branch 'master' into SPARK-12297

commit 2c8a22811f404c751841e9d9f2e8b22780d60f99
Author: Imran Rashid <ir...@cloudera.com>
Date:   2017-03-06T19:22:11Z

    fix merge

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org