You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by alexbaretta <gi...@git.apache.org> on 2014/12/31 08:30:18 UTC

[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

GitHub user alexbaretta opened a pull request:

    https://github.com/apache/spark/pull/3855

    [SPARK-4985][SQL Parquet] Parquet date support

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/alexbaretta/spark parquet-date-support

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3855.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3855
    
----
commit 0ebe356bceff169fe89134bed603a17514dc1108
Author: Daoyuan Wang <da...@intel.com>
Date:   2014-12-29T07:59:37Z

    parquet support for date type

commit 81f466d3a6d0b7654d66205c018bd9496a82ad3b
Author: Alex Baretta <al...@gmail.com>
Date:   2014-12-31T02:13:34Z

    [SPARK-4985][SQL Parquet] Make DateType a subtype of PrimitiveType

commit d29224e14b424bdace41e13888cd7b2e9edc1c03
Author: Alex Baretta <al...@gmail.com>
Date:   2014-12-31T07:13:26Z

    [SPARK-4985][SQL Parquet] Fix 'Unsupported datatype DateType, cannot write to consumer'

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-68477625
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24979/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-69867653
  
      [Test build #25509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25509/consoleFull) for   PR 3855 at commit [`8851d1a`](https://github.com/apache/spark/commit/8851d1af5bf568770b4ec409d42ddbf45a69f22c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-68477622
  
      [Test build #24979 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24979/consoleFull) for   PR 3855 at commit [`d29224e`](https://github.com/apache/spark/commit/d29224e14b424bdace41e13888cd7b2e9edc1c03).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-68428076
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by ash211 <gi...@git.apache.org>.
Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-68475839
  
    Jenkins, this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3855#discussion_r28160082
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala ---
    @@ -207,6 +207,7 @@ private[parquet] class RowWriteSupport extends WriteSupport[Row] with Logging {
             case DoubleType => writer.addDouble(value.asInstanceOf[Double])
             case FloatType => writer.addFloat(value.asInstanceOf[Float])
             case BooleanType => writer.addBoolean(value.asInstanceOf[Boolean])
    +        case DateType => writer.addInteger(value.asInstanceOf[java.sql.Date].getTime.toInt)
    --- End diff --
    
    This doesn't conform to the [Parquet specification for date](https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date) and produces invalid data.
    
    When using the `DATE` annotation, the value must be the number of days from the Unix epoch, 1 January 1970. `java.sql.Date` and `java.util.Date` are backed by a long timestamp, the number of milliseconds from the Unix epoch (which is a Parquet `TIMESTAMP_MILLIS`) and casting that value to an integer makes it impossible to recover the real date.
    
    I recommend using `TIMESTAMP_MILLIS` instead of date here (you won't need the `toInt` part). That seems to be what you want, if you're interested in using `java.sql.Date`. The reason why there is a name mismatch is that the Parquet types mirror SQL types more closely than Java objects.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3855#discussion_r23207916
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetSchemaSuite.scala ---
    @@ -55,7 +55,7 @@ class ParquetSchemaSuite extends FunSuite with ParquetTest {
           |}
         """.stripMargin)
     
    -  testSchema[(Byte, Short, Int, Long)](
    +  testSchema[(Byte, Short, Int, Long, java.sql.Date)](
    --- End diff --
    
    Nit: Usually I'd prefer import `java.sql.Date` and just use `Date` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by nchammas <gi...@git.apache.org>.
Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-69264886
  
    cc @marmbrus


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-69868221
  
      [Test build #25510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25510/consoleFull) for   PR 3855 at commit [`929f294`](https://github.com/apache/spark/commit/929f294d9284ccb09c615823f95d006aa899edd8).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-93043864
  
    @liancheng, I'll take a look as soon as I can. I'm a little swamped this week though, so I can't guarantee it'll be quick. Sorry!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-69872028
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25510/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-91621457
  
    I just looked at #3822 and it looks correct, so you can ignore my review comment above. In the future, please feel free to ping me for reviews when you're using logical types like Date, Timestamp, Decimal, etc. in either Parquet or Avro. I've been working on the specs in those communities and I'm happy to make sure the implementations look correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-69867816
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25509/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-82712480
  
    Seems like this is subsumed by #3822, and thus we can close this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-91892616
  
    @rdblue Thanks all the same for the review! BTW, it would be great if you can have a look at #5422, which refactors Spark SQL Parquet converter (Parquet records to Spark SQL rows) and implements backwards-compatibility rules.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-69867814
  
      [Test build #25509 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25509/consoleFull) for   PR 3855 at commit [`8851d1a`](https://github.com/apache/spark/commit/8851d1af5bf568770b4ec409d42ddbf45a69f22c).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-68475857
  
      [Test build #24979 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24979/consoleFull) for   PR 3855 at commit [`d29224e`](https://github.com/apache/spark/commit/d29224e14b424bdace41e13888cd7b2e9edc1c03).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3855#issuecomment-69872024
  
      [Test build #25510 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25510/consoleFull) for   PR 3855 at commit [`929f294`](https://github.com/apache/spark/commit/929f294d9284ccb09c615823f95d006aa899edd8).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3855


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org