You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by alexbaretta <gi...@git.apache.org> on 2014/12/31 08:30:18 UTC
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
GitHub user alexbaretta opened a pull request:
https://github.com/apache/spark/pull/3855
[SPARK-4985][SQL Parquet] Parquet date support
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/alexbaretta/spark parquet-date-support
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3855.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3855
----
commit 0ebe356bceff169fe89134bed603a17514dc1108
Author: Daoyuan Wang <da...@intel.com>
Date: 2014-12-29T07:59:37Z
parquet support for date type
commit 81f466d3a6d0b7654d66205c018bd9496a82ad3b
Author: Alex Baretta <al...@gmail.com>
Date: 2014-12-31T02:13:34Z
[SPARK-4985][SQL Parquet] Make DateType a subtype of PrimitiveType
commit d29224e14b424bdace41e13888cd7b2e9edc1c03
Author: Alex Baretta <al...@gmail.com>
Date: 2014-12-31T07:13:26Z
[SPARK-4985][SQL Parquet] Fix 'Unsupported datatype DateType, cannot write to consumer'
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-68477625
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24979/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-69867653
[Test build #25509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25509/consoleFull) for PR 3855 at commit [`8851d1a`](https://github.com/apache/spark/commit/8851d1af5bf568770b4ec409d42ddbf45a69f22c).
* This patch merges cleanly.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-68477622
[Test build #24979 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24979/consoleFull) for PR 3855 at commit [`d29224e`](https://github.com/apache/spark/commit/d29224e14b424bdace41e13888cd7b2e9edc1c03).
* This patch **passes all tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-68428076
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by ash211 <gi...@git.apache.org>.
Github user ash211 commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-68475839
Jenkins, this is ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/3855#discussion_r28160082
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala ---
@@ -207,6 +207,7 @@ private[parquet] class RowWriteSupport extends WriteSupport[Row] with Logging {
case DoubleType => writer.addDouble(value.asInstanceOf[Double])
case FloatType => writer.addFloat(value.asInstanceOf[Float])
case BooleanType => writer.addBoolean(value.asInstanceOf[Boolean])
+ case DateType => writer.addInteger(value.asInstanceOf[java.sql.Date].getTime.toInt)
--- End diff --
This doesn't conform to the [Parquet specification for date](https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date) and produces invalid data.
When using the `DATE` annotation, the value must be the number of days from the Unix epoch, 1 January 1970. `java.sql.Date` and `java.util.Date` are backed by a long timestamp, the number of milliseconds from the Unix epoch (which is a Parquet `TIMESTAMP_MILLIS`) and casting that value to an integer makes it impossible to recover the real date.
I recommend using `TIMESTAMP_MILLIS` instead of date here (you won't need the `toInt` part). That seems to be what you want, if you're interested in using `java.sql.Date`. The reason why there is a name mismatch is that the Parquet types mirror SQL types more closely than Java objects.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/3855#discussion_r23207916
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetSchemaSuite.scala ---
@@ -55,7 +55,7 @@ class ParquetSchemaSuite extends FunSuite with ParquetTest {
|}
""".stripMargin)
- testSchema[(Byte, Short, Int, Long)](
+ testSchema[(Byte, Short, Int, Long, java.sql.Date)](
--- End diff --
Nit: Usually I'd prefer import `java.sql.Date` and just use `Date` here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by nchammas <gi...@git.apache.org>.
Github user nchammas commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-69264886
cc @marmbrus
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-69868221
[Test build #25510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25510/consoleFull) for PR 3855 at commit [`929f294`](https://github.com/apache/spark/commit/929f294d9284ccb09c615823f95d006aa899edd8).
* This patch merges cleanly.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-93043864
@liancheng, I'll take a look as soon as I can. I'm a little swamped this week though, so I can't guarantee it'll be quick. Sorry!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-69872028
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25510/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-91621457
I just looked at #3822 and it looks correct, so you can ignore my review comment above. In the future, please feel free to ping me for reviews when you're using logical types like Date, Timestamp, Decimal, etc. in either Parquet or Avro. I've been working on the specs in those communities and I'm happy to make sure the implementations look correct.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-69867816
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25509/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-82712480
Seems like this is subsumed by #3822, and thus we can close this issue.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-91892616
@rdblue Thanks all the same for the review! BTW, it would be great if you can have a look at #5422, which refactors Spark SQL Parquet converter (Parquet records to Spark SQL rows) and implements backwards-compatibility rules.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-69867814
[Test build #25509 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25509/consoleFull) for PR 3855 at commit [`8851d1a`](https://github.com/apache/spark/commit/8851d1af5bf568770b4ec409d42ddbf45a69f22c).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-68475857
[Test build #24979 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24979/consoleFull) for PR 3855 at commit [`d29224e`](https://github.com/apache/spark/commit/d29224e14b424bdace41e13888cd7b2e9edc1c03).
* This patch merges cleanly.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3855#issuecomment-69872024
[Test build #25510 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25510/consoleFull) for PR 3855 at commit [`929f294`](https://github.com/apache/spark/commit/929f294d9284ccb09c615823f95d006aa899edd8).
* This patch **passes all tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4985][SQL Parquet] Parquet date support
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/3855
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org