You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/18 20:24:05 UTC

[GitHub] [spark] MaxGekk opened a new pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

MaxGekk opened a new pull request #31884:
URL: https://github.com/apache/spark/pull/31884


   ### What changes were proposed in this pull request?
   For all built-in datasources, prohibit saving of year-month and day-time intervals that were introduced by SPARK-27793. We plan to support saving of such types at the milestone 2, see SPARK-27790. 
   
   ### Why are the changes needed?
   To improve user experience with Spark SQL, and print nicer error message. Current error message might confuse users:
   ```
   scala> Seq(java.time.Period.ofMonths(1)).toDF.write.mode("overwrite").json("/Users/maximgekk/tmp/123")
   21/03/18 22:44:35 ERROR FileFormatWriter: Aborting job 8de402d7-ab69-4dc0-aa8e-14ef06bd2d6b.
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (192.168.1.66 executor driver): org.apache.spark.SparkException: Task failed while writing rows.
   	at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:418)
   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:298)
   	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:211)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.RuntimeException: Failed to convert value 1 (class of class java.lang.Integer}) with the type of YearMonthIntervalType to JSON.
   	at scala.sys.package$.error(package.scala:30)
   	at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$23(JacksonGenerator.scala:179)
   	at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$23$adapted(JacksonGenerator.scala:176)
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. After the changes, the example above:
   ```
   scala> Seq(java.time.Period.ofMonths(1)).toDF.write.mode("overwrite").json("/Users/maximgekk/tmp/123")
   org.apache.spark.sql.AnalysisException: Cannot save the 'year-month interval' data type into external storage.
   ```
   
   ### How was this patch tested?
   Manually by saving year-month and day-time intervals to the JSON datasource.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31884:
URL: https://github.com/apache/spark/pull/31884#issuecomment-802607579


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40828/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31884:
URL: https://github.com/apache/spark/pull/31884#issuecomment-802928141


   @yaooqinn @cloud-fan Thank you for review. I am merging this to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31884:
URL: https://github.com/apache/spark/pull/31884#discussion_r597418087



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -579,6 +573,14 @@ case class DataSource(
     DataSource.checkAndGlobPathIfNecessary(allPaths.toSeq, newHadoopConfiguration(),
       checkEmptyGlobPath, checkFilesExist, enableGlobbing = globPaths)
   }
+
+  private def checkAllowedTypesInWrite(dataTypes: Seq[DataType]): Unit = {
+    dataTypes.foreach {
+      case i @ (CalendarIntervalType | DayTimeIntervalType | YearMonthIntervalType) =>

Review comment:
       There also a method here`org.apache.spark.sql.catalyst.util.TypeUtils.failWithIntervalType`, maybe we can merge them together

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -579,6 +573,14 @@ case class DataSource(
     DataSource.checkAndGlobPathIfNecessary(allPaths.toSeq, newHadoopConfiguration(),
       checkEmptyGlobPath, checkFilesExist, enableGlobbing = globPaths)
   }
+
+  private def checkAllowedTypesInWrite(dataTypes: Seq[DataType]): Unit = {
+    dataTypes.foreach {
+      case i @ (CalendarIntervalType | DayTimeIntervalType | YearMonthIntervalType) =>

Review comment:
       There's also a method here`org.apache.spark.sql.catalyst.util.TypeUtils.failWithIntervalType`, maybe we can merge them together




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31884:
URL: https://github.com/apache/spark/pull/31884#issuecomment-802300454


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40807/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31884:
URL: https://github.com/apache/spark/pull/31884#discussion_r597415662



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -579,6 +573,14 @@ case class DataSource(
     DataSource.checkAndGlobPathIfNecessary(allPaths.toSeq, newHadoopConfiguration(),
       checkEmptyGlobPath, checkFilesExist, enableGlobbing = globPaths)
   }
+
+  private def checkAllowedTypesInWrite(dataTypes: Seq[DataType]): Unit = {
+    dataTypes.foreach {
+      case i @ (CalendarIntervalType | DayTimeIntervalType | YearMonthIntervalType) =>

Review comment:
       At least, this is compatible with current master which checks intervals **only** on top-level. :-)  See
   On top-level:
   ```
   scala> spark.range(1).selectExpr("""timestamp'2021-01-02 00:01:02' - timestamp'2021-01-01 00:00:00'""").write.mode("overwrite").parquet("/Users/maximgekk/tmp/123")
   org.apache.spark.sql.AnalysisException: Cannot save interval data type into external storage.
   ```
   
   Nested:
   ```
   scala> spark.range(1).selectExpr("""struct(timestamp'2021-01-02 00:01:02' - timestamp'2021-01-01 00:00:00')""").write.mode("overwrite").parquet("/Users/maximgekk/tmp/123")
   org.apache.spark.sql.AnalysisException: Parquet data source does not support struct<col1:interval> data type.
   ```
   The last exception come from another place. Parquet cannot save nested interval because `CalendarIntervalType` is not `AtomicType`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31884:
URL: https://github.com/apache/spark/pull/31884#discussion_r597419979



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -579,6 +573,14 @@ case class DataSource(
     DataSource.checkAndGlobPathIfNecessary(allPaths.toSeq, newHadoopConfiguration(),
       checkEmptyGlobPath, checkFilesExist, enableGlobbing = globPaths)
   }
+
+  private def checkAllowedTypesInWrite(dataTypes: Seq[DataType]): Unit = {
+    dataTypes.foreach {
+      case i @ (CalendarIntervalType | DayTimeIntervalType | YearMonthIntervalType) =>

Review comment:
       since we are touching it, let's make it right and consider nested interval types as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk closed pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
MaxGekk closed pull request #31884:
URL: https://github.com/apache/spark/pull/31884


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31884:
URL: https://github.com/apache/spark/pull/31884#issuecomment-802301747


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136225/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31884:
URL: https://github.com/apache/spark/pull/31884#issuecomment-802600603






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31884:
URL: https://github.com/apache/spark/pull/31884#discussion_r597391806



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -510,10 +510,7 @@ case class DataSource(
       physicalPlan: SparkPlan,
       metrics: Map[String, SQLMetric]): BaseRelation = {
     val outputColumns = DataWritingCommand.logicalPlanOutputWithNames(data, outputColumnNames)
-    if (outputColumns.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {
-      throw QueryCompilationErrors.cannotSaveIntervalIntoExternalStorageError()
-    }
-
+    checkAllowedTypesInWrite(outputColumns.map(_.dataType))

Review comment:
       let's be more specific `disallowWritingIntervals`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31884:
URL: https://github.com/apache/spark/pull/31884#issuecomment-802608540


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136246/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31884: [SPARK-34793][SQL] Prohibit saving of day-time and year-month intervals

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31884:
URL: https://github.com/apache/spark/pull/31884#discussion_r597391929



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -579,6 +573,14 @@ case class DataSource(
     DataSource.checkAndGlobPathIfNecessary(allPaths.toSeq, newHadoopConfiguration(),
       checkEmptyGlobPath, checkFilesExist, enableGlobbing = globPaths)
   }
+
+  private def checkAllowedTypesInWrite(dataTypes: Seq[DataType]): Unit = {
+    dataTypes.foreach {
+      case i @ (CalendarIntervalType | DayTimeIntervalType | YearMonthIntervalType) =>

Review comment:
        we should use `DataType.exists` to detect nested interval types.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org