You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gengliangwang <gi...@git.apache.org> on 2018/06/29 06:55:00 UTC

[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

GitHub user gengliangwang opened a pull request:

    https://github.com/apache/spark/pull/21667

    [SPARK-24691][SQL]Add new API `supportDataType` in FileFormat

    ## What changes were proposed in this pull request?
    
    In https://github.com/apache/spark/pull/21389,  data source schema is validated before actual read/write. However,
    
    1. Putting all the validations together in `DataSourceUtils` is tricky and hard to maintain. On second thought after review, I find that the `OrcFileFormat` in hive package is not matched, so that its validation wrong.
    2.  `DataSourceUtils.verifyWriteSchema` and `DataSourceUtils.verifyReadSchema` is not supposed to be called in every file format. We can move them to some upper entry.
    So, I propose we can add a new API `supportDataType` in FileFormat. Each file format can override the method to specify its supported/non-supported data types.
    
    ## How was this patch tested?
    
    Unit test


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gengliangwang/spark refactorSchemaValidate

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21667.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21667
    
----
commit 7fdf6033b6778d06850e6ae5a0fd6e3fde76a5c2
Author: Gengliang Wang <ge...@...>
Date:   2018-06-28T16:32:44Z

    refactor schema validation

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92867/testReport)** for PR 21667 at commit [`9ed3a7d`](https://github.com/apache/spark/commit/9ed3a7d3f94aed00fb8087d4a6318c0eb4f8da7d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200303691
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,11 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   * By default all data types are supported.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
    --- End diff --
    
    then why not remove this default implementation and create `HiveFileFormat#supportDataType` to return true?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92678 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92678/testReport)** for PR 21667 at commit [`16beafa`](https://github.com/apache/spark/commit/16beafa4f128d3d592bf1f08c8e9b29770ae5a8a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    retest this please
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/749/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199418986
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala ---
    @@ -30,7 +30,7 @@ import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JacksonParser, JSON
     import org.apache.spark.sql.catalyst.util.CompressionCodecs
     import org.apache.spark.sql.execution.datasources._
     import org.apache.spark.sql.sources._
    -import org.apache.spark.sql.types.{StringType, StructType}
    +import org.apache.spark.sql.types._
    --- End diff --
    
    If we employ the blacklist, I think it'd be better that you don't fold these imports.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    @HyukjinKwon @maropu  I have updated the code. It is now using whitelist.
    @cloud-fan Thanks for the review and +1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92680 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92680/testReport)** for PR 21667 at commit [`7266611`](https://github.com/apache/spark/commit/7266611b243000868c81f4538dd850c394fe7c20).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    The fixes about the bug look all okay but the API thing. Mind if I ask to proceed separately for the API change if that's possible?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199524768
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ---
    @@ -42,63 +38,27 @@ object DataSourceUtils {
     
       /**
        * Verify if the schema is supported in datasource. This verification should be done
    -   * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues`
    -   * in `FileFormat`.
    -   *
    -   * Unsupported data types of csv, json, orc, and parquet are as follows;
    -   *  csv -> R/W: Interval, Null, Array, Map, Struct
    -   *  json -> W: Interval
    -   *  orc -> W: Interval, Null
    -   *  parquet -> R/W: Interval, Null
    +   * in a driver side.
        */
       private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = {
    -    def throwUnsupportedException(dataType: DataType): Unit = {
    -      throw new UnsupportedOperationException(
    -        s"$format data source does not support ${dataType.simpleString} data type.")
    -    }
    -
    -    def verifyType(dataType: DataType): Unit = dataType match {
    -      case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
    -           StringType | BinaryType | DateType | TimestampType | _: DecimalType =>
    -
    -      // All the unsupported types for CSV
    -      case _: NullType | _: CalendarIntervalType | _: StructType | _: ArrayType | _: MapType
    -          if format.isInstanceOf[CSVFileFormat] =>
    -        throwUnsupportedException(dataType)
    -
    -      case st: StructType => st.foreach { f => verifyType(f.dataType) }
    -
    -      case ArrayType(elementType, _) => verifyType(elementType)
    -
    -      case MapType(keyType, valueType, _) =>
    -        verifyType(keyType)
    -        verifyType(valueType)
    -
    -      case udt: UserDefinedType[_] => verifyType(udt.sqlType)
    -
    -      // Interval type not supported in all the write path
    -      case _: CalendarIntervalType if !isReadPath =>
    -        throwUnsupportedException(dataType)
    -
    -      // JSON and ORC don't support an Interval type, but we pass it in read pass
    -      // for back-compatibility.
    -      case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] ||
    -        format.isInstanceOf[OrcFileFormat] =>
    +    def verifyType(dataType: DataType): Unit = {
    +      if (!format.supportDataType(dataType, isReadPath)) {
    +        throw new UnsupportedOperationException(
    +          s"$format data source does not support ${dataType.simpleString} data type.")
    +      }
    +      dataType match {
    --- End diff --
    
    This is for general purpose, so that developer can skip matching arrays/maps/structs.
    I don't know about nested timestamp, but we can override supportDataType to make sure the case is unsupported, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/674/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/851/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199863349
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala ---
    @@ -148,6 +144,28 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister {
       override def hashCode(): Int = getClass.hashCode()
     
       override def equals(other: Any): Boolean = other.isInstanceOf[JsonFileFormat]
    +
    +  override def validateDataType(dataType: DataType, isReadPath: Boolean): Unit = dataType match {
    --- End diff --
    
    how about
    
    ```
    the base class
    def validateDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
      case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
        StringType | BinaryType | DateType | TimestampType | _: DecimalType => true
      case _ => false
    }
    
    json
    override def validateDataType(dataType: DataType, isReadPath: Boolean): Boolean = {
      case st: StructType => st.forall { f => validateDataType(f.dataType, isReadPath) }
      case ArrayType...
      ...
      case other => super.validateDataType(other)
    }
    
    
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92686/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200621910
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,11 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   * By default all data types are supported.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
    --- End diff --
    
    Technically this is an internal API but you keep the compatibility here in practice. I think I already see we are concerned about Avro, right? I will doubt if it's a good idea to expose this in this trait.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199515847
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ---
    @@ -42,63 +38,27 @@ object DataSourceUtils {
     
       /**
        * Verify if the schema is supported in datasource. This verification should be done
    -   * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues`
    -   * in `FileFormat`.
    -   *
    -   * Unsupported data types of csv, json, orc, and parquet are as follows;
    -   *  csv -> R/W: Interval, Null, Array, Map, Struct
    -   *  json -> W: Interval
    -   *  orc -> W: Interval, Null
    -   *  parquet -> R/W: Interval, Null
    +   * in a driver side.
        */
       private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = {
    -    def throwUnsupportedException(dataType: DataType): Unit = {
    -      throw new UnsupportedOperationException(
    -        s"$format data source does not support ${dataType.simpleString} data type.")
    -    }
    -
    -    def verifyType(dataType: DataType): Unit = dataType match {
    -      case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
    -           StringType | BinaryType | DateType | TimestampType | _: DecimalType =>
    -
    -      // All the unsupported types for CSV
    -      case _: NullType | _: CalendarIntervalType | _: StructType | _: ArrayType | _: MapType
    -          if format.isInstanceOf[CSVFileFormat] =>
    -        throwUnsupportedException(dataType)
    -
    -      case st: StructType => st.foreach { f => verifyType(f.dataType) }
    -
    -      case ArrayType(elementType, _) => verifyType(elementType)
    -
    -      case MapType(keyType, valueType, _) =>
    -        verifyType(keyType)
    -        verifyType(valueType)
    -
    -      case udt: UserDefinedType[_] => verifyType(udt.sqlType)
    -
    -      // Interval type not supported in all the write path
    -      case _: CalendarIntervalType if !isReadPath =>
    -        throwUnsupportedException(dataType)
    -
    -      // JSON and ORC don't support an Interval type, but we pass it in read pass
    -      // for back-compatibility.
    -      case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] ||
    -        format.isInstanceOf[OrcFileFormat] =>
    +    def verifyType(dataType: DataType): Unit = {
    +      if (!format.supportDataType(dataType, isReadPath)) {
    +        throw new UnsupportedOperationException(
    +          s"$format data source does not support ${dataType.simpleString} data type.")
    +      }
    +      dataType match {
    --- End diff --
    
    Wait .. why we do the recursive thing here? What if the top level type is supported but nested is not? For example, Arrow doesn't currently support nested timestamp conversion for localization issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92569 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92569/testReport)** for PR 21667 at commit [`5c590eb`](https://github.com/apache/spark/commit/5c590ebf39794c83c30dda1cdbbdc5d94dd0ff0d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/579/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200304260
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala ---
    @@ -148,6 +144,23 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister {
       override def hashCode(): Int = getClass.hashCode()
     
       override def equals(other: Any): Boolean = other.isInstanceOf[JsonFileFormat]
    +
    +  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    +    case _: AtomicType => true
    +
    +    case st: StructType => st.forall { f => supportDataType(f.dataType, isReadPath) }
    +
    +    case ArrayType(elementType, _) => supportDataType(elementType, isReadPath)
    +
    +    case MapType(keyType, valueType, _) =>
    +      supportDataType(keyType, isReadPath) && supportDataType(valueType, isReadPath)
    +
    +    case udt: UserDefinedType[_] => supportDataType(udt.sqlType, isReadPath)
    +
    +    case _: NullType => true
    --- End diff --
    
    why JSON supports null type but CSV doesn't?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199095666
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    Might be easier to write but it doesn't consider if we happened to have some more types on the other hand. It should better be explicit on what we support on the other hand.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/731/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199494578
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    Whitelist for all file formats is behavior change. There are external file sources like https://github.com/databricks/spark-avro , which we probably have to update the code to make it compatible.
    
    Currently exceptions are thrown in `buildReader` / `buildReaderWithPartitionValues`/ `prepareWrite` for unsupported types. 
    
    So overall I prefer blacklist.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92465/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    @maropu @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199076368
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala ---
    @@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton {
             sql("select testType()").write.mode("overwrite").orc(orcDir)
           }.getMessage
           assert(msg.contains("ORC data source does not support calendarinterval data type."))
    -
    -      // read path
    --- End diff --
    
    Is there any write path test already?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200255604
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,11 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   * By default all data types are supported.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
    --- End diff --
    
    `HiveFileFormat`. Currently I don't know Hive well. If someone can override it for `HiveFileFormat`, please create a follow up PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199526695
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    --- End diff --
    
    FYI, `CalendarIntervalType` isn't completely public yet .. cc @cloud-fan.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199070174
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala ---
    @@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton {
             sql("select testType()").write.mode("overwrite").orc(orcDir)
           }.getMessage
           assert(msg.contains("ORC data source does not support calendarinterval data type."))
    -
    -      // read path
    --- End diff --
    
    In read path, ORC should support `CalendarIntervalType` and `NullType`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92611/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92863/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199861581
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,12 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Validate the given [[DataType]] in read/write path for this file format.
    +   * If the [[DataType]] is not supported, an exception will be thrown.
    +   * By default all data types are supported.
    +   */
    +  def validateDataType(dataType: DataType, isReadPath: Boolean): Unit = {}
    --- End diff --
    
    it's better to return boolean here, and let the caller side to throw exception, so that we can unify the error message.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Reading it again closely, I am actually not super happy on the proposal about introducing API change if the purpose of this is just to check the type and throw an exception. Apparently, it looks so. I am less sure how useful it is by looking the current change. It reduces the size of codes because it blacklists. I would suggest to make the API change separate with this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199861251
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ---
    @@ -42,65 +38,9 @@ object DataSourceUtils {
     
       /**
        * Verify if the schema is supported in datasource. This verification should be done
    -   * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues`
    -   * in `FileFormat`.
    -   *
    -   * Unsupported data types of csv, json, orc, and parquet are as follows;
    -   *  csv -> R/W: Interval, Null, Array, Map, Struct
    -   *  json -> W: Interval
    -   *  orc -> W: Interval, Null
    -   *  parquet -> R/W: Interval, Null
    +   * in a driver side.
        */
       private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = {
    --- End diff --
    
    do we still need this method? it's just one-line in the body.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200236630
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala ---
    @@ -141,6 +136,11 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
           }
         }
       }
    +
    +  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    `dataType == StringType`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92678/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92639/testReport)** for PR 21667 at commit [`44cf265`](https://github.com/apache/spark/commit/44cf265d3c3535ef4d1216c6317247d1365dca8c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199511369
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    I meant the `case`s in the match in each implementation within Spark. I didn't mean about the semantic about the API itself. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199079744
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    Blacklist is easier.
    With whitelist , we will have to validate 
    ``` 
    BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
              StringType | BinaryType | DateType | TimestampType | DecimalType
    ```
    Of course we can have a default function to process these. But if we add a new data source which didn't support all of them, the implementation will be verbose.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200236504
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -153,6 +151,16 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
       override def hashCode(): Int = getClass.hashCode()
     
       override def equals(other: Any): Boolean = other.isInstanceOf[CSVFileFormat]
    +
    +  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    +    case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
    +         StringType | BinaryType | DateType | TimestampType | _: DecimalType => true
    --- End diff --
    
    isn't it just `case _: AtomicType => true`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    @hvanhovell 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200630376
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,11 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   * By default all data types are supported.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
    --- End diff --
    
    I meant to say it's not free to remove or change the signature later once we happen to add it. Do we plan to refactor or remove this out in the near future? If so I am okay.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92711/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92459 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92459/testReport)** for PR 21667 at commit [`7fdf603`](https://github.com/apache/spark/commit/7fdf6033b6778d06850e6ae5a0fd6e3fde76a5c2).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92680/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/850/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199867594
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala ---
    @@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton {
             sql("select testType()").write.mode("overwrite").orc(orcDir)
           }.getMessage
           assert(msg.contains("ORC data source does not support calendarinterval data type."))
    -
    -      // read path
    --- End diff --
    
    +1 for ^


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199999766
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,12 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Validate the given [[DataType]] in read/write path for this file format.
    +   * If the [[DataType]] is not supported, an exception will be thrown.
    +   * By default all data types are supported.
    +   */
    +  def validateDataType(dataType: DataType, isReadPath: Boolean): Unit = {}
    --- End diff --
    
    Yes, that was what I did in first commit.  If the unsupported type is inside struct/array, then the error message is not accurate as the current way.
    I am OK with revert to return Boolean though.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    My major concern is when to apply this check. Ideally this should happen during analysis.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r201747863
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala ---
    @@ -228,4 +224,21 @@ class OrcFileFormat
           }
         }
       }
    +
    +  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    +    case _: AtomicType => true
    +
    +    case st: StructType => st.forall { f => supportDataType(f.dataType, isReadPath) }
    +
    +    case ArrayType(elementType, _) => supportDataType(elementType, isReadPath)
    +
    +    case MapType(keyType, valueType, _) =>
    +      supportDataType(keyType, isReadPath) && supportDataType(valueType, isReadPath)
    +
    +    case udt: UserDefinedType[_] => supportDataType(udt.sqlType, isReadPath)
    +
    +    case _: NullType => isReadPath
    +
    +    case _ => false
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199862160
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -306,6 +306,7 @@ case class FileSourceScanExec(
       }
     
       private lazy val inputRDD: RDD[InternalRow] = {
    +    DataSourceUtils.verifyReadSchema(relation.fileFormat, relation.dataSchema)
    --- End diff --
    
    BTW why do we need to check schema at read path?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199416587
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -306,6 +306,7 @@ case class FileSourceScanExec(
       }
     
       private lazy val inputRDD: RDD[InternalRow] = {
    +    DataSourceUtils.verifyReadSchema(relation.fileFormat, relation.dataSchema)
    --- End diff --
    
    In some formats, is this verification applied two times, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92686 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92686/testReport)** for PR 21667 at commit [`7266611`](https://github.com/apache/spark/commit/7266611b243000868c81f4538dd850c394fe7c20).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199096057
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    I wrote CSV's with whitelisting before per @hvanhovell's comment long time ago. I was (am still) okay either way but might be good to leave a cc for him.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199418036
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    I like the whilelist, too. As @HyukjinKwon said, if someone implements a new type, the blacklist pass through it...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92459 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92459/testReport)** for PR 21667 at commit [`7fdf603`](https://github.com/apache/spark/commit/7fdf6033b6778d06850e6ae5a0fd6e3fde76a5c2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200256945
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -306,6 +306,7 @@ case class FileSourceScanExec(
       }
     
       private lazy val inputRDD: RDD[InternalRow] = {
    +    DataSourceUtils.verifyReadSchema(relation.fileFormat, relation.dataSchema)
    --- End diff --
    
    Yes, for user-specified schema.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199519380
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    @HyukjinKwon Sorry I don't understand. Do you mean the default case is not supported?
     ```
    case _  =>  false
    ```
    But how to make all the external formats work?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200561229
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala ---
    @@ -148,6 +144,23 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister {
       override def hashCode(): Int = getClass.hashCode()
     
       override def equals(other: Any): Boolean = other.isInstanceOf[JsonFileFormat]
    +
    +  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    +    case _: AtomicType => true
    +
    +    case st: StructType => st.forall { f => supportDataType(f.dataType, isReadPath) }
    +
    +    case ArrayType(elementType, _) => supportDataType(elementType, isReadPath)
    +
    +    case MapType(keyType, valueType, _) =>
    +      supportDataType(keyType, isReadPath) && supportDataType(valueType, isReadPath)
    +
    +    case udt: UserDefinedType[_] => supportDataType(udt.sqlType, isReadPath)
    +
    +    case _: NullType => true
    --- End diff --
    
    currently null type is not handled in UnivocityParser


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199094171
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala ---
    @@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton {
             sql("select testType()").write.mode("overwrite").orc(orcDir)
           }.getMessage
           assert(msg.contains("ORC data source does not support calendarinterval data type."))
    -
    -      // read path
    --- End diff --
    
    I mean the tests were negative tests. so I was expecting that we'd have positive tests.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/896/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199525496
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ---
    @@ -42,63 +38,27 @@ object DataSourceUtils {
     
       /**
        * Verify if the schema is supported in datasource. This verification should be done
    -   * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues`
    -   * in `FileFormat`.
    -   *
    -   * Unsupported data types of csv, json, orc, and parquet are as follows;
    -   *  csv -> R/W: Interval, Null, Array, Map, Struct
    -   *  json -> W: Interval
    -   *  orc -> W: Interval, Null
    -   *  parquet -> R/W: Interval, Null
    +   * in a driver side.
        */
       private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = {
    -    def throwUnsupportedException(dataType: DataType): Unit = {
    -      throw new UnsupportedOperationException(
    -        s"$format data source does not support ${dataType.simpleString} data type.")
    -    }
    -
    -    def verifyType(dataType: DataType): Unit = dataType match {
    -      case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
    -           StringType | BinaryType | DateType | TimestampType | _: DecimalType =>
    -
    -      // All the unsupported types for CSV
    -      case _: NullType | _: CalendarIntervalType | _: StructType | _: ArrayType | _: MapType
    -          if format.isInstanceOf[CSVFileFormat] =>
    -        throwUnsupportedException(dataType)
    -
    -      case st: StructType => st.foreach { f => verifyType(f.dataType) }
    -
    -      case ArrayType(elementType, _) => verifyType(elementType)
    -
    -      case MapType(keyType, valueType, _) =>
    -        verifyType(keyType)
    -        verifyType(valueType)
    -
    -      case udt: UserDefinedType[_] => verifyType(udt.sqlType)
    -
    -      // Interval type not supported in all the write path
    -      case _: CalendarIntervalType if !isReadPath =>
    -        throwUnsupportedException(dataType)
    -
    -      // JSON and ORC don't support an Interval type, but we pass it in read pass
    -      // for back-compatibility.
    -      case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] ||
    -        format.isInstanceOf[OrcFileFormat] =>
    +    def verifyType(dataType: DataType): Unit = {
    +      if (!format.supportDataType(dataType, isReadPath)) {
    +        throw new UnsupportedOperationException(
    +          s"$format data source does not support ${dataType.simpleString} data type.")
    +      }
    +      dataType match {
    --- End diff --
    
    Yea.. then this code bit is not really general purpose anymore ... developers should check the codes inside and see if the nested types are automatically checked or not ..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199879469
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala ---
    @@ -148,6 +144,28 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister {
       override def hashCode(): Int = getClass.hashCode()
     
       override def equals(other: Any): Boolean = other.isInstanceOf[JsonFileFormat]
    +
    +  override def validateDataType(dataType: DataType, isReadPath: Boolean): Unit = dataType match {
    --- End diff --
    
    what do you mean by break? If people use internal API(like avro), they are responsible to update their code for the internal API changes in new Spark releases.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200236731
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -178,6 +178,24 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
           }
         }
       }
    +
    +  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    ```
    case NullType => isReadPath
    case _ => true
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199860192
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ---
    @@ -42,65 +38,9 @@ object DataSourceUtils {
     
       /**
        * Verify if the schema is supported in datasource. This verification should be done
    -   * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues`
    -   * in `FileFormat`.
    -   *
    -   * Unsupported data types of csv, json, orc, and parquet are as follows;
    -   *  csv -> R/W: Interval, Null, Array, Map, Struct
    -   *  json -> W: Interval
    -   *  orc -> W: Interval, Null
    -   *  parquet -> R/W: Interval, Null
    +   * in a driver side.
    --- End diff --
    
    `FileFormat` is internal so it's nothing about public API, but just about design choice.
    
    Generally it's ok to have a central place to put some business logic for different cases. However, here we can't access all `FileFormat` implementations, Hive ORC is in Hive module. Now the only choice is: dispatch the business logic into implementations.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92465 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92465/testReport)** for PR 21667 at commit [`7fdf603`](https://github.com/apache/spark/commit/7fdf6033b6778d06850e6ae5a0fd6e3fde76a5c2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92913 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92913/testReport)** for PR 21667 at commit [`13de60e`](https://github.com/apache/spark/commit/13de60ec3916b8396e6fe04e755eeb7d70c54b3a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92913 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92913/testReport)** for PR 21667 at commit [`13de60e`](https://github.com/apache/spark/commit/13de60ec3916b8396e6fe04e755eeb7d70c54b3a).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200622296
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala ---
    @@ -148,6 +144,23 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister {
       override def hashCode(): Int = getClass.hashCode()
     
       override def equals(other: Any): Boolean = other.isInstanceOf[JsonFileFormat]
    +
    +  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    Yea, supported types are very specific to datasource's implementation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199523308
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    OK. I meant, leaving the default case `true`
    
    ```scala
    def supportDataType(...): Boolean = dataType match {
      case _ => true
    }
    ```
    
    and whitelist each type within each implementation, for example, in `CSVFileFormat.scala`
    
    ```scala
    def supportDataType(...) ...
        case _: StringType | ... => true
        case _ => false
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92863/testReport)** for PR 21667 at commit [`757b82a`](https://github.com/apache/spark/commit/757b82a248f019a7f722702b7fd65f87a6ef0d17).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199075466
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    Hm, shouldn't we better whitelist them rather then blacklist?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199694805
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ---
    @@ -42,63 +38,27 @@ object DataSourceUtils {
     
       /**
        * Verify if the schema is supported in datasource. This verification should be done
    -   * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues`
    -   * in `FileFormat`.
    -   *
    -   * Unsupported data types of csv, json, orc, and parquet are as follows;
    -   *  csv -> R/W: Interval, Null, Array, Map, Struct
    -   *  json -> W: Interval
    -   *  orc -> W: Interval, Null
    -   *  parquet -> R/W: Interval, Null
    +   * in a driver side.
        */
       private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = {
    -    def throwUnsupportedException(dataType: DataType): Unit = {
    -      throw new UnsupportedOperationException(
    -        s"$format data source does not support ${dataType.simpleString} data type.")
    -    }
    -
    -    def verifyType(dataType: DataType): Unit = dataType match {
    -      case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
    -           StringType | BinaryType | DateType | TimestampType | _: DecimalType =>
    -
    -      // All the unsupported types for CSV
    -      case _: NullType | _: CalendarIntervalType | _: StructType | _: ArrayType | _: MapType
    -          if format.isInstanceOf[CSVFileFormat] =>
    -        throwUnsupportedException(dataType)
    -
    -      case st: StructType => st.foreach { f => verifyType(f.dataType) }
    -
    -      case ArrayType(elementType, _) => verifyType(elementType)
    -
    -      case MapType(keyType, valueType, _) =>
    -        verifyType(keyType)
    -        verifyType(valueType)
    -
    -      case udt: UserDefinedType[_] => verifyType(udt.sqlType)
    -
    -      // Interval type not supported in all the write path
    -      case _: CalendarIntervalType if !isReadPath =>
    -        throwUnsupportedException(dataType)
    -
    -      // JSON and ORC don't support an Interval type, but we pass it in read pass
    -      // for back-compatibility.
    -      case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] ||
    -        format.isInstanceOf[OrcFileFormat] =>
    +    def verifyType(dataType: DataType): Unit = {
    +      if (!format.supportDataType(dataType, isReadPath)) {
    +        throw new UnsupportedOperationException(
    +          s"$format data source does not support ${dataType.simpleString} data type.")
    +      }
    +      dataType match {
    --- End diff --
    
    It's tricky to rely on 2 places to correctly determine the unsupported type. `format.supportDataType` should handle complex types themselves, to make the code clearer and easier to maintain.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92864/testReport)** for PR 21667 at commit [`757b82a`](https://github.com/apache/spark/commit/757b82a248f019a7f722702b7fd65f87a6ef0d17).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199529548
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ---
    @@ -42,63 +38,27 @@ object DataSourceUtils {
     
       /**
        * Verify if the schema is supported in datasource. This verification should be done
    -   * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues`
    -   * in `FileFormat`.
    -   *
    -   * Unsupported data types of csv, json, orc, and parquet are as follows;
    -   *  csv -> R/W: Interval, Null, Array, Map, Struct
    -   *  json -> W: Interval
    -   *  orc -> W: Interval, Null
    -   *  parquet -> R/W: Interval, Null
    +   * in a driver side.
        */
       private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = {
    -    def throwUnsupportedException(dataType: DataType): Unit = {
    -      throw new UnsupportedOperationException(
    -        s"$format data source does not support ${dataType.simpleString} data type.")
    -    }
    -
    -    def verifyType(dataType: DataType): Unit = dataType match {
    -      case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
    -           StringType | BinaryType | DateType | TimestampType | _: DecimalType =>
    -
    -      // All the unsupported types for CSV
    -      case _: NullType | _: CalendarIntervalType | _: StructType | _: ArrayType | _: MapType
    -          if format.isInstanceOf[CSVFileFormat] =>
    -        throwUnsupportedException(dataType)
    -
    -      case st: StructType => st.foreach { f => verifyType(f.dataType) }
    -
    -      case ArrayType(elementType, _) => verifyType(elementType)
    -
    -      case MapType(keyType, valueType, _) =>
    -        verifyType(keyType)
    -        verifyType(valueType)
    -
    -      case udt: UserDefinedType[_] => verifyType(udt.sqlType)
    -
    -      // Interval type not supported in all the write path
    -      case _: CalendarIntervalType if !isReadPath =>
    -        throwUnsupportedException(dataType)
    -
    -      // JSON and ORC don't support an Interval type, but we pass it in read pass
    -      // for back-compatibility.
    -      case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] ||
    -        format.isInstanceOf[OrcFileFormat] =>
    +    def verifyType(dataType: DataType): Unit = {
    +      if (!format.supportDataType(dataType, isReadPath)) {
    +        throw new UnsupportedOperationException(
    +          s"$format data source does not support ${dataType.simpleString} data type.")
    +      }
    +      dataType match {
    --- End diff --
    
    I know..But if developers didn't read inside and process the case of  arrays/maps/structs, the code should still work.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92930 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92930/testReport)** for PR 21667 at commit [`13de60e`](https://github.com/apache/spark/commit/13de60ec3916b8396e6fe04e755eeb7d70c54b3a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92867/testReport)** for PR 21667 at commit [`9ed3a7d`](https://github.com/apache/spark/commit/9ed3a7d3f94aed00fb8087d4a6318c0eb4f8da7d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200616465
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,11 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   * By default all data types are supported.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
    --- End diff --
    
    makes sense


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    how about the rule `PreReadCheck` and `PreWriteCheck`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92864/testReport)** for PR 21667 at commit [`757b82a`](https://github.com/apache/spark/commit/757b82a248f019a7f722702b7fd65f87a6ef0d17).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/726/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Sure, I am actually OK if we can have a different approach other than API.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92611 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92611/testReport)** for PR 21667 at commit [`34134f1`](https://github.com/apache/spark/commit/34134f10f4f0191acb3e52fc7ef9ccb25eae85ad).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200629628
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,11 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   * By default all data types are supported.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
    --- End diff --
    
    This is the only way to allow avro to define its supported types. BTW the default true value here is good for compatibility: if a file source doesn't know this API, it doesn't need to implement it and the behavior is unchanged, which is, no check applied.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/642/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92863/testReport)** for PR 21667 at commit [`757b82a`](https://github.com/apache/spark/commit/757b82a248f019a7f722702b7fd65f87a6ef0d17).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92686 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92686/testReport)** for PR 21667 at commit [`7266611`](https://github.com/apache/spark/commit/7266611b243000868c81f4538dd850c394fe7c20).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199705148
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ---
    @@ -42,63 +38,27 @@ object DataSourceUtils {
     
       /**
        * Verify if the schema is supported in datasource. This verification should be done
    -   * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues`
    -   * in `FileFormat`.
    -   *
    -   * Unsupported data types of csv, json, orc, and parquet are as follows;
    -   *  csv -> R/W: Interval, Null, Array, Map, Struct
    -   *  json -> W: Interval
    -   *  orc -> W: Interval, Null
    -   *  parquet -> R/W: Interval, Null
    +   * in a driver side.
        */
       private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = {
    -    def throwUnsupportedException(dataType: DataType): Unit = {
    -      throw new UnsupportedOperationException(
    -        s"$format data source does not support ${dataType.simpleString} data type.")
    -    }
    -
    -    def verifyType(dataType: DataType): Unit = dataType match {
    -      case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
    -           StringType | BinaryType | DateType | TimestampType | _: DecimalType =>
    -
    -      // All the unsupported types for CSV
    -      case _: NullType | _: CalendarIntervalType | _: StructType | _: ArrayType | _: MapType
    -          if format.isInstanceOf[CSVFileFormat] =>
    -        throwUnsupportedException(dataType)
    -
    -      case st: StructType => st.foreach { f => verifyType(f.dataType) }
    -
    -      case ArrayType(elementType, _) => verifyType(elementType)
    -
    -      case MapType(keyType, valueType, _) =>
    -        verifyType(keyType)
    -        verifyType(valueType)
    -
    -      case udt: UserDefinedType[_] => verifyType(udt.sqlType)
    -
    -      // Interval type not supported in all the write path
    -      case _: CalendarIntervalType if !isReadPath =>
    -        throwUnsupportedException(dataType)
    -
    -      // JSON and ORC don't support an Interval type, but we pass it in read pass
    -      // for back-compatibility.
    -      case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] ||
    -        format.isInstanceOf[OrcFileFormat] =>
    +    def verifyType(dataType: DataType): Unit = {
    +      if (!format.supportDataType(dataType, isReadPath)) {
    +        throw new UnsupportedOperationException(
    +          s"$format data source does not support ${dataType.simpleString} data type.")
    +      }
    +      dataType match {
    --- End diff --
    
    I see. I will update it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/889/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r201747509
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala ---
    @@ -153,6 +151,15 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
       override def hashCode(): Int = getClass.hashCode()
     
       override def equals(other: Any): Boolean = other.isInstanceOf[CSVFileFormat]
    +
    +  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    +    case _: AtomicType => true
    +
    +    case udt: UserDefinedType[_] => supportDataType(udt.sqlType, isReadPath)
    --- End diff --
    
    can we do this in https://github.com/apache/spark/pull/21667/files#diff-ea05eba8c71b5596561adbbe9755ff36R46 ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92459/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92569 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92569/testReport)** for PR 21667 at commit [`5c590eb`](https://github.com/apache/spark/commit/5c590ebf39794c83c30dda1cdbbdc5d94dd0ff0d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/727/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92867/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92913/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r201753187
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
    @@ -276,44 +334,31 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo
             }.getMessage
             assert(msg.contains("Cannot save interval data type into external storage."))
     
    -        msg = intercept[UnsupportedOperationException] {
    +        msg = intercept[AnalysisException] {
               spark.udf.register("testType", () => new IntervalData())
               sql("select testType()").write.format(format).mode("overwrite").save(tempDir)
             }.getMessage
             assert(msg.toLowerCase(Locale.ROOT)
    -          .contains(s"$format data source does not support calendarinterval data type."))
    +          .contains(s"$format data source does not support interval data type."))
           }
     
           // read path
           Seq("parquet", "csv").foreach { format =>
    -        var msg = intercept[UnsupportedOperationException] {
    +        var msg = intercept[AnalysisException] {
               val schema = StructType(StructField("a", CalendarIntervalType, true) :: Nil)
               spark.range(1).write.format(format).mode("overwrite").save(tempDir)
               spark.read.schema(schema).format(format).load(tempDir).collect()
             }.getMessage
             assert(msg.toLowerCase(Locale.ROOT)
               .contains(s"$format data source does not support calendarinterval data type."))
     
    -        msg = intercept[UnsupportedOperationException] {
    +        msg = intercept[AnalysisException] {
               val schema = StructType(StructField("a", new IntervalUDT(), true) :: Nil)
               spark.range(1).write.format(format).mode("overwrite").save(tempDir)
               spark.read.schema(schema).format(format).load(tempDir).collect()
             }.getMessage
             assert(msg.toLowerCase(Locale.ROOT)
    -          .contains(s"$format data source does not support calendarinterval data type."))
    -      }
    -
    -      // We expect the types below should be passed for backward-compatibility
    -      Seq("orc", "json").foreach { format =>
    -        // Interval type
    -        var schema = StructType(StructField("a", CalendarIntervalType, true) :: Nil)
    -        spark.range(1).write.format(format).mode("overwrite").save(tempDir)
    -        spark.read.schema(schema).format(format).load(tempDir).collect()
    --- End diff --
    
    when the user-specified schema doesn't match the physical schema, the behavior is undefined. So I don't think this is about backward compatibility, +1 to forbid interval type.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92711/testReport)** for PR 21667 at commit [`7266611`](https://github.com/apache/spark/commit/7266611b243000868c81f4538dd850c394fe7c20).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21667


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199865933
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala ---
    @@ -148,6 +144,28 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister {
       override def hashCode(): Int = getClass.hashCode()
     
       override def equals(other: Any): Boolean = other.isInstanceOf[JsonFileFormat]
    +
    +  override def validateDataType(dataType: DataType, isReadPath: Boolean): Unit = dataType match {
    --- End diff --
    
    Then the base class could break other existing file formats, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92864/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199863675
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala ---
    @@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton {
             sql("select testType()").write.mode("overwrite").orc(orcDir)
           }.getMessage
           assert(msg.contains("ORC data source does not support calendarinterval data type."))
    -
    -      // read path
    --- End diff --
    
    Spark can never write out interval type, do we really need to support interval type at read path?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92465 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92465/testReport)** for PR 21667 at commit [`7fdf603`](https://github.com/apache/spark/commit/7fdf6033b6778d06850e6ae5a0fd6e3fde76a5c2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/853/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92639/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92569/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199081671
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala ---
    @@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton {
             sql("select testType()").write.mode("overwrite").orc(orcDir)
           }.getMessage
           assert(msg.contains("ORC data source does not support calendarinterval data type."))
    -
    -      // read path
    --- End diff --
    
    @HyukjinKwon No, the unit test is about unsupported data types, and ORC supports all data types in read path.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92639/testReport)** for PR 21667 at commit [`44cf265`](https://github.com/apache/spark/commit/44cf265d3c3535ef4d1216c6317247d1365dca8c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199860902
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -306,6 +306,7 @@ case class FileSourceScanExec(
       }
     
       private lazy val inputRDD: RDD[InternalRow] = {
    +    DataSourceUtils.verifyReadSchema(relation.fileFormat, relation.dataSchema)
    --- End diff --
    
    I'm not sure about this change. This is very late(just before execution), is there a better place for this check that happens at analysis phase?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199695311
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    blacklist is easier, but whitelist is safer.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92680 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92680/testReport)** for PR 21667 at commit [`7266611`](https://github.com/apache/spark/commit/7266611b243000868c81f4538dd850c394fe7c20).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92611 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92611/testReport)** for PR 21667 at commit [`34134f1`](https://github.com/apache/spark/commit/34134f10f4f0191acb3e52fc7ef9ccb25eae85ad).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/700/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    @cloud-fan I can understand you concern. But I can't find better entries. The entry in `FileFormatWriter` is the only one entry for every write action, otherwise we have to add the check in multiple places. The same for read path.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199695118
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    --- End diff --
    
    yea, it's not by default, we can't write out interval type. This check is in `DataSource.planForWriting`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92930 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92930/testReport)** for PR 21667 at commit [`13de60e`](https://github.com/apache/spark/commit/13de60ec3916b8396e6fe04e755eeb7d70c54b3a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199531608
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala ---
    @@ -42,63 +38,27 @@ object DataSourceUtils {
     
       /**
        * Verify if the schema is supported in datasource. This verification should be done
    -   * in a driver side, e.g., `prepareWrite`, `buildReader`, and `buildReaderWithPartitionValues`
    -   * in `FileFormat`.
    -   *
    -   * Unsupported data types of csv, json, orc, and parquet are as follows;
    -   *  csv -> R/W: Interval, Null, Array, Map, Struct
    -   *  json -> W: Interval
    -   *  orc -> W: Interval, Null
    -   *  parquet -> R/W: Interval, Null
    +   * in a driver side.
        */
       private def verifySchema(format: FileFormat, schema: StructType, isReadPath: Boolean): Unit = {
    -    def throwUnsupportedException(dataType: DataType): Unit = {
    -      throw new UnsupportedOperationException(
    -        s"$format data source does not support ${dataType.simpleString} data type.")
    -    }
    -
    -    def verifyType(dataType: DataType): Unit = dataType match {
    -      case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
    -           StringType | BinaryType | DateType | TimestampType | _: DecimalType =>
    -
    -      // All the unsupported types for CSV
    -      case _: NullType | _: CalendarIntervalType | _: StructType | _: ArrayType | _: MapType
    -          if format.isInstanceOf[CSVFileFormat] =>
    -        throwUnsupportedException(dataType)
    -
    -      case st: StructType => st.foreach { f => verifyType(f.dataType) }
    -
    -      case ArrayType(elementType, _) => verifyType(elementType)
    -
    -      case MapType(keyType, valueType, _) =>
    -        verifyType(keyType)
    -        verifyType(valueType)
    -
    -      case udt: UserDefinedType[_] => verifyType(udt.sqlType)
    -
    -      // Interval type not supported in all the write path
    -      case _: CalendarIntervalType if !isReadPath =>
    -        throwUnsupportedException(dataType)
    -
    -      // JSON and ORC don't support an Interval type, but we pass it in read pass
    -      // for back-compatibility.
    -      case _: CalendarIntervalType if format.isInstanceOf[JsonFileFormat] ||
    -        format.isInstanceOf[OrcFileFormat] =>
    +    def verifyType(dataType: DataType): Unit = {
    +      if (!format.supportDataType(dataType, isReadPath)) {
    +        throw new UnsupportedOperationException(
    +          s"$format data source does not support ${dataType.simpleString} data type.")
    +      }
    +      dataType match {
    --- End diff --
    
    In that case, this code bit becomes rather obsolete .. To me Spark's dev API is too difficult for me to understand :-) .. Personally, I don't like to be too clever when it comes to API thing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92678 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92678/testReport)** for PR 21667 at commit [`16beafa`](https://github.com/apache/spark/commit/16beafa4f128d3d592bf1f08c8e9b29770ae5a8a).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200593158
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,11 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   * By default all data types are supported.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
    --- End diff --
    
    Then we also need to update `LibSVMFileFormat`, and several file format in unit test. I really prefer to have a default behavior here, as `FileFormat` can still work without the new method.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/571/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200236759
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,11 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   * By default all data types are supported.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
    --- End diff --
    
    who does not overwrite it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r201747795
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala ---
    @@ -148,6 +144,23 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister {
       override def hashCode(): Int = getClass.hashCode()
     
       override def equals(other: Any): Boolean = other.isInstanceOf[JsonFileFormat]
    +
    +  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    +    case _: AtomicType => true
    +
    +    case st: StructType => st.forall { f => supportDataType(f.dataType, isReadPath) }
    +
    +    case ArrayType(elementType, _) => supportDataType(elementType, isReadPath)
    +
    +    case MapType(keyType, valueType, _) =>
    +      supportDataType(keyType, isReadPath) && supportDataType(valueType, isReadPath)
    +
    +    case udt: UserDefinedType[_] => supportDataType(udt.sqlType, isReadPath)
    +
    +    case _: NullType => true
    +
    +    case _ => false
    --- End diff --
    
    which type we don't support here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Add new API `supportDataType` i...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r199515518
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,16 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   *
    +   * By default all data types are supported except [[CalendarIntervalType]] in write path.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
    --- End diff --
    
    Also, do we really need this API? All what it does it is just to check the type and throw an exception.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Dispatch the type support check in Fil...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    **[Test build #92711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92711/testReport)** for PR 21667 at commit [`7266611`](https://github.com/apache/spark/commit/7266611b243000868c81f4538dd850c394fe7c20).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21667: [SPARK-24691][SQL]Add new API `supportDataType` in FileF...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21667
  
    I agree that making it an API is a bit over.
    But current there are problems(bug) as I listed in PR description.
    Maybe we can create another separate Trait?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21667: [SPARK-24691][SQL]Dispatch the type support check...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21667#discussion_r200640647
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ---
    @@ -152,6 +152,11 @@ trait FileFormat {
         }
       }
     
    +  /**
    +   * Returns whether this format supports the given [[DataType]] in read/write path.
    +   * By default all data types are supported.
    +   */
    +  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
    --- End diff --
    
    the entire FileFormat will be migrated to data source v2 in the future. The FileFormat will be still there for backward compatibility, but I don't think we will update it frequently.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org