You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2017/10/25 03:53:49 UTC

[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/19571

    [SPARK-15474][SQL] Write and read back non-emtpy schema with empty dataframe

    ## What changes were proposed in this pull request?
    
    Previously, ORC file format cannot write a correct schema in case of empty dataframe. Instead, it creates an empty ORC file with **empty** schema, `struct<>`. So, Spark users cannot write and read back ORC files with non-empty schema and no rows. This PR uses new Apache ORC 1.4.1 to create an empty ORC file with a correct schema. Also, this PR uses ORC 1.4.1 to infer schema always.
    
    **BEFORE**
    ```scala
    scala> val emptyDf = Seq((true, 1, "str")).toDF("a", "b", "c").limit(0)
    scala> emptyDf.write.format("orc").mode("overwrite").save("/tmp/empty")
    scala> spark.read.format("orc").load("/tmp/empty").printSchema
    org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.;
    ```
    
    **AFTER**
    ```scala
    scala> val emptyDf = Seq((true, 1, "str")).toDF("a", "b", "c").limit(0)
    scala> emptyDf.write.format("orc").mode("overwrite").save("/tmp/empty")
    scala> spark.read.format("orc").load("/tmp/empty").printSchema
    root
     |-- a: boolean (nullable = true)
     |-- b: integer (nullable = true)
     |-- c: string (nullable = true)
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with newly added test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-15474

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19571.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19571
    
----
commit be7ba9b5a9c70519a7fa1b0497955fbba763e2e6
Author: Dongjoon Hyun <do...@apache.org>
Date:   2017-10-25T02:50:33Z

    [SPARK-15474][SQL] Write and read back non-emtpy schema with empty dataframe

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83030/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    Sorry I miss-understood the problem at the beginning. I thought the new orc version just changes the existing APIs a little, but it turns out the new orc version has a new set of read/write APIs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146959728
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
    @@ -2127,4 +2127,18 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
           }
         }
       }
    +
    +  Seq("orc", "parquet").foreach { format =>
    +    test(s"SPARK-15474 Write and read back non-emtpy schema with empty dataframe - $format") {
    +      withTempDir { dir =>
    +        val path = dir.getCanonicalPath
    +        val emptyDf = Seq((true, 1, "str")).toDF.limit(0)
    +        emptyDf.write.format(format).mode("overwrite").save(path)
    --- End diff --
    
    Thanks. No problem. I'll use `withTempPath`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146959158
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -252,6 +253,13 @@ private[orc] class OrcOutputWriter(
       override def close(): Unit = {
         if (recordWriterInstantiated) {
           recordWriter.close(Reporter.NULL)
    +    } else {
    +      // SPARK-15474 Write empty orc file with correct schema
    +      val conf = context.getConfiguration()
    +      val writer = org.apache.orc.OrcFile.createWriter(
    +        new Path(path), org.apache.orc.mapred.OrcOutputFormat.buildOptions(conf))
    +      new org.apache.orc.mapreduce.OrcMapreduceRecordWriter(writer)
    +      writer.close()
    --- End diff --
    
    Yep. That's correct understanding. This PR intentionally focuses only on handling empty files and inferring schema. This will help us transit safely from old Hive ORC to new Apache ORC 1.4.1.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    yes please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    **[Test build #83072 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83072/testReport)** for PR 19571 at commit [`8d212f0`](https://github.com/apache/spark/commit/8d212f049ccd176e5d6800d620929eed20844415).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    I checked with how we introduce the new parquet reader before, and maybe we can follow it: https://github.com/apache/spark/pull/4308
    
    Basically we leave the old orc data source as it is, and implement a new orc 1.4.1 data source in sql core module. Then we have an internal config to switch the implementation(by default true), and remove the old implementation after one or two releases.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    **[Test build #83055 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83055/testReport)** for PR 19571 at commit [`8b4fc96`](https://github.com/apache/spark/commit/8b4fc96a2f6ed56403a68a6dc401d54380c29fa9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    **[Test build #83055 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83055/testReport)** for PR 19571 at commit [`8b4fc96`](https://github.com/apache/spark/commit/8b4fc96a2f6ed56403a68a6dc401d54380c29fa9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146783595
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -73,6 +70,10 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
     
         val configuration = job.getConfiguration
     
    +    configuration.set(
    +      MAPRED_OUTPUT_SCHEMA.getAttribute,
    +      org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.getSchemaString(dataSchema))
    --- End diff --
    
    Do we always need to set this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    What is the backward compatibility of ORC  1.4.1? Can we create multiple ORC files created by the previous versions and ensure they are not broken?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146752170
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -252,6 +253,13 @@ private[orc] class OrcOutputWriter(
       override def close(): Unit = {
         if (recordWriterInstantiated) {
           recordWriter.close(Reporter.NULL)
    +    } else {
    +      // SPARK-15474 Write empty orc file with correct schema
    +      val conf = context.getConfiguration()
    --- End diff --
    
    Looks like the behavior to skip creating an empty file if no rows are written is deliberate. Is there any impact to current behavior?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146958510
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -73,6 +70,10 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
     
         val configuration = job.getConfiguration
     
    +    configuration.set(
    +      MAPRED_OUTPUT_SCHEMA.getAttribute,
    +      org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.getSchemaString(dataSchema))
    --- End diff --
    
    Yes. This is the correct schema to be written.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83072/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146958154
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -252,6 +253,13 @@ private[orc] class OrcOutputWriter(
       override def close(): Unit = {
         if (recordWriterInstantiated) {
    --- End diff --
    
    In this PR, I'm focusing on emtpy file. We will replace the whole writer and reader with ORC 1.4.1 eventually. The newly added test case in this PR will make us to transit safely.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    > What is the backward compatibility of ORC 1.4.1? Can we create multiple ORC files created by the previous versions and ensure they are not broken?
    
    That a good point, and I think it's better to have these tests in the orc project. If they don't have, then we can take over and add these tests.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146752367
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -252,6 +253,13 @@ private[orc] class OrcOutputWriter(
       override def close(): Unit = {
         if (recordWriterInstantiated) {
    --- End diff --
    
    Btw, according to the existing comment, seems that we can simply remove `recordWriterInstantiated` to allow empty file created.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83055/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    **[Test build #83030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83030/testReport)** for PR 19571 at commit [`be7ba9b`](https://github.com/apache/spark/commit/be7ba9b5a9c70519a7fa1b0497955fbba763e2e6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146957321
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala ---
    @@ -39,4 +45,33 @@ private[sql] object OrcFileFormat {
         schema.fieldNames.foreach(checkFieldName)
         schema
       }
    +
    +  def getSchemaString(schema: StructType): String = {
    +    schema.fields.map(f => s"${f.name}:${f.dataType.catalogString}").mkString("struct<", ",", ">")
    +  }
    +
    +  private def readSchema(file: Path, conf: ReaderOptions): Option[TypeDescription] = {
    +    try {
    +      val reader = OrcFile.createReader(file, conf)
    +      val schema = reader.getSchema
    +      if (schema.getFieldNames.size == 0) {
    +        None
    +      } else {
    +        Some(schema)
    +      }
    +    } catch {
    +      case _: IOException => None
    +    }
    +  }
    +
    +  def readSchema(sparkSession: SparkSession, files: Seq[FileStatus]): Option[StructType] = {
    +    val conf = sparkSession.sparkContext.hadoopConfiguration
    +    val fs = FileSystem.get(conf)
    +    val options = OrcFile.readerOptions(conf).filesystem(fs)
    +    files.map(_.getPath).flatMap(readSchema(_, options))
    +      .headOption.map { schema =>
    --- End diff --
    
    Yes. This is based on the existing [OrcFileOperator.readSchema](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L74-L82).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146783934
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ---
    @@ -2127,4 +2127,18 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
           }
         }
       }
    +
    +  Seq("orc", "parquet").foreach { format =>
    +    test(s"SPARK-15474 Write and read back non-emtpy schema with empty dataframe - $format") {
    +      withTempDir { dir =>
    +        val path = dir.getCanonicalPath
    +        val emptyDf = Seq((true, 1, "str")).toDF.limit(0)
    +        emptyDf.write.format(format).mode("overwrite").save(path)
    --- End diff --
    
    Hm, why is `withTempPath { path =>` not used without `overwrite` instead?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    To be clear, for [ORC File Versions](https://github.com/apache/orc/blob/master/proto/orc_proto.proto#L240-L248), there exists some ORC test case against version 0.11, but it's not our scope because Spark (and Hive 1.2) uses `0.12 with HIVE_8732`.
    
    There are 6 versions with 0.12.
    
    0 = original
    **1 = HIVE-8732 fixed (fixed stripe/file maximum statistics & string statistics use utf8 for min/max)**
    2 = HIVE-4243 fixed (use real column names from Hive tables)
    3 = HIVE-12055 fixed (vectorized writer implementation)
    4 = HIVE-13083 fixed (decimals write present stream correctly)
    5 = ORC-101 fixed (bloom filters use utf8 consistently)
    6 = ORC-135 fixed (timestamp statistics use utc)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun closed the pull request at:

    https://github.com/apache/spark/pull/19571


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146751242
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala ---
    @@ -39,4 +45,33 @@ private[sql] object OrcFileFormat {
         schema.fieldNames.foreach(checkFieldName)
         schema
       }
    +
    +  def getSchemaString(schema: StructType): String = {
    +    schema.fields.map(f => s"${f.name}:${f.dataType.catalogString}").mkString("struct<", ",", ">")
    +  }
    +
    +  private def readSchema(file: Path, conf: ReaderOptions): Option[TypeDescription] = {
    +    try {
    +      val reader = OrcFile.createReader(file, conf)
    +      val schema = reader.getSchema
    +      if (schema.getFieldNames.size == 0) {
    +        None
    +      } else {
    +        Some(schema)
    +      }
    +    } catch {
    +      case _: IOException => None
    +    }
    +  }
    +
    +  def readSchema(sparkSession: SparkSession, files: Seq[FileStatus]): Option[StructType] = {
    +    val conf = sparkSession.sparkContext.hadoopConfiguration
    +    val fs = FileSystem.get(conf)
    +    val options = OrcFile.readerOptions(conf).filesystem(fs)
    +    files.map(_.getPath).flatMap(readSchema(_, options))
    +      .headOption.map { schema =>
    --- End diff --
    
    Seems that you just take the first available schema. Looks like we don't need to read other files when we found the first available schema.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    @gatorsmile and @cloud-fan . 
    For ORC compatibility, I checked the ORC code, but it's not clearly tested.
    I'll try to add some suite as a separate issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146784952
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -58,10 +58,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
           sparkSession: SparkSession,
           options: Map[String, String],
           files: Seq[FileStatus]): Option[StructType] = {
    -    OrcFileOperator.readSchema(
    -      files.map(_.getPath.toString),
    -      Some(sparkSession.sessionState.newHadoopConf())
    -    )
    +    org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.readSchema(sparkSession, files)
    --- End diff --
    
    I am not sure of this one too. This looks a complete rewrite of `org.apache.spark.sql.hive.orc.OrcFileOperator.readSchema`.. Is this change required to fix this issue?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146783308
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -252,6 +253,13 @@ private[orc] class OrcOutputWriter(
       override def close(): Unit = {
         if (recordWriterInstantiated) {
           recordWriter.close(Reporter.NULL)
    +    } else {
    +      // SPARK-15474 Write empty orc file with correct schema
    +      val conf = context.getConfiguration()
    +      val writer = org.apache.orc.OrcFile.createWriter(
    +        new Path(path), org.apache.orc.mapred.OrcOutputFormat.buildOptions(conf))
    +      new org.apache.orc.mapreduce.OrcMapreduceRecordWriter(writer)
    +      writer.close()
    --- End diff --
    
    So, if i understood correctly it will write out by `org.apache.orc.mapreduce.OrcMapreduceRecordWriter` when output is empty but, write out by `org.apache.hadoop.hive.ql.io.orc.OrcRecordWriter` when output is non-empty? I thought we should use the same writer for both paths if possible and this one looks rather a band-aid fix. It won't block this PR but I wonder if this is the only way we could do for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146958429
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -58,10 +58,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
           sparkSession: SparkSession,
           options: Map[String, String],
           files: Seq[FileStatus]): Option[StructType] = {
    -    OrcFileOperator.readSchema(
    -      files.map(_.getPath.toString),
    -      Some(sparkSession.sessionState.newHadoopConf())
    -    )
    +    org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.readSchema(sparkSession, files)
    --- End diff --
    
    Yes. It's intentional. OrcFileOperator will be replaced later completely. I made this PR as small as possible for review.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19571: [SPARK-15474][SQL] Write and read back non-emtpy ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19571#discussion_r146957878
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
    @@ -252,6 +253,13 @@ private[orc] class OrcOutputWriter(
       override def close(): Unit = {
         if (recordWriterInstantiated) {
           recordWriter.close(Reporter.NULL)
    +    } else {
    +      // SPARK-15474 Write empty orc file with correct schema
    +      val conf = context.getConfiguration()
    --- End diff --
    
    Previously, only ORC does. So, it creates more issues like SPARK-22258 (#19477) and SPARK-21762. This is consistent with the other data sources like Parquet.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    I see. Then, can we continue on #17980 ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    Thank you for review, @gatorsmile and @cloud-fan . Especially, @cloud-fan 's opinion is my original approach in #17980 and #18953 (before Aug 16). I cannot agree any more.
    
    > Basically we leave the old orc data source as it is, and implement a new orc 1.4.1 data source in sql core module. Then we have an internal config to switch the implementation(by default prefer the new implementation), and remove the old implementation after one or two releases.
    
    BTW, I'm wondering what is changed after you commented [the following](https://github.com/apache/spark/pull/18953#issuecomment-322827590) on that PR on 16th Aug.
    
    > Are the ORC APIs changed a lot in 1.4? I was expecting a small patch to upgrade the current ORC data source, without moving it to sql/core.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    **[Test build #83072 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83072/testReport)** for PR 19571 at commit [`8d212f0`](https://github.com/apache/spark/commit/8d212f049ccd176e5d6800d620929eed20844415).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    I updated the PR.
    Could you review this PR again, @viirya , @HyukjinKwon , @gatorsmile , @cloud-fan ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    **[Test build #83030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83030/testReport)** for PR 19571 at commit [`be7ba9b`](https://github.com/apache/spark/commit/be7ba9b5a9c70519a7fa1b0497955fbba763e2e6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    Thank you for review, @viirya and @HyukjinKwon .
    You know that I tried to introduce new OrcFileFormat in `sql/core` before. But, it is too big for review. According to @cloud-fan 's advice, I'm trying to upgrade the existing one by one in a piece.
    
    So far,
    - We introduced new ORC 1.4.1 dependency
    - Introduce new Spark SQL ORC parameters and replace Hive constant with new ORC parameters.
    
    This is the actual first PR to use read and write using ORC 1.4.1 library.
    - It reads ORC file only for inferencing schema.
    - It writes only empty ORC file.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19571
  
    This is resolved in https://github.com/apache/spark/pull/19651 .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org