You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sbcd90 <gi...@git.apache.org> on 2016/05/22 22:24:06 UTC

[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

GitHub user sbcd90 opened a pull request:

    https://github.com/apache/spark/pull/13257

    [SPARK-15474][SQL]ORC data source fails to write and read back empty dataframe

    ## What changes were proposed in this pull request?
    
    Currently ORC data source fails to write and read empty data.
    This PR provides a fix for this issue.
    
    
    ## How was this patch tested?
    
    unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sbcd90/spark orcemptydfissue

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13257
    
----
commit e3fd618456b80908672d18daeb4433e2fc52e475
Author: Subhobrata Dey <sb...@gmail.com>
Date:   2016-05-22T22:14:22Z

    [SPARK-15474][SQL]ORC data source fails to write and read back empty dataframe

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13257#issuecomment-221132286
  
    Ah, I think he meant this below:
    
    - Parquet
    
    ```scala
    val emptyDf = spark.range(10).limit(0).toDF()
    emptyDf.write
     .format("parquet")
      .save(path.getCanonicalPath)
    
    val copyEmptyDf = spark.read
      .format("parquet")
      .load(path.getCanonicalPath)
    
    copyEmptyDf.printSchema()
    ```
    
    ```
    root
     |-- id: long (nullable = true)
    ```
    
    - ORC (with this PR)
    
    ```scala
    val emptyDf = spark.range(10).limit(0).toDF()
    emptyDf.write
     .format("orc")
      .save(path.getCanonicalPath)
    
    val copyEmptyDf = spark.read
      .format("orc")
      .load(path.getCanonicalPath)
    
    copyEmptyDf.printSchema()
    ```
    
    ```
    root
    ```
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13257#issuecomment-221455213
  
    @sbcd90 I currently can't think of other alternatives and it seems that's why it has not been enabled again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13257#issuecomment-220860424
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13257#issuecomment-220860596
  
    @sbcd90 I think we will need a test to make sure this fixes the issue and other changes in the future do not break this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by sbcd90 <gi...@git.apache.org>.
Github user sbcd90 closed the pull request at:

    https://github.com/apache/spark/pull/13257


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13257#issuecomment-220898876
  
    @sbcd90 I just read this JIRA [SPARK-8501](https://issues.apache.org/jira/browse/SPARK-8501) to figure out why it was disabled and tried to manually test this. 
    
    To cut it short, it seems there is a "bug" for writing schema for empty ORC files. When it is empty, it seems it does not write schema (whereas Parquet does). Would there be any way to keep the schema even if the data is empty?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by sbcd90 <gi...@git.apache.org>.
Github user sbcd90 commented on the pull request:

    https://github.com/apache/spark/pull/13257#issuecomment-220959617
  
    @HyukjinKwon, this PR exactly fixes that issue. If you test this PR with the sample code you provided in the JIRA ticket
    
    ```
    val conf = new SparkConf().setAppName("TestApp16").setMaster("local")
      val sc = new SparkContext(conf)
      val sqlContext = new SQLContext(sc)
    
      val path = new File("check.orc")
    
      val df = sqlContext.range(10).limit(0)
      df.write.format("orc").save(path.getCanonicalPath)
    ```
    you would find the empty schema being kept. I believe thats the only way to fix this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13257: [SPARK-15474][SQL]ORC data source fails to write and rea...

Posted by nchammas <gi...@git.apache.org>.
Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/13257
  
    The discussion on [ORC-152](https://issues.apache.org/jira/browse/ORC-152) suggests that this is an issue with Spark's DataFrame writer for ORC, not with ORC itself.
    
    If you have evidence that this is not the case, it would be good to post it directly on ORC-152 so we can get input from people on that project.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13257: [SPARK-15474][SQL]ORC data source fails to write and rea...

Posted by omalley <gi...@git.apache.org>.
Github user omalley commented on the issue:

    https://github.com/apache/spark/pull/13257
  
    Ok, I see the problem. Hive's OrcInputFormat has that property, because it was getting the schema from the ObjectInspector, which only came with the values. When I get a chance, let me look at what would be required to have you guys use the ORC project APIs directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13257#issuecomment-220886604
  
    Hi @sbcd90 I am not a committer but I just left a comment because I like your PR. Let me add some more comments which I think might be changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by sbcd90 <gi...@git.apache.org>.
Github user sbcd90 commented on the pull request:

    https://github.com/apache/spark/pull/13257#issuecomment-221452898
  
    Hello @HyukjinKwon , I think it is an ORC issue. There is a final call to ORC api & feel that the issue should be fixed in ORC.
    
    ```
    new OrcOutputFormat().getRecordWriter(
          new Path(path, filename).getFileSystem(conf),
          conf.asInstanceOf[JobConf],
          new Path(path, filename).toString,
          Reporter.NULL
        ).asInstanceOf[RecordWriter[NullWritable, Writable]]
    ```
    
    Do you have any different suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by sbcd90 <gi...@git.apache.org>.
Github user sbcd90 commented on the pull request:

    https://github.com/apache/spark/pull/13257#issuecomment-220885344
  
    Hello @HyukjinKwon , I have added a testcase now. Please review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15474][SQL]ORC data source fails to wri...

Posted by sbcd90 <gi...@git.apache.org>.
Github user sbcd90 commented on the pull request:

    https://github.com/apache/spark/pull/13257#issuecomment-222180929
  
    @HyukjinKwon , closing this PR for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org