You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yanbohappy <gi...@git.apache.org> on 2015/02/14 19:02:11 UTC

[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON data sour...

GitHub user yanbohappy opened a pull request:

    https://github.com/apache/spark/pull/4607

    [SPARK-5821] [SPARK-5746] [SQL] JSON data source refactor initial draft

    JSON data source refactor
    1, The path in "CREATE TABLE AS SELECT" must be a directory. Because in this scenario we need to write or append files to the existed table, underlying directory is more reasonable for append operation, authentication and authorization.
    For SPARK-5821, if we don't have write permission for the parent directory, the CTAS command will failure.
    Another reason is that we can't append to HDFS files which represent RDD, if we want to implement append semantics, we need new files and add to a specific directory.
    2, New INSERT OVERWRITE implementation.
    First insert the new generated table to a temporary directory which named as "_temporary" under the path directory. After insert finished, we deleted the original files. At last we rename "_temporary" for "data".
    This can fix the bug which mentioned at SPARK-5746.
    Why to rename "_temporary" for "data" rather than move all files in "_temporary" to path and then delete "_temporary"? Because that spark RDD.saveAsTextFile(path) related operation will store the whole RDD to HDFS files which named as "part-*****" like files under the path. If the original files were produced by this mean, and then we use "INSERT" without overwrite, the new generated table files are also named as "part-*****" which will produce corrupted table.
    This is the initial draft and need optimization. Looking forward your opinions and comments.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanbohappy/spark JSONDataSourceRefactor

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4607.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4607
    
----
commit 8683a483c074f692152159d63a101f78c3c3fe58
Author: Yanbo Liang <ya...@gmail.com>
Date:   2015-02-14T17:37:05Z

    JSON data source refactor initial draft

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

Posted by yanbohappy <gi...@git.apache.org>.
Github user yanbohappy commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74405701
  
    @yhuai Thank you for your reply. Add analysis rule and throw an exception is reasonable and looking forward your PR.
    I can address the issue of SPARK-5821, I'm working on another PR #4610 not only resolve SPARK-5821 but also with some improvements.
    Could I close this PR and discuss JSON data source improvement related problem at #4610 ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74387462
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27495/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON data sour...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74385439
  
      [Test build #27494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27494/consoleFull) for   PR 4607 at commit [`0812dd1`](https://github.com/apache/spark/commit/0812dd1c269e2d2c57cce817a1c3ecee46e59c5b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON data sour...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74385482
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27494/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

Posted by yanbohappy <gi...@git.apache.org>.
Github user yanbohappy commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74405767
  
    Actually, the insert function (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L107) will be not called any time. The "CTAS" command is just executed at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L81 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74403026
  
      [Test build #27502 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27502/consoleFull) for   PR 4607 at commit [`41307cd`](https://github.com/apache/spark/commit/41307cd743f5cf27e13b73389ae54d9c1d761718).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [deprecated] [SPARK-5821] [SPARK-5746] [SQL] J...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74405871
  
    Yes, please close it.
    
    `insert` is used by `INSERT INTO/OVERWRITE` and `DataFrame.insertInto`. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74385841
  
      [Test build #27495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27495/consoleFull) for   PR 4607 at commit [`29e138a`](https://github.com/apache/spark/commit/29e138a311208e75a6b94c109a0ba7df408181d1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74405659
  
    Actually, I think we just need to throw an exception if the delete returns false when we try to do OVERWRITE (we also need to make the change at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L70). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON data sour...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74385476
  
      [Test build #27494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27494/consoleFull) for   PR 4607 at commit [`0812dd1`](https://github.com/apache/spark/commit/0812dd1c269e2d2c57cce817a1c3ecee46e59c5b).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [deprecated] [SPARK-5821] [SPARK-5746] [SQL] J...

Posted by yanbohappy <gi...@git.apache.org>.
Github user yanbohappy closed the pull request at:

    https://github.com/apache/spark/pull/4607


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74403044
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27502/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74387458
  
      [Test build #27495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27495/consoleFull) for   PR 4607 at commit [`29e138a`](https://github.com/apache/spark/commit/29e138a311208e75a6b94c109a0ba7df408181d1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74403043
  
      [Test build #27502 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27502/consoleFull) for   PR 4607 at commit [`41307cd`](https://github.com/apache/spark/commit/41307cd743f5cf27e13b73389ae54d9c1d761718).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5821] [SPARK-5746] [SQL] JSON external ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/4607#issuecomment-74405239
  
    @yanbohappy Thank you for working on it! For SPARK-5746, I think it is better to add an analysis rule to do a check and throw an exception when you find that users try to write to a table while reading it. Actually, I have been working on it and will have a PR soon. How about we use this PR to address the issue of SPARK-5821 in JSONRelation? Can you also try the parquet data source and see if SPARK-5821 also affects that? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org