You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by chenghao-intel <gi...@git.apache.org> on 2014/09/11 09:05:51 UTC

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

GitHub user chenghao-intel opened a pull request:

    https://github.com/apache/spark/pull/2352

    [SPARK-3481] [SQL] Eliminate the error log in local Hive comparison test

    Logically, we should remove the Hive Table/Database first and then reset the Hive configuration, repoint to the new data warehouse directory etc.
    Otherwise it raised exceptions like "Database doesn't not exists: default" in the local testing.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chenghao-intel/spark test_hive

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2352.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2352
    
----
commit 74fd76b1f7aa9f8deb1e9b4dbab204d54b729235
Author: Cheng Hao <ha...@intel.com>
Date:   2014-09-11T06:31:58Z

    eliminate the error log

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by chenghao-intel <gi...@git.apache.org>.

Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55546663
  
    Thank you @liancheng for so detailed explanation. Actually I didn't know those while submitting this PR. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55365014
  
    Hmm... I couldn't reproduce the `HiveQuerySuite` failure, but I can steadily reproduce similar failure with `StatisticsSuite`, and your patch does fixes this one.
    
    Just to make sure I understand this correctly: so the `RESET` command resets all Hive configurations to their default value, including `javax.jdo.option.ConnectionURL` and `hive.metastore.warehouse.dir`. And in the case of `TestHiveContext`, default value of these two properties are respectively:
    
    - `jdbc:derby:;databaseName=metastore_db;create=true`, and
    - `/user/hive/warehouse`
    
    which overrides the random temporary directories specified by `TestHiveContext`. Then, when [this line](https://github.com/apache/spark/blame/42904b8d013e71d03e301c3da62e33b4cc2eb54e/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L389) gets executed, we try to find the "default" database from the wrong place and thus causes the "default" database missing error.
    
    The weird thing is that this part of code has existed for a long time (ever since Spark SQL became part of Spark), but it never fails :( While debugging this "default" database missing issue, I also observed that some race conditions and execution order related issue, which maybe the reason that this bug has been covered for so long a time...
    
    Anyway, this PR LGTM. Thanks for fixing this!
    
    @marmbrus Let's see whether this can bring our Maven build on Jenkins back!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2352


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55231414
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20142/consoleFull) for   PR 2352 at commit [`74fd76b`](https://github.com/apache/spark/commit/74fd76b1f7aa9f8deb1e9b4dbab204d54b729235).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55310300
  
    @chenghao-intel Actually this issue has bothered us for some time, and makes the Maven build on Jenkins fail. But we had never reproduce it locally... Would you mind to elaborate on the exact reproduction steps? Details like Maven profiles and other parameters would be greatly helpful. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55241690
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20142/consoleFull) for   PR 2352 at commit [`74fd76b`](https://github.com/apache/spark/commit/74fd76b1f7aa9f8deb1e9b4dbab204d54b729235).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2352#issuecomment-55440621

Got more clue on this, which explains why `HiveQuerySuite` doesn't fail previously. (but @chenghao-intel, why it fails on your side? Still mysterious.) Basically, we were jumping between two different sets of local metastore/warehouse directories while testing. The detailed process is:

1. When the `TestHive` singleton object is instantiated, we create a pair of temporary directories and [configure them as local testing metastore/warehouse](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L80). Let's abbreviate them as `m1` and `w1`.

At this point, these two directories are created, but remain empty. Default Hive database will be created lazily later.

1. Then `HiveQuerySuite` gets started. Whenever a test case created via `HiveComparisonTest.createQueryTest` is executed, we first [execute a `SHOW TABLES` command](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L253-L254) (notice the "MINOR HACK" comment).

An important thing happens here is that the "default" database gets created in `m1` lazily at this point.

1. Then [`reset()` is called](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L255).

Within `reset()`, first of all, we execute a Hive `RESET` command, which sets all configurations to their default values, including `javax.jdo.option.ConnectionURL` and `hive.metastore.warehouse.dir`. This implies metastore is reset to the `metastore_db` directory under current working directory and warehouse is reset to `/user/hive/warehouse` (which usually doesn't exist).

1. Then follows the `getAllTables` call, which is used to delete all tables in the "default" database.

During the `getAllTables` call, the `metastore_db` directory is created if it's not there, and again, Hive creates an empty "default" database in it lazily. Hmm... wait, so here we end up with two "default" databases, one in `m1` and another in `metastore_db`! As a result, [these lines](https://github.com/apache/spark/blob/533377621f1e178e18fa0b79d653a11b66e4e250/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L389-L405) are actually always trying to cleanup tables and databases under the newly created `metastore_db` directory, which is empty.

1. At last, we call `configure()` again and sets metadata/warehouse directories back to `m1`/`w1`, which remain intact.

In a word, the TL;DR here is, previously, testing databases and testing tables created by test suites inherited from `HiveComparisonTest` never really got cleaned up, and the "MINOR HACK" perfectly covered up probably the oldest bug in the history of Spark SQL! By applying this PR, we should be able to remove this hack safely.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55442596
  
    Thanks for finding this!  I've merge to master and 1.1 and 1.0.
    
    @JoshRosen I think this should fix the Jenkins errors.  Please let me know if SQL is responsible for any more failures.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55366440
  
    Actually the SBT Jenkins build is still alright, it's the Maven build that is broken, that's even stranger, since you can easily reproduce it with SBT...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55441813
  
    If you run `StatisticsSuite` separately with either `sbt test-only` or `mvn -DwildcardSuites`, you can always reproduce the "default" database missing exception. Because no command like `SHOW TABLES`  gets executed, thus the "default" database is not created in the temporary testing warehouse directory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by chenghao-intel <gi...@git.apache.org>.

Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55366097
  
    Yes, I think you understand correctly, but I am not sure why the unit test passed with Jenkins previously. Probably the multithreading stuff did the tricky. Let's see if this fix will help.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3481] [SQL] Eliminate the error log in ...

Posted by chenghao-intel <gi...@git.apache.org>.

Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/2352#issuecomment-55347427
  
    I got the latest master and run 
    ```
    sbt/sbt -Phive assembly 'test-only org.apache.spark.sql.hive.execution.HiveQuerySuite'
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org