You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2017/11/29 13:24:05 UTC

[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/19845

    [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hive clients for ImageSchema.readImages

    ## What changes were proposed in this pull request?
    
    Calling `ImageSchema.readImages` multiple times as below:
    
    ```python
    from pyspark.ml.image import ImageSchema
    data_path = 'data/mllib/images/kittens'
    _ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
    _ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
    ```
    
    throws an error as below:
    
    ```
    ...
    org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
    java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see the next exception for details.
    ...
    	at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
    ...
    	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
    ...
    	at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
    ...
    	at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
    	at org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
    ...
    Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see the next exception for details.
    	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
    	at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
    	... 121 more
    Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /.../spark/metastore_db.
    ...
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
        dropImageFailures, float(sampleRatio), seed)
      File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
      File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
        raise AnalysisException(s.split(': ', 1)[1], stackTrace)
    pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
    ```
    
    Seems we better stick to `SparkSession.builder.getOrCreate()` like:
    
    https://github.com/apache/spark/blob/51620e288b5e0a7fffc3899c9deadabace28e6d7/python/pyspark/sql/streaming.py#L329
    
    https://github.com/apache/spark/blob/dc5d34d8dcd6526d1dfdac8606661561c7576a62/python/pyspark/sql/column.py#L541
    
    https://github.com/apache/spark/blob/33d43bf1b6f55594187066f0e38ba3985fa2542b/python/pyspark/sql/readwriter.py#L105
    
    
    ## How was this patch tested?
    
    This was tested as below:
    
    ```python
    from pyspark.ml.image import ImageSchema
    data_path = 'data/mllib/images/kittens'
    _ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
    _ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-22651

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19845.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19845
    
----
commit c0c3c487474629df00ba50c5b6904552010f1aab
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-11-29T13:17:24Z

    Prevent initiating multiple Hive clients for ImageSchema.readImages

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    **[Test build #84342 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84342/testReport)** for PR 19845 at commit [`2e4c402`](https://github.com/apache/spark/commit/2e4c4027f8717af32e3954a018869378551604c8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    LGTM as long as all tests pass - but what about the other methods that use "ctx = SparkContext._active_spark_context" -- should those be modified as well?  Can you run those one after the other in sequence too, or do they fail with the same error?  I recall I tried to get the current session just like this in the first iteration but the python tests started to fail when run in parallel, that was the original reason to move to use _active_spark_context.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19845#discussion_r154266335
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1837,6 +1837,29 @@ def test_read_images(self):
             self.assertEqual(ImageSchema.undefinedImageType, "Undefined")
     
     
    +class ImageReaderTest2(PySparkTestCase):
    +
    +    @classmethod
    +    def setUpClass(cls):
    +        PySparkTestCase.setUpClass()
    +        try:
    +            cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
    +        except py4j.protocol.Py4JError:
    +            cls.tearDownClass()
    +            raise unittest.SkipTest("Hive is not available")
    +        except TypeError:
    +            cls.tearDownClass()
    +            raise unittest.SkipTest("Hive is not available")
    +        cls.spark = HiveContext._createForTesting(cls.sc)
    +
    --- End diff --
    
    Sure, that should be safer.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    Thanks for reviewing this @jiangxb1987, @dongjoon-hyun and @viirya.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19845


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    @imatiach-msft, ah, I think it's not about SparkContext but SparkSession, (SparkSession(...) directly) to be more clear, which seems causing multiple Hive clients when Hive support is enabled.
    
    I think the most of PySpark's unit tests do not enable Hive's support by default. Also, I double checked and I ran this multiple times and it seems fine after this fix as well. 
    
    Let me keep my eyes on Jenkins tests about this. Basically the tests are ran in parallel on Jenkins. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    I could fold this change into `https://github.com/apache/spark/pull/19835`. I only opened a separate PR here as it was easy to describe this issue separately and seems another issue although the change is small.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84369/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    cc @jiangxb1987, @viirya who I am seeing touched and reviewed similar codes, and @imatiach-msft who's the primary author of this codes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84297/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84342/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    **[Test build #84297 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84297/testReport)** for PR 19845 at commit [`c0c3c48`](https://github.com/apache/spark/commit/c0c3c487474629df00ba50c5b6904552010f1aab).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19845#discussion_r153966424
  
    --- Diff: python/pyspark/ml/image.py ---
    @@ -180,9 +180,8 @@ def readImages(self, path, recursive=False, numPartitions=-1,
             .. versionadded:: 2.3.0
             """
     
    -        ctx = SparkContext._active_spark_context
    -        spark = SparkSession(ctx)
    -        image_schema = ctx._jvm.org.apache.spark.ml.image.ImageSchema
    +        spark = SparkSession.builder.getOrCreate()
    --- End diff --
    
    Yeah, I think this should be best practice to initialize `SparkSession`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    Shall we also add a simple test to `ImageReaderTest`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    **[Test build #84369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84369/testReport)** for PR 19845 at commit [`0c7e537`](https://github.com/apache/spark/commit/0c7e5378a491f01b14df23fe56ae50a0f52971ee).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    **[Test build #84297 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84297/testReport)** for PR 19845 at commit [`c0c3c48`](https://github.com/apache/spark/commit/c0c3c487474629df00ba50c5b6904552010f1aab).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    @HyukjinKwon Thanks. I forgot the Hive support is needed to test it. The added test looks good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    Merged to master.
    
    Thanks for reviewing this @viirya, @jiangxb1987, @dongjoon-hyun and @imatiach-msft.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19845#discussion_r154262931
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1837,6 +1837,29 @@ def test_read_images(self):
             self.assertEqual(ImageSchema.undefinedImageType, "Undefined")
     
     
    +class ImageReaderTest2(PySparkTestCase):
    +
    +    @classmethod
    +    def setUpClass(cls):
    +        PySparkTestCase.setUpClass()
    +        try:
    +            cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
    +        except py4j.protocol.Py4JError:
    +            cls.tearDownClass()
    +            raise unittest.SkipTest("Hive is not available")
    +        except TypeError:
    +            cls.tearDownClass()
    +            raise unittest.SkipTest("Hive is not available")
    +        cls.spark = HiveContext._createForTesting(cls.sc)
    +
    --- End diff --
    
    Add classmethod `tearDownClass` to stop the `cls.spark`? I didn't see `HiveContextSQLTests` closes it anyway, maybe we can fix it too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    **[Test build #84342 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84342/testReport)** for PR 19845 at commit [`2e4c402`](https://github.com/apache/spark/commit/2e4c4027f8717af32e3954a018869378551604c8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ImageReaderTest2(PySparkTestCase):`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    **[Test build #84369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84369/testReport)** for PR 19845 at commit [`0c7e537`](https://github.com/apache/spark/commit/0c7e5378a491f01b14df23fe56ae50a0f52971ee).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    @viirya, actually, I think it's not that simple to just add multiple `ImageSchema.readImages` (if you thought this way). I think I need another class with a Hive support enabled session. Let me try to add it now that you pointed it out anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19845
  
    Let me look into this tomorrow. :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org