You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2020/10/17 23:38:04 UTC
[spark] branch branch-2.4 updated: [SPARK-26646][TEST][PYSPARK][2.4] Fix flaky test: pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new 2e72b01  [SPARK-26646][TEST][PYSPARK][2.4] Fix flaky test: pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
2e72b01 is described below

commit 2e72b0110c0d962a7997fddb2ef08b6613f3d338
Author: Liang-Chi Hsieh <vi...@gmail.com>
AuthorDate: Sat Oct 17 16:31:42 2020 -0700

    [SPARK-26646][TEST][PYSPARK][2.4] Fix flaky test: pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
    
    ### What changes were proposed in this pull request?
    
    This is backport of SPARK-26646 to branch-2.4 to fix flaky test in the branch.
    
    ### Why are the changes needed?
    
    The test pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction looks sometimes flaky.
    
    ```
    Traceback (most recent call last):
      File "/home/runner/work/spark/spark/python/pyspark/mllib/tests.py", line 1492, in test_training_and_prediction
        self._eventually(condition, timeout=180.0)
      File "/home/runner/work/spark/spark/python/pyspark/mllib/tests.py", line 133, in _eventually
        lastValue = condition()
      File "/home/runner/work/spark/spark/python/pyspark/mllib/tests.py", line 1487, in condition
        self.assertGreater(errors[1] - errors[-1], 0.3)
    AssertionError: -0.07000000000000006 not greater than 0.3
    ```
    
    The predict stream can possibly be consumed to the end before the input stream. When it happens, the model improvement is not high as expected and causes test failed. This patch tries to increase number of batches of streams. This won't increase test time because we have a timeout there.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Unit test
    
    Closes #30078 from viirya/SPARK-26646-2.4.
    
    Authored-by: Liang-Chi Hsieh <vi...@gmail.com>
    Signed-off-by: Dongjoon Hyun <dh...@apple.com>
---
 python/pyspark/mllib/tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/mllib/tests.py b/python/pyspark/mllib/tests.py
index ec9497c..a3df358 100644
--- a/python/pyspark/mllib/tests.py
+++ b/python/pyspark/mllib/tests.py
@@ -1459,7 +1459,7 @@ class StreamingLogisticRegressionWithSGDTests(MLLibStreamingTestCase):
         """Test that the model improves on toy data with no. of batches"""
         input_batches = [
             self.sc.parallelize(self.generateLogisticInput(0, 1.5, 100, 42 + i))
-            for i in range(20)]
+            for i in range(40)]
         predict_batches = [
             b.map(lambda lp: (lp.label, lp.features)) for b in input_batches]
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org