You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/08/18 19:50:45 UTC
[jira] [Created] (SPARK-10086) Flaky StreamingKMeans test in PySpark

Joseph K. Bradley created SPARK-10086:
-----------------------------------------

             Summary: Flaky StreamingKMeans test in PySpark
                 Key: SPARK-10086
                 URL: https://issues.apache.org/jira/browse/SPARK-10086
             Project: Spark
          Issue Type: Bug
          Components: MLlib, PySpark, Streaming, Tests
    Affects Versions: 1.5.0
            Reporter: Joseph K. Bradley


Here's a report on investigating this test failure:

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41081/console]

It is a StreamingKMeans test which trains on a DStream with 2 batches and then tests on those same 2 batches.  It fails here: [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 batches, (2a) just the first batch, (2b) just the second batch.  Here's the code:
[https://github.com/jkbradley/spark/blob/d3eedb7773b9e15595cbc79c009fe932703c0b11/examples/src/main/python/mllib/streaming_kmeans.py]

Disturbingly, only (2b) produced the failure, indicating that batch 2 was processed and 1 was not.  [~tdas] says queueStream should have consistency guarantees and that should not happen.  There is no randomness in the StreamingKMeans algorithm (since initial centers are fixed, not randomized).

*Current status: Not sure happened*

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
======================================================================
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1147, in test_trainOn_predictOn
    self._eventually(condition, catch_assertions=True)
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 123, in _eventually
    raise lastValue
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 114, in _eventually
    lastValue = condition()
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1144, in condition
    self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]

First differing element 1:
[0, 0, 0]
[1, 0, 1]

- [[0, 1, 1], [0, 0, 0]]
?                 ^^^^

+ [[0, 1, 1], [1, 0, 1]]
?              +++   ^


----------------------------------------------------------------------
Ran 62 tests in 164.188s
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org