You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Adam Chhina <am...@gmail.com> on 2023/01/18 16:56:41 UTC

Re: Building Spark to run PySpark Tests?

Bump,

Just trying to see where I can find what tests are known failing for a particular release, to ensure I’m building upstream correctly following the build docs. I figured this would be the best place to ask as it pertains to building and testing upstream (also more than happy to provide a PR for any docs if required afterwards), however if there would be a more appropriate place, please let me know.

Best,

Adam Chhina

> On Dec 27, 2022, at 11:37 AM, Adam Chhina <am...@gmail.com> wrote:
> 
> As part of an upgrade I was looking to run upstream PySpark unit tests on `v3.2.1-rc2` before I applied some downstream patches and tested those. However, I'm running into some issues with failing unit tests, which I'm not sure are failing upstream or due to some step I missed in the build.
> 
> The current failing tests (at least so far, since I believe the python script exits on test failure):
> ```
> ======================================================================
> FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 474, in test_train_prediction
>     eventually(condition, timeout=180.0)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, in eventually
>     lastValue = condition()
>   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 469, in condition
>     self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.8960983527735014 not greater than 2
> 
> ======================================================================
> FAIL: test_parameter_accuracy (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the final value of weights is close to the desired value.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 229, in test_parameter_accuracy
>     eventually(condition, timeout=60.0, catch_assertions=True)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, in eventually
>     raise lastValue
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, in eventually
>     lastValue = condition()
>   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 226, in condition
>     self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.23052813480829393 != 0.1 within 1 places (0.13052813480829392 difference)
> 
> ======================================================================
> FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 334, in test_training_and_prediction
>     eventually(condition, timeout=180.0)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93, in eventually
>     raise AssertionError(
> AssertionError: Test failed due to timeout after 180 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64, 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
> 
> ----------------------------------------------------------------------
> Ran 13 tests in 661.536s
> 
> FAILED (failures=3, skipped=1)
> 
> Had test failures in pyspark.mllib.tests.test_streaming_algorithms with /usr/local/bin/python3; see logs.
> ```
> 
> Here's how I'm currently building Spark, I was using the [building-spark](https://spark.apache.org/docs/3..1/building-spark.html) docs as a reference.
> ```
> > git clone git@github.com:apache/spark.git
> > git checkout -b spark-321 v3.2.1
> > ./build/mvn -DskipTests clean package -Phive
> > export JAVA_HOME=$(path/to/jdk/11)
> > ./python/run-tests
> ```
> 
> Current Java version
> ```
> java -version
> openjdk version "11.0.17" 2022-10-18
> OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
> OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
> ```
> 
> Alternatively, I've also tried simply building Spark and using a python=3.9 venv and installing the requirements from `pip install -r dev/requirements.txt` and using that as the interpreter to run tests. However, I was running into some failing pandas test which to me seemed like it was coming from a pandas version difference as `requirements.txt` didn't specify a version.
> 
> I suppose I have a couple of questions in regards to this:
> 1. Am I missing a build step to build Spark and run PySpark unit tests?
> 2. Where could I find whether an upstream test is failing for a specific release?
> 3. Would it be possible to configure the `run-tests` script to run all tests regardless of test failures?


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Building Spark to run PySpark Tests?

Posted by Sean Owen <sr...@gmail.com>.

It's not clear what error you're facing from this info (ConnectionError
could mean lots of things), so would be hard to generalize answers. How
much mem do you have on your Mac?
-Xmx2g sounds low, but also probably doesn't matter much.
Spark builds work on my Mac, FWIW.

On Thu, Jan 19, 2023 at 10:15 AM Adam Chhina <am...@gmail.com> wrote:

> Hmm, would there be a list of common env issues that would interfere with
> builds? Looking up the error message, it seemed like often the issue was
> OOM by the JVM process. I’m not sure if that’s what’s happening here, since
> during the build and setting up the tests the config should have allocated
> enough memory?
>
> I’ve been just trying to follow the build docs, and so far I’m running as
> such:
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
> > cd spark
> > export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g” // was
> unset, but set to be safe
> > export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES // I saw in the
> developer tools that some pyspark tests were having issues on macOS
> > export JAVA_HOME=`/usr/libexec/java_home -v 11`
> > ./build/mvn -DskipTests clean package -Phive
> > ./python/run-tests --python-executables --testnames
> ‘pyspark.tests.test_broadcast'
>
> > java -version
>
> openjdk version "11.0.17" 2022-10-18
>
> OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>
> OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>
>
> > OS
>
> Ventura 13.1 (22C65)
>
>
> Best,
>
>
> Adam Chhina
>
> On Jan 18, 2023, at 6:50 PM, Sean Owen <sr...@gmail.com> wrote:
>
> Release _branches_ are tested as commits arrive to the branch, yes. That's
> what you see at https://github.com/apache/spark/actions
> Released versions are fixed, they don't change, and were also manually
> tested before release, so no they are not re-tested; there is no need.
>
> You presumably have some local env issue, because the source of Spark
> 3.2.3 was passing CI/CD at time of release as well as manual tests of the
> PMC.
>
>
> On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina <am...@gmail.com> wrote:
>
>> Hi Sean,
>>
>> That’s fair in regards to 3.3.x being the current release branch. I’m not
>> familiar with the testing schedule, but I had assumed all currently
>> supported release versions would have some nightly/weekly tests ran; is
>> that not the case? I only ask, as when I when I’m seeing these test
>> failures, I assumed these were either known/unknown from some recurring
>> testing pipeline.
>>
>> Also, unfortunately using v3.2.3 also had the same test failures.
>>
>> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>>
>> I’ve posted the traceback below for one of the ran tests. At the end it
>> mentioned to check the logs - `see logs`. However I wasn’t sure whether
>> that just meant the traceback or some more detailed logs elsewhere? I
>> wasn’t able to see any files that looked relevant running `find . -name
>> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>>
>> ```
>> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
>> ... ERROR
>> test_broadcast_value_against_gc
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_no_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_with_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>>
>> ======================================================================
>> ERROR: test_broadcast_with_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest)
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
>> test_broadcast_with_encryption
>>     self._test_multiple_broadcasts(("spark.io.encryption.enabled",
>> "true"))
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
>> _test_multiple_broadcasts
>>     conf = SparkConf()
>>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
>>     self._jconf = _jvm.SparkConf(loadDefaults)
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
>> 1709, in __getattr__
>>     answer = self._gateway_client.send_command(
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
>> 1036, in send_command
>>     connection = self._get_connection()
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
>> 284, in _get_connection
>>     connection = self._create_new_connection()
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
>> 291, in _create_new_connection
>>     connection.connect_to_java_server()
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
>> 438, in connect_to_java_server
>>     self.socket.connect((self.java_address, self.java_port))
>> ConnectionRefusedError: [Errno 61] Connection refused
>>
>> ----------------------------------------------------------------------
>> Ran 7 tests in 12.950s
>>
>> FAILED (errors=7)
>> sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>
>>
>> Had test failures in pyspark.tests.test_broadcast with
>> /usr/local/bin/python3; see logs.
>> ```
>>
>> Best,
>>
>> Adam Chhina
>>
>> On Jan 18, 2023, at 5:03 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>> That isn't the released version either, but rather the head of the 3.2
>> branch (which is beyond 3.2.3).
>> You may want to check out the v3.2.3 tag instead:
>> https://github.com/apache/spark/tree/v3.2.3
>> ... instead of 3.2.1.
>> But note of course the 3.3.x is the current release branch anyway.
>>
>> Hard to say what the error is without seeing more of the error log.
>>
>> That final warning is fine, just means you are using Java 11+.
>>
>>
>> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina <am...@gmail.com>
>> wrote:
>>
>>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>>>
>>> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>>>
>>> Ah, so the old failing tests are passing now, but I am seeing failures
>>> in `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`,
>>> with a majority of them failing due to `ConnectionRefusedError: [Errno
>>> 61] Connection refused`. Maybe these tests are not mean to be ran locally,
>>> and only in the pipeline?
>>>
>>> Also, I see this warning that mentions to notify the maintainers here:
>>>
>>> ```
>>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>>> WARNING: An illegal reflective access operation has occurred
>>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>>> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor
>>> java.nio.DirectByteBuffer(long,int)
>>> ```
>>>
>>> FWIW, not sure if this matters, but python executable used for running
>>> these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>>>
>>> Best,
>>>
>>> Adam Chhina
>>>
>>> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <bj...@gmail.com>
>>> wrote:
>>>
>>> Replace
>>> > > git clone git@github.com:apache/spark.git
>>> > > git checkout -b spark-321 v3.2.1
>>>
>>> with
>>> git clone --branch branch-3.2 https://github.com/apache/spark.git
>>> This will give you branch 3.2 as today, what I suppose you call
>>> upstream
>>> https://github.com/apache/spark/commits/branch-3.2
>>> and right now all tests in github action are passed :)
>>>
>>>
>>> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <sr...@gmail.com>:
>>>
>>>> Never seen those, but it's probably a difference in pandas, numpy
>>>> versions. You can see the current CICD test results in GitHub Actions. But,
>>>> you want to use release versions, not an RC. 3.2.1 is not the latest
>>>> version, and it's possible the tests were actually failing in the RC.
>>>>
>>>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <am...@gmail.com>
>>>> wrote:
>>>>
>>>>> Bump,
>>>>>
>>>>> Just trying to see where I can find what tests are known failing for a
>>>>> particular release, to ensure I’m building upstream correctly following the
>>>>> build docs. I figured this would be the best place to ask as it pertains to
>>>>> building and testing upstream (also more than happy to provide a PR for any
>>>>> docs if required afterwards), however if there would be a more appropriate
>>>>> place, please let me know.
>>>>>
>>>>> Best,
>>>>>
>>>>> Adam Chhina
>>>>>
>>>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <am...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > As part of an upgrade I was looking to run upstream PySpark unit
>>>>> tests on `v3.2.1-rc2` before I applied some downstream patches and tested
>>>>> those. However, I'm running into some issues with failing unit tests, which
>>>>> I'm not sure are failing upstream or due to some step I missed in the build.
>>>>> >
>>>>> > The current failing tests (at least so far, since I believe the
>>>>> python script exits on test failure):
>>>>> > ```
>>>>> >
>>>>> ======================================================================
>>>>> > FAIL: test_train_prediction
>>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>>>> > Test that error on test data improves as model is trained.
>>>>> >
>>>>> ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File
>>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> line 474, in test_train_prediction
>>>>> >     eventually(condition, timeout=180.0)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>>> 86, in eventually
>>>>> >     lastValue = condition()
>>>>> >   File
>>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> line 469, in condition
>>>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>>>> > AssertionError: 1.8960983527735014 not greater than 2
>>>>> >
>>>>> >
>>>>> ======================================================================
>>>>> > FAIL: test_parameter_accuracy
>>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>>> > Test that the final value of weights is close to the desired value.
>>>>> >
>>>>> ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File
>>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> line 229, in test_parameter_accuracy
>>>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>>> 91, in eventually
>>>>> >     raise lastValue
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>>> 82, in eventually
>>>>> >     lastValue = condition()
>>>>> >   File
>>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> line 226, in condition
>>>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
>>>>> (0.13052813480829392 difference)
>>>>> >
>>>>> >
>>>>> ======================================================================
>>>>> > FAIL: test_training_and_prediction
>>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>>> > Test that the model improves on toy data with no. of batches
>>>>> >
>>>>> ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File
>>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> line 334, in test_training_and_prediction
>>>>> >     eventually(condition, timeout=180.0)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>>> 93, in eventually
>>>>> >     raise AssertionError(
>>>>> > AssertionError: Test failed due to timeout after 180 sec, with last
>>>>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>>>>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>>>>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
>>>>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>>>> >
>>>>> >
>>>>> ----------------------------------------------------------------------
>>>>> > Ran 13 tests in 661.536s
>>>>> >
>>>>> > FAILED (failures=3, skipped=1)
>>>>> >
>>>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms
>>>>> with /usr/local/bin/python3; see logs.
>>>>> > ```
>>>>> >
>>>>> > Here's how I'm currently building Spark, I was using the
>>>>> [building-spark](
>>>>> https://spark.apache.org/docs/3..1/building-spark.html) docs as a
>>>>> reference.
>>>>> > ```
>>>>> > > git clone git@github.com:apache/spark.git
>>>>> > > git checkout -b spark-321 v3.2.1
>>>>> > > ./build/mvn -DskipTests clean package -Phive
>>>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>>>> > > ./python/run-tests
>>>>> > ```
>>>>> >
>>>>> > Current Java version
>>>>> > ```
>>>>> > java -version
>>>>> > openjdk version "11.0.17" 2022-10-18
>>>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>>>> > ```
>>>>> >
>>>>> > Alternatively, I've also tried simply building Spark and using a
>>>>> python=3.9 venv and installing the requirements from `pip install -r
>>>>> dev/requirements.txt` and using that as the interpreter to run tests.
>>>>> However, I was running into some failing pandas test which to me seemed
>>>>> like it was coming from a pandas version difference as `requirements.txt`
>>>>> didn't specify a version.
>>>>> >
>>>>> > I suppose I have a couple of questions in regards to this:
>>>>> > 1. Am I missing a build step to build Spark and run PySpark unit
>>>>> tests?
>>>>> > 2. Where could I find whether an upstream test is failing for a
>>>>> specific release?
>>>>> > 3. Would it be possible to configure the `run-tests` script to run
>>>>> all tests regardless of test failures?
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>>
>>>
>>
>

Re: Building Spark to run PySpark Tests?

Posted by Adam Chhina <am...@gmail.com>.

Hmm, would there be a list of common env issues that would interfere with builds? Looking up the error message, it seemed like often the issue was OOM by the JVM process. I’m not sure if that’s what’s happening here, since during the build and setting up the tests the config should have allocated enough memory?

I’ve been just trying to follow the build docs, and so far I’m running as such:

> git clone --branch v3.2.3 https://github.com/apache/spark.git 
> cd spark
> export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g” // was unset, but set to be safe
> export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES // I saw in the developer tools that some pyspark tests were having issues on macOS
> export JAVA_HOME=`/usr/libexec/java_home -v 11`
> ./build/mvn -DskipTests clean package -Phive
> ./python/run-tests --python-executables --testnames ‘pyspark.tests.test_broadcast'

> java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)

> OS
Ventura 13.1 (22C65)

Best,

Adam Chhina

> On Jan 18, 2023, at 6:50 PM, Sean Owen <sr...@gmail.com> wrote:
> 
> Release _branches_ are tested as commits arrive to the branch, yes. That's what you see at https://github.com/apache/spark/actions
> Released versions are fixed, they don't change, and were also manually tested before release, so no they are not re-tested; there is no need.
> 
> You presumably have some local env issue, because the source of Spark 3.2.3 was passing CI/CD at time of release as well as manual tests of the PMC.
> 
> 
> On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina <amanschhina@gmail.com <ma...@gmail.com>> wrote:
>> Hi Sean,
>> 
>> That’s fair in regards to 3.3.x being the current release branch. I’m not familiar with the testing schedule, but I had assumed all currently supported release versions would have some nightly/weekly tests ran; is that not the case? I only ask, as when I when I’m seeing these test failures, I assumed these were either known/unknown from some recurring testing pipeline.
>> 
>> Also, unfortunately using v3.2.3 also had the same test failures.
>> 
>> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>> 
>> I’ve posted the traceback below for one of the ran tests. At the end it mentioned to check the logs - `see logs`. However I wasn’t sure whether that just meant the traceback or some more detailed logs elsewhere? I wasn’t able to see any files that looked relevant running `find . -name “*logs*”` afterwards. Sorry if I’m missing something obvious.
>> 
>> ```
>> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_against_gc (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_no_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_with_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> 
>> ======================================================================
>> ERROR: test_broadcast_with_encryption (pyspark.tests.test_broadcast.BroadcastTest)
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in test_broadcast_with_encryption
>>     self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in _test_multiple_broadcasts
>>     conf = SparkConf()
>>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
>>     self._jconf = _jvm.SparkConf(loadDefaults)
>>   File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1709, in __getattr__
>>     answer = self._gateway_client.send_command(
>>   File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1036, in send_command
>>     connection = self._get_connection()
>>   File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 284, in _get_connection
>>     connection = self._create_new_connection()
>>   File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 291, in _create_new_connection
>>     connection.connect_to_java_server()
>>   File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 438, in connect_to_java_server
>>     self.socket.connect((self.java_address, self.java_port))
>> ConnectionRefusedError: [Errno 61] Connection refused
>> 
>> ----------------------------------------------------------------------
>> Ran 7 tests in 12.950s
>> 
>> FAILED (errors=7)
>> sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>
>> 
>> Had test failures in pyspark.tests.test_broadcast with /usr/local/bin/python3; see logs.
>> ```
>> 
>> Best,
>> 
>> Adam Chhina
>> 
>>> On Jan 18, 2023, at 5:03 PM, Sean Owen <srowen@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> That isn't the released version either, but rather the head of the 3.2 branch (which is beyond 3.2.3).
>>> You may want to check out the v3.2.3 tag instead: https://github.com/apache/spark/tree/v3.2.3
>>> ... instead of 3.2.1. 
>>> But note of course the 3.3.x is the current release branch anyway.
>>> 
>>> Hard to say what the error is without seeing more of the error log.
>>> 
>>> That final warning is fine, just means you are using Java 11+.
>>> 
>>> 
>>> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina <amanschhina@gmail.com <ma...@gmail.com>> wrote:
>>>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>>>> 
>>>> > git clone --branch branch-3.2 https://github.com/apache/spark.git 
>>>> 
>>>> Ah, so the old failing tests are passing now, but I am seeing failures in `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`, with a majority of them failing due to `ConnectionRefusedError: [Errno 61] Connection refused`. Maybe these tests are not mean to be ran locally, and only in the pipeline?
>>>> 
>>>> Also, I see this warning that mentions to notify the maintainers here:
>>>> 
>>>> ```
>>>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>>>> WARNING: An illegal reflective access operation has occurred
>>>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor java.nio.DirectByteBuffer(long,int)
>>>> ```
>>>> 
>>>> FWIW, not sure if this matters, but python executable used for running these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>>>> 
>>>> Best,
>>>> 
>>>> Adam Chhina
>>>> 
>>>>> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <bjornjorgensen@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Replace 
>>>>> > > git clone git@github.com:apache/spark.git
>>>>> > > git checkout -b spark-321 v3.2.1
>>>>> 
>>>>> with 
>>>>> git clone --branch branch-3.2 https://github.com/apache/spark.git    
>>>>> This will give you branch 3.2 as today, what I suppose you call upstream      
>>>>> https://github.com/apache/spark/commits/branch-3.2
>>>>> and right now all tests in github action are passed :) 
>>>>> 
>>>>> 
>>>>> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <srowen@gmail.com <ma...@gmail.com>>:
>>>>>> Never seen those, but it's probably a difference in pandas, numpy versions. You can see the current CICD test results in GitHub Actions. But, you want to use release versions, not an RC. 3.2.1 is not the latest version, and it's possible the tests were actually failing in the RC.
>>>>>> 
>>>>>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <amanschhina@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> Bump,
>>>>>>> 
>>>>>>> Just trying to see where I can find what tests are known failing for a particular release, to ensure I’m building upstream correctly following the build docs. I figured this would be the best place to ask as it pertains to building and testing upstream (also more than happy to provide a PR for any docs if required afterwards), however if there would be a more appropriate place, please let me know.
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Adam Chhina
>>>>>>> 
>>>>>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <amanschhina@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> > 
>>>>>>> > As part of an upgrade I was looking to run upstream PySpark unit tests on `v3.2.1-rc2` before I applied some downstream patches and tested those. However, I'm running into some issues with failing unit tests, which I'm not sure are failing upstream or due to some step I missed in the build.
>>>>>>> > 
>>>>>>> > The current failing tests (at least so far, since I believe the python script exits on test failure):
>>>>>>> > ```
>>>>>>> > ======================================================================
>>>>>>> > FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>>>>>> > Test that error on test data improves as model is trained.
>>>>>>> > ----------------------------------------------------------------------
>>>>>>> > Traceback (most recent call last):
>>>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 474, in test_train_prediction
>>>>>>> >     eventually(condition, timeout=180.0)
>>>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, in eventually
>>>>>>> >     lastValue = condition()
>>>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 469, in condition
>>>>>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>>>>>> > AssertionError: 1.8960983527735014 not greater than 2
>>>>>>> > 
>>>>>>> > ======================================================================
>>>>>>> > FAIL: test_parameter_accuracy (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>>>>> > Test that the final value of weights is close to the desired value.
>>>>>>> > ----------------------------------------------------------------------
>>>>>>> > Traceback (most recent call last):
>>>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 229, in test_parameter_accuracy
>>>>>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, in eventually
>>>>>>> >     raise lastValue
>>>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, in eventually
>>>>>>> >     lastValue = condition()
>>>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 226, in condition
>>>>>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>>>>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places (0.13052813480829392 difference)
>>>>>>> > 
>>>>>>> > ======================================================================
>>>>>>> > FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>>>>> > Test that the model improves on toy data with no. of batches
>>>>>>> > ----------------------------------------------------------------------
>>>>>>> > Traceback (most recent call last):
>>>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 334, in test_training_and_prediction
>>>>>>> >     eventually(condition, timeout=180.0)
>>>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93, in eventually
>>>>>>> >     raise AssertionError(
>>>>>>> > AssertionError: Test failed due to timeout after 180 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64, 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>>>>>> > 
>>>>>>> > ----------------------------------------------------------------------
>>>>>>> > Ran 13 tests in 661.536s
>>>>>>> > 
>>>>>>> > FAILED (failures=3, skipped=1)
>>>>>>> > 
>>>>>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with /usr/local/bin/python3; see logs.
>>>>>>> > ```
>>>>>>> > 
>>>>>>> > Here's how I'm currently building Spark, I was using the [building-spark](https://spark.apache.org/docs/3..1/building-spark.html) docs as a reference.
>>>>>>> > ```
>>>>>>> > > git clone git@github.com:apache/spark.git
>>>>>>> > > git checkout -b spark-321 v3.2.1
>>>>>>> > > ./build/mvn -DskipTests clean package -Phive
>>>>>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>>>>>> > > ./python/run-tests
>>>>>>> > ```
>>>>>>> > 
>>>>>>> > Current Java version
>>>>>>> > ```
>>>>>>> > java -version
>>>>>>> > openjdk version "11.0.17" 2022-10-18
>>>>>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>>>>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>>>>>> > ```
>>>>>>> > 
>>>>>>> > Alternatively, I've also tried simply building Spark and using a python=3.9 venv and installing the requirements from `pip install -r dev/requirements.txt` and using that as the interpreter to run tests. However, I was running into some failing pandas test which to me seemed like it was coming from a pandas version difference as `requirements.txt` didn't specify a version.
>>>>>>> > 
>>>>>>> > I suppose I have a couple of questions in regards to this:
>>>>>>> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
>>>>>>> > 2. Where could I find whether an upstream test is failing for a specific release?
>>>>>>> > 3. Would it be possible to configure the `run-tests` script to run all tests regardless of test failures?
>>>>>>> 
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Bjørn Jørgensen 
>>>>> Vestre Aspehaug 4, 6010 Ålesund 
>>>>> Norge
>>>>> 
>>>>> +47 480 94 297
>>>> 
>>

Re: Building Spark to run PySpark Tests?

Posted by Sean Owen <sr...@gmail.com>.

Release _branches_ are tested as commits arrive to the branch, yes. That's
what you see at https://github.com/apache/spark/actions
Released versions are fixed, they don't change, and were also manually
tested before release, so no they are not re-tested; there is no need.

You presumably have some local env issue, because the source of Spark 3.2.3
was passing CI/CD at time of release as well as manual tests of the PMC.


On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina <am...@gmail.com> wrote:

> Hi Sean,
>
> That’s fair in regards to 3.3.x being the current release branch. I’m not
> familiar with the testing schedule, but I had assumed all currently
> supported release versions would have some nightly/weekly tests ran; is
> that not the case? I only ask, as when I when I’m seeing these test
> failures, I assumed these were either known/unknown from some recurring
> testing pipeline.
>
> Also, unfortunately using v3.2.3 also had the same test failures.
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>
> I’ve posted the traceback below for one of the ran tests. At the end it
> mentioned to check the logs - `see logs`. However I wasn’t sure whether
> that just meant the traceback or some more detailed logs elsewhere? I
> wasn’t able to see any files that looked relevant running `find . -name
> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>
> ```
> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
> ... ERROR
> test_broadcast_value_against_gc
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_no_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>
> ======================================================================
> ERROR: test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
> test_broadcast_with_encryption
>     self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
> _test_multiple_broadcasts
>     conf = SparkConf()
>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
>     self._jconf = _jvm.SparkConf(loadDefaults)
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1709, in __getattr__
>     answer = self._gateway_client.send_command(
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1036, in send_command
>     connection = self._get_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 284, in _get_connection
>     connection = self._create_new_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 291, in _create_new_connection
>     connection.connect_to_java_server()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 438, in connect_to_java_server
>     self.socket.connect((self.java_address, self.java_port))
> ConnectionRefusedError: [Errno 61] Connection refused
>
> ----------------------------------------------------------------------
> Ran 7 tests in 12.950s
>
> FAILED (errors=7)
> sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>
>
> Had test failures in pyspark.tests.test_broadcast with
> /usr/local/bin/python3; see logs.
> ```
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 5:03 PM, Sean Owen <sr...@gmail.com> wrote:
>
> That isn't the released version either, but rather the head of the 3.2
> branch (which is beyond 3.2.3).
> You may want to check out the v3.2.3 tag instead:
> https://github.com/apache/spark/tree/v3.2.3
> ... instead of 3.2.1.
> But note of course the 3.3.x is the current release branch anyway.
>
> Hard to say what the error is without seeing more of the error log.
>
> That final warning is fine, just means you are using Java 11+.
>
>
> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina <am...@gmail.com> wrote:
>
>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>>
>> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>>
>> Ah, so the old failing tests are passing now, but I am seeing failures in
>> `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`,
>> with a majority of them failing due to `ConnectionRefusedError: [Errno
>> 61] Connection refused`. Maybe these tests are not mean to be ran locally,
>> and only in the pipeline?
>>
>> Also, I see this warning that mentions to notify the maintainers here:
>>
>> ```
>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>> WARNING: An illegal reflective access operation has occurred
>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor
>> java.nio.DirectByteBuffer(long,int)
>> ```
>>
>> FWIW, not sure if this matters, but python executable used for running
>> these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>>
>> Best,
>>
>> Adam Chhina
>>
>> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <bj...@gmail.com>
>> wrote:
>>
>> Replace
>> > > git clone git@github.com:apache/spark.git
>> > > git checkout -b spark-321 v3.2.1
>>
>> with
>> git clone --branch branch-3.2 https://github.com/apache/spark.git
>> This will give you branch 3.2 as today, what I suppose you call upstream
>>
>> https://github.com/apache/spark/commits/branch-3.2
>> and right now all tests in github action are passed :)
>>
>>
>> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <sr...@gmail.com>:
>>
>>> Never seen those, but it's probably a difference in pandas, numpy
>>> versions. You can see the current CICD test results in GitHub Actions. But,
>>> you want to use release versions, not an RC. 3.2.1 is not the latest
>>> version, and it's possible the tests were actually failing in the RC.
>>>
>>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <am...@gmail.com>
>>> wrote:
>>>
>>>> Bump,
>>>>
>>>> Just trying to see where I can find what tests are known failing for a
>>>> particular release, to ensure I’m building upstream correctly following the
>>>> build docs. I figured this would be the best place to ask as it pertains to
>>>> building and testing upstream (also more than happy to provide a PR for any
>>>> docs if required afterwards), however if there would be a more appropriate
>>>> place, please let me know.
>>>>
>>>> Best,
>>>>
>>>> Adam Chhina
>>>>
>>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <am...@gmail.com>
>>>> wrote:
>>>> >
>>>> > As part of an upgrade I was looking to run upstream PySpark unit
>>>> tests on `v3.2.1-rc2` before I applied some downstream patches and tested
>>>> those. However, I'm running into some issues with failing unit tests, which
>>>> I'm not sure are failing upstream or due to some step I missed in the build.
>>>> >
>>>> > The current failing tests (at least so far, since I believe the
>>>> python script exits on test failure):
>>>> > ```
>>>> > ======================================================================
>>>> > FAIL: test_train_prediction
>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>>> > Test that error on test data improves as model is trained.
>>>> > ----------------------------------------------------------------------
>>>> > Traceback (most recent call last):
>>>> >   File
>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>> line 474, in test_train_prediction
>>>> >     eventually(condition, timeout=180.0)
>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>> 86, in eventually
>>>> >     lastValue = condition()
>>>> >   File
>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>> line 469, in condition
>>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>>> > AssertionError: 1.8960983527735014 not greater than 2
>>>> >
>>>> > ======================================================================
>>>> > FAIL: test_parameter_accuracy
>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>> > Test that the final value of weights is close to the desired value.
>>>> > ----------------------------------------------------------------------
>>>> > Traceback (most recent call last):
>>>> >   File
>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>> line 229, in test_parameter_accuracy
>>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>> 91, in eventually
>>>> >     raise lastValue
>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>> 82, in eventually
>>>> >     lastValue = condition()
>>>> >   File
>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>> line 226, in condition
>>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
>>>> (0.13052813480829392 difference)
>>>> >
>>>> > ======================================================================
>>>> > FAIL: test_training_and_prediction
>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>> > Test that the model improves on toy data with no. of batches
>>>> > ----------------------------------------------------------------------
>>>> > Traceback (most recent call last):
>>>> >   File
>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>> line 334, in test_training_and_prediction
>>>> >     eventually(condition, timeout=180.0)
>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>> 93, in eventually
>>>> >     raise AssertionError(
>>>> > AssertionError: Test failed due to timeout after 180 sec, with last
>>>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>>>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>>>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
>>>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>>> >
>>>> > ----------------------------------------------------------------------
>>>> > Ran 13 tests in 661.536s
>>>> >
>>>> > FAILED (failures=3, skipped=1)
>>>> >
>>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms
>>>> with /usr/local/bin/python3; see logs.
>>>> > ```
>>>> >
>>>> > Here's how I'm currently building Spark, I was using the
>>>> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
>>>> docs as a reference.
>>>> > ```
>>>> > > git clone git@github.com:apache/spark.git
>>>> > > git checkout -b spark-321 v3.2.1
>>>> > > ./build/mvn -DskipTests clean package -Phive
>>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>>> > > ./python/run-tests
>>>> > ```
>>>> >
>>>> > Current Java version
>>>> > ```
>>>> > java -version
>>>> > openjdk version "11.0.17" 2022-10-18
>>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>>> > ```
>>>> >
>>>> > Alternatively, I've also tried simply building Spark and using a
>>>> python=3.9 venv and installing the requirements from `pip install -r
>>>> dev/requirements.txt` and using that as the interpreter to run tests.
>>>> However, I was running into some failing pandas test which to me seemed
>>>> like it was coming from a pandas version difference as `requirements.txt`
>>>> didn't specify a version.
>>>> >
>>>> > I suppose I have a couple of questions in regards to this:
>>>> > 1. Am I missing a build step to build Spark and run PySpark unit
>>>> tests?
>>>> > 2. Where could I find whether an upstream test is failing for a
>>>> specific release?
>>>> > 3. Would it be possible to configure the `run-tests` script to run
>>>> all tests regardless of test failures?
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>>
>>
>

Re: Building Spark to run PySpark Tests?

Posted by Adam Chhina <am...@gmail.com>.

Hi Sean,

That’s fair in regards to 3.3.x being the current release branch. I’m not familiar with the testing schedule, but I had assumed all currently supported release versions would have some nightly/weekly tests ran; is that not the case? I only ask, as when I when I’m seeing these test failures, I assumed these were either known/unknown from some recurring testing pipeline.

Also, unfortunately using v3.2.3 also had the same test failures.

> git clone --branch v3.2.3 https://github.com/apache/spark.git

I’ve posted the traceback below for one of the ran tests. At the end it mentioned to check the logs - `see logs`. However I wasn’t sure whether that just meant the traceback or some more detailed logs elsewhere? I wasn’t able to see any files that looked relevant running `find . -name “*logs*”` afterwards. Sorry if I’m missing something obvious.

```
test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
test_broadcast_value_against_gc (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
test_broadcast_value_driver_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
test_broadcast_value_driver_no_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
test_broadcast_with_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR

======================================================================
ERROR: test_broadcast_with_encryption (pyspark.tests.test_broadcast.BroadcastTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in test_broadcast_with_encryption
    self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
  File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in _test_multiple_broadcasts
    conf = SparkConf()
  File "$path/spark/python/pyspark/conf.py", line 120, in __init__
    self._jconf = _jvm.SparkConf(loadDefaults)
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1709, in __getattr__
    answer = self._gateway_client.send_command(
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1036, in send_command
    connection = self._get_connection()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 284, in _get_connection
    connection = self._create_new_connection()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 291, in _create_new_connection
    connection.connect_to_java_server()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 438, in connect_to_java_server
    self.socket.connect((self.java_address, self.java_port))
ConnectionRefusedError: [Errno 61] Connection refused

----------------------------------------------------------------------
Ran 7 tests in 12.950s

FAILED (errors=7)
sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>

Had test failures in pyspark.tests.test_broadcast with /usr/local/bin/python3; see logs.
```

Best,

Adam Chhina

> On Jan 18, 2023, at 5:03 PM, Sean Owen <sr...@gmail.com> wrote:
> 
> That isn't the released version either, but rather the head of the 3.2 branch (which is beyond 3.2.3).
> You may want to check out the v3.2.3 tag instead: https://github.com/apache/spark/tree/v3.2.3
> ... instead of 3.2.1. 
> But note of course the 3.3.x is the current release branch anyway.
> 
> Hard to say what the error is without seeing more of the error log.
> 
> That final warning is fine, just means you are using Java 11+.
> 
> 
> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina <amanschhina@gmail.com <ma...@gmail.com>> wrote:
>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>> 
>> > git clone --branch branch-3.2 https://github.com/apache/spark.git 
>> 
>> Ah, so the old failing tests are passing now, but I am seeing failures in `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`, with a majority of them failing due to `ConnectionRefusedError: [Errno 61] Connection refused`. Maybe these tests are not mean to be ran locally, and only in the pipeline?
>> 
>> Also, I see this warning that mentions to notify the maintainers here:
>> 
>> ```
>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>> WARNING: An illegal reflective access operation has occurred
>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor java.nio.DirectByteBuffer(long,int)
>> ```
>> 
>> FWIW, not sure if this matters, but python executable used for running these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>> 
>> Best,
>> 
>> Adam Chhina
>> 
>>> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <bjornjorgensen@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Replace 
>>> > > git clone git@github.com:apache/spark.git
>>> > > git checkout -b spark-321 v3.2.1
>>> 
>>> with 
>>> git clone --branch branch-3.2 https://github.com/apache/spark.git    
>>> This will give you branch 3.2 as today, what I suppose you call upstream      
>>> https://github.com/apache/spark/commits/branch-3.2
>>> and right now all tests in github action are passed :) 
>>> 
>>> 
>>> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <srowen@gmail.com <ma...@gmail.com>>:
>>>> Never seen those, but it's probably a difference in pandas, numpy versions. You can see the current CICD test results in GitHub Actions. But, you want to use release versions, not an RC. 3.2.1 is not the latest version, and it's possible the tests were actually failing in the RC.
>>>> 
>>>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <amanschhina@gmail.com <ma...@gmail.com>> wrote:
>>>>> Bump,
>>>>> 
>>>>> Just trying to see where I can find what tests are known failing for a particular release, to ensure I’m building upstream correctly following the build docs. I figured this would be the best place to ask as it pertains to building and testing upstream (also more than happy to provide a PR for any docs if required afterwards), however if there would be a more appropriate place, please let me know.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Adam Chhina
>>>>> 
>>>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <amanschhina@gmail.com <ma...@gmail.com>> wrote:
>>>>> > 
>>>>> > As part of an upgrade I was looking to run upstream PySpark unit tests on `v3.2.1-rc2` before I applied some downstream patches and tested those. However, I'm running into some issues with failing unit tests, which I'm not sure are failing upstream or due to some step I missed in the build.
>>>>> > 
>>>>> > The current failing tests (at least so far, since I believe the python script exits on test failure):
>>>>> > ```
>>>>> > ======================================================================
>>>>> > FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>>>> > Test that error on test data improves as model is trained.
>>>>> > ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 474, in test_train_prediction
>>>>> >     eventually(condition, timeout=180.0)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, in eventually
>>>>> >     lastValue = condition()
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 469, in condition
>>>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>>>> > AssertionError: 1.8960983527735014 not greater than 2
>>>>> > 
>>>>> > ======================================================================
>>>>> > FAIL: test_parameter_accuracy (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>>> > Test that the final value of weights is close to the desired value.
>>>>> > ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 229, in test_parameter_accuracy
>>>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, in eventually
>>>>> >     raise lastValue
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, in eventually
>>>>> >     lastValue = condition()
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 226, in condition
>>>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places (0.13052813480829392 difference)
>>>>> > 
>>>>> > ======================================================================
>>>>> > FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>>> > Test that the model improves on toy data with no. of batches
>>>>> > ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 334, in test_training_and_prediction
>>>>> >     eventually(condition, timeout=180.0)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93, in eventually
>>>>> >     raise AssertionError(
>>>>> > AssertionError: Test failed due to timeout after 180 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64, 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>>>> > 
>>>>> > ----------------------------------------------------------------------
>>>>> > Ran 13 tests in 661.536s
>>>>> > 
>>>>> > FAILED (failures=3, skipped=1)
>>>>> > 
>>>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with /usr/local/bin/python3; see logs.
>>>>> > ```
>>>>> > 
>>>>> > Here's how I'm currently building Spark, I was using the [building-spark](https://spark.apache.org/docs/3..1/building-spark.html) docs as a reference.
>>>>> > ```
>>>>> > > git clone git@github.com:apache/spark.git
>>>>> > > git checkout -b spark-321 v3.2.1
>>>>> > > ./build/mvn -DskipTests clean package -Phive
>>>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>>>> > > ./python/run-tests
>>>>> > ```
>>>>> > 
>>>>> > Current Java version
>>>>> > ```
>>>>> > java -version
>>>>> > openjdk version "11.0.17" 2022-10-18
>>>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>>>> > ```
>>>>> > 
>>>>> > Alternatively, I've also tried simply building Spark and using a python=3.9 venv and installing the requirements from `pip install -r dev/requirements.txt` and using that as the interpreter to run tests. However, I was running into some failing pandas test which to me seemed like it was coming from a pandas version difference as `requirements.txt` didn't specify a version.
>>>>> > 
>>>>> > I suppose I have a couple of questions in regards to this:
>>>>> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
>>>>> > 2. Where could I find whether an upstream test is failing for a specific release?
>>>>> > 3. Would it be possible to configure the `run-tests` script to run all tests regardless of test failures?
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>>> 
>>> 
>>> 
>>> -- 
>>> Bjørn Jørgensen 
>>> Vestre Aspehaug 4, 6010 Ålesund 
>>> Norge
>>> 
>>> +47 480 94 297
>>

Re: Building Spark to run PySpark Tests?

Posted by Sean Owen <sr...@gmail.com>.

That isn't the released version either, but rather the head of the 3.2
branch (which is beyond 3.2.3).
You may want to check out the v3.2.3 tag instead:
https://github.com/apache/spark/tree/v3.2.3
... instead of 3.2.1.
But note of course the 3.3.x is the current release branch anyway.

Hard to say what the error is without seeing more of the error log.

That final warning is fine, just means you are using Java 11+.


On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina <am...@gmail.com> wrote:

> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>
> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>
> Ah, so the old failing tests are passing now, but I am seeing failures in `pyspark.tests.test_broadcast`
> such as  `test_broadcast_value_against_gc`, with a majority of them
> failing due to `ConnectionRefusedError: [Errno 61] Connection refused`.
> Maybe these tests are not mean to be ran locally, and only in the pipeline?
>
> Also, I see this warning that mentions to notify the maintainers here:
>
> ```
>
> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor
> java.nio.DirectByteBuffer(long,int)
> ```
>
> FWIW, not sure if this matters, but python executable used for running
> these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
> Replace
> > > git clone git@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
>
> with
> git clone --branch branch-3.2 https://github.com/apache/spark.git
> This will give you branch 3.2 as today, what I suppose you call upstream
>
> https://github.com/apache/spark/commits/branch-3.2
> and right now all tests in github action are passed :)
>
>
> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <sr...@gmail.com>:
>
>> Never seen those, but it's probably a difference in pandas, numpy
>> versions. You can see the current CICD test results in GitHub Actions. But,
>> you want to use release versions, not an RC. 3.2.1 is not the latest
>> version, and it's possible the tests were actually failing in the RC.
>>
>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <am...@gmail.com> wrote:
>>
>>> Bump,
>>>
>>> Just trying to see where I can find what tests are known failing for a
>>> particular release, to ensure I’m building upstream correctly following the
>>> build docs. I figured this would be the best place to ask as it pertains to
>>> building and testing upstream (also more than happy to provide a PR for any
>>> docs if required afterwards), however if there would be a more appropriate
>>> place, please let me know.
>>>
>>> Best,
>>>
>>> Adam Chhina
>>>
>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <am...@gmail.com>
>>> wrote:
>>> >
>>> > As part of an upgrade I was looking to run upstream PySpark unit tests
>>> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
>>> However, I'm running into some issues with failing unit tests, which I'm
>>> not sure are failing upstream or due to some step I missed in the build.
>>> >
>>> > The current failing tests (at least so far, since I believe the python
>>> script exits on test failure):
>>> > ```
>>> > ======================================================================
>>> > FAIL: test_train_prediction
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>> > Test that error on test data improves as model is trained.
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 474, in test_train_prediction
>>> >     eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 86, in eventually
>>> >     lastValue = condition()
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 469, in condition
>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>> > AssertionError: 1.8960983527735014 not greater than 2
>>> >
>>> > ======================================================================
>>> > FAIL: test_parameter_accuracy
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the final value of weights is close to the desired value.
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 229, in test_parameter_accuracy
>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 91, in eventually
>>> >     raise lastValue
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 82, in eventually
>>> >     lastValue = condition()
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 226, in condition
>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
>>> (0.13052813480829392 difference)
>>> >
>>> > ======================================================================
>>> > FAIL: test_training_and_prediction
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the model improves on toy data with no. of batches
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 334, in test_training_and_prediction
>>> >     eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 93, in eventually
>>> >     raise AssertionError(
>>> > AssertionError: Test failed due to timeout after 180 sec, with last
>>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
>>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>> >
>>> > ----------------------------------------------------------------------
>>> > Ran 13 tests in 661.536s
>>> >
>>> > FAILED (failures=3, skipped=1)
>>> >
>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms
>>> with /usr/local/bin/python3; see logs.
>>> > ```
>>> >
>>> > Here's how I'm currently building Spark, I was using the
>>> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
>>> docs as a reference.
>>> > ```
>>> > > git clone git@github.com:apache/spark.git
>>> > > git checkout -b spark-321 v3.2.1
>>> > > ./build/mvn -DskipTests clean package -Phive
>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>> > > ./python/run-tests
>>> > ```
>>> >
>>> > Current Java version
>>> > ```
>>> > java -version
>>> > openjdk version "11.0.17" 2022-10-18
>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>> > ```
>>> >
>>> > Alternatively, I've also tried simply building Spark and using a
>>> python=3.9 venv and installing the requirements from `pip install -r
>>> dev/requirements.txt` and using that as the interpreter to run tests.
>>> However, I was running into some failing pandas test which to me seemed
>>> like it was coming from a pandas version difference as `requirements.txt`
>>> didn't specify a version.
>>> >
>>> > I suppose I have a couple of questions in regards to this:
>>> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
>>> > 2. Where could I find whether an upstream test is failing for a
>>> specific release?
>>> > 3. Would it be possible to configure the `run-tests` script to run all
>>> tests regardless of test failures?
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>
>
>

Re: Building Spark to run PySpark Tests?

Posted by Adam Chhina <am...@gmail.com>.

Oh, whoops, didn’t realize that wasn’t the release version, thanks!

> git clone --branch branch-3.2 https://github.com/apache/spark.git 

Ah, so the old failing tests are passing now, but I am seeing failures in `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`, with a majority of them failing due to `ConnectionRefusedError: [Errno 61] Connection refused`. Maybe these tests are not mean to be ran locally, and only in the pipeline?

Also, I see this warning that mentions to notify the maintainers here:

```
Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor java.nio.DirectByteBuffer(long,int)
```

FWIW, not sure if this matters, but python executable used for running these tests is `Python 3.10.9` under `/user/local/bin/python3`.

Best,

Adam Chhina

> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <bj...@gmail.com> wrote:
> 
> Replace 
> > > git clone git@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
> 
> with 
> git clone --branch branch-3.2 https://github.com/apache/spark.git    
> This will give you branch 3.2 as today, what I suppose you call upstream      
> https://github.com/apache/spark/commits/branch-3.2
> and right now all tests in github action are passed :) 
> 
> 
> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <srowen@gmail.com <ma...@gmail.com>>:
>> Never seen those, but it's probably a difference in pandas, numpy versions. You can see the current CICD test results in GitHub Actions. But, you want to use release versions, not an RC. 3.2.1 is not the latest version, and it's possible the tests were actually failing in the RC.
>> 
>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <amanschhina@gmail.com <ma...@gmail.com>> wrote:
>>> Bump,
>>> 
>>> Just trying to see where I can find what tests are known failing for a particular release, to ensure I’m building upstream correctly following the build docs. I figured this would be the best place to ask as it pertains to building and testing upstream (also more than happy to provide a PR for any docs if required afterwards), however if there would be a more appropriate place, please let me know.
>>> 
>>> Best,
>>> 
>>> Adam Chhina
>>> 
>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <amanschhina@gmail.com <ma...@gmail.com>> wrote:
>>> > 
>>> > As part of an upgrade I was looking to run upstream PySpark unit tests on `v3.2.1-rc2` before I applied some downstream patches and tested those. However, I'm running into some issues with failing unit tests, which I'm not sure are failing upstream or due to some step I missed in the build.
>>> > 
>>> > The current failing tests (at least so far, since I believe the python script exits on test failure):
>>> > ```
>>> > ======================================================================
>>> > FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>> > Test that error on test data improves as model is trained.
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 474, in test_train_prediction
>>> >     eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, in eventually
>>> >     lastValue = condition()
>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 469, in condition
>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>> > AssertionError: 1.8960983527735014 not greater than 2
>>> > 
>>> > ======================================================================
>>> > FAIL: test_parameter_accuracy (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the final value of weights is close to the desired value.
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 229, in test_parameter_accuracy
>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, in eventually
>>> >     raise lastValue
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, in eventually
>>> >     lastValue = condition()
>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 226, in condition
>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places (0.13052813480829392 difference)
>>> > 
>>> > ======================================================================
>>> > FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the model improves on toy data with no. of batches
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 334, in test_training_and_prediction
>>> >     eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93, in eventually
>>> >     raise AssertionError(
>>> > AssertionError: Test failed due to timeout after 180 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64, 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>> > 
>>> > ----------------------------------------------------------------------
>>> > Ran 13 tests in 661.536s
>>> > 
>>> > FAILED (failures=3, skipped=1)
>>> > 
>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with /usr/local/bin/python3; see logs.
>>> > ```
>>> > 
>>> > Here's how I'm currently building Spark, I was using the [building-spark](https://spark.apache.org/docs/3..1/building-spark.html) docs as a reference.
>>> > ```
>>> > > git clone git@github.com:apache/spark.git
>>> > > git checkout -b spark-321 v3.2.1
>>> > > ./build/mvn -DskipTests clean package -Phive
>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>> > > ./python/run-tests
>>> > ```
>>> > 
>>> > Current Java version
>>> > ```
>>> > java -version
>>> > openjdk version "11.0.17" 2022-10-18
>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>> > ```
>>> > 
>>> > Alternatively, I've also tried simply building Spark and using a python=3.9 venv and installing the requirements from `pip install -r dev/requirements.txt` and using that as the interpreter to run tests. However, I was running into some failing pandas test which to me seemed like it was coming from a pandas version difference as `requirements.txt` didn't specify a version.
>>> > 
>>> > I suppose I have a couple of questions in regards to this:
>>> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
>>> > 2. Where could I find whether an upstream test is failing for a specific release?
>>> > 3. Would it be possible to configure the `run-tests` script to run all tests regardless of test failures?
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>> 
> 
> 
> -- 
> Bjørn Jørgensen 
> Vestre Aspehaug 4, 6010 Ålesund 
> Norge
> 
> +47 480 94 297

Re: Building Spark to run PySpark Tests?

Posted by Bjørn Jørgensen <bj...@gmail.com>.

Replace
> > git clone git@github.com:apache/spark.git
> > git checkout -b spark-321 v3.2.1

with
git clone --branch branch-3.2 https://github.com/apache/spark.git
This will give you branch 3.2 as today, what I suppose you call upstream

https://github.com/apache/spark/commits/branch-3.2
and right now all tests in github action are passed :)


ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <sr...@gmail.com>:

> Never seen those, but it's probably a difference in pandas, numpy
> versions. You can see the current CICD test results in GitHub Actions. But,
> you want to use release versions, not an RC. 3.2.1 is not the latest
> version, and it's possible the tests were actually failing in the RC.
>
> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <am...@gmail.com> wrote:
>
>> Bump,
>>
>> Just trying to see where I can find what tests are known failing for a
>> particular release, to ensure I’m building upstream correctly following the
>> build docs. I figured this would be the best place to ask as it pertains to
>> building and testing upstream (also more than happy to provide a PR for any
>> docs if required afterwards), however if there would be a more appropriate
>> place, please let me know.
>>
>> Best,
>>
>> Adam Chhina
>>
>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <am...@gmail.com>
>> wrote:
>> >
>> > As part of an upgrade I was looking to run upstream PySpark unit tests
>> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
>> However, I'm running into some issues with failing unit tests, which I'm
>> not sure are failing upstream or due to some step I missed in the build.
>> >
>> > The current failing tests (at least so far, since I believe the python
>> script exits on test failure):
>> > ```
>> > ======================================================================
>> > FAIL: test_train_prediction
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>> > Test that error on test data improves as model is trained.
>> > ----------------------------------------------------------------------
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 474, in test_train_prediction
>> >     eventually(condition, timeout=180.0)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 86, in eventually
>> >     lastValue = condition()
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 469, in condition
>> >     self.assertGreater(errors[1] - errors[-1], 2)
>> > AssertionError: 1.8960983527735014 not greater than 2
>> >
>> > ======================================================================
>> > FAIL: test_parameter_accuracy
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>> > Test that the final value of weights is close to the desired value.
>> > ----------------------------------------------------------------------
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 229, in test_parameter_accuracy
>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 91, in eventually
>> >     raise lastValue
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 82, in eventually
>> >     lastValue = condition()
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 226, in condition
>> >     self.assertAlmostEqual(rel, 0.1, 1)
>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
>> (0.13052813480829392 difference)
>> >
>> > ======================================================================
>> > FAIL: test_training_and_prediction
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>> > Test that the model improves on toy data with no. of batches
>> > ----------------------------------------------------------------------
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 334, in test_training_and_prediction
>> >     eventually(condition, timeout=180.0)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 93, in eventually
>> >     raise AssertionError(
>> > AssertionError: Test failed due to timeout after 180 sec, with last
>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>> >
>> > ----------------------------------------------------------------------
>> > Ran 13 tests in 661.536s
>> >
>> > FAILED (failures=3, skipped=1)
>> >
>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with
>> /usr/local/bin/python3; see logs.
>> > ```
>> >
>> > Here's how I'm currently building Spark, I was using the
>> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
>> docs as a reference.
>> > ```
>> > > git clone git@github.com:apache/spark.git
>> > > git checkout -b spark-321 v3.2.1
>> > > ./build/mvn -DskipTests clean package -Phive
>> > > export JAVA_HOME=$(path/to/jdk/11)
>> > > ./python/run-tests
>> > ```
>> >
>> > Current Java version
>> > ```
>> > java -version
>> > openjdk version "11.0.17" 2022-10-18
>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>> > ```
>> >
>> > Alternatively, I've also tried simply building Spark and using a
>> python=3.9 venv and installing the requirements from `pip install -r
>> dev/requirements.txt` and using that as the interpreter to run tests.
>> However, I was running into some failing pandas test which to me seemed
>> like it was coming from a pandas version difference as `requirements.txt`
>> didn't specify a version.
>> >
>> > I suppose I have a couple of questions in regards to this:
>> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
>> > 2. Where could I find whether an upstream test is failing for a
>> specific release?
>> > 3. Would it be possible to configure the `run-tests` script to run all
>> tests regardless of test failures?
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Building Spark to run PySpark Tests?

Posted by Sean Owen <sr...@gmail.com>.

Never seen those, but it's probably a difference in pandas, numpy versions.
You can see the current CICD test results in GitHub Actions. But, you want
to use release versions, not an RC. 3.2.1 is not the latest version, and
it's possible the tests were actually failing in the RC.

On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <am...@gmail.com> wrote:

> Bump,
>
> Just trying to see where I can find what tests are known failing for a
> particular release, to ensure I’m building upstream correctly following the
> build docs. I figured this would be the best place to ask as it pertains to
> building and testing upstream (also more than happy to provide a PR for any
> docs if required afterwards), however if there would be a more appropriate
> place, please let me know.
>
> Best,
>
> Adam Chhina
>
> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <am...@gmail.com> wrote:
> >
> > As part of an upgrade I was looking to run upstream PySpark unit tests
> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
> However, I'm running into some issues with failing unit tests, which I'm
> not sure are failing upstream or due to some step I missed in the build.
> >
> > The current failing tests (at least so far, since I believe the python
> script exits on test failure):
> > ```
> > ======================================================================
> > FAIL: test_train_prediction
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> > Test that error on test data improves as model is trained.
> > ----------------------------------------------------------------------
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 474, in test_train_prediction
> >     eventually(condition, timeout=180.0)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86,
> in eventually
> >     lastValue = condition()
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 469, in condition
> >     self.assertGreater(errors[1] - errors[-1], 2)
> > AssertionError: 1.8960983527735014 not greater than 2
> >
> > ======================================================================
> > FAIL: test_parameter_accuracy
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> > Test that the final value of weights is close to the desired value.
> > ----------------------------------------------------------------------
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 229, in test_parameter_accuracy
> >     eventually(condition, timeout=60.0, catch_assertions=True)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91,
> in eventually
> >     raise lastValue
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82,
> in eventually
> >     lastValue = condition()
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 226, in condition
> >     self.assertAlmostEqual(rel, 0.1, 1)
> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
> (0.13052813480829392 difference)
> >
> > ======================================================================
> > FAIL: test_training_and_prediction
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> > Test that the model improves on toy data with no. of batches
> > ----------------------------------------------------------------------
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 334, in test_training_and_prediction
> >     eventually(condition, timeout=180.0)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93,
> in eventually
> >     raise AssertionError(
> > AssertionError: Test failed due to timeout after 180 sec, with last
> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
> >
> > ----------------------------------------------------------------------
> > Ran 13 tests in 661.536s
> >
> > FAILED (failures=3, skipped=1)
> >
> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with
> /usr/local/bin/python3; see logs.
> > ```
> >
> > Here's how I'm currently building Spark, I was using the
> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
> docs as a reference.
> > ```
> > > git clone git@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
> > > ./build/mvn -DskipTests clean package -Phive
> > > export JAVA_HOME=$(path/to/jdk/11)
> > > ./python/run-tests
> > ```
> >
> > Current Java version
> > ```
> > java -version
> > openjdk version "11.0.17" 2022-10-18
> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
> > ```
> >
> > Alternatively, I've also tried simply building Spark and using a
> python=3.9 venv and installing the requirements from `pip install -r
> dev/requirements.txt` and using that as the interpreter to run tests.
> However, I was running into some failing pandas test which to me seemed
> like it was coming from a pandas version difference as `requirements.txt`
> didn't specify a version.
> >
> > I suppose I have a couple of questions in regards to this:
> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
> > 2. Where could I find whether an upstream test is failing for a specific
> release?
> > 3. Would it be possible to configure the `run-tests` script to run all
> tests regardless of test failures?
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>