You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2018/02/01 04:52:40 UTC

[GitHub] spark pull request #20465: [SPARK-23292][TEST] always run python tests

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/20465

    [SPARK-23292][TEST] always run python tests

    ## What changes were proposed in this pull request?
    
    We should not skip any tests, otherwise we can't trust the jenkins report.
    
    This PR make the pandas related tests always be run, so that we can be confident about pandas related features in Spark.
    
    
    ## How was this patch tested?
    
    N/A

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark python-test

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20465.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20465
    
----
commit 8aba4f502879b7e3b8c154b00ded22e4bcba8df2
Author: Wenchen Fan <we...@...>
Date:   2018-02-01T04:49:05Z

    always run python tests

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    ```
    ImportError: Pandas >= 0.19.2 must be installed on calling Python process; however, your version was 0.16.0.
    ```
    
    I guess the RISELab boxes will need some updates...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    **[Test build #86910 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86910/testReport)** for PR 20465 at commit [`8aba4f5`](https://github.com/apache/spark/commit/8aba4f502879b7e3b8c154b00ded22e4bcba8df2).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    **[Test build #86910 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86910/testReport)** for PR 20465 at commit [`8aba4f5`](https://github.com/apache/spark/commit/8aba4f502879b7e3b8c154b00ded22e4bcba8df2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20465: [SPARK-23292][TEST] always run python tests

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20465#discussion_r165259756
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2858,16 +2833,6 @@ def test_create_dataframe_from_pandas_with_timestamp(self):
             self.assertTrue(isinstance(df.schema['ts'].dataType, TimestampType))
             self.assertTrue(isinstance(df.schema['d'].dataType, DateType))
     
    -    @unittest.skipIf(not _have_old_pandas, "Old Pandas not installed")
    -    def test_create_dataframe_from_old_pandas(self):
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86910/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    I've not worked in the logging stuff yet, feel free to take it, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20465: [SPARK-23292][TEST] always run python tests

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20465#discussion_r165259740
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2819,13 +2802,6 @@ def test_to_pandas(self):
             self.assertEquals(types[4], 'datetime64[ns]')
             self.assertEquals(types[5], 'datetime64[ns]')
     
    -    @unittest.skipIf(not _have_old_pandas, "Old Pandas not installed")
    -    def test_to_pandas_old(self):
    --- End diff --
    
    I have no idea how to test it in jenkins, as jenkins should always have pandas and pyarrow installed with required versions. I think we can only test it manually.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    Yup, there was a related discussion already. See this https://github.com/apache/spark/pull/19884#issuecomment-351916074 and https://github.com/apache/spark/pull/19884#issuecomment-353068446. We shouldn't run this for now. Also these are technically not hard dependencies. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    Also, if we should go in this way, I think we should enable some tests with PyPy too if I understood correctly and there isn't another problem I maybe missed:
    
    https://github.com/apache/spark/blob/9623a98248837da302ba4ec240335d1c4268ee21/dev/sparktestsupport/modules.py#L457
    
    At least, PyPy in my local has `numpy`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    Yup, explicitly logging sounds fine for now so that we can easily check.
    
    >  I do prefer to have these conditional skips removed because sometimes it is hard to tell if everything passed or was just skipped
    
    To be clear, I think it's more because our own testing script doesn't show the skipped tests output from unittests in the console.
    
    Also, I think it's more because we couldn't make sure Pandas and Arrow were installed properly in testing env, Jenkins but not because we skip tests related with extra dependencies when they are not installed. Making them as required dependencies is a big deal IMHO.
    
    FYI, I tried to install PyArrow with PyPy last time and I failed. I wonder if we can easily install it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    @cloud-fan, will try it. Thank you sincerely.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    @felixcheung jenkins is actually skipping those tests (see the failure of this pr). It makes sense to provide a way to allow developers to not run those tests. But, I'd prefer that we run those tests by default. So, we can make sure that jenkins is doing the right thing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    I agree that pandas and pyarrow should not be a hard requirement for users, and this is what it is today: PySpark only throws exception when users try to use pandas related functions without pandas/pyarrow installed.
    
    My proposal is, pandas and pyarrow should be a hard requirement for our jenkins, to make sure the features are well tested.
    
    If there is a way to prove that py3 tests run well, and the environment issue is hard to fix, then we maybe we can deal with it later, after 2.3.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    hmm, I think there are some values in having a way to run python tests without Arrow? I mean the test.py is not just for Jenkins but for everyone consuming the Spark release... unless we are saying Arrow is required now?
    
    And in Jenkins we shouldn't be skipping any of these tests anyway? Is there a reason we need to change that if Jenkins isn't affected (and if I recall there is a way to check if running under Jenkins - we could always make Arrow tests not skipped in Jenkins)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20465: [SPARK-23292][TEST] always run python tests

Posted by shaneknapp <gi...@git.apache.org>.
Github user shaneknapp commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20465#discussion_r165412455
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2819,13 +2802,6 @@ def test_to_pandas(self):
             self.assertEquals(types[4], 'datetime64[ns]')
             self.assertEquals(types[5], 'datetime64[ns]')
     
    -    @unittest.skipIf(not _have_old_pandas, "Old Pandas not installed")
    -    def test_to_pandas_old(self):
    --- End diff --
    
    jenkins does indeed have pandas and pyarrow installed, btw.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    sure, that's ok, I think we can revisit later (ie. next release) if we want to add an env switch or something to make them optional


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20465: [SPARK-23292][TEST] always run python tests

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan closed the pull request at:

    https://github.com/apache/spark/pull/20465


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/460/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    So, jenkins jobs run those tests with python3? If so, I feel better because those tests are not completely skipped in Jenkins. If it is hard to make them run with python 2. Let’s have a log to explicitly show if we are going to run tests using pandas/pyarrow, which will help us confirm if they get exercised with python 3 in Jenkins or not.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    Looking back at when pyarrow was last upgraded in #19884, pandas and pyarrow were upgraded on all workers for python 3, but there were maybe some concerns or difficulties with upgrading for python 2 and pypy environments at that time. That is why the above failure is from python 2 with an older version of pandas. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    >  I think there are some values in having a way to run python tests without Arrow?
    
    I agree, but the more important thing is to make sure jenkins runs everything, so that we can be confident about our release.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    > My proposal is, pandas and pyarrow should be a hard requirement for our jenkins, to make sure the features are well tested.
    
    If this is a goal, I think another simple way is just to use an env set in Jenkins and throw an exception if both PyArrow or Pandas are not installed in the future.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    Yes, the tests are being run with python3. I do prefer to have these conditional skips removed because sometimes it is hard to tell if everything passed or was just skipped. But since pandas and pyarrow are optional dependencies, there should be some way for the user to skip with an environment variable or something.  At the very least, being able to verify they were run in a log would be good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    The logging approach has been merged and I'm closing this one.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    cc @yhuai @icexelloss @BryanCutler @ueshin @shaneknapp 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20465
  
    Thank you for bearing with me @cloud-fan. I agree with it.
    
    BTW, are you working on the logging thing BTW? I was thinking the simplest way to check is just print out once if PyArrow / Pandas are installed or not.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org