You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/26 07:58:46 UTC

[GitHub] [spark] itholic opened a new pull request, #37671: [WIP][SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

itholic opened a new pull request, #37671:
URL: https://github.com/apache/spark/pull/37671

   ### What changes were proposed in this pull request?
   
   This PR proposes to install the `openpyxl` for PySpark test environments to re-enable the skipping tests, `read_excel` and `to_excel`.
   
   
   ### Why are the changes needed?
   
   For better test coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's test only
   
   
   ### How was this patch tested?
   
   Enabling the existing skipping tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

Yikun commented on code in PR #37671:
URL: https://github.com/apache/spark/pull/37671#discussion_r966592734


##########
python/pyspark/pandas/tests/test_dataframe_conversion.py:
##########
@@ -90,7 +90,6 @@ def get_excel_dfs(pandas_on_spark_location, pandas_location):
             "expected": pd.read_excel(pandas_location, index_col=0),
         }
 
-    @unittest.skip("openpyxl")
     def test_to_excel(self):

Review Comment:
   What about this case for pypy3? Will it affect developers run tests?
   
   (I remember we run pypy3 and python3.9 in CI....[but not](https://github.com/itholic/spark/runs/8240425530?check_suite_focus=true#step:11:1174) running, pls let me know if I missed something history changes...)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #37671:
URL: https://github.com/apache/spark/pull/37671#issuecomment-1244847277

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

itholic commented on PR #37671:
URL: https://github.com/apache/spark/pull/37671#issuecomment-1237515485

   The `read_excel` tests still failed for some reason with error below:
   
   ```
   ======================================================================
   FAIL [1.476s]: test_read_excel (pyspark.pandas.tests.test_dataframe_spark_io.DataFrameSparkIOTest)
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/__w/spark/spark/python/pyspark/pandas/tests/test_dataframe_spark_io.py", line 278, in test_read_excel
       ps.read_excel(tmp, index_col=0).sort_index(),
     File "/__w/spark/spark/python/pyspark/pandas/namespace.py", line 1224, in read_excel
       return read_excel_on_spark(pdf_or_psers, sheet_name)
     File "/__w/spark/spark/python/pyspark/pandas/namespace.py", line 1213, in read_excel_on_spark
       psdf = DataFrame(psdf._internal.with_new_sdf(sdf))
     File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1223, in with_new_sdf
       return self.copy(
     File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1427, in copy
       return InternalFrame(
     File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 755, in __init__
       assert all(
   AssertionError: ([InternalField(dtype=float64, struct_field=StructField('__index_level_0__', DoubleType(), False))], [StructField('__index_level_0__', DoubleType(), True)])
   
   ----------------------------------------------------------------------
   Ran 8 tests in 34.923s
   ```
   
   I failed to reproduce this error although I use the same version of related envs (e.g. `Python`, `pandas`, `openpyxl`).
   
   Let me leave the re-enabling `read_excel` tests as TODO, and just re-enable the `to_excel` tests here for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

itholic commented on PR #37671:
URL: https://github.com/apache/spark/pull/37671#issuecomment-1237520014

   cc @HyukjinKwon FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

Yikun commented on code in PR #37671:
URL: https://github.com/apache/spark/pull/37671#discussion_r971381963


##########
python/pyspark/pandas/tests/test_dataframe_conversion.py:
##########
@@ -90,7 +90,6 @@ def get_excel_dfs(pandas_on_spark_location, pandas_location):
             "expected": pd.read_excel(pandas_location, index_col=0),
         }
 
-    @unittest.skip("openpyxl")
     def test_to_excel(self):

Review Comment:
   Thanks, good to know!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #37671:
URL: https://github.com/apache/spark/pull/37671#discussion_r963179722


##########
.github/workflows/build_and_test.yml:
##########
@@ -383,6 +383,10 @@ jobs:
       uses: actions/setup-java@v1
       with:
         java-version: ${{ matrix.java }}
+    - name: Install Python packages (Python 3.9, PyPy3)
+      run: |
+        # To test excel I/O for pandas API on Spark.
+        python3.9 -m pip install openpyxl

Review Comment:
   can we add this into Dockerfile? https://github.com/apache/spark/blob/master/dev/infra/Dockerfile



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #37671:
URL: https://github.com/apache/spark/pull/37671#issuecomment-1240077585

   cc @xinrong-meng FYI (I will be off for the rest of this week)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

Yikun commented on code in PR #37671:
URL: https://github.com/apache/spark/pull/37671#discussion_r966592734


##########
python/pyspark/pandas/tests/test_dataframe_conversion.py:
##########
@@ -90,7 +90,6 @@ def get_excel_dfs(pandas_on_spark_location, pandas_location):
             "expected": pd.read_excel(pandas_location, index_col=0),
         }
 
-    @unittest.skip("openpyxl")
     def test_to_excel(self):

Review Comment:
   What about this case for pypy3? Will it affect developers run tests?
   
   https://github.com/apache/spark/blob/e83aedd0f072ce18d5542b61eb40e2df1b83bb50/python/run-tests.py#L202
   
   (I remember we run pypy3 and python3.9 in CI....[but not](https://github.com/itholic/spark/runs/8240425530?check_suite_focus=true#step:11:1174) running, pls let me know if I missed something history changes...)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37671: [WIP][SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37671:
URL: https://github.com/apache/spark/pull/37671#discussion_r955788905


##########
.github/workflows/build_and_test.yml:
##########
@@ -242,7 +242,7 @@ jobs:
     - name: Install Python packages (Python 3.8)
       if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-'))
       run: |
-        python3.8 -m pip install 'numpy>=1.20.0' pyarrow pandas scipy unittest-xml-reporting
+        python3.8 -m pip install 'numpy>=1.20.0' pyarrow pandas scipy unittest-xml-reporting openpyxl

Review Comment:
   Maybe we don't need to install for Python 3.8 ??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

itholic commented on PR #37671:
URL: https://github.com/apache/spark/pull/37671#issuecomment-1237519230

   Just created a ticket for re-enabling the `read_excel` test: SPARK-40353.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37671:
URL: https://github.com/apache/spark/pull/37671#discussion_r963183205


##########
.github/workflows/build_and_test.yml:
##########
@@ -383,6 +383,10 @@ jobs:
       uses: actions/setup-java@v1
       with:
         java-version: ${{ matrix.java }}
+    - name: Install Python packages (Python 3.9, PyPy3)
+      run: |
+        # To test excel I/O for pandas API on Spark.
+        python3.9 -m pip install openpyxl

Review Comment:
   Sounds good! Just removed



##########
.github/workflows/build_and_test.yml:
##########
@@ -383,6 +383,10 @@ jobs:
       uses: actions/setup-java@v1
       with:
         java-version: ${{ matrix.java }}
+    - name: Install Python packages (Python 3.9, PyPy3)
+      run: |
+        # To test excel I/O for pandas API on Spark.
+        python3.9 -m pip install openpyxl

Review Comment:
   Sounds good! Just moved



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #37671:
URL: https://github.com/apache/spark/pull/37671#discussion_r969112556


##########
python/pyspark/pandas/tests/test_dataframe_conversion.py:
##########
@@ -90,7 +90,6 @@ def get_excel_dfs(pandas_on_spark_location, pandas_location):
             "expected": pd.read_excel(pandas_location, index_col=0),
         }
 
-    @unittest.skip("openpyxl")
     def test_to_excel(self):

Review Comment:
   Actually PyPy is not tested with Pandas API on Spark per https://github.com/apache/spark/blob/master/dev/sparktestsupport/modules.py#L663-L667 because pyarrow, etc are not available with PyPy. Should probably docunent it somewhere but let's do that separately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark
URL: https://github.com/apache/spark/pull/37671


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org