You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2024/01/25 08:57:16 UTC

[PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

itholic opened a new pull request, #44881:
URL: https://github.com/apache/spark/pull/44881

### What changes were proposed in this pull request?

This PR proposes to upgrade Pandas to 2.2.0.

See [What's new in 2.2.0 (January 19, 2024)](https://pandas.pydata.org/docs/whatsnew/v2.2.0.html)

### Why are the changes needed?

Pandas 2.2.0 is released, and we should support the latest Pandas.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The existing CI should pass

### Was this patch authored or co-authored using generative AI tooling?

No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495305345


##########
python/pyspark/pandas/series.py:
##########
@@ -7092,15 +7092,15 @@ def resample(
         ----------
         rule : str
             The offset string or object representing target conversion.
-            Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-            'T', 'MIN', 'S'}.
+            Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+            'min', 'MIN', 's'}.

Review Comment:
   The background of my question is that `Data Science` team has been struggling when they validate their pipelines on new Spark versions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1942942082

   Yeah, Pandas fixes many bugs from Pandas 2.2.0 that brings couple of behavior changes 😢 
   
   Let me fix them. Thanks for the confirm!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495314548


##########
python/pyspark/pandas/series.py:
##########
@@ -7092,15 +7092,15 @@ def resample(
         ----------
         rule : str
             The offset string or object representing target conversion.
-            Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-            'T', 'MIN', 'S'}.
+            Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+            'min', 'MIN', 's'}.

Review Comment:
   > Even for the users who choose old Pandas libraries, Apache Spark enforces this breaking change
   
   Yes. In current policy, if we want to use latest Apache Spark then we cannot avoid having to follow the behavior of latest Pandas as well IIRC.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1466077514


##########
dev/infra/Dockerfile:
##########
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage matplotlib lxml

Review Comment:
   I think maybe the CI would be broken in the future when the higher version will be released?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495320737


##########
python/pyspark/pandas/series.py:
##########
@@ -7092,15 +7092,15 @@ def resample(
         ----------
         rule : str
             The offset string or object representing target conversion.
-            Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-            'T', 'MIN', 'S'}.
+            Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+            'min', 'MIN', 's'}.

Review Comment:
   Oh, sorry that was my mistake. This should work even in old Pandas before Spark 4.0.0 release.
   
   Let me fix them to work both Pandas 2.2.0 and old Pandas.



##########
python/pyspark/pandas/series.py:
##########
@@ -7092,15 +7092,15 @@ def resample(
         ----------
         rule : str
             The offset string or object representing target conversion.
-            Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-            'T', 'MIN', 'S'}.
+            Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+            'min', 'MIN', 's'}.

Review Comment:
   ~~However, the current rule is that it should not be accompanied by such a breaking change unless the major version changes.~~
   
   ~~This means that users should be able to use their pipeline as is, as long as they are using at least version 3.x of Spark.~~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1470461086


##########
dev/infra/Dockerfile:
##########
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas==2.2.0' scipy coverage matplotlib lxml

Review Comment:
   Let met just use upperbound for now since it's only issue on PyPy3 CI.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1466155995


##########
dev/infra/Dockerfile:
##########
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage matplotlib lxml

Review Comment:
   AFAIK, pip automatically finds the most recent version that meets the conditions as below (Has this not worked well so far btw??):
   
   ```
   (pyspark-dev-env) spark % pip install "pandas<=2.2.0"
   Collecting pandas<=2.2.0
   ...
   Installing collected packages: pandas
   Successfully installed pandas-2.2.0
   ```
   
   But I'm okay with the way `==`. WDYT, @HyukjinKwon ?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1953618046

   Just updated to resample work in old Pandas as well.
   
   I think we can just make it as deprecate for now to avoid breaking the existing pipeline.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1953550280

   I believe now this PR completed to address all of Pandas 2.2.0 behavior. cc @HyukjinKwon @dongjoon-hyun FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1470651441


##########
python/pyspark/pandas/namespace.py:
##########
@@ -2554,7 +2553,7 @@ def resolve_func(psdf, this_column_labels, that_column_labels):
         if isinstance(obj, Series):
             num_series += 1
             series_names.add(obj.name)
-            new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))
+            new_objs.append(obj.to_frame())

Review Comment:
   Yes, actually this followups a bug fix from Pandas(https://github.com/pandas-dev/pandas/issues/15047) so I believe it should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1466068442


##########
dev/infra/Dockerfile:
##########
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage matplotlib lxml

Review Comment:
   shall we use `>=` to confirm 2.2.0+ is used?



##########
dev/infra/Dockerfile:
##########
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage matplotlib lxml
 
 
-ARG BASIC_PIP_PKGS="numpy pyarrow>=14.0.0 six==1.16.0 pandas<=2.1.4 scipy plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2"
+ARG BASIC_PIP_PKGS="numpy pyarrow>=14.0.0 six==1.16.0 pandas<=2.2.0 scipy plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 scikit-learn>=1.3.2"

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495316022


##########
python/pyspark/pandas/series.py:
##########
@@ -7092,15 +7092,15 @@ def resample(
         ----------
         rule : str
             The offset string or object representing target conversion.
-            Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-            'T', 'MIN', 'S'}.
+            Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+            'min', 'MIN', 's'}.

Review Comment:
   However, the current rule is that it should not be accompanied by such a breaking change unless the major version changes.
   
   This means that users should be able to use their pipeline as is, as long as they are using at least version 3.x of Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1954505946

   Merged to master.
   
   Thank you again, @itholic and @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495304624


##########
python/pyspark/pandas/series.py:
##########
@@ -7092,15 +7092,15 @@ def resample(
         ----------
         rule : str
             The offset string or object representing target conversion.
-            Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-            'T', 'MIN', 'S'}.
+            Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+            'min', 'MIN', 's'}.

Review Comment:
   Ya, that comes to my second question. (https://github.com/apache/spark/pull/44881#pullrequestreview-1889581226).
   
   Even for the users who choose old Pandas libraries, Apache Spark enforces this breaking change, @itholic ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1953588339

   We should not bring any breaking change. Let me address them.
   
   Thanks, @dongjoon-hyun for double checking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1467133120


##########
dev/infra/Dockerfile:
##########
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage matplotlib lxml

Review Comment:
   Got it. btw Pandas 2.2.0 again introduces some breaking changes 😂 Let me address it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1470668096


##########
python/pyspark/pandas/namespace.py:
##########
@@ -2554,7 +2553,7 @@ def resolve_func(psdf, this_column_labels, that_column_labels):
         if isinstance(obj, Series):
             num_series += 1
             series_names.add(obj.name)
-            new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))
+            new_objs.append(obj.to_frame())

Review Comment:
   Thank you for the confirmation, @itholic .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1469178246


##########
dev/infra/Dockerfile:
##########
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas==2.2.0' scipy coverage matplotlib lxml

Review Comment:
   @zhengruifeng This breaks CI because Pandas 2.2.0 is not supported from PyPy3 yet.
   
   ```
   #18 2.142 ERROR: Ignored the following versions that require a different python version: 1.25.0 Requires-Python >=3.9; 1.25.0rc1 Requires-Python >=3.9; 1.25.1 Requires-Python >=3.9; 1.25.2 Requires-Python >=3.9; 1.26.0 Requires-Python <3.13,>=3.9; 1.26.0b1 Requires-Python <3.13,>=3.9; 1.26.0rc1 Requires-Python <3.13,>=3.9; 1.26.1 Requires-Python <3.13,>=3.9; 1.26.2 Requires-Python >=3.9; 1.26.3 Requires-Python >=3.9; 2.1.0 Requires-Python >=3.9; 2.1.0rc0 Requires-Python >=3.9; 2.1.1 Requires-Python >=3.9; 2.1.2 Requires-Python >=3.9; 2.1.3 Requires-Python >=3.9; 2.1.4 Requires-Python >=3.9; 2.2.0 Requires-Python >=3.9; 2.2.0rc0 Requires-Python >=3.9
   #18 2.145 ERROR: Could not find a version that satisfies the requirement pandas==2.2.0 (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5, 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.2.5, 1.3.0, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5, 1.4.0rc0, 1.4.0, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.5.0rc0, 1.5.0, 1.5.1, 1.5.2, 1.5.3, 2.0.0rc0, 2.0.0rc1, 2.0.0, 2.0.1, 2.0.2, 2.0.3)
   #18 2.146 ERROR: No matching distribution found for pandas==2.2.0
   ```
   
   Seems like the current latest Pandas is 2.0.3 supported from PyPy3. Should we pending this PR until Pandas 2.2.0 is supported from PyPy3? Or should we get back to use upper-bound again?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495305345


##########
python/pyspark/pandas/series.py:
##########
@@ -7092,15 +7092,15 @@ def resample(
         ----------
         rule : str
             The offset string or object representing target conversion.
-            Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-            'T', 'MIN', 'S'}.
+            Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+            'min', 'MIN', 's'}.

Review Comment:
   The background of my question is that `Data Science` team has been struggling when they validates their pipelines on new Spark versions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495297012


##########
python/pyspark/pandas/series.py:
##########
@@ -7092,15 +7092,15 @@ def resample(
         ----------
         rule : str
             The offset string or object representing target conversion.
-            Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-            'T', 'MIN', 'S'}.
+            Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+            'min', 'MIN', 's'}.

Review Comment:
   Is this a breaking change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1953573086

   - Is the change of python/pyspark/pandas/resample.py safe?
   
   It breaks the previous behavior, so if we plan to release other minor release (Spark 3.5.0) this should not be included.
   
   - What happens when the users decide to use old Pandas (<= 2.2.0)?
   
   Using deprecated aliases (`Y`, `M`, `H`, `T`, `S`) wouldn't work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1487597019


##########
python/pyspark/pandas/frame.py:
##########
@@ -10607,7 +10607,9 @@ def melt(
                     name_like_string(name) if name is not None else "variable_{}".format(i)
                     for i, name in enumerate(self._internal.column_label_names)
                 ]
-        elif isinstance(var_name, str):
+        elif is_list_like(var_name):
+            raise ValueError(f"{var_name=} must be a scalar.")
+        else:

Review Comment:
   Fixed from: https://github.com/pandas-dev/pandas/pull/55948



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1467123646


##########
dev/infra/Dockerfile:
##########
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage matplotlib lxml

Review Comment:
   Let's pin this to 2.2.0 for now. I think we have seen some issues when automaticaly using latest pandas version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1955590870

   Thank you so much all for review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "bjornjorgensen (via GitHub)" <gi...@apache.org>.

bjornjorgensen commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1955073306

   Great work @itholic Thank you :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun closed pull request #44881: [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0
URL: https://github.com/apache/spark/pull/44881


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1470641012


##########
python/pyspark/pandas/namespace.py:
##########
@@ -2554,7 +2553,7 @@ def resolve_func(psdf, this_column_labels, that_column_labels):
         if isinstance(obj, Series):
             num_series += 1
             series_names.add(obj.name)
-            new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))
+            new_objs.append(obj.to_frame())

Review Comment:
   Just a question, after this change, do we support old Pandas versions still?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1492108932


##########
python/pyspark/pandas/plot/matplotlib.py:
##########
@@ -363,10 +364,23 @@ def _args_adjust(self):
         if is_list_like(self.bottom):
             self.bottom = np.array(self.bottom)
 
+    def _ensure_frame(self, data):
+        return data
+
+    def _calculate_bins(self, data, bins):
+        return bins

Review Comment:
   Pandas recently pushed couple of commits for refactoring the internal plotting structure such as https://github.com/pandas-dev/pandas/pull/55850 or https://github.com/pandas-dev/pandas/pull/55872, so we also should inherits couple of internal methods to follow the latest Pandas behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1466096686


##########
dev/infra/Dockerfile:
##########
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
     ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage matplotlib lxml

Review Comment:
   or using `==`? just because `<=` cannot confirm 2.2.0 is used



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495302902


##########
python/pyspark/pandas/series.py:
##########
@@ -7092,15 +7092,15 @@ def resample(
         ----------
         rule : str
             The offset string or object representing target conversion.
-            Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-            'T', 'MIN', 'S'}.
+            Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+            'min', 'MIN', 's'}.

Review Comment:
   Yeah, Pandas 2.2.0 brings couple of breaking changes so we should make sure we ship this support after Spark 4.0.0.
   
   See [related update from Pandas release note](https://pandas.pydata.org/docs/whatsnew/v2.2.0.html#deprecate-aliases-m-q-y-etc-in-favour-of-me-qe-ye-etc-for-offsets) for more detail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1953603875

   Oh, wait.
   
   I just remembered that we just follow the Pandas behavior and separately mention the breaking changes into [release note](https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_upgrade.rst).
   
   So maybe we should add a release note instead of reverting the breaking changes here? @dongjoon-hyun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495189446


##########
python/pyspark/pandas/namespace.py:
##########
@@ -2554,7 +2554,10 @@ def resolve_func(psdf, this_column_labels, that_column_labels):
         if isinstance(obj, Series):
             num_series += 1
             series_names.add(obj.name)
-            new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))
+            if not ignore_index and not should_return_series:
+                new_objs.append(obj.to_frame())
+            else:
+                new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))

Review Comment:
   Related to https://github.com/pandas-dev/pandas/issues/15047



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

Posted by "itholic (via GitHub)" <gi...@apache.org>.

itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495345854


##########
python/pyspark/pandas/series.py:
##########
@@ -7092,15 +7092,15 @@ def resample(
         ----------
         rule : str
             The offset string or object representing target conversion.
-            Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-            'T', 'MIN', 'S'}.
+            Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+            'min', 'MIN', 's'}.

Review Comment:
   Just updated to resample work in old Pandas as well. Now it's safe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org