You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "kumarn (via GitHub)" <gi...@apache.org> on 2023/07/27 03:40:21 UTC

[GitHub] [spark] kumarn opened a new pull request, #42183: [SPARK-43871][Pandas API on Spark][PySpark] Enable SeriesDateTimeTests for pandas 2.0.0.

kumarn opened a new pull request, #42183:
URL: https://github.com/apache/spark/pull/42183

   ### What changes were proposed in this pull request?
   Changes dtype for datetime properties to be inline with pandas 2.0.0.  Please see for more information https://pandas.pydata.org/docs/whatsnew/v2.0.0.html
   
   Enabled corresponding tests that were skipped during the update.
   
   ### Why are the changes needed?
   To be fully compliant with pandas 2.0
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Previously dtype for the updated properties (year, month, day, hour, minute, second, microsecond, dayofweek, dayofyear, quarter, daysinmonth) was int64, and after this change the dtype will be in32.
   
   Prior to the change
   
   ```python
   import pandas as pd
   import pyspark.pandas as ps
   
   series = ps.Series(pd.date_range("2012-1-1 12:45:31", periods=3, freq="M"))
   series.dt.dayofweek
   
   #Output
   0    1
   1    2
   2    5
   dtype: int64
   ```
   
   After this commit
   ```python
   import pandas as pd
   import pyspark.pandas as ps
   
   series = ps.Series(pd.date_range("2012-1-1 12:45:31", periods=3, freq="M"))
   series.dt.dayofweek
   
   #Output
   0    1
   1    2
   2    5
   dtype: int32
   ```
   
   
   ### How was this patch tested?
   enabled tests that were skipped, and updated doctests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42183: [SPARK-43871][PS] Enable SeriesDateTimeTests for pandas 2.0.0.

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42183:
URL: https://github.com/apache/spark/pull/42183#issuecomment-1652896156

   Thanks @kumarn for taking care of pandas support! 😺 
   
   Yeah, the changes itself looks pretty reasonable to me, but actually it's already addressed as a part of bulk behavior update PR from https://github.com/apache/spark/pull/40658.
   
   Actually most of them have already been addressed in bulk-PR, so let me add some comment for each ticket if they already addressed from bulk-PR (since I have no authority to mark tickets as "in progress"). Sorry for the confusion 🙏 
   
   But you can still find another pandas 2 ticket issue that is not duplicated to https://github.com/apache/spark/pull/40658 ?? 
   
   And also I recommend you good starter task: SPARK-37935 which is addressing error class / error message refining. It's relatively simple, but also this is an good opportunity to explore various parts of Apache Spark code base. Also there are many example PRs to refer such as https://github.com/apache/spark/pull/42109 and https://github.com/apache/spark/pull/42018 :-).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #42183: [SPARK-43871][PS][PySpark] Enable SeriesDateTimeTests for pandas 2.0.0.

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #42183:
URL: https://github.com/apache/spark/pull/42183#issuecomment-1652865501

   cc @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42183: [SPARK-43871][PS] Enable SeriesDateTimeTests for pandas 2.0.0.

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42183:
URL: https://github.com/apache/spark/pull/42183#issuecomment-1652974938

   No problem! Please feel free to ping me if you need any assistance or have any questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kumarn commented on pull request #42183: [SPARK-43871][PS] Enable SeriesDateTimeTests for pandas 2.0.0.

Posted by "kumarn (via GitHub)" <gi...@apache.org>.
kumarn commented on PR #42183:
URL: https://github.com/apache/spark/pull/42183#issuecomment-1652937449

   Thank you for the pointers, and thank you for the clarification @itholic. I appreciate it.
   
   Let me close this one out as it is already addressed, and try my hand at one of the issues you suggested. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kumarn closed pull request #42183: [SPARK-43871][PS] Enable SeriesDateTimeTests for pandas 2.0.0.

Posted by "kumarn (via GitHub)" <gi...@apache.org>.
kumarn closed pull request #42183: [SPARK-43871][PS] Enable SeriesDateTimeTests for pandas 2.0.0.
URL: https://github.com/apache/spark/pull/42183


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kumarn commented on pull request #42183: [SPARK-43871][PS][PySpark] Enable SeriesDateTimeTests for pandas 2.0.0.

Posted by "kumarn (via GitHub)" <gi...@apache.org>.
kumarn commented on PR #42183:
URL: https://github.com/apache/spark/pull/42183#issuecomment-1652867027

   @itholic I am new around here, and I was looking for a good starter task. Thought this was something that I can handle, and noticed your comment on the parent ticket about holding off on pandas 2.0 migration tasks till we start preparing for 4.0 a little late. Sorry about that. I will hold off from merging, but I just wanted to see my first task through. 
   
   Could you mark the issue (spark-43871) as in progress, to avoid duplication of effort by someone else, and we can revisit this whenever we are ready to take this up? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org