You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/03/09 06:52:53 UTC

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34498: GH-34404: [Python] Failing tests because pandas.Index can now store all numeric dtypes (not only 64bit versions)

jorisvandenbossche commented on code in PR #34498:
URL: https://github.com/apache/arrow/pull/34498#discussion_r1130550569


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -735,8 +735,15 @@ def _partition_test_for_filesystem(fs, base_path, use_legacy_dataset=True):
                    .reset_index(drop=True)
                    .reindex(columns=result_df.columns))
 
-    expected_df['foo'] = pd.Categorical(df['foo'], categories=foo_keys)
-    expected_df['bar'] = pd.Categorical(df['bar'], categories=bar_keys)
+    if use_legacy_dataset or Version(pd.__version__) < Version("2.0.0"):
+        expected_df['foo'] = pd.Categorical(df['foo'], categories=foo_keys)
+        expected_df['bar'] = pd.Categorical(df['bar'], categories=bar_keys)
+    else:
+        # With pandas 2.0.0 Index can store all numeric dtypes (not just
+        # int64/uint64/float64). Using astype() to create a categorical
+        # column preserves original dtype (int32)
+        expected_df['foo'] = expected_df['foo'].astype("category")
+        expected_df['bar'] = expected_df['bar'].astype("category")

Review Comment:
   This new way might work for all pandas versions? (then the if/else is not needed)



##########
python/pyarrow/tests/test_compute.py:
##########
@@ -1934,22 +1934,48 @@ def _check_datetime_components(timestamps, timezone=None):
         [iso_year, iso_week, iso_day],
         fields=iso_calendar_fields)
 
-    assert pc.year(tsa).equals(pa.array(ts.dt.year))
+    year = ts.dt.year
+    month = ts.dt.month
+    day = ts.dt.day
+    dayofweek = ts.dt.dayofweek
+    dayofyear = ts.dt.dayofyear
+    quarter = ts.dt.quarter
+    hour = ts.dt.hour
+    minute = ts.dt.minute
+    second = ts.dt.second.values
+    microsecond = ts.dt.microsecond
+    nanosecond = ts.dt.nanosecond
+    if Version(pd.__version__) >= Version("2.0.0"):
+        # Casting is required because pandas with 2.0.0 various numeric
+        # date/time attributes have dtype int32 (previously int64)

Review Comment:
   Also here, it might not do any harm to just always cast to int64 above, without this extra `if: ` block (if it's already int64, that should be a no-op)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org