You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "AlenkaF (via GitHub)" <gi...@apache.org> on 2023/03/14 13:14:51 UTC

[GitHub] [arrow] AlenkaF opened a new pull request, #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

AlenkaF opened a new pull request, #34559:
URL: https://github.com/apache/arrow/pull/34559

   ### Rationale for this change
   Method `to_pandas` fails with `KeyError` if a table has an extension array as a column with extension dtype not having `to_pandas_dtype` defined. In this cases we should fall back to storage type of the extension array.
   
   ### What changes are included in this PR?
   Changes in `arrow_to_pandas.cc` at:
   - `GetBlockType` for `ConvertTableToPandas`
   - `ConvertChunkedArrayToPandas`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1497369012

   @github-actions crossbow submit test-conda-python-3.8-pandas-nightly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1496127323

   Maybe update once more, then the nightly pandas builds should actually pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1497024986

   Nightly is now passing, but I have to look at the failures in the CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1488176494

   Revision: 090fb870c1f4c5962c9728e3a076a36b63ab9e25
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-8ae53f7b73](https://github.com/ursacomputing/crossbow/branches/all?query=actions-8ae53f7b73)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.8-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8ae53f7b73-github-test-conda-python-3.8-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/4551997501/jobs/8026770676)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1490023438

   @github-actions crossbow submit test-conda-python-3.8-pandas-nightly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1468084549

   * Closes: #34165


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1488802671

   Thanks for looking at this! I wasn't sure what was the issue for the same reason you mentioned (latest crossbow nightly builds were fine).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche merged pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche merged PR #34559:
URL: https://github.com/apache/arrow/pull/34559


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #34559:
URL: https://github.com/apache/arrow/pull/34559#discussion_r1150471816


##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1127,3 +1142,51 @@ def test_cpp_extension_in_python(tmpdir):
     reconstructed_array = batch.column(0)
     assert reconstructed_array.type == uuid_type
     assert reconstructed_array == array
+
+
+@pytest.mark.pandas
+def test_extension_to_pandas_storage_type(registered_period_type):
+    period_type, _ = registered_period_type
+    np_arr = np.array([1, 2, 3, 4], dtype='i8')
+    storage = pa.array([1, 2, 3, 4], pa.int64())
+    arr = pa.ExtensionArray.from_storage(period_type, storage)
+
+    if isinstance(period_type, PeriodTypeWithToPandasDtype):
+        pandas_dtype = period_type.to_pandas_dtype()
+    else:
+        pandas_dtype = np_arr.dtype
+
+    # Test arrays
+    result = arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertChunkedArrayToPandas
+    chunked_arr = pa.chunked_array([arr])
+    result = chunked_arr.to_numpy()
+    assert result.dtype == np_arr.dtype
+
+    result = chunked_arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertTableToPandas
+    data = [
+        pa.array([1, 2, 3, 4]),
+        pa.array(['foo', 'bar', None, None]),
+        pa.array([True, None, True, False]),
+        arr
+    ]
+    my_schema = pa.schema([('f0', pa.int8()),
+                           ('f1', pa.string()),
+                           ('f2', pa.bool_()),
+                           ('ext', period_type)])
+    table = pa.Table.from_arrays(data, schema=my_schema)
+    result = table.to_pandas()
+    assert result["ext"].dtype == pandas_dtype
+
+    import pandas as pd
+    if Version(pd.__version__) < Version("1.5.0"):
+        pytest.skip("ArrowDtype missing")
+
+        # Check the usage of types_mapper
+        result = table.to_pandas(types_mapper=pd.ArrowDtype)
+        assert isinstance(result["ext"].dtype, pd.ArrowDtype)

Review Comment:
   Will go with the alternative (only run this part for newer pandas)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1488843355

   To make progress on this PR in the meantime, you could add a `llvmdev<16` to `conda_env_gandiva.txt`, I _think_ that should solve it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1496986651

   @github-actions crossbow submit test-conda-python-3.8-pandas-nightly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1497373397

   Revision: 7a3288454d830c54b9685172225d087c19675ffc
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-921b93c0d8](https://github.com/ursacomputing/crossbow/branches/all?query=actions-921b93c0d8)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.8-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-921b93c0d8-github-test-conda-python-3.8-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/4618232260/jobs/8165423558)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1498607521

   Benchmark runs are scheduled for baseline = ad115be1214b13ce393537bdb9c34ae919e4997f and contender = 01e76ba69ff5bbf6f0cd16b3bf98fbbdfc47105f. 01e76ba69ff5bbf6f0cd16b3bf98fbbdfc47105f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d4fd56e6ba764b0bb24e86d2fa715778...0493a4439bbf4895b637fb3177cb9701/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f3991d05986940ab93d1ac091b00da56...43b662af380348308ab25bdb33aedb67/)
   [Finished :arrow_down:0.77% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/09f20a19046e40c0b41d260cd1ed0b0b...e261b7f3cc074f188af3cd0dd83db772/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1c60565acb6041ed83b248f400dd7e95...c4a2602899204adba94f2625c2b64aea/)
   Buildkite builds:
   [Finished] [`01e76ba6` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2636)
   [Failed] [`01e76ba6` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2668)
   [Finished] [`01e76ba6` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2635)
   [Failed] [`01e76ba6` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2659)
   [Finished] [`ad115be1` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2635)
   [Failed] [`ad115be1` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2667)
   [Finished] [`ad115be1` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2634)
   [Failed] [`ad115be1` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2658)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1490027407

   Revision: e6280573dfdba67533dae6db279060ab1c33c1ad
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-aae73fcd07](https://github.com/ursacomputing/crossbow/branches/all?query=actions-aae73fcd07)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.8-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-aae73fcd07-github-test-conda-python-3.8-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/4563054826/jobs/8051063826)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1488122733

   Revision: 090fb870c1f4c5962c9728e3a076a36b63ab9e25
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-98b85af9bf](https://github.com/ursacomputing/crossbow/branches/all?query=actions-98b85af9bf)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.8-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-98b85af9bf-github-test-conda-python-3.8-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/4551649097/jobs/8025976809)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1496988549

   Revision: 580ab3a0368028edc24fee85abde383f9b120a05
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-fabea40e18](https://github.com/ursacomputing/crossbow/branches/all?query=actions-fabea40e18)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.8-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-fabea40e18-github-test-conda-python-3.8-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/4615651818/jobs/8159754073)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #34559:
URL: https://github.com/apache/arrow/pull/34559#discussion_r1143805864


##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1127,3 +1141,45 @@ def test_cpp_extension_in_python(tmpdir):
     reconstructed_array = batch.column(0)
     assert reconstructed_array.type == uuid_type
     assert reconstructed_array == array
+
+
+def test_extension_to_pandas_storage_type(registered_period_type):
+    period_type, _ = registered_period_type
+    np_arr = np.array([1, 2, 3, 4])
+    storage = pa.array([1, 2, 3, 4], pa.int64())
+    arr = pa.ExtensionArray.from_storage(period_type, storage)
+
+    if isinstance(period_type, PeriodTypeWithToPandasDtype):
+        pandas_dtype = period_type.to_pandas_dtype()
+    else:
+        pandas_dtype = np_arr.dtype
+
+    # Test arrays
+    result = arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertChunkedArrayToPandas
+    chunked_arr = pa.chunked_array([arr])
+    result = chunked_arr.to_numpy()
+    assert result.dtype == np_arr.dtype
+
+    result = chunked_arr.to_pandas()
+    # TODO: to_pandas should take use of to_pandas_dtype
+    # if defined!
+    # assert result.dtype == pandas_dtype
+    assert result.dtype == np_arr.dtype
+
+    # Test the change in ConvertTableToPandas
+    data = [
+        pa.array([1, 2, 3, 4]),
+        pa.array(['foo', 'bar', None, None]),
+        pa.array([True, None, True, False]),
+        arr
+    ]
+    my_schema = pa.schema([('f0', pa.int8()),
+                           ('f1', pa.string()),
+                           ('f2', pa.bool_()),
+                           ('ext', period_type)])
+    table = pa.Table.from_arrays(data, schema=my_schema)
+    result = table.to_pandas()
+    assert result["ext"].dtype == pandas_dtype

Review Comment:
   We should indeed ensure that extension types are hashable, but yes that's worth a separate issue. 
   
   Now, for `types_mapper`, you don't have to use a dict (it only provides a convenient way to specify it). The keyword takes a generic callable.
   
   > But using `PeriodType` with the `__hash__` method and the `types_mapper` (latest main, not this PR) I get an error on pandas side:
   
   Ah, yes, that's because the Int64Dtype from pandas doesn't expect an extension type. You can probably use `pd.ArrowDtype` instead, that should probably work (I think that's the only pandas extension dtype that should work with any pyarrow extension type).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #34559:
URL: https://github.com/apache/arrow/pull/34559#discussion_r1144801496


##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1127,3 +1141,45 @@ def test_cpp_extension_in_python(tmpdir):
     reconstructed_array = batch.column(0)
     assert reconstructed_array.type == uuid_type
     assert reconstructed_array == array
+
+
+def test_extension_to_pandas_storage_type(registered_period_type):
+    period_type, _ = registered_period_type
+    np_arr = np.array([1, 2, 3, 4])
+    storage = pa.array([1, 2, 3, 4], pa.int64())
+    arr = pa.ExtensionArray.from_storage(period_type, storage)
+
+    if isinstance(period_type, PeriodTypeWithToPandasDtype):
+        pandas_dtype = period_type.to_pandas_dtype()
+    else:
+        pandas_dtype = np_arr.dtype
+
+    # Test arrays
+    result = arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertChunkedArrayToPandas
+    chunked_arr = pa.chunked_array([arr])
+    result = chunked_arr.to_numpy()
+    assert result.dtype == np_arr.dtype
+
+    result = chunked_arr.to_pandas()
+    # TODO: to_pandas should take use of to_pandas_dtype
+    # if defined!
+    # assert result.dtype == pandas_dtype
+    assert result.dtype == np_arr.dtype
+
+    # Test the change in ConvertTableToPandas
+    data = [
+        pa.array([1, 2, 3, 4]),
+        pa.array(['foo', 'bar', None, None]),
+        pa.array([True, None, True, False]),
+        arr
+    ]
+    my_schema = pa.schema([('f0', pa.int8()),
+                           ('f1', pa.string()),
+                           ('f2', pa.bool_()),
+                           ('ext', period_type)])
+    table = pa.Table.from_arrays(data, schema=my_schema)
+    result = table.to_pandas()
+    assert result["ext"].dtype == pandas_dtype

Review Comment:
   True for the callable. Will add that to the test (a callable that is not a dict and `pd.ArrowDtype`).
   
   It feels a bit out of place that the `types_mapper` argument doesn't really work for extension columns, as the only case that it does is if the function is not a dict and it transforms to a pandas `pd.ArrowDtype`. Maybe adding the info to the docs might be good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #34559:
URL: https://github.com/apache/arrow/pull/34559#discussion_r1150345050


##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1127,3 +1142,51 @@ def test_cpp_extension_in_python(tmpdir):
     reconstructed_array = batch.column(0)
     assert reconstructed_array.type == uuid_type
     assert reconstructed_array == array
+
+
+@pytest.mark.pandas
+def test_extension_to_pandas_storage_type(registered_period_type):
+    period_type, _ = registered_period_type
+    np_arr = np.array([1, 2, 3, 4], dtype='i8')
+    storage = pa.array([1, 2, 3, 4], pa.int64())
+    arr = pa.ExtensionArray.from_storage(period_type, storage)
+
+    if isinstance(period_type, PeriodTypeWithToPandasDtype):
+        pandas_dtype = period_type.to_pandas_dtype()
+    else:
+        pandas_dtype = np_arr.dtype
+
+    # Test arrays
+    result = arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertChunkedArrayToPandas

Review Comment:
   ```suggestion
       # Test chunked arrays
   ```
   
   (later we won't know what this "change" is about)



##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1127,3 +1142,51 @@ def test_cpp_extension_in_python(tmpdir):
     reconstructed_array = batch.column(0)
     assert reconstructed_array.type == uuid_type
     assert reconstructed_array == array
+
+
+@pytest.mark.pandas
+def test_extension_to_pandas_storage_type(registered_period_type):
+    period_type, _ = registered_period_type
+    np_arr = np.array([1, 2, 3, 4], dtype='i8')
+    storage = pa.array([1, 2, 3, 4], pa.int64())
+    arr = pa.ExtensionArray.from_storage(period_type, storage)
+
+    if isinstance(period_type, PeriodTypeWithToPandasDtype):
+        pandas_dtype = period_type.to_pandas_dtype()
+    else:
+        pandas_dtype = np_arr.dtype
+
+    # Test arrays
+    result = arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertChunkedArrayToPandas
+    chunked_arr = pa.chunked_array([arr])
+    result = chunked_arr.to_numpy()
+    assert result.dtype == np_arr.dtype
+
+    result = chunked_arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertTableToPandas

Review Comment:
   ```suggestion
       # Test Table.to_pandas
   ```



##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1127,3 +1142,51 @@ def test_cpp_extension_in_python(tmpdir):
     reconstructed_array = batch.column(0)
     assert reconstructed_array.type == uuid_type
     assert reconstructed_array == array
+
+
+@pytest.mark.pandas
+def test_extension_to_pandas_storage_type(registered_period_type):
+    period_type, _ = registered_period_type
+    np_arr = np.array([1, 2, 3, 4], dtype='i8')
+    storage = pa.array([1, 2, 3, 4], pa.int64())
+    arr = pa.ExtensionArray.from_storage(period_type, storage)
+
+    if isinstance(period_type, PeriodTypeWithToPandasDtype):
+        pandas_dtype = period_type.to_pandas_dtype()
+    else:
+        pandas_dtype = np_arr.dtype
+
+    # Test arrays
+    result = arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertChunkedArrayToPandas
+    chunked_arr = pa.chunked_array([arr])
+    result = chunked_arr.to_numpy()
+    assert result.dtype == np_arr.dtype
+
+    result = chunked_arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertTableToPandas
+    data = [
+        pa.array([1, 2, 3, 4]),
+        pa.array(['foo', 'bar', None, None]),
+        pa.array([True, None, True, False]),
+        arr
+    ]
+    my_schema = pa.schema([('f0', pa.int8()),
+                           ('f1', pa.string()),
+                           ('f2', pa.bool_()),
+                           ('ext', period_type)])
+    table = pa.Table.from_arrays(data, schema=my_schema)
+    result = table.to_pandas()
+    assert result["ext"].dtype == pandas_dtype
+
+    import pandas as pd
+    if Version(pd.__version__) < Version("1.5.0"):
+        pytest.skip("ArrowDtype missing")
+
+        # Check the usage of types_mapper
+        result = table.to_pandas(types_mapper=pd.ArrowDtype)
+        assert isinstance(result["ext"].dtype, pd.ArrowDtype)

Review Comment:
   ```suggestion
       # Check the usage of types_mapper
       result = table.to_pandas(types_mapper=pd.ArrowDtype)
       assert isinstance(result["ext"].dtype, pd.ArrowDtype)
   ```
   
   This should be de-dented, so that this actually runs for latest pandas versions?
   
   (once you do that, you will also need to trigger crossbow to run the test-conda-python-3.8-pandas-nightly build)



##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1127,3 +1142,51 @@ def test_cpp_extension_in_python(tmpdir):
     reconstructed_array = batch.column(0)
     assert reconstructed_array.type == uuid_type
     assert reconstructed_array == array
+
+
+@pytest.mark.pandas
+def test_extension_to_pandas_storage_type(registered_period_type):
+    period_type, _ = registered_period_type
+    np_arr = np.array([1, 2, 3, 4], dtype='i8')
+    storage = pa.array([1, 2, 3, 4], pa.int64())
+    arr = pa.ExtensionArray.from_storage(period_type, storage)
+
+    if isinstance(period_type, PeriodTypeWithToPandasDtype):
+        pandas_dtype = period_type.to_pandas_dtype()
+    else:
+        pandas_dtype = np_arr.dtype
+
+    # Test arrays
+    result = arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertChunkedArrayToPandas
+    chunked_arr = pa.chunked_array([arr])
+    result = chunked_arr.to_numpy()
+    assert result.dtype == np_arr.dtype
+
+    result = chunked_arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertTableToPandas
+    data = [
+        pa.array([1, 2, 3, 4]),
+        pa.array(['foo', 'bar', None, None]),
+        pa.array([True, None, True, False]),
+        arr
+    ]
+    my_schema = pa.schema([('f0', pa.int8()),
+                           ('f1', pa.string()),
+                           ('f2', pa.bool_()),
+                           ('ext', period_type)])
+    table = pa.Table.from_arrays(data, schema=my_schema)
+    result = table.to_pandas()
+    assert result["ext"].dtype == pandas_dtype
+
+    import pandas as pd
+    if Version(pd.__version__) < Version("1.5.0"):
+        pytest.skip("ArrowDtype missing")

Review Comment:
   ```suggestion
       if Version(pd.__version__) >= Version("2.0.0.dev0"):
   ```
   
   Alternative fix, it might be better to not mark the full test as "skipped", but just only run this part for newer pandas



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1488118830

   @github-actions crossbow submit test-conda-python-3.8-pandas-nightly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1488778807

   Hmm, the pandas nightly build is failing to build (not even running the tests), some LLVM related error for gandiva. I don't (yet) see this on the latest crossbow nightly build, though


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1488782892

   Ah, llvm updated from 15 to 16, and our cmake requires 15. We should probably pin that temporarily in the env files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #34559:
URL: https://github.com/apache/arrow/pull/34559#discussion_r1136632650


##########
python/pyarrow/array.pxi:
##########
@@ -2888,23 +2888,6 @@ cdef class ExtensionArray(Array):
         # otherwise convert the storage array with the base implementation
         return Array._to_pandas(self.storage, options, **kwargs)
 
-    def to_numpy(self, **kwargs):
-        """
-        Convert extension array to a numpy ndarray.
-
-        This method simply delegates to the underlying storage array.

Review Comment:
   Can you move this sentence to the docstring of the to_numpy method on the base class ("For this extension arrays, this method ...")



##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1127,3 +1141,45 @@ def test_cpp_extension_in_python(tmpdir):
     reconstructed_array = batch.column(0)
     assert reconstructed_array.type == uuid_type
     assert reconstructed_array == array
+
+
+def test_extension_to_pandas_storage_type(registered_period_type):
+    period_type, _ = registered_period_type
+    np_arr = np.array([1, 2, 3, 4])
+    storage = pa.array([1, 2, 3, 4], pa.int64())
+    arr = pa.ExtensionArray.from_storage(period_type, storage)
+
+    if isinstance(period_type, PeriodTypeWithToPandasDtype):
+        pandas_dtype = period_type.to_pandas_dtype()
+    else:
+        pandas_dtype = np_arr.dtype
+
+    # Test arrays
+    result = arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertChunkedArrayToPandas
+    chunked_arr = pa.chunked_array([arr])
+    result = chunked_arr.to_numpy()
+    assert result.dtype == np_arr.dtype
+
+    result = chunked_arr.to_pandas()
+    # TODO: to_pandas should take use of to_pandas_dtype
+    # if defined!
+    # assert result.dtype == pandas_dtype
+    assert result.dtype == np_arr.dtype
+
+    # Test the change in ConvertTableToPandas
+    data = [
+        pa.array([1, 2, 3, 4]),
+        pa.array(['foo', 'bar', None, None]),
+        pa.array([True, None, True, False]),
+        arr
+    ]
+    my_schema = pa.schema([('f0', pa.int8()),
+                           ('f1', pa.string()),
+                           ('f2', pa.bool_()),
+                           ('ext', period_type)])
+    table = pa.Table.from_arrays(data, schema=my_schema)
+    result = table.to_pandas()
+    assert result["ext"].dtype == pandas_dtype

Review Comment:
   Maybe we can test that specifying `types_mapper` still overrides it?



##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1127,3 +1141,45 @@ def test_cpp_extension_in_python(tmpdir):
     reconstructed_array = batch.column(0)
     assert reconstructed_array.type == uuid_type
     assert reconstructed_array == array
+
+
+def test_extension_to_pandas_storage_type(registered_period_type):
+    period_type, _ = registered_period_type
+    np_arr = np.array([1, 2, 3, 4])
+    storage = pa.array([1, 2, 3, 4], pa.int64())
+    arr = pa.ExtensionArray.from_storage(period_type, storage)
+
+    if isinstance(period_type, PeriodTypeWithToPandasDtype):
+        pandas_dtype = period_type.to_pandas_dtype()
+    else:
+        pandas_dtype = np_arr.dtype
+
+    # Test arrays
+    result = arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertChunkedArrayToPandas
+    chunked_arr = pa.chunked_array([arr])
+    result = chunked_arr.to_numpy()
+    assert result.dtype == np_arr.dtype
+
+    result = chunked_arr.to_pandas()
+    # TODO: to_pandas should take use of to_pandas_dtype
+    # if defined!
+    # assert result.dtype == pandas_dtype
+    assert result.dtype == np_arr.dtype

Review Comment:
   Yeah, for `Array.to_pandas`, we have some custom logic for this in the cython code. We might want to do something similar for ChunkedArray?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #34559:
URL: https://github.com/apache/arrow/pull/34559#discussion_r1142094232


##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1127,3 +1141,45 @@ def test_cpp_extension_in_python(tmpdir):
     reconstructed_array = batch.column(0)
     assert reconstructed_array.type == uuid_type
     assert reconstructed_array == array
+
+
+def test_extension_to_pandas_storage_type(registered_period_type):
+    period_type, _ = registered_period_type
+    np_arr = np.array([1, 2, 3, 4])
+    storage = pa.array([1, 2, 3, 4], pa.int64())
+    arr = pa.ExtensionArray.from_storage(period_type, storage)
+
+    if isinstance(period_type, PeriodTypeWithToPandasDtype):
+        pandas_dtype = period_type.to_pandas_dtype()
+    else:
+        pandas_dtype = np_arr.dtype
+
+    # Test arrays
+    result = arr.to_pandas()
+    assert result.dtype == pandas_dtype
+
+    # Test the change in ConvertChunkedArrayToPandas
+    chunked_arr = pa.chunked_array([arr])
+    result = chunked_arr.to_numpy()
+    assert result.dtype == np_arr.dtype
+
+    result = chunked_arr.to_pandas()
+    # TODO: to_pandas should take use of to_pandas_dtype
+    # if defined!
+    # assert result.dtype == pandas_dtype
+    assert result.dtype == np_arr.dtype
+
+    # Test the change in ConvertTableToPandas
+    data = [
+        pa.array([1, 2, 3, 4]),
+        pa.array(['foo', 'bar', None, None]),
+        pa.array([True, None, True, False]),
+        arr
+    ]
+    my_schema = pa.schema([('f0', pa.int8()),
+                           ('f1', pa.string()),
+                           ('f2', pa.bool_()),
+                           ('ext', period_type)])
+    table = pa.Table.from_arrays(data, schema=my_schema)
+    result = table.to_pandas()
+    assert result["ext"].dtype == pandas_dtype

Review Comment:
   This doesn't work currently (without the change in this PR also). The problem is that the extension type is unhashable so we can not even define a `types_mapper` with an extension type:
   
   ```python
   >>> period_type = PeriodType('D')
   >>> {period_type: pd.Int64Dtype()}
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   TypeError: unhashable type: 'PeriodType'
   ```
   
   The `DataType` base class has the `__hash__` method implemented so I guess that should work for `ExtensionType` subclass also? If we would want `types_mapper` keyword to work for extension types, I guess, we would have to add `__hash__` method in the definition of the type:
   
   ```python
   class PeriodType(pa.ExtensionType):
       ...
       def __hash__(self):
           return hash(str(self)) 
   ```
   
   But using `PeriodType` with the `__hash__` method and the `types_mapper` (latest main, not this PR) I get an error on pandas side:
   
   ```python
   >>> types_mapper = {period_type: pd.Int64Dtype()}.get
   >>> table.to_pandas(types_mapper=types_mapper)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/array.pxi", line 835, in pyarrow.lib._PandasConvertible.to_pandas
       return self._to_pandas(options, categories=categories,
     File "pyarrow/table.pxi", line 4111, in pyarrow.lib.Table._to_pandas
       mgr = table_to_blockmanager(
     File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line 820, in table_to_blockmanager
       blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
     File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line 1171, in _table_to_blocks
       return [_reconstruct_block(item, columns, extension_columns)
     File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line 1171, in <listcomp>
       return [_reconstruct_block(item, columns, extension_columns)
     File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line 780, in _reconstruct_block
       pd_ext_arr = pandas_dtype.__from_arrow__(arr)
     File "/Users/alenkafrim/repos/pyarrow-dev/lib/python3.10/site-packages/pandas/core/arrays/numeric.py", line 88, in __from_arrow__
       raise TypeError(
   TypeError: Expected array of Int64 type, got extension<test.period<PeriodType>> instead
   ```
   
   This looks it deserves a separate issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1488172649

   @github-actions crossbow submit test-conda-python-3.8-pandas-nightly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #34559: GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #34559:
URL: https://github.com/apache/arrow/pull/34559#issuecomment-1490079851

   @jorisvandenbossche comparing the test failures in the pandas nightly build for this PR and the latest crossbow nightly build (https://github.com/ursacomputing/crossbow/actions/runs/4559282266) it seems there is no extra test failing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org