You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/31 13:07:55 UTC

[GitHub] [arrow] AlenkaF opened a new pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with pd.interval_range index

AlenkaF opened a new pull request #12300:
URL: https://github.com/apache/arrow/pull/12300


   This PR adds a check for the name of the column in `_get_extension_dtypes()` from `pandas_compat.py` to fix an error when using `pd.iterval_range` index with empty dataframe in Pandas roundtrip.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#issuecomment-1028867740


   Benchmark runs are scheduled for baseline = 360252b6bedbc69c4191bc3102282a1e7d57ad29 and contender = 56d060ca197352f575edced64e6a1fbc9331b336. 56d060ca197352f575edced64e6a1fbc9331b336 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e0b5c1bc6fc4a4f9cc218f75402eddb...732dc06332604b5f9b325af50f8dece5/)
   [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8b3c20fe52414b2e9500933051a79eb4...904e7917b8954329beeef50209a97232/)
   [Finished :arrow_down:0.3% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/bada73252a1244a29f1b4c54104d7087...c5bbbeb6853842e0921b872d4958ee84/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#issuecomment-1028867740


   Benchmark runs are scheduled for baseline = 360252b6bedbc69c4191bc3102282a1e7d57ad29 and contender = 56d060ca197352f575edced64e6a1fbc9331b336. 56d060ca197352f575edced64e6a1fbc9331b336 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e0b5c1bc6fc4a4f9cc218f75402eddb...732dc06332604b5f9b325af50f8dece5/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8b3c20fe52414b2e9500933051a79eb4...904e7917b8954329beeef50209a97232/)
   [Finished :arrow_down:0.3% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/bada73252a1244a29f1b4c54104d7087...c5bbbeb6853842e0921b872d4958ee84/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#issuecomment-1028739596


   The failing lint build seems unrelated, I restarted it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with pd.interval_range index

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#issuecomment-1025720391


   https://issues.apache.org/jira/browse/ARROW-15253


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#discussion_r798346269



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -4082,6 +4082,18 @@ def test_array_to_pandas():
         # tm.assert_series_equal(result, expected)
 
 
+def test_roundtrip_empty_table_with_extension_dtype_index():
+    if Version(pd.__version__) < Version("1.0.0"):
+        pytest.skip("ExtensionDtype to_pandas method missing")
+
+    df = pd.DataFrame(index=pd.interval_range(start=0, end=3))
+    table = pa.table(df)
+    table.to_pandas().index == pd.Index([{'left': 0, 'right': 1},

Review comment:
       This is a different issue, but what strikes me here is that the result is an object-dtype index, and not a proper IntervalIndex (so the interval dtype is not preserved on a roundtrip, while this would be the case if it was a normal column instead of the index)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#issuecomment-1028867740


   Benchmark runs are scheduled for baseline = 360252b6bedbc69c4191bc3102282a1e7d57ad29 and contender = 56d060ca197352f575edced64e6a1fbc9331b336. 56d060ca197352f575edced64e6a1fbc9331b336 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e0b5c1bc6fc4a4f9cc218f75402eddb...732dc06332604b5f9b325af50f8dece5/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8b3c20fe52414b2e9500933051a79eb4...904e7917b8954329beeef50209a97232/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/bada73252a1244a29f1b4c54104d7087...c5bbbeb6853842e0921b872d4958ee84/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#issuecomment-1028867740


   Benchmark runs are scheduled for baseline = 360252b6bedbc69c4191bc3102282a1e7d57ad29 and contender = 56d060ca197352f575edced64e6a1fbc9331b336. 56d060ca197352f575edced64e6a1fbc9331b336 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e0b5c1bc6fc4a4f9cc218f75402eddb...732dc06332604b5f9b325af50f8dece5/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8b3c20fe52414b2e9500933051a79eb4...904e7917b8954329beeef50209a97232/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/bada73252a1244a29f1b4c54104d7087...c5bbbeb6853842e0921b872d4958ee84/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#discussion_r799406409



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -4082,6 +4082,18 @@ def test_array_to_pandas():
         # tm.assert_series_equal(result, expected)
 
 
+def test_roundtrip_empty_table_with_extension_dtype_index():
+    if Version(pd.__version__) < Version("1.0.0"):
+        pytest.skip("ExtensionDtype to_pandas method missing")
+
+    df = pd.DataFrame(index=pd.interval_range(start=0, end=3))
+    table = pa.table(df)
+    table.to_pandas().index == pd.Index([{'left': 0, 'right': 1},

Review comment:
       Yes, that bothered me also (see comment in https://issues.apache.org/jira/browse/ARROW-15253).
   
   I created a separate issue to get this corrected:
   https://issues.apache.org/jira/browse/ARROW-15565




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#issuecomment-1028867740


   Benchmark runs are scheduled for baseline = 360252b6bedbc69c4191bc3102282a1e7d57ad29 and contender = 56d060ca197352f575edced64e6a1fbc9331b336. 56d060ca197352f575edced64e6a1fbc9331b336 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e0b5c1bc6fc4a4f9cc218f75402eddb...732dc06332604b5f9b325af50f8dece5/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8b3c20fe52414b2e9500933051a79eb4...904e7917b8954329beeef50209a97232/)
   [Finished :arrow_down:0.3% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/bada73252a1244a29f1b4c54104d7087...c5bbbeb6853842e0921b872d4958ee84/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#discussion_r796589838



##########
File path: python/pyarrow/tests/test_pandas.py
##########
@@ -4082,6 +4082,18 @@ def test_array_to_pandas():
         # tm.assert_series_equal(result, expected)
 
 
+def test_roundtrip_empty_table_with_intervalrange_index():

Review comment:
       ```suggestion
   def test_roundtrip_empty_table_with_extension_dtype_index():
   ```
   
   It's not an issue specifically with IntervalDtype, but more specifically with any extension dtype that defines a `__from_arrow__` (and interval type is one of the examples of this in pandas)

##########
File path: python/pyarrow/pandas_compat.py
##########
@@ -822,8 +822,12 @@ def _get_extension_dtypes(table, columns_metadata, types_mapper=None):
 
     # infer the extension columns from the pandas metadata
     for col_meta in columns_metadata:
-        name = col_meta['name']
+        if col_meta['name']:
+            name = col_meta['name']
+        else:
+            name = col_meta['field_name']

Review comment:
       We can maybe simply always use `col_meta["field_name"]` ? 
   I don't think there is a case where that would be incorrect (as that should always map to the name in the arrow table, while `col_meta["name"]` doesn't always match exactly)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] AlenkaF commented on a change in pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on a change in pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#discussion_r797371352



##########
File path: python/pyarrow/pandas_compat.py
##########
@@ -822,8 +822,12 @@ def _get_extension_dtypes(table, columns_metadata, types_mapper=None):
 
     # infer the extension columns from the pandas metadata
     for col_meta in columns_metadata:
-        name = col_meta['name']
+        if col_meta['name']:
+            name = col_meta['name']
+        else:
+            name = col_meta['field_name']

Review comment:
       If we only use `col_meta["field_name"]` we get KeyErrors for the missing `filed_name` in the parquet tests. Something to do with fastparquet (0.3.2)? 
   
   https://github.com/apache/arrow/blob/ad073b7c0fec80ce88aaf1e7d6a78104711952f2/python/pyarrow/tests/test_pandas.py#L4258-L4262




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#issuecomment-1028867740


   Benchmark runs are scheduled for baseline = 360252b6bedbc69c4191bc3102282a1e7d57ad29 and contender = 56d060ca197352f575edced64e6a1fbc9331b336. 56d060ca197352f575edced64e6a1fbc9331b336 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e0b5c1bc6fc4a4f9cc218f75402eddb...732dc06332604b5f9b325af50f8dece5/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8b3c20fe52414b2e9500933051a79eb4...904e7917b8954329beeef50209a97232/)
   [Finished :arrow_down:0.3% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/bada73252a1244a29f1b4c54104d7087...c5bbbeb6853842e0921b872d4958ee84/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche closed pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche closed pull request #12300:
URL: https://github.com/apache/arrow/pull/12300


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12300: ARROW-15253: [Python] Error in to_pandas for empty dataframe with index with extension type

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #12300:
URL: https://github.com/apache/arrow/pull/12300#discussion_r797393033



##########
File path: python/pyarrow/pandas_compat.py
##########
@@ -822,8 +822,12 @@ def _get_extension_dtypes(table, columns_metadata, types_mapper=None):
 
     # infer the extension columns from the pandas metadata
     for col_meta in columns_metadata:
-        name = col_meta['name']
+        if col_meta['name']:
+            name = col_meta['name']
+        else:
+            name = col_meta['field_name']

Review comment:
       Ah, yes, that's for compatibility with old metadata where the "field_name" can be missing. Maybe instead of this if/else, you could then also do 
   
   ```
   try:
       name = col_meta["field_name"]
   except KeyError:
       name = col_meta["name"]
   ```
   
   so it is clearer that this is a fallback.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org