You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/17 04:53:06 UTC

[GitHub] [arrow] kato1208 opened a new issue #12652: How to avoid overflow error.

kato1208 opened a new issue #12652:
URL: https://github.com/apache/arrow/issues/12652


   I know that the argument safe=False in pa.Table.from_pandas ignores overflow errors, but it does not ignore overflow in list or in struct.
   
   It works.
   ```
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pandas as pd
   import json
   
   test_json = [
       {
           "name": "taro",
           "id": **3046682132**,
           "points": [2, 2, 2],
           "groups": {
               "group_name": "baseball", 
               "group_id": 1234
           }
       },
       { 
           "name": "taro",
           "id": 1234, 
       }
   ]
   
   schema = pa.schema([
       pa.field('name', pa.string()),
       pa.field('id', pa.int32()),
       pa.field("points", pa.list_(pa.int32())),
       pa.field('groups', pa.struct([
           pa.field("group_name", pa.string()),
           pa.field("group_id", pa.int32()),
       ])),
   ])
   
   writer = pq.ParquetWriter('test_schema.parquet', schema=schema)
   df = pd.DataFrame(test_json)
   table = pa.Table.from_pandas(df, schema=schema, safe=False)
   writer.write_table(table)
   writer.close()
   
   table = pq.read_table("test_schema.parquet")
   print(table)
   ```
   ```
   name: [["taro","taro"]]
   id: [[**-1248285164**,1234]]
   points: [[[2,2,2],null]]
   groups: [ – is_valid: [
   true,
   false
   ] – child 0 type: string
   [
   "baseball",
   null
   ] – child 1 type: int32
   [
   1234,
   null
   ]]
   ```
   However, the following two do not work.
   ```
   test_json = [
       {
           "name": "taro",
           "id": 2,
           "points": [2, **3046682132**, 2],
           "groups": {
               "group_name": "baseball", 
               "group_id": 1234
           }
       },
       { 
           "name": "taro",
           "id": 1234, 
       }
   ]
   ```
   ```
   Traceback (most recent call last):
   File "test_pyarrow.py", line 35, in <module>
   table = pa.Table.from_pandas(df, schema=schema, safe=False)
   File "pyarrow/table.pxi", line 1782, in pyarrow.lib.Table.from_pandas
   File "/home/s0108403058/.pyenv/versions/3.8.0/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays
   arrays = [convert_column(c, f)
   File "/home/s0108403058/.pyenv/versions/3.8.0/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in <listcomp>
   arrays = [convert_column(c, f)
   File "/home/s0108403058/.pyenv/versions/3.8.0/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column
   raise e
   File "/home/s0108403058/.pyenv/versions/3.8.0/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column
   result = pa.array(col, type=type_, from_pandas=True, safe=safe)
   File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: ('Value 3046682132 too large to fit in C integer type', 'Conversion failed for column points with type object')
   ```
   ```
   test_json = [
       {
           "name": "taro",
           "id": 2,
           "points": [2, 2, 2],
           "groups": {
               "group_name": "baseball", 
               "group_id": **3046682132**
           }
       },
       { 
           "name": "taro",
           "id": 1234, 
       }
   ]
   ```
   ```
   Traceback (most recent call last):
   File "test_pyarrow.py", line 35, in <module>
   table = pa.Table.from_pandas(df, schema=schema, safe=False)
   File "pyarrow/table.pxi", line 1782, in pyarrow.lib.Table.from_pandas
   File "/home/s0108403058/.pyenv/versions/3.8.0/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays
   arrays = [convert_column(c, f)
   File "/home/s0108403058/.pyenv/versions/3.8.0/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in <listcomp>
   arrays = [convert_column(c, f)
   File "/home/s0108403058/.pyenv/versions/3.8.0/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column
   raise e
   File "/home/s0108403058/.pyenv/versions/3.8.0/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column
   result = pa.array(col, type=type_, from_pandas=True, safe=safe)
   File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: ('Value 3046682132 too large to fit in C integer type', 'Conversion failed for column groups with type object')
   ```
   How can I resolve this error?
   
   pyarrow==7.0.0
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #12652: [Python] How to avoid overflow error.

Posted by GitBox <gi...@apache.org>.

wjones127 commented on issue #12652:
URL: https://github.com/apache/arrow/issues/12652#issuecomment-1072539371


   > How can I ignore overflow error?
   
   Well it sounds like there is an issue where safe isn't being pushed down into nested arrays, so I don't think this is currently possible. You are welcome to file an issue on the Apache Jira to implement that pushdown.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kato1208 commented on issue #12652: [Python] How to avoid overflow error.

Posted by GitBox <gi...@apache.org>.

kato1208 commented on issue #12652:
URL: https://github.com/apache/arrow/issues/12652#issuecomment-1072021083


   @wjones127 
   Thank you for your comment !
   Unfortunately there are reasons why the schema cannot be changed and we would like to allow for overflow error.
   How can I ignore overflow error?
   We apologize for the inconvenience and thank you in advance for your time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #12652: [Python] How to avoid overflow error.

Posted by GitBox <gi...@apache.org>.

wjones127 commented on issue #12652:
URL: https://github.com/apache/arrow/issues/12652#issuecomment-1071004210


   @kato1208 The preferred way to solve overflow is to change your schema to use types that are appropriate for the data. Look at your first example: The `"id"` value starts as `3046682132`, but when converted to Arrow (with `safe=False`) comes out as `-1248285164`. I doubt that's what you actually want.
   
   That id value is just too large to fit into an int32. If you change the schema to use int64 you can run this without any overflow errors and without needing `safe=False`. And it will give you correct data.
   
   ```python
   schema = pa.schema([
       pa.field('name', pa.string()),
       pa.field('id', pa.int64()),
       pa.field("points", pa.list_(pa.int32())),
       pa.field('groups', pa.struct([
           pa.field("group_name", pa.string()),
           pa.field("group_id", pa.int64()),
       ])),
   ])
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kato1208 commented on issue #12652: [Python] How to avoid overflow error.

Posted by GitBox <gi...@apache.org>.

kato1208 commented on issue #12652:
URL: https://github.com/apache/arrow/issues/12652#issuecomment-1072021083


   @wjones127 
   Thank you for your comment !
   Unfortunately there are reasons why the schema cannot be changed and we would like to allow for overflow error.
   How can I ignore overflow error?
   We apologize for the inconvenience and thank you in advance for your time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #12652: [Python] How to avoid overflow error.

Posted by GitBox <gi...@apache.org>.

wjones127 commented on issue #12652:
URL: https://github.com/apache/arrow/issues/12652#issuecomment-1071004210


   @kato1208 The preferred way to solve overflow is to change your schema to use types that are appropriate for the data. Look at your first example: The `"id"` value starts as `3046682132`, but when converted to Arrow (with `safe=False`) comes out as `-1248285164`. I doubt that's what you actually want.
   
   That id value is just too large to fit into an int32. If you change the schema to use int64 you can run this without any overflow errors and without needing `safe=False`. And it will give you correct data.
   
   ```python
   schema = pa.schema([
       pa.field('name', pa.string()),
       pa.field('id', pa.int64()),
       pa.field("points", pa.list_(pa.int32())),
       pa.field('groups', pa.struct([
           pa.field("group_name", pa.string()),
           pa.field("group_id", pa.int64()),
       ])),
   ])
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org