You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/29 16:00:31 UTC
[GitHub] [arrow] westonpace commented on issue #33188: [Parquet][C++][Python] "List index overflow" when read parquet file
westonpace commented on issue #33188:
URL: https://github.com/apache/arrow/issues/33188#issuecomment-1613460257
> is there a pyarrow API for that?
Are you creating these tables in python? You could cast the columns:
```
import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.parquet as pq
import numpy as np
big_arr = np.zeros(1024*1024*1024, dtype=np.int8)
straw_that_broke_the_camels_back = np.zeros(1, dtype=np.int8)
big_series = pd.Series([big_arr, big_arr, straw_that_broke_the_camels_back])
big_df = pd.DataFrame({"big": big_series})
# Will not be readable by pyarrow
big_df.to_parquet("/tmp/unreadable.parquet")
big_table = pa.Table.from_pandas(big_df)
new_columns = []
for column in big_table.columns:
if isinstance(column.type, pa.ListType):
new_columns.append(pc.cast(column, pa.large_list(column.type.value_type)))
else:
new_columns.append(column)
new_table = pa.Table.from_arrays(new_columns, names=big_table.schema.names)
# This will contain the same data but use large_list and thus will be readable
pq.write_table(new_table, "/tmp/readable.parquet")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org