You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/29 16:00:31 UTC

[GitHub] [arrow] westonpace commented on issue #33188: [Parquet][C++][Python] "List index overflow" when read parquet file

westonpace commented on issue #33188:
URL: https://github.com/apache/arrow/issues/33188#issuecomment-1613460257

   > is there a pyarrow API for that?
   
   Are you creating these tables in python?  You could cast the columns:
   
   ```
   import pandas as pd
   import pyarrow as pa
   import pyarrow.compute as pc
   import pyarrow.parquet as pq
   import numpy as np
   
   big_arr = np.zeros(1024*1024*1024, dtype=np.int8)
   straw_that_broke_the_camels_back = np.zeros(1, dtype=np.int8)
   big_series = pd.Series([big_arr, big_arr, straw_that_broke_the_camels_back])
   big_df = pd.DataFrame({"big": big_series})
   
   # Will not be readable by pyarrow                                                                                                                                                                                  
   big_df.to_parquet("/tmp/unreadable.parquet")
   
   big_table = pa.Table.from_pandas(big_df)
   new_columns = []
   for column in big_table.columns:
       if isinstance(column.type, pa.ListType):
           new_columns.append(pc.cast(column, pa.large_list(column.type.value_type)))
       else:
           new_columns.append(column)
   
   new_table = pa.Table.from_arrays(new_columns, names=big_table.schema.names)
   # This will contain the same data but use large_list and thus will be readable                                                                                                                                     
   pq.write_table(new_table, "/tmp/readable.parquet")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org