You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/21 22:30:54 UTC

[GitHub] [arrow-rs] ghuls edited a comment on issue #286: Unable to load Feather v2 files created by pyarrow and pandas.

ghuls edited a comment on issue #286:
URL: https://github.com/apache/arrow-rs/issues/286#issuecomment-865384731


   @jorgecarleitao I think I might have figured out the problem.
   
   ```python
   import polars as pl
   import pyarrow as pa
   import pandas as pd
   
   # Read Feather file written with pandas, with pa,feather.read_feather (wrapped inside pl.read_ipc) in Polars dataframe.
   df_pl = pl.read_ipc('test_pandas.feather', use_pyarrow=True)
   
   # Convert Polars dataframe to arrow table and write to Feather v2 file without compression (with pyarrow).
   pa.feather.write_feather(df_pl.to_arrow(), 'test_polars_to_arrow_uncompressed.feather', compression='uncompressed', version=2)
   
   # Convert Polars dataframe to arrow table and write to Feather v2 file without compression (with pyarrow).
   pa.feather.write_feather(df_pl.to_arrow(), 'test_polars_to_arrow_lz4.feather', compression='lz4', version=2)
   
   # Convert Polars dataframe to arrow table and convert arrow table to pandas dataframe and write to Feather v2 file without compression (with pyarrow).
   pa.feather.write_feather(df_pl.to_arrow().to_pandas(), 'test_polars_to_arrow_to_pandas_uncompressed.feather', compression='uncompressed', version=2)
   
   # Convert Polars dataframe to arrow table and convert arrow table to pandas dataframe and write to Feather v2 file with lz4 compression (with pyarrow).
   pa.feather.write_feather(df_pl.to_arrow().to_pandas(), 'test_polars_to_arrow_to_pandas_lz4.feather', compression='lz4', version=2)
   
   
   # Now try to read all those files with polars without using the pyarrow Feather reading code, but the arrow-rs code instead.
   
   # Reading Feather v2 file without compression containing saved arrow table data, works.
   In [9]: pl.read_ipc('test_polars_to_arrow_uncompressed.feather', use_pyarrow=False)
   Out[9]: 
   shape: (7, 5)
   ╭────────────────────┬────────┬─────────────────────┬────────────────────┬─────────╮
   │ motif1             ┆ motif2 ┆ motif3              ┆ motif4             ┆ regions │
   │ ---                ┆ ---    ┆ ---                 ┆ ---                ┆ ---     │
   │ f32                ┆ f32    ┆ f32                 ┆ f32                ┆ str     │
   ╞════════════════════╪════════╪═════════════════════╪════════════════════╪═════════╡
   │ 1.2000000476837158 ┆ 3      ┆ 0.30000001192092896 ┆ 5.599999904632568  ┆ "reg1"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 6.699999809265137  ┆ 3      ┆ 4.300000190734863   ┆ 5.599999904632568  ┆ "reg2"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 3.5                ┆ 3      ┆ 0.0                 ┆ 0.0                ┆ "reg3"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 0.0                ┆ 3      ┆ 0.0                 ┆ 5.599999904632568  ┆ "reg4"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 2.4000000953674316 ┆ 3      ┆ 7.800000190734863   ┆ 1.2000000476837158 ┆ "reg5"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 2.4000000953674316 ┆ 3      ┆ 0.6000000238418579  ┆ 0.0                ┆ "reg6"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 2.4000000953674316 ┆ 3      ┆ 7.699999809265137   ┆ 0.0                ┆ "reg7"  │
   ╰────────────────────┴────────┴─────────────────────┴────────────────────┴─────────╯
   
   
   # Reading Feather v2 file without compression containing saved pandas dataframe, works.
   In [10]: pl.read_ipc('test_polars_to_arrow_to_pandas_uncompressed.feather', use_pyarrow=False)
   Out[10]: 
   shape: (7, 5)
   ╭────────────────────┬────────┬─────────────────────┬────────────────────┬─────────╮
   │ motif1             ┆ motif2 ┆ motif3              ┆ motif4             ┆ regions │
   │ ---                ┆ ---    ┆ ---                 ┆ ---                ┆ ---     │
   │ f32                ┆ f32    ┆ f32                 ┆ f32                ┆ str     │
   ╞════════════════════╪════════╪═════════════════════╪════════════════════╪═════════╡
   │ 1.2000000476837158 ┆ 3      ┆ 0.30000001192092896 ┆ 5.599999904632568  ┆ "reg1"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 6.699999809265137  ┆ 3      ┆ 4.300000190734863   ┆ 5.599999904632568  ┆ "reg2"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 3.5                ┆ 3      ┆ 0.0                 ┆ 0.0                ┆ "reg3"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 0.0                ┆ 3      ┆ 0.0                 ┆ 5.599999904632568  ┆ "reg4"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 2.4000000953674316 ┆ 3      ┆ 7.800000190734863   ┆ 1.2000000476837158 ┆ "reg5"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 2.4000000953674316 ┆ 3      ┆ 0.6000000238418579  ┆ 0.0                ┆ "reg6"  │
   ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
   │ 2.4000000953674316 ┆ 3      ┆ 7.699999809265137   ┆ 0.0                ┆ "reg7"  │
   ╰────────────────────┴────────┴─────────────────────┴────────────────────┴─────────╯
   
   
   # Reading Feather v2 file with lz4 compression containing saved pandas dataframe, gives the error from the first post.
   In [11]: pl.read_ipc('test_polars_to_arrow_to_pandas_lz4.feather', use_pyarrow=False)
   thread '<unnamed>' panicked at 'assertion failed: prefix.is_empty() && suffix.is_empty()', /github/home/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/9f56afb/arrow/src/buffer/immutable.rs:179:9
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ---------------------------------------------------------------------------
   PanicException                            Traceback (most recent call last)
   <ipython-input-11-04613b1d0975> in <module>
   ----> 1 pl.read_ipc('test_polars_to_arrow_to_pandas_lz4.feather', use_pyarrow=False)
   /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/polars/functions.py in read_ipc(file, use_pyarrow)
       337     """
       338     file = _prepare_file_arg(file)
   --> 339     return DataFrame.read_ipc(file, use_pyarrow)
       340 
       341 
   
   /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/polars/frame.py in read_ipc(file, use_pyarrow)
       302 
       303         self = DataFrame.__new__(DataFrame)
   --> 304         self._df = PyDataFrame.read_ipc(file)
       305         return self
       306 
   
   PanicException: assertion failed: prefix.is_empty() && suffix.is_empty()
   
   
   # Reading Feather v2 file with lz4 compression containing saved pyarrow table, results in killing of iPython due to trying to allocate a too big buffer.
   In [12]: pl.read_ipc('test_polars_to_arrow_lz4.feather', use_pyarrow=False)
   Out[12]: memory allocation of 2702793507844465093 bytes failed
   Aborted
   ```
   
   So to me it looks like that arrow-rs is not detecting that pyarrow saved the Feather file with lz4 compression and I guess it is reading data (or offsets) from the wrong locations.
   
   ```python
   In [6]: ?pa.feather.write_feather
   Signature:
   pa.feather.write_feather(
       df,
       dest,
       compression=None,
       compression_level=None,
       chunksize=None,
       version=2,
   )
   Docstring:
   Write a pandas.DataFrame to Feather format.
   
   Parameters
   ----------
   df : pandas.DataFrame or pyarrow.Table
       Data to write out as Feather format.
   dest : str
       Local destination path.
   compression : string, default None
       Can be one of {"zstd", "lz4", "uncompressed"}. The default of None uses
       LZ4 for V2 files if it is available, otherwise uncompressed.
   compression_level : int, default None
       Use a compression level particular to the chosen compressor. If None
       use the default compression level
   chunksize : int, default None
       For V2 files, the internal maximum size of Arrow RecordBatch chunks
       when writing the Arrow IPC file format. None means use the default,
       which is currently 64K
   version : int, default 2
       Feather file version. Version 2 is the current. Version 1 is the more
       limited legacy format
   File:      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/feather.py
   Type:      function
   ```
   
   Feather files are attached:
   [test_feather_polars_to_pyarrow.zip](https://github.com/apache/arrow-rs/files/6689794/test_feather_polars_to_pyarrow.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org