You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/11/11 16:36:00 UTC

[jira] [Commented] (ARROW-10056) [Python] PyArrow writes invalid Feather v2 file: OSError: Verification of flatbuffer-encoded Footer failed.

    [ https://issues.apache.org/jira/browse/ARROW-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230074#comment-17230074 ] 

Joris Van den Bossche commented on ARROW-10056:
-----------------------------------------------

Coming back to this again: there is an inherent limitation of flatbuffers at play, I think. A single FlatBuffer has a max size of 2GB (see eg https://stackoverflow.com/questions/59185516/do-flatbuffers-have-size-limits for an explanation why). 

So that means we have a theoretical max number of columns we can write to an IPC file (this number then depends on some things, like whether there are custom metadata present (like the pandas metadata), or the exact type (some types have more metadata)). But so even for plain floats with no custom metadata (like your last example) has an inherent max number of columns. 

(note this is only when serializing, the Table itself (in memory) can have more columns. But of course serializing is a critical aspect ..)

That said, we should still ensure we don't _write_ such invalid data, though, but already error while writing, I would say.

> [Python] PyArrow writes invalid Feather v2 file: OSError: Verification of flatbuffer-encoded Footer failed.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10056
>                 URL: https://issues.apache.org/jira/browse/ARROW-10056
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>         Environment: CentOS7
> conda environment with pyarrow 1.0.1, numpy 1.19.1 and pandas 1.1.1
>            Reporter: Gert Hulselmans
>            Priority: Major
>             Fix For: 3.0.0
>
>
> pyarrow writes an invalid Feather v2 file, which it can't read afterwards.
> {code:java}
>     OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> The following code reproduces the problem for me:
> {code:python}
> import pyarrow as pa
> import numpy as np
> import pandas as pd
> nbr_regions = 1223024
> nbr_motifs = 4891
> # Create (big) dataframe.
> df = pd.DataFrame(
>     np.arange(nbr_regions * nbr_motifs, dtype=np.float32).reshape((nbr_regions, nbr_motifs)),
>     index=pd.Index(['region' + str(i) for i in range(nbr_regions)], name='regions'),
>     columns=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], name='motifs')
> )
> # Transpose dataframe
> df_transposed = df.transpose()
> # Write transposed dataframe to Feather v2 format.
> pf.write_feather(df_transposed, 'df_transposed.feather')
> # Trying to read the transposed dataframe from Feather v2 format, results in this error:
> df_transposed_read = pf.read_feather('df_transposed.feather')
> {code}
> {code:python}
> ---------------------------------------------------------------------------
> OSError                                   Traceback (most recent call last)
> <ipython-input-64-b41ad5157e77> in <module>
> ----> 1 df_transposed_read = pf.read_feather('df_transposed.feather')
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
>     213     """
>     214     _check_pandas_version()
> --> 215     return (read_table(source, columns=columns, memory_map=memory_map)
>     216             .to_pandas(use_threads=use_threads))
>     217
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
>     235     """
>     236     reader = ext.FeatherReader()
> --> 237     reader.open(source, use_memory_map=memory_map)
>     238
>     239     if columns is None:
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.open()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> Later I discovered that it happens also if the original dataframe is created in the transposed order:
> {code:python}
> # Create (big) dataframe.
> df_without_transpose = pd.DataFrame(
>     np.arange(nbr_motifs * nbr_regions, dtype=np.float32).reshape((nbr_motifs, nbr_regions)),
>     index=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], name='motifs'),
>     columns=pd.Index(['region' + str(i) for i in range(nbr_regions)], name='regions'),
> )
> pf.write_feather(df_without_transpose, 'df_without_transpose.feather')
> df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
> ---------------------------------------------------------------------------
> OSError                                   Traceback (most recent call last)
> <ipython-input-91-3cdad1d58c35> in <module>
> ----> 1 df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
>     213     """
>     214     _check_pandas_version()
> --> 215     return (read_table(source, columns=columns, memory_map=memory_map)
>     216             .to_pandas(use_threads=use_threads))
>     217
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
>     235     """
>     236     reader = ext.FeatherReader()
> --> 237     reader.open(source, use_memory_map=memory_map)
>     238
>     239     if columns is None:
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.open()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> Writing to Feather v1 format works:
> {code:python}
> pf.write_feather(df_transposed, 'df_transposed.v1.feather', version=1)
> df_transposed_read_v1 = pf.read_feather('df_transposed.v1.feather')
> # Now do the same, but also save the index in the Feather v1 file.
> df_transposed_reset_index = df_transposed.reset_index()
> pf.write_feather(df_transposed_reset_index, 'df_transposed_reset_index.v1.feather', version=1)
> df_transposed_reset_index_read_v1 = pf.read_feather('df_transposed_reset_index.v1.feather')
> # Returns True
> df_transposed_reset_index_read_v1.equals(df_transposed)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)