You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Gert Hulselmans (Jira)" <ji...@apache.org> on 2020/12/16 12:23:00 UTC
[jira] [Commented] (ARROW-10056) [Python] PyArrow writes invalid Feather v2 file: OSError: Verification of flatbuffer-encoded Footer failed.

    [ https://issues.apache.org/jira/browse/ARROW-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250281#comment-17250281 ] 

Gert Hulselmans commented on ARROW-10056:
-----------------------------------------

I finally managed to compile pyarrow from git (didn't work properly for me a few weeks ago).

After increasing the number of max_tables to 10_000_000 (default flatbuffer value: 1_000_000), writing to a feather file and reading it back works with:
* arrow table with 4999999 (= 10000000 / 2 - 1) columns
* pandas dataframe with 4999998 (= 10000000 / 2 - 2) columns

According to https://groups.google.com/g/flatbuffers/c/JtDGnBPx9is max_tables can even be set to MAX_INT. For my usecase, almost 5 milion rows are enough, but a higher limit probably doesn't hurt.

It seems like flatbuffer does not have issues writing feather files with more columns, only the reading of the feather file seems to check the number of columns.

It would be great if writing feather files with more columns than 499999 is supported by the arrow library.

{code:c++}
diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc
index 3d855425c..a3c95ef4b 100644
--- a/cpp/src/arrow/ipc/reader.cc
+++ b/cpp/src/arrow/ipc/reader.cc
@@ -1051,7 +1051,7 @@ class RecordBatchFileReaderImpl : public RecordBatchFileReader {
         file_->ReadAt(footer_offset_ - footer_length - file_end_size, footer_length));

     auto data = footer_buffer_->data();
-    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128);
+    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128, 10000000);
     if (!flatbuf::VerifyFooterBuffer(verifier)) {
       return Status::IOError("Verification of flatbuffer-encoded Footer failed.");
     }
{code}
{code:python}
import pyarrow as pa
import pyarrow.feather as pf
import numpy as np

n_columns = 4999999

print('make table')
table = pa.table([np.random.randn(1) for _ in range(n_columns)], names=['col' + str(i) for i in range(n_columns)])

print('write feather file')
pf.write_feather(table, "/tmp/test_wide.feather")
del table

print('read feather file and verify')
result = pf.read_table("/tmp/test_wide.feather")
{code}

{code:python}
import pyarrow as pa
import pyarrow.feather as pf
import numpy as np
import pandas as pd

n_columns = 4999998
print('make table')
table = pa.table(pd.DataFrame(np.random.randn(1, n_columns), columns=['col' + str(i) for i in range(n_columns)]))

print('write feather file')
pf.write_feather(table, "/tmp/test_wide.feather")
del table

print('read feather file and verify')
result = pf.read_table("/tmp/test_wide.feather")
{code}
 

> [Python] PyArrow writes invalid Feather v2 file: OSError: Verification of flatbuffer-encoded Footer failed.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10056
>                 URL: https://issues.apache.org/jira/browse/ARROW-10056
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>         Environment: CentOS7
> conda environment with pyarrow 1.0.1, numpy 1.19.1 and pandas 1.1.1
>            Reporter: Gert Hulselmans
>            Priority: Major
>             Fix For: 3.0.0
>
>
> pyarrow writes an invalid Feather v2 file, which it can't read afterwards.
> {code:java}
>     OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> The following code reproduces the problem for me:
> {code:python}
> import pyarrow as pa
> import numpy as np
> import pandas as pd
> nbr_regions = 1223024
> nbr_motifs = 4891
> # Create (big) dataframe.
> df = pd.DataFrame(
>     np.arange(nbr_regions * nbr_motifs, dtype=np.float32).reshape((nbr_regions, nbr_motifs)),
>     index=pd.Index(['region' + str(i) for i in range(nbr_regions)], name='regions'),
>     columns=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], name='motifs')
> )
> # Transpose dataframe
> df_transposed = df.transpose()
> # Write transposed dataframe to Feather v2 format.
> pf.write_feather(df_transposed, 'df_transposed.feather')
> # Trying to read the transposed dataframe from Feather v2 format, results in this error:
> df_transposed_read = pf.read_feather('df_transposed.feather')
> {code}
> {code:python}
> ---------------------------------------------------------------------------
> OSError                                   Traceback (most recent call last)
> <ipython-input-64-b41ad5157e77> in <module>
> ----> 1 df_transposed_read = pf.read_feather('df_transposed.feather')
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
>     213     """
>     214     _check_pandas_version()
> --> 215     return (read_table(source, columns=columns, memory_map=memory_map)
>     216             .to_pandas(use_threads=use_threads))
>     217
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
>     235     """
>     236     reader = ext.FeatherReader()
> --> 237     reader.open(source, use_memory_map=memory_map)
>     238
>     239     if columns is None:
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.open()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> Later I discovered that it happens also if the original dataframe is created in the transposed order:
> {code:python}
> # Create (big) dataframe.
> df_without_transpose = pd.DataFrame(
>     np.arange(nbr_motifs * nbr_regions, dtype=np.float32).reshape((nbr_motifs, nbr_regions)),
>     index=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], name='motifs'),
>     columns=pd.Index(['region' + str(i) for i in range(nbr_regions)], name='regions'),
> )
> pf.write_feather(df_without_transpose, 'df_without_transpose.feather')
> df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
> ---------------------------------------------------------------------------
> OSError                                   Traceback (most recent call last)
> <ipython-input-91-3cdad1d58c35> in <module>
> ----> 1 df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
>     213     """
>     214     _check_pandas_version()
> --> 215     return (read_table(source, columns=columns, memory_map=memory_map)
>     216             .to_pandas(use_threads=use_threads))
>     217
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
>     235     """
>     236     reader = ext.FeatherReader()
> --> 237     reader.open(source, use_memory_map=memory_map)
>     238
>     239     if columns is None:
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.open()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> Writing to Feather v1 format works:
> {code:python}
> pf.write_feather(df_transposed, 'df_transposed.v1.feather', version=1)
> df_transposed_read_v1 = pf.read_feather('df_transposed.v1.feather')
> # Now do the same, but also save the index in the Feather v1 file.
> df_transposed_reset_index = df_transposed.reset_index()
> pf.write_feather(df_transposed_reset_index, 'df_transposed_reset_index.v1.feather', version=1)
> df_transposed_reset_index_read_v1 = pf.read_feather('df_transposed_reset_index.v1.feather')
> # Returns True
> df_transposed_reset_index_read_v1.equals(df_transposed)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)