You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Gert Hulselmans (Jira)" <ji...@apache.org> on 2020/12/16 12:23:00 UTC
[jira] [Commented] (ARROW-10056) [Python] PyArrow writes invalid
Feather v2 file: OSError: Verification of flatbuffer-encoded Footer failed.
[ https://issues.apache.org/jira/browse/ARROW-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250281#comment-17250281 ]
Gert Hulselmans commented on ARROW-10056:
-----------------------------------------
I finally managed to compile pyarrow from git (didn't work properly for me a few weeks ago).
After increasing the number of max_tables to 10_000_000 (default flatbuffer value: 1_000_000), writing to a feather file and reading it back works with:
* arrow table with 4999999 (= 10000000 / 2 - 1) columns
* pandas dataframe with 4999998 (= 10000000 / 2 - 2) columns
According to https://groups.google.com/g/flatbuffers/c/JtDGnBPx9is max_tables can even be set to MAX_INT. For my usecase, almost 5 milion rows are enough, but a higher limit probably doesn't hurt.
It seems like flatbuffer does not have issues writing feather files with more columns, only the reading of the feather file seems to check the number of columns.
It would be great if writing feather files with more columns than 499999 is supported by the arrow library.
{code:c++}
diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc
index 3d855425c..a3c95ef4b 100644
--- a/cpp/src/arrow/ipc/reader.cc
+++ b/cpp/src/arrow/ipc/reader.cc
@@ -1051,7 +1051,7 @@ class RecordBatchFileReaderImpl : public RecordBatchFileReader {
file_->ReadAt(footer_offset_ - footer_length - file_end_size, footer_length));
auto data = footer_buffer_->data();
- flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128);
+ flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128, 10000000);
if (!flatbuf::VerifyFooterBuffer(verifier)) {
return Status::IOError("Verification of flatbuffer-encoded Footer failed.");
}
{code}
{code:python}
import pyarrow as pa
import pyarrow.feather as pf
import numpy as np
n_columns = 4999999
print('make table')
table = pa.table([np.random.randn(1) for _ in range(n_columns)], names=['col' + str(i) for i in range(n_columns)])
print('write feather file')
pf.write_feather(table, "/tmp/test_wide.feather")
del table
print('read feather file and verify')
result = pf.read_table("/tmp/test_wide.feather")
{code}
{code:python}
import pyarrow as pa
import pyarrow.feather as pf
import numpy as np
import pandas as pd
n_columns = 4999998
print('make table')
table = pa.table(pd.DataFrame(np.random.randn(1, n_columns), columns=['col' + str(i) for i in range(n_columns)]))
print('write feather file')
pf.write_feather(table, "/tmp/test_wide.feather")
del table
print('read feather file and verify')
result = pf.read_table("/tmp/test_wide.feather")
{code}
> [Python] PyArrow writes invalid Feather v2 file: OSError: Verification of flatbuffer-encoded Footer failed.
> -----------------------------------------------------------------------------------------------------------
>
> Key: ARROW-10056
> URL: https://issues.apache.org/jira/browse/ARROW-10056
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.1
> Environment: CentOS7
> conda environment with pyarrow 1.0.1, numpy 1.19.1 and pandas 1.1.1
> Reporter: Gert Hulselmans
> Priority: Major
> Fix For: 3.0.0
>
>
> pyarrow writes an invalid Feather v2 file, which it can't read afterwards.
> {code:java}
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> The following code reproduces the problem for me:
> {code:python}
> import pyarrow as pa
> import numpy as np
> import pandas as pd
> nbr_regions = 1223024
> nbr_motifs = 4891
> # Create (big) dataframe.
> df = pd.DataFrame(
> np.arange(nbr_regions * nbr_motifs, dtype=np.float32).reshape((nbr_regions, nbr_motifs)),
> index=pd.Index(['region' + str(i) for i in range(nbr_regions)], name='regions'),
> columns=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], name='motifs')
> )
> # Transpose dataframe
> df_transposed = df.transpose()
> # Write transposed dataframe to Feather v2 format.
> pf.write_feather(df_transposed, 'df_transposed.feather')
> # Trying to read the transposed dataframe from Feather v2 format, results in this error:
> df_transposed_read = pf.read_feather('df_transposed.feather')
> {code}
> {code:python}
> ---------------------------------------------------------------------------
> OSError Traceback (most recent call last)
> <ipython-input-64-b41ad5157e77> in <module>
> ----> 1 df_transposed_read = pf.read_feather('df_transposed.feather')
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
> 213 """
> 214 _check_pandas_version()
> --> 215 return (read_table(source, columns=columns, memory_map=memory_map)
> 216 .to_pandas(use_threads=use_threads))
> 217
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
> 235 """
> 236 reader = ext.FeatherReader()
> --> 237 reader.open(source, use_memory_map=memory_map)
> 238
> 239 if columns is None:
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.open()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> Later I discovered that it happens also if the original dataframe is created in the transposed order:
> {code:python}
> # Create (big) dataframe.
> df_without_transpose = pd.DataFrame(
> np.arange(nbr_motifs * nbr_regions, dtype=np.float32).reshape((nbr_motifs, nbr_regions)),
> index=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], name='motifs'),
> columns=pd.Index(['region' + str(i) for i in range(nbr_regions)], name='regions'),
> )
> pf.write_feather(df_without_transpose, 'df_without_transpose.feather')
> df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
> ---------------------------------------------------------------------------
> OSError Traceback (most recent call last)
> <ipython-input-91-3cdad1d58c35> in <module>
> ----> 1 df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
> 213 """
> 214 _check_pandas_version()
> --> 215 return (read_table(source, columns=columns, memory_map=memory_map)
> 216 .to_pandas(use_threads=use_threads))
> 217
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
> 235 """
> 236 reader = ext.FeatherReader()
> --> 237 reader.open(source, use_memory_map=memory_map)
> 238
> 239 if columns is None:
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.open()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> Writing to Feather v1 format works:
> {code:python}
> pf.write_feather(df_transposed, 'df_transposed.v1.feather', version=1)
> df_transposed_read_v1 = pf.read_feather('df_transposed.v1.feather')
> # Now do the same, but also save the index in the Feather v1 file.
> df_transposed_reset_index = df_transposed.reset_index()
> pf.write_feather(df_transposed_reset_index, 'df_transposed_reset_index.v1.feather', version=1)
> df_transposed_reset_index_read_v1 = pf.read_feather('df_transposed_reset_index.v1.feather')
> # Returns True
> df_transposed_reset_index_read_v1.equals(df_transposed)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)