You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/05/05 21:22:00 UTC
[jira] [Commented] (PARQUET-1858) [Python] [Rust] Parquet read file
fails with batch size 1_000_000 and 41 row groups
[ https://issues.apache.org/jira/browse/PARQUET-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100260#comment-17100260 ]
Wes McKinney commented on PARQUET-1858:
---------------------------------------
The PLAIN encoding for the boolean type is possibly malformed. I opened PARQUET-1859 about providing better error messages, but here is what the failure is
{code}
$ python test.py
Traceback (most recent call last):
File "test.py", line 7, in <module>
pq.read_table(path)
File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 1539, in read_table
use_pandas_metadata=use_pandas_metadata)
File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 1264, in read
use_pandas_metadata=use_pandas_metadata)
File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 707, in read
table = reader.read(**options)
File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 337, in read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1130, in pyarrow._parquet.ParquetReader.read_all
check_status(self.reader.get()
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
raise IOError(message)
OSError: Unexpected end of stream: Failed to decode 1000000 bits for boolean PLAIN encoding only decoded 2048
In ../src/parquet/arrow/reader.cc, line 844, code: final_status
{code}
Can this file be read by the Java library?
> [Python] [Rust] Parquet read file fails with batch size 1_000_000 and 41 row groups
> -----------------------------------------------------------------------------------
>
> Key: PARQUET-1858
> URL: https://issues.apache.org/jira/browse/PARQUET-1858
> Project: Parquet
> Issue Type: Bug
> Reporter: Novice
> Priority: Major
> Attachments: test_2.parquet.tgz
>
>
> Here is the error I got:
> Pyarrow:
> ```
> >>> df = pd.read_parquet("test.parquet", engine="pyarrow")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 296, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 125, in read
> path, columns=columns, **kwargs
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1281, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 1137, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 605, in read
> table = reader.read(**options)
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", line 253, in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: Unexpected end of stream
> ```
> fastparquet:
> ```
> >>> df = pd.read_parquet("test.parquet", engine="fastparquet")
> /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:222: NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 'numba.experimental.jitclass' to better reflect the experimental nature of the functionality. Please update your imports to accommodate this change and see [http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location] for the time frame.
> Numpy8 = numba.jitclass(spec8)(NumpyIO)
> /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:224: NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 'numba.experimental.jitclass' to better reflect the experimental nature of the functionality. Please update your imports to accommodate this change and see [http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location] for the time frame.
> Numpy32 = numba.jitclass(spec32)(NumpyIO)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 296, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 201, in read
> return parquet_file.to_pandas(columns=columns, **kwargs)
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/api.py", line 399, in to_pandas
> index=index, assign=parts)
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/api.py", line 228, in read_row_group
> scheme=self.file_scheme)
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", line 354, in read_row_group
> cats, selfmade, assign=assign)
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", line 331, in read_row_group_arrays
> catdef=out.get(name+'-catdef', None))
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", line 245, in read_col
> skip_nulls, selfmade=selfmade)
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", line 99, in read_data_page
> raw_bytes = _read_page(f, header, metadata)
> File "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", line 31, in _read_page
> page_header.uncompressed_page_size)
> AssertionError: found 120016208 raw bytes (expected None)
> ```
> The corresponding Rust code is:
> ```
> use parquet::{
> column::writer::ColumnWriter::BoolColumnWriter,
> column::writer::ColumnWriter::Int32ColumnWriter,
> [file::]
> { properties::WriterProperties, writer::
> {FileWriter, SerializedFileWriter}
> ,
> },
> schema::parser::parse_message_type,
> };
> use std::\{fs, rc::Rc};
> fn main() {
> let schema = "
> message schema
> { REQUIRED INT32 a; REQUIRED BOOLEAN b; }
> ";
> let schema = Rc::new(parse_message_type(schema).unwrap());
> let props = Rc::new(
> WriterProperties::builder()
> .set_statistics_enabled(false)
> .set_dictionary_enabled(false)
> .build(),
> );
> let file = fs::File::create("test.parquet").unwrap();
> let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
> let batch_size = 1_000_000;
> let mut data = vec![];
> let mut data_bool = vec![];
> for i in 0..batch_size
> { data.push(i); data_bool.push(true); }
> let mut j = 0;
> loop {
> let mut row_group_writer = writer.next_row_group().unwrap();
> let mut col_writer = row_group_writer.next_column().unwrap().unwrap();
> if let Int32ColumnWriter(ref mut typed_writer) = col_writer
> { typed_writer.write_batch(&data, None, None).unwrap(); }
> else
> { panic!(); }
> row_group_writer.close_column(col_writer).unwrap();
> let mut col_writer = row_group_writer.next_column().unwrap().unwrap();
> if let BoolColumnWriter(ref mut typed_writer) = col_writer \{ typed_writer.write_batch(&data_bool, None, None).unwrap(); } else \{ panic!(); }
> row_group_writer.close_column(col_writer).unwrap();
> writer.close_row_group(row_group_writer).unwrap();
> j += 1;
> if j * batch_size > 40_000_000
> { break; }
> }
> writer.close().unwrap()
> }
> ```
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)