You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Kamil Skalski (Jira)" <ji...@apache.org> on 2022/10/22 03:57:00 UTC

[jira] [Commented] (ARROW-17611) [Rust] Boolean column data saved with V2 from arrow-rs unreadable by pyarrow

    [ https://issues.apache.org/jira/browse/ARROW-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622546#comment-17622546 ] 

Kamil Skalski commented on ARROW-17611:
---------------------------------------

Thanks for debugging, indeed forcing plain encoding in writer options allows pyarrow to read the output file.

I think this can be closed if https://issues.apache.org/jira/browse/ARROW-18031 fixes this issue too.

> [Rust] Boolean column data saved with V2 from arrow-rs unreadable by pyarrow
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-17611
>                 URL: https://issues.apache.org/jira/browse/ARROW-17611
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Format
>    Affects Versions: 9.0.0
>         Environment: Rust:
> "arrow" = "21.0.0"
> "parquet" = "21.0.0"
> Python:
> parquet-tools         0.2.11
> pyarrow                      9.0.0
>            Reporter: Kamil Skalski
>            Priority: Minor
>         Attachments: arrow_boolean.tar.gz, main.rs, x.parquet
>
>
> I'm generating Parquet V2 files with boolean column, but when trying to read them with pyarrow ({_}parquet-tool{_}s or {_}parq{_}) I'm getting 
> {code:java}
> OSError: Unknown encoding type. {code}
> To reproduce run following Rust program:
> {code:java}
> use arrow::json;
> use std::fs::File;
> const DATA: &'static str = r#"
>    {"x": 1, "y": false}
> "#;
> fn main() -> anyhow::Result<()> {
>    let mut json = json::ReaderBuilder::new().infer_schema(Some(2))
>       .build(std::io::Cursor::new(DATA.as_bytes()))?;
>    let batch = json.next()?.unwrap();   
>    let out_file = File::create("x.parquet")?;
>    let props = parquet::file::properties::WriterProperties::builder()
>       .set_writer_version(
>           parquet::file::properties::WriterVersion::PARQUET_2_0)
>       .build();
>    let mut writer = parquet::arrow::ArrowWriter::try_new(
>           out_file, batch.schema(), Some(props))?;
>    writer.write(&batch)?;
>    writer.close()?;
>    Ok(())
> } {code}
> and try to show the output _x.parquet_ file
> {code:java}
> $ cargo run
> $ parquet-tools show x.parquet 
> Traceback (most recent call last):
>   File "/home/nazgul/.local/bin/parquet-tools", line 8, in <module>
>     sys.exit(main())
>   File "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/cli.py", line 26, in main
>     args.handler(args)
>   File "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/show.py", line 59, in _cli
>     with get_datafame_from_objs(pfs, args.head) as df:
>   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
>     return next(self.gen)
>   File "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py", line 190, in get_datafame_from_objs
>     df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe())
>   File "/usr/lib/python3.10/contextlib.py", line 492, in enter_context
>     result = _cm_type.__enter__(cm)
>   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
>     return next(self.gen)
>   File "/home/nazgul/.local/lib/python3.10/site-packages/parquet_tools/commands/utils.py", line 71, in get_dataframe
>     yield pq.read_table(local_path).to_pandas()
>   File "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File "/home/nazgul/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> OSError: Unknown encoding type. {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)