You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/12/13 16:33:00 UTC

[jira] [Updated] (ARROW-15073) [C++][Parquet][Python] LZ4- and compressed parquet files are unreadable by (py)spark

     [ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jorge Leitão updated ARROW-15073:
---------------------------------
    Summary: [C++][Parquet][Python] LZ4- and compressed parquet files are unreadable by (py)spark  (was: [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark)

> [C++][Parquet][Python] LZ4- and compressed parquet files are unreadable by (py)spark
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-15073
>                 URL: https://issues.apache.org/jira/browse/ARROW-15073
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>            Reporter: Jorge Leitão
>            Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
>     [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
>     schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
>     t,
>     path,
>     use_dictionary=False,
>     compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)