You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/12/13 16:33:00 UTC
[jira] [Updated] (ARROW-15073) [C++][Parquet][Python] LZ4- and compressed parquet files are unreadable by (py)spark
[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jorge Leitão updated ARROW-15073:
---------------------------------
Summary: [C++][Parquet][Python] LZ4- and compressed parquet files are unreadable by (py)spark (was: [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark)
> [C++][Parquet][Python] LZ4- and compressed parquet files are unreadable by (py)spark
> ------------------------------------------------------------------------------------
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet
> Reporter: Jorge Leitão
> Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)