You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (JIRA)" <ji...@apache.org> on 2018/04/21 12:40:01 UTC
[jira] [Updated] (PARQUET-363) Cannot construct empty MessageType for ReadContext.requestedSchema

     [ https://issues.apache.org/jira/browse/PARQUET-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabor Szadovszky updated PARQUET-363:
-------------------------------------
    Fix Version/s: 1.8.2

> Cannot construct empty MessageType for ReadContext.requestedSchema
> ------------------------------------------------------------------
>
>                 Key: PARQUET-363
>                 URL: https://issues.apache.org/jira/browse/PARQUET-363
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.8.0, 1.8.1
>            Reporter: Cheng Lian
>            Assignee: Ryan Blue
>            Priority: Major
>             Fix For: 1.9.0, 1.8.2
>
>
> In parquet-mr 1.8.1, constructing empty {{GroupType}} (and thus {{MessageType}}) is not allowed anymore (see PARQUET-278). This change makes sense in most cases since Parquet doesn't support empty groups. However, there is one use case where an empty {{MessageType}} is valid, namely passing an empty {{MessageType}} as the {{requestedSchema}} constructor argument of {{ReadContext}} when counting rows in a Parquet file. The reason why it works is that, Parquet can retrieve row count from block metadata without materializing any columns. Take the following PySpark shell snippet ([1.5-SNAPSHOT|https://github.com/apache/spark/commit/010b03ed52f35fd4d426d522f8a9927ddc579209], which uses parquet-mr 1.7.0) as an example:
> {noformat}
> >>> path = 'file:///tmp/foo'
> >>> # Writes 10 integers into a Parquet file
> >>> sqlContext.range(10).coalesce(1).write.mode('overwrite').parquet(path)
> >>> sqlContext.read.parquet(path).count()
> 10
> {noformat}
> Parquet related log lines:
> {noformat}
> 15/08/21 12:32:04 INFO CatalystReadSupport: Going to read the following fields from the Parquet file:
> Parquet form:
> message root {
> }
> Catalyst form:
> StructType()
> 15/08/21 12:32:04 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.
> 15/08/21 12:32:04 INFO InternalParquetRecordReader: at row 0. reading next block
> 15/08/21 12:32:04 INFO InternalParquetRecordReader: block read in memory in 0 ms. row count = 10
> {noformat}
> We can see that Spark SQL passes no requested columns to the underlying Parquet reader. What happens here is that:
> # Spark SQL creates a {{CatalystRowConverter}} with zero converters (and thus only generates empty rows).
> # {{InternalParquetRecordReader}} first obtain the row count from block metadata ([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L184-L186]).
> # {{MessageColumnIO}} returns an {{EmptyRecordRecorder}} for reading the Parquet file ([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99]).
> # {{InternalParquetRecordReader.nextKeyValue()}} is invoked _n_ times, where _n_ equals to the row count. Each time, it invokes the converter created by Spark SQL and produces an empty Spark SQL row object.
> This issue is also the cause of HIVE-11611.  Because when upgrading to Parquet 1.8.1, Hive worked around this issue by using {{tableSchema}} as {{requestedSchema}} when no columns are requested ([here|https://github.com/apache/hive/commit/3e68cdc9962cacab59ee891fcca6a736ad10d37d#diff-cc764a8828c4acc2a27ba717610c3f0bR233]). IMO this introduces a performance regression in cases like counting, because now we need to materialize all columns just for counting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)