You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2015/08/21 08:47:45 UTC
[jira] [Created] (PARQUET-363) Cannot construct empty MessageType for ReadContext.requestedSchema

Cheng Lian created PARQUET-363:
----------------------------------

             Summary: Cannot construct empty MessageType for ReadContext.requestedSchema
                 Key: PARQUET-363
                 URL: https://issues.apache.org/jira/browse/PARQUET-363
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.8.0, 1.8.1
            Reporter: Cheng Lian


In parquet-mr 1.8.1, constructing empty {{GroupType}} (and thus {{MessageType}}) is not allowed anymore (see PARQUET-278). This change makes sense in most cases since Parquet doesn't support empty groups. However, there is one use case where an empty {{MessageType}} is valid, namely passing an empty {{MessageType}} as the {{requestedSchema}} constructor argument of {{ReadContext}} when counting rows in a Parquet file. The reason why it works is that, Parquet can retrieve row count from block metadata without materializing any columns. Take the following PySpark shell snippet ([1.5-SNAPSHOT|https://github.com/apache/spark/commit/010b03ed52f35fd4d426d522f8a9927ddc579209], which uses parquet-mr 1.7.0) as an example:
{noformat}
>>> path = 'file:///tmp/foo'
>>> # Writes 10 integers into a Parquet file
>>> sqlContext.range(10).coalesce(1).write.mode('overwrite').parquet(path)
>>> sqlContext.read.parquet(path).count()

10
{noformat}
Parquet related log lines:
{noformat}
15/08/21 12:32:04 INFO CatalystReadSupport: Going to read the following fields from the Parquet file:

Parquet form:
message root {
}


Catalyst form:
StructType()

15/08/21 12:32:04 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.
15/08/21 12:32:04 INFO InternalParquetRecordReader: at row 0. reading next block
15/08/21 12:32:04 INFO InternalParquetRecordReader: block read in memory in 0 ms. row count = 10
{noformat}
We can see that Spark SQL passes no requested columns to the underlying Parquet reader. What happens here is that:

# Spark SQL creates a {{CatalystRowConverter}} with zero converters (and thus only generates empty rows).
# {{InternalParquetRecordReader}} first obtain the row count from block metadata ([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L184-L186]).
# {{MessageColumnIO}} returns an {{EmptyRecordRecorder}} for reading the Parquet file ([here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L97-L99]).
# {{InternalParquetRecordReader.nextKeyValue()}} is invoked _n_ times, where _n_ equals to the row count. Each time, it invokes the converter created by Spark SQL and produces an empty Spark SQL row object.

This issue is also the cause of HIVE-11611.  Because when upgrading to Parquet 1.8.1, Hive worked around this issue by using {{tableSchema}} as {{requestedSchema}} when no columns are requested ([here|https://github.com/apache/hive/commit/3e68cdc9962cacab59ee891fcca6a736ad10d37d#diff-cc764a8828c4acc2a27ba717610c3f0bR233]). IMO this introduces a performance regression in cases like counting, because now we need to materialize all columns just for counting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)