You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2015/09/01 12:48:45 UTC
[jira] [Created] (SPARK-10395) Simplify CatalystReadSupport
Cheng Lian created SPARK-10395:
----------------------------------
Summary: Simplify CatalystReadSupport
Key: SPARK-10395
URL: https://issues.apache.org/jira/browse/SPARK-10395
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor
The API interface of Parquet {{ReadSupport}} is a little bit over complicated because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3 and prior), {{ReadSupport}} need to be instantiated and initialized twice on both driver side and executor side. The {{init()}} method is for driver side initialization, while {{prepareForRead()}} is for executor side. However, starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}} is only instantiated and initialized on executor side. So, theoretically, now it's totally fine to combine these two methods into a single initialization method. The only reason (I could think of) to still have them here is for parquet-mr API backwards-compatibility.
Due to this reason, we no longer need to rely on {{ReadContext}} to pass requested schema from {{init()}} to {{prepareForRead()}}, using a private `var` for requested schema in {{CatalystReadSupport}} would be enough.
Another thing is that, after removing the old Parquet support code, now we always set Catalyst requested schema properly when reading Parquet files. So all those "fallback" logic in {{CatalystReadSupport}} is now redundant.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org