You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2015/09/01 12:48:45 UTC

[jira] [Created] (SPARK-10395) Simplify CatalystReadSupport

Cheng Lian created SPARK-10395:
----------------------------------

             Summary: Simplify CatalystReadSupport
                 Key: SPARK-10395
                 URL: https://issues.apache.org/jira/browse/SPARK-10395
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.5.0
            Reporter: Cheng Lian
            Assignee: Cheng Lian
            Priority: Minor


The API interface of Parquet {{ReadSupport}} is a little bit over complicated because of historical reasons.  In older versions of parquet-mr (say 1.6.0rc3 and prior), {{ReadSupport}} need to be instantiated and initialized twice on both driver side and executor side.  The {{init()}} method is for driver side initialization, while {{prepareForRead()}} is for executor side.  However, starting from parquet-mr 1.6.0, it's no longer the case, and {{ReadSupport}} is only instantiated and initialized on executor side.  So, theoretically, now it's totally fine to combine these two methods into a single initialization method.  The only reason (I could think of) to still have them here is for parquet-mr API backwards-compatibility.

Due to this reason, we no longer need to rely on {{ReadContext}} to pass requested schema from {{init()}} to {{prepareForRead()}}, using a private `var` for requested schema in {{CatalystReadSupport}} would be enough.

Another thing is that, after removing the old Parquet support code, now we always set Catalyst requested schema properly when reading Parquet files.  So all those "fallback" logic in {{CatalystReadSupport}} is now redundant.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org