You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Alexey Kudinkin (Jira)" <ji...@apache.org> on 2022/02/09 01:08:00 UTC

[jira] [Created] (HUDI-3396) Make sure Spark reads only Projected Columns for both MOR/COW

Alexey Kudinkin created HUDI-3396:
-------------------------------------

             Summary: Make sure Spark reads only Projected Columns for both MOR/COW
                 Key: HUDI-3396
                 URL: https://issues.apache.org/jira/browse/HUDI-3396
             Project: Apache Hudi
          Issue Type: Task
            Reporter: Alexey Kudinkin
         Attachments: Screen Shot 2022-02-08 at 4.58.12 PM.png

Spark Relation impl for MOR table seem to have following issues:
 * `requiredSchemaParquetReader` still leverages full table schema, entailing that we're fetching *all* columns from Parquet (even though the query might just be projecting a handful) 
 * `fullSchemaParquetReader` is always reading full-table to (presumably)be able to do merging which might access arbitrary key-fields. This seems superfluous, since we can only fetch the fields designated as `PRECOMBINE_FIELD_NAME` as well as `RECORDKEY_FIELD_NAME`. We won't be able to do that if either of the following is true:
 ** Virtual Keys are used (key-gen will require whole payload)
 ** Non-trivial merging strategy is used requiring whole record payload

 

!Screen Shot 2022-02-08 at 4.58.12 PM.png!

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)