You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/11/01 21:24:53 UTC

[GitHub] [incubator-iceberg] xabriel commented on issue #590: Allow spark.read.schema() to be set.

xabriel commented on issue #590: Allow spark.read.schema() to be set.
URL: https://github.com/apache/incubator-iceberg/pull/590#issuecomment-548956148
 
 
   > Is this intended as a work-around for Spark not supporting nested column pruning?
   
   @aokolnychyi graciously shared their code for nested column pruning with us, so in our environment nested column pruning works as expected.
   
   But the case I'm trying to solve for in this PR is projection of partial nested columns, which would still need me to somehow specify the set of nested columns that I want.
   
   > For example, when you load tables from SQL or using spark.table there is no way to pass this in. 
   
   Agreed, although you could do a preliminary projection before moving to SparkSQL:
   ```
   val df = spark.read
     .schema(preliminaryProjectionOnNestedColumns)
     .format("iceberg")
     .load(...)
   
   df.createOrReplaceTempView("dfView")
   spark.sql("SELECT ...")
   ```
   
   Our need for projecting with a schema is given by the fact that our data sets are highly nested, and those nested levels can also be quite wide. And so when `location.*` includes, say, 100 fields, and we only want to project 2 of them while still having the original hierarchy, we want to have a mechanism to not pay the I/O cost of loading all 100 fields.
   
   `df.select("location.lat", "location.lon)` doesn't work since that changes the shape of the data.
   `df.selectExpr("struct(location.lat, location.lon) as location")` doesn't work either as per discussion on previous comment.
   
   And so that leaves us with `schema`.
   
   > Maybe we should add project(StructType) to the Dataset API?
   
   That would work, and to your point, it would be the ideal solution as it would be purpose built. But we'd have to wait for Spark 3.x.  :)
   
   I do understand that allowing this PR now means we will have to support it indefinitely, but I still feel it is a use case that, today, can only be achieved by the `schema` API.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org