You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Gautam <ga...@gmail.com> on 2019/08/30 12:42:15 UTC

Nested Column Pruning in Iceberg (DSV2) ..

Hello Devs,
                    I was measuring perf on structs between V1 and V2
datasources. Found that although Iceberg Reader supports
`SupportsPushDownRequiredColumns` it doesn't seem to prune nested column
projections. I want to be able to prune on nested fields. How does V2
datasource have provision to be able to let Iceberg decide this? The
`SupportsPushDownRequiredColumns` mix-in gives the entire struct field even
if a sub-field is requested.

*Here's an illustration .. *

scala> spark.sql("select location.lat from iceberg_people_struct").show()
+-------+
|    lat|
+-------+
|   null|
|101.123|
|175.926|
+-------+


The pruning gets the entire struct instead of just `location.lat`  ..

*public void pruneColumns(StructType newRequestedSchema) *

19/08/30 16:25:38 WARN Reader: => Prune columns : {
  "type" : "struct",
  "fields" : [ {
    "name" : "location",
    "type" : {
      "type" : "struct",
      "fields" : [ {
        "name" : "lat",
        "type" : "double",
        "nullable" : true,
        "metadata" : { }
      }, {
        "name" : "lon",
        "type" : "double",
        "nullable" : true,
        "metadata" : { }
      } ]
    },
    "nullable" : true,
    "metadata" : { }
  } ]
}

Is there information I can use in the IcebergSource (or add some) that can
be used to prune the exact sub-field here?  What's a good way to approach
this? For dense/wide struct fields this affects performance significantly.


Sample gist:
https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac


thanks and regards,
-Gautam.

Re: Nested Column Pruning in Iceberg (DSV2) ..

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

It would be great to get this into Spark master. I think it would make the
DSv2 path more valuable before the 3.0 release!

On Fri, Aug 30, 2019 at 9:58 PM Gautam Kowshik <ga...@gmail.com>
wrote:

> Super! That’d be great. Lemme know if I can help in any way.
>
> Sent from my iPhone
>
> > On Aug 30, 2019, at 6:30 PM, Anton Okolnychyi
> <ao...@apple.com.invalid> wrote:
> >
> > Hi Gautam,
> >
> > Iceberg does support nested schema pruning but Spark doesn’t request
> this for DS V2 in 2.4. Internally, we had to modify Spark 2.4 to make this
> work end-to-end.
> > One of the options is to extend DataSourceV2Strategy with logic similar
> to what we have in ParquetSchemaPruning in 2.4.0. I think we can share that
> part if needed.
> >
> > I am planning to check whether Spark master already has this
> functionality.
> > If that’s not implemented and nobody is working on it yet, I can fix it.
> >
> > - Anton
> >
> >
> >> On 30 Aug 2019, at 15:42, Gautam <ga...@gmail.com> wrote:
> >>
> >> Hello Devs,
> >>                    I was measuring perf on structs between V1 and V2
> datasources. Found that although Iceberg Reader supports
> `SupportsPushDownRequiredColumns` it doesn't seem to prune nested column
> projections. I want to be able to prune on nested fields. How does V2
> datasource have provision to be able to let Iceberg decide this? The
> `SupportsPushDownRequiredColumns` mix-in gives the entire struct field even
> if a sub-field is requested.
> >>
> >> Here's an illustration ..
> >>
> >> scala> spark.sql("select location.lat from
> iceberg_people_struct").show()
> >> +-------+
> >> |    lat|
> >> +-------+
> >> |   null|
> >> |101.123|
> >> |175.926|
> >> +-------+
> >>
> >>
> >> The pruning gets the entire struct instead of just `location.lat`  ..
> >>
> >> public void pruneColumns(StructType newRequestedSchema)
> >>
> >> 19/08/30 16:25:38 WARN Reader: => Prune columns : {
> >>  "type" : "struct",
> >>  "fields" : [ {
> >>    "name" : "location",
> >>    "type" : {
> >>      "type" : "struct",
> >>      "fields" : [ {
> >>        "name" : "lat",
> >>        "type" : "double",
> >>        "nullable" : true,
> >>        "metadata" : { }
> >>      }, {
> >>        "name" : "lon",
> >>        "type" : "double",
> >>        "nullable" : true,
> >>        "metadata" : { }
> >>      } ]
> >>    },
> >>    "nullable" : true,
> >>    "metadata" : { }
> >>  } ]
> >> }
> >>
> >> Is there information I can use in the IcebergSource (or add some) that
> can be used to prune the exact sub-field here?  What's a good way to
> approach this? For dense/wide struct fields this affects performance
> significantly.
> >>
> >>
> >> Sample gist:
> https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac
> >>
> >>
> >> thanks and regards,
> >> -Gautam.
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Nested Column Pruning in Iceberg (DSV2) ..

Posted by Gautam Kowshik <ga...@gmail.com>.

Super! That’d be great. Lemme know if I can help in any way. 

Sent from my iPhone

> On Aug 30, 2019, at 6:30 PM, Anton Okolnychyi <ao...@apple.com.invalid> wrote:
> 
> Hi Gautam,
> 
> Iceberg does support nested schema pruning but Spark doesn’t request this for DS V2 in 2.4. Internally, we had to modify Spark 2.4 to make this work end-to-end.
> One of the options is to extend DataSourceV2Strategy with logic similar to what we have in ParquetSchemaPruning in 2.4.0. I think we can share that part if needed.
> 
> I am planning to check whether Spark master already has this functionality.
> If that’s not implemented and nobody is working on it yet, I can fix it.
> 
> - Anton
> 
> 
>> On 30 Aug 2019, at 15:42, Gautam <ga...@gmail.com> wrote:
>> 
>> Hello Devs,
>>                    I was measuring perf on structs between V1 and V2 datasources. Found that although Iceberg Reader supports `SupportsPushDownRequiredColumns` it doesn't seem to prune nested column projections. I want to be able to prune on nested fields. How does V2 datasource have provision to be able to let Iceberg decide this? The `SupportsPushDownRequiredColumns` mix-in gives the entire struct field even if a sub-field is requested.
>> 
>> Here's an illustration .. 
>> 
>> scala> spark.sql("select location.lat from iceberg_people_struct").show()
>> +-------+
>> |    lat|
>> +-------+
>> |   null|
>> |101.123|
>> |175.926|
>> +-------+
>> 
>> 
>> The pruning gets the entire struct instead of just `location.lat`  ..
>> 
>> public void pruneColumns(StructType newRequestedSchema) 
>> 
>> 19/08/30 16:25:38 WARN Reader: => Prune columns : {
>>  "type" : "struct",
>>  "fields" : [ {
>>    "name" : "location",
>>    "type" : {
>>      "type" : "struct",
>>      "fields" : [ {
>>        "name" : "lat",
>>        "type" : "double",
>>        "nullable" : true,
>>        "metadata" : { }
>>      }, {
>>        "name" : "lon",
>>        "type" : "double",
>>        "nullable" : true,
>>        "metadata" : { }
>>      } ]
>>    },
>>    "nullable" : true,
>>    "metadata" : { }
>>  } ]
>> }
>> 
>> Is there information I can use in the IcebergSource (or add some) that can be used to prune the exact sub-field here?  What's a good way to approach this? For dense/wide struct fields this affects performance significantly.
>> 
>> 
>> Sample gist: https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac
>> 
>> 
>> thanks and regards,
>> -Gautam.
>

Re: Nested Column Pruning in Iceberg (DSV2) ..

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.

Hi Gautam,

Iceberg does support nested schema pruning but Spark doesn’t request this for DS V2 in 2.4. Internally, we had to modify Spark 2.4 to make this work end-to-end.
One of the options is to extend DataSourceV2Strategy with logic similar to what we have in ParquetSchemaPruning in 2.4.0. I think we can share that part if needed.

I am planning to check whether Spark master already has this functionality.
If that’s not implemented and nobody is working on it yet, I can fix it.

- Anton


> On 30 Aug 2019, at 15:42, Gautam <ga...@gmail.com> wrote:
> 
> Hello Devs,
>                     I was measuring perf on structs between V1 and V2 datasources. Found that although Iceberg Reader supports `SupportsPushDownRequiredColumns` it doesn't seem to prune nested column projections. I want to be able to prune on nested fields. How does V2 datasource have provision to be able to let Iceberg decide this? The `SupportsPushDownRequiredColumns` mix-in gives the entire struct field even if a sub-field is requested.
> 
> Here's an illustration .. 
> 
> scala> spark.sql("select location.lat from iceberg_people_struct").show()
> +-------+
> |    lat|
> +-------+
> |   null|
> |101.123|
> |175.926|
> +-------+
> 
> 
> The pruning gets the entire struct instead of just `location.lat`  ..
> 
> public void pruneColumns(StructType newRequestedSchema) 
> 
> 19/08/30 16:25:38 WARN Reader: => Prune columns : {
>   "type" : "struct",
>   "fields" : [ {
>     "name" : "location",
>     "type" : {
>       "type" : "struct",
>       "fields" : [ {
>         "name" : "lat",
>         "type" : "double",
>         "nullable" : true,
>         "metadata" : { }
>       }, {
>         "name" : "lon",
>         "type" : "double",
>         "nullable" : true,
>         "metadata" : { }
>       } ]
>     },
>     "nullable" : true,
>     "metadata" : { }
>   } ]
> }
> 
> Is there information I can use in the IcebergSource (or add some) that can be used to prune the exact sub-field here?  What's a good way to approach this? For dense/wide struct fields this affects performance significantly.
>  
> 
> Sample gist: https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac
> 
> 
> thanks and regards,
> -Gautam.