You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Harish Butani <rh...@gmail.com> on 2019/09/16 19:16:30 UTC

Surfacing Iceberg in Spark SQL versions 2.x

Hi,

I have done some initial work on this. https://github.com/hbutani/icebergSQL <https://github.com/hbutani/icebergSQL>
See README <https://github.com/hbutani/icebergSQL/blob/master/README.md>, example <https://github.com/hbutani/icebergSQL/blob/master/docs/basicExample.sql> for details. 

Goal is to provide the following for DataSource V1 tables:
allow users to create managed tables and define source column to partition column transformations as table options.
have SQL insert statements create new Iceberg Table snapshots
have SQL select statements leverage Iceberg Table snapshots for partition and file pruning
provide a new 'as of' clause to the sql select statement to run a query against a particular snapshot of a managed table.
extend Spark SQL with Iceberg management views and statements to view and manage the snapshots of a managed table.

Reason for this:
Our experience is that a  lot of deployments that use V1 datasource tables can benefit from Iceberg. So we focus on Spark 2.x; repo is at 2.4.4, but easy to back port to 2.3.x,2.2.x.
I see there is work going on to surface Iceberg Table Management as a V2 Datasource table <https://databricks.com/session/apache-spark-data-source-v2>, but as far as I can tell V2 Datasources SQL integration is still in the works.

Looking for feedback from iceberg community.

Regards,
Harish Butani.

(Please cc my email on any replies, I am not subscribed to iceberg dev)