You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "gustavoatt (via GitHub)" <gi...@apache.org> on 2023/02/24 17:33:16 UTC

[GitHub] [iceberg] gustavoatt opened a new issue, #6932: Spark SQL writes with a specific partition spec ID

gustavoatt opened a new issue, #6932:
URL: https://github.com/apache/iceberg/issues/6932

### Feature Request / Improvement

## Current behavior

Currently any writes done through Spark SQL always write to the [default partition spec ID](https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L632). We could use a different spec ID but we would need to use Iceberg's lower level APIs instead of a direct Spark SQL write.

## Use case context

We have a use-case internally where we create hable a table `table` with two different partition specs:

* `(ds, hr)`: this is the default partition spec which is used to write data in an hourly cadence.
* `(ds)`: this spec is used at the end of the day when compacting all 24 hours of data into a single partition. We do this both for efficiency of compacting these files and for keeping track of when fully daily partitions have landed.

This works well for us except when our GDPR job rewrites a whole day of data, but unfortunately the rewrite writes using the default spec, i.e. `(ds, hr)` which ends up creating more files than needed.

## Proposed feature

I would like to propose a feature that would let us override the default write spec ID when using Spark by passing a new `SparkWriteOption` called `output-spec-id`.

Whenever `output-spec-id` we will write data using that partition spec, otherwise we will use the default spec ID as we are currently doing.

Is there any interest to have this feature? I can work on a PR to get this enabled.

### Query engine

Spark

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho closed issue #6932: Spark SQL writes with a specific partition spec ID

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho closed issue #6932: Spark SQL writes with a specific partition spec ID
URL: https://github.com/apache/iceberg/issues/6932


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org