You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2018/08/22 16:20:00 UTC
[jira] [Commented] (SPARK-25188) Add WriteConfig

    [ https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589086#comment-16589086 ] 

Ryan Blue commented on SPARK-25188:
-----------------------------------

Here's the original proposal for adding a write config:

The read side has a {{ScanConfig}}, but the write side doesn't have an equivalent object that tracks a particular write. I think if we introduce one, the API would be more similar between the read and write sides, and we would have a better API for overwrite operations. I propose adding a {{WriteConfig}} object and passing it like this:

{code:lang=java}
interface BatchWriteSupport {
  WriteConfig newWriteConfig(writeOptions: Map[String, String])

  DataWriterFactory createWriterFactory(WriteConfig)

  void commit(WriteConfig, WriterCommitMessage[])
}
{code}

That allows us to pass options for the write that affect how the WriterFactory operates. For example, in Iceberg I could request using Orc as the underlying format instead of Parquet. (I also suggested an addition like this for the read side.)

The other benefit of adding {{WriteConfig}} is that it provides a clean way of adding the ReplaceData operations. The ones I'm currently working on are ReplaceDynamicPartitions and ReplaceData. The first one removes any data in partitions that are being written to, and the second one replaces data based on a filter: e.g. {{df.writeTo(t).overwrite($"day" == "2018-08-15")}}. The information about replacement could be carried by {{WriteConfig}} to {{commit}} and would be created with a support interface:

{code:lang=java}
interface BatchOverwriteSupport extends BatchWriteSupport {
  WriteConfig newOverwrite(writeOptions, filters: Filter[])

  WriteConfig newDynamicOverwrite(writeOptions)
}
{code}

This is much cleaner than staging a delete and then running a write to complete the operation. All of the information about what to overwrite is just passed to the commit operation that can handle it at once. This is much better for dynamic partition replacement because the partitions to be replaced aren't even known by Spark before the write.

Last, this adds a place for write life-cycle operations that matches the ScanConfig read life-cycle. This could be used to perform operations like getting a write lock on a Hive table if someone wanted to support Hive's locking mechanism in the future.

> Add WriteConfig
> ---------------
>
>                 Key: SPARK-25188
>                 URL: https://issues.apache.org/jira/browse/SPARK-25188
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Priority: Major
>
> The current write API is not flexible enough to implement more complex write operations like `replaceWhere`. We can follow the read API and add a `WriteConfig` to make it more flexible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org