You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Igor Calabria <ig...@gmail.com> on 2022/10/11 17:26:39 UTC

Support "RequiresDistributionAndOrdering" when using CreateTableAs statements

Hi everyone,

It seems that there's no rule to enforce sort order and distribution when
using "CREATE TABLE table PARTITIONED BY (...)  AS (SELECT ... FROM );"
statements.
With iceberg, partitioned tables have a distribution requirement[1] and it
would be nice to have those automatically applied just like when inserting
to an existing table.

Looking at "org.apache.spark.sql.execution.datasources.v2.V2Writes" there's
a bunch of rules that enforce the requirements but there's no rule for
"CreateTableAsSelect" logical plan. I've started writing a patch for this
but I've a hit a kind of chicken and egg problem with the table. All others
V2Writes rules have an existing relation, but that's not the case for
CreateTableAsSelect (makes sense since the table does not exist yet). This
is tricky because the distribution is derived from the table.

I probably missed something, but I couldn't find a clean way to implement
this. I have a few ideas(probably bad), like maybe adding a new catalog
interface that reports the requirements based on the partition spec and
table properties(which are available to "CreateTableAsSelect").

I'd love to hear some opinions on this issue

[1]
https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables