You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by krexos <kr...@protonmail.com.INVALID> on 2023/01/21 12:02:26 UTC

Table created with saveAsTable behaves differently than a table created with spark.sql("CREATE TABLE....)

My periodically running process writes data to a table over parquet files with the configuration"spark.sql.sources.partitionOverwriteMode" = "dynamic"with the following code:

if

(!tableExists) {
  df.write
    .mode(

"overwrite"

)
    .partitionBy(

"partitionCol"

)
    .format(

"parquet"

)
    .saveAsTable(

"tablename"

)
}

else

{
  df.write
    .format(

"parquet"

)
    .mode(

"overwrite"

)
    .insertInto(

"table"

)
}

If the table doesn't exist and is created in the first clause, it works fine and on the next run when the table does exist and the else clause runs it works as expected.

However, when I create the table over existing parquet files either through a hive session or usingspark.sql("CREATE TABLE...")and then run the process it fails to write with the error:

"org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict"

Adding this configuration to the spark conf solves the issue but I don't understand why it is needed when creating the table through a command but isn't needed when creating the table with saveAsTable.

Also, I don't understand how this configuration is relevant for spark.[From what I've read](https://cwiki.apache.org/confluence/display/hive/tutorial#Tutorial-Dynamic-PartitionInsert), static partition here means we directly specify the partition to write into instead of specifying the column to partition by. Is it even possible to do such an insert in spark (as opposed to HiveQL)?

Spark 2.4, Hadoop 3.1

thanks

Re: Table created with saveAsTable behaves differently than a table created with spark.sql("CREATE TABLE....)

Posted by krexos <kr...@protonmail.com.INVALID>.
But in this case too the single partition is dynamic. I would expect the error to be thrown here too.

When I create the table through a query I do it with PARTITION BY 'partitionCol'
thanks

------- Original Message -------
On Saturday, January 21st, 2023 at 9:27 PM, Peyman Mohajerian <mo...@gmail.com> wrote:

> In the case of saveAsTable("tablename") you specified the partition: 'partitionBy("partitionCol")'
>
> On Sat, Jan 21, 2023 at 4:03 AM krexos <kr...@protonmail.com.invalid> wrote:
>
>> My periodically running process writes data to a table over parquet files with the configuration"spark.sql.sources.partitionOverwriteMode" = "dynamic"with the following code:
>>
>> if
>>
>> (!tableExists) {
>>   df.write
>>     .mode(
>>
>> "overwrite"
>>
>> )
>>     .partitionBy(
>>
>> "partitionCol"
>>
>> )
>>     .format(
>>
>> "parquet"
>>
>> )
>>     .saveAsTable(
>>
>> "tablename"
>>
>> )
>> }
>>
>> else
>>
>> {
>>   df.write
>>     .format(
>>
>> "parquet"
>>
>> )
>>     .mode(
>>
>> "overwrite"
>>
>> )
>>     .insertInto(
>>
>> "table"
>>
>> )
>> }
>>
>> If the table doesn't exist and is created in the first clause, it works fine and on the next run when the table does exist and the else clause runs it works as expected.
>>
>> However, when I create the table over existing parquet files either through a hive session or usingspark.sql("CREATE TABLE...")and then run the process it fails to write with the error:
>>
>> "org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict"
>>
>> Adding this configuration to the spark conf solves the issue but I don't understand why it is needed when creating the table through a command but isn't needed when creating the table with saveAsTable.
>>
>> Also, I don't understand how this configuration is relevant for spark.[From what I've read](https://cwiki.apache.org/confluence/display/hive/tutorial#Tutorial-Dynamic-PartitionInsert), static partition here means we directly specify the partition to write into instead of specifying the column to partition by. Is it even possible to do such an insert in spark (as opposed to HiveQL)?
>>
>> Spark 2.4, Hadoop 3.1
>>
>> thanks

Re: Table created with saveAsTable behaves differently than a table created with spark.sql("CREATE TABLE....)

Posted by Peyman Mohajerian <mo...@gmail.com>.
In the case of saveAsTable("tablename") you specified the partition: '
partitionBy("partitionCol")'

On Sat, Jan 21, 2023 at 4:03 AM krexos <kr...@protonmail.com.invalid>
wrote:

> My periodically running process writes data to a table over parquet files
> with the configuration "spark.sql.sources.partitionOverwriteMode" =
> "dynamic" with the following code:
>
> if (!tableExists) {
>   df.write
>     .mode("overwrite")
>     .partitionBy("partitionCol")
>     .format("parquet")
>     .saveAsTable("tablename")
> }else {
>   df.write
>     .format("parquet")
>     .mode("overwrite")
>     .insertInto("table")
> }
>
> If the table doesn't exist and is created in the first clause, it works
> fine and on the next run when the table does exist and the else clause runs
> it works as expected.
>
> However, when I create the table over existing parquet files either
> through a hive session or using spark.sql("CREATE TABLE...") and then run
> the process it fails to write with the error:
>
> "org.apache.spark.SparkException: Dynamic partition strict mode requires
> at least one static partition column. To turn this off set
> hive.exec.dynamic.partition.mode=nonstrict"
> Adding this configuration to the spark conf solves the issue but I don't
> understand why it is needed when creating the table through a command but
> isn't needed when creating the table with saveAsTable.
>
> Also, I don't understand how this configuration is relevant for spark. From
> what I've read
> <https://cwiki.apache.org/confluence/display/hive/tutorial#Tutorial-Dynamic-PartitionInsert>,
> static partition here means we directly specify the partition to write into
> instead of specifying the column to partition by. Is it even possible to do
> such an insert in spark (as opposed to HiveQL)?
>
> Spark 2.4, Hadoop 3.1
>
>
> thanks
>