You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by francexo83 <fr...@gmail.com> on 2019/10/24 14:21:43 UTC

[Spark Sql] Direct write on hive and s3 while executing a CTAS on spark sql

Hi all,
I'm using spark 2.4.0, my spark.sql.catalogImplementation is set to hive
while spark.sql.warehouse.dir is set to a specific s3 bucket.

I want to execute a CTAS statement in spark sql like the one below.

*create table as db_name.table_name as (select ..)*
When writing, spark always uses the hive staging folder on s3 as a scratch
dir. Once the executors finish their computation SPARK  moves the files
from the staging dir to the final location.
This is causing performance degradation in write phase because of the
nature of the object storage where the rename operation is not permitted.

Is it possible to enable a direct-write on s3 bucket while performing a
CTAS execution in the scenario depicted above?

I performed the write operation by using the DataFrameWriter.saveAsTable
api and obtained the desired result.

Thank you in advance