You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Abhijeet (Jira)" <ji...@apache.org> on 2019/09/30 05:28:00 UTC

[jira] [Updated] (SPARK-29299) Intermittently getting "Cannot create the managed table error" while creating table from spark 2.4

     [ https://issues.apache.org/jira/browse/SPARK-29299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Abhijeet updated SPARK-29299:
-----------------------------
    Summary: Intermittently getting "Cannot create the managed table error" while creating table from spark 2.4  (was: Intermittently getting "Can not create the managed table error" while creating table from spark 2.4)

> Intermittently getting "Cannot create the managed table error" while creating table from spark 2.4
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29299
>                 URL: https://issues.apache.org/jira/browse/SPARK-29299
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Abhijeet
>            Priority: Major
>
> We are facing below error in spark 2.4 intermittently when saving the managed table from spark.
> Error -
> pyspark.sql.utils.AnalysisException: u"Can not create the managed table('`hive_issue`.`table`'). The associated location('s3://\{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table') already exists.;"
> Steps to reproduce--
> 1. Create dataframe from spark mid size data (30MB CSV file)
> 2. Save dataframe as a table
> 3. Terminate the session when above mentioned operation is in progress
> Note--
> Session termination is just a way to repro this issue. In real time we are facing this issue intermittently when we are running same spark jobs multiple times. We use EMRFS and HDFS from EMR cluster and we face the same issue on both of the systems.
> The only ways we can fix this is by deleting the target folder where table will keep its files which is not option for us and we need to keep historical information in the table hence we use APPEND mode while writing to table.
> Sample code--
> from pyspark.sql import SparkSession
> sc = SparkSession.builder.enableHiveSupport().getOrCreate()
> df = sc.read.csv("s3://\{sample-bucket}1/DATA/consumecomplians.csv")
> print "STARTED WRITING TO TABLE"
> # Terminate session using ctrl + c after this statement post df.write action started
> df.write.mode("append").saveAsTable("hive_issue.table")
> print "COMPLETED WRITING TO TABLE"
> We went through the documentation of spark 2.4 [1] and found that spark is no longer allowing to create manage tables on non empty folders.
> 1. Any reason behind change in the spatk behaviour
> 2. To us it looks like a breaking change as despite specifying "overwrite" option spark in unable to wipe out existing data and create tables
> 3. Do we have any solution for this issue other that setting "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag
> [1]
> https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org