You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rick Moritz <ra...@gmail.com> on 2017/04/20 07:48:19 UTC

Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

Hi List,

I'm wondering if the following behaviour should be considered a bug, or
whether it "works as designed":

I'm starting multiple concurrent (FIFO-scheduled) jobs in a single
SparkContext, some of which write into the same tables.
When these tables already exist, it appears as though both jobs [at least
believe that they] successfully appended to the table (i.e., both jobs
terminate succesfully, but I haven't checked whether the data from both
jobs was actually written, or if one job overwrote the other's data,
despite Mode.APPEND). If the table does not exist, both jobs will attempt
to create the table, but whichever job's turn is second (or  later) will
then fail with a AlreadyExistsException
(org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException).

I think the issue here is, that both jobs don't register the table with the
metastore, until they actually start writing to it, but determine early on
that they will need to create it. The slower job then oobviously fails
creating the table, and instead of falling back to appending the data to
the existing table crashes out.

I would consider this a bit of a bug, but I'd like to make sure that it
isn't merely a case of me doing something stupid elsewhere, or indeed
simply an inherent architectural limitation of working with the metastore,
before going to Jira with this.

Also, I'm aware that running the jobs strictly sequentially would work
around the issue, but that would require reordering jobs before sending
them off to Spark, or kill efficiency.

Thanks for any feedback,

Rick

Re: Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

Posted by Subhash Sriram <su...@gmail.com>.

Would it be an option to just write the results of each job into separate tables and then run a UNION on all of them at the end into a final target table? Just thinking of an alternative!

Thanks,
Subhash

Sent from my iPhone

> On Apr 20, 2017, at 3:48 AM, Rick Moritz <ra...@gmail.com> wrote:
> 
> Hi List,
> 
> I'm wondering if the following behaviour should be considered a bug, or whether it "works as designed":
> 
> I'm starting multiple concurrent (FIFO-scheduled) jobs in a single SparkContext, some of which write into the same tables.
> When these tables already exist, it appears as though both jobs [at least believe that they] successfully appended to the table (i.e., both jobs terminate succesfully, but I haven't checked whether the data from both jobs was actually written, or if one job overwrote the other's data, despite Mode.APPEND). If the table does not exist, both jobs will attempt to create the table, but whichever job's turn is second (or  later) will then fail with a AlreadyExistsException (org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException).
> 
> I think the issue here is, that both jobs don't register the table with the metastore, until they actually start writing to it, but determine early on that they will need to create it. The slower job then oobviously fails creating the table, and instead of falling back to appending the data to the existing table crashes out.
> 
> I would consider this a bit of a bug, but I'd like to make sure that it isn't merely a case of me doing something stupid elsewhere, or indeed simply an inherent architectural limitation of working with the metastore, before going to Jira with this.
> 
> Also, I'm aware that running the jobs strictly sequentially would work around the issue, but that would require reordering jobs before sending them off to Spark, or kill efficiency.
> 
> Thanks for any feedback,
> 
> Rick

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org