You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yong Zhang <ja...@hotmail.com> on 2018/05/25 14:42:40 UTC

Why Spark JDBC Writing in a sequential order

Spark version 2.2.0

We are trying to write a DataFrame to remote relationship database (AWS Redshift). Based on the Spark JDBC document, we already repartition our DF as 12 and set the spark jdbc to concurrent writing for 12 partitions as "numPartitions" parameter.

We run the command as following:

dataframe.repartition(12).write.mode("overwrite").option("batchsize", 5000).option("numPartitions", 12).jdbc(url=jdbcurl, table="tableName", connectionProperties=connectionProps)

Here is the Spark UI:

[cid:5b2bbcae-b538-431c-a6bb-32165adbea62]

We found out that the 12 tasks obviously are running in sequential order. They are all in "Running" status in the beginning at the same time, but if we check the "Duration" and "Shuffle Read Size/Records" of them, it is clear that they are run one by one.

For example, task 8 finished first in about 2 hours, and wrote 34732 records to remote DB (I knew the speed looks terrible, but that's not the question of this post), and task 0 started after task 8, and took 4 hours (first 2 hours waiting for task 8).

In this picture, only task 2 and 4 are in running stage, but task 4 is obviously waiting for task 2 to finish, then start writing after that.

My question is, in the above Spark command, my understanding that 12 executors should open the JDBC connection to the remote DB concurrently, and all 12 tasks should start writing also in concurrent, and whole job should finish around 2 hours overall.

Why 12 tasks indeed are in "RUNNING" stage, but looks like waiting for something, and can ONLY write to remote DB sequentially? The 12 executors are on different JVMs on different physical nodes. Why this is happening? What stops Spark pushing the data truly concurrent?

Thanks

Yong

Re: Why Spark JDBC Writing in a sequential order

Posted by Yong Zhang <ja...@hotmail.com>.

I am not sure about Redshift, but I know the target table is not partitioned. But we should be able to just insert into non-partitioned remote table from 12 clients concurrently, right?


Even let's say Redshift doesn't allow concurrently write, then Spark Driver will detect this and coordinating all tasks and executors as I observed?


Yong

________________________________
From: Jörn Franke <jo...@gmail.com>
Sent: Friday, May 25, 2018 10:50 AM
To: Yong Zhang
Cc: user@spark.apache.org
Subject: Re: Why Spark JDBC Writing in a sequential order

Can your database receive the writes concurrently ? Ie do you make sure that each executor writes into a different partition at database side ?

On 25. May 2018, at 16:42, Yong Zhang <ja...@hotmail.com>> wrote:


Spark version 2.2.0


We are trying to write a DataFrame to remote relationship database (AWS Redshift). Based on the Spark JDBC document, we already repartition our DF as 12 and set the spark jdbc to concurrent writing for 12 partitions as "numPartitions" parameter.


We run the command as following:

dataframe.repartition(12).write.mode("overwrite").option("batchsize", 5000).option("numPartitions", 12).jdbc(url=jdbcurl, table="tableName", connectionProperties=connectionProps)


Here is the Spark UI:

<Screen Shot 2018-05-25 at 10.21.50 AM.png>


We found out that the 12 tasks obviously are running in sequential order. They are all in "Running" status in the beginning at the same time, but if we check the "Duration" and "Shuffle Read Size/Records" of them, it is clear that they are run one by one.

For example, task 8 finished first in about 2 hours, and wrote 34732 records to remote DB (I knew the speed looks terrible, but that's not the question of this post), and task 0 started after task 8, and took 4 hours (first 2 hours waiting for task 8).

In this picture, only task 2 and 4 are in running stage, but task 4 is obviously waiting for task 2 to finish, then start writing after that.


My question is, in the above Spark command, my understanding that 12 executors should open the JDBC connection to the remote DB concurrently, and all 12 tasks should start writing also in concurrent, and whole job should finish around 2 hours overall.


Why 12 tasks indeed are in "RUNNING" stage, but looks like waiting for something, and can ONLY write to remote DB sequentially? The 12 executors are on different JVMs on different physical nodes. Why this is happening? What stops Spark pushing the data truly concurrent?


Thanks


Yong

Re: Why Spark JDBC Writing in a sequential order

Posted by Jörn Franke <jo...@gmail.com>.

Can your database receive the writes concurrently ? Ie do you make sure that each executor writes into a different partition at database side ?

> On 25. May 2018, at 16:42, Yong Zhang <ja...@hotmail.com> wrote:
> 
> Spark version 2.2.0
> 
> 
> We are trying to write a DataFrame to remote relationship database (AWS Redshift). Based on the Spark JDBC document, we already repartition our DF as 12 and set the spark jdbc to concurrent writing for 12 partitions as "numPartitions" parameter.
> 
> 
> We run the command as following:
> 
> dataframe.repartition(12).write.mode("overwrite").option("batchsize", 5000).option("numPartitions", 12).jdbc(url=jdbcurl, table="tableName", connectionProperties=connectionProps)
> 
> Here is the Spark UI:
> 
> 
> <Screen Shot 2018-05-25 at 10.21.50 AM.png>
> 
> We found out that the 12 tasks obviously are running in sequential order. They are all in "Running" status in the beginning at the same time, but if we check the "Duration" and "Shuffle Read Size/Records" of them, it is clear that they are run one by one.
> For example, task 8 finished first in about 2 hours, and wrote 34732 records to remote DB (I knew the speed looks terrible, but that's not the question of this post), and task 0 started after task 8, and took 4 hours (first 2 hours waiting for task 8). 
> In this picture, only task 2 and 4 are in running stage, but task 4 is obviously waiting for task 2 to finish, then start writing after that.
> 
> My question is, in the above Spark command, my understanding that 12 executors should open the JDBC connection to the remote DB concurrently, and all 12 tasks should start writing also in concurrent, and whole job should finish around 2 hours overall.
> 
> Why 12 tasks indeed are in "RUNNING" stage, but looks like waiting for something, and can ONLY write to remote DB sequentially? The 12 executors are on different JVMs on different physical nodes. Why this is happening? What stops Spark pushing the data truly concurrent?
> 
> Thanks
> 
> Yong 
>