You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Niranda Perera <ni...@gmail.com> on 2016/11/21 09:03:07 UTC

How is the order ensured in the jdbc relation provider when inserting data from multiple executors

Hi,

Say, I have a table with 1 column and 1000 rows. I want to save the result
in a RDBMS table using the jdbc relation provider. So I run the following
query,

"insert into table table2 select value, count(*) from table1 group by value
order by value"

While debugging, I found that the resultant df from select value, count(*)
from table1 group by value order by value would have around 200+ partitions
and say I have 4 executors attached to my driver. So, I would have 200+
writing tasks assigned to 4 executors. I want to understand, how these
executors are able to write the data to the underlying RDBMS table of
table2 without messing up the order.

I checked the jdbc insertable relation and in jdbcUtils [1] it does the
following

df.foreachPartition { iterator =>
      savePartition(getConnection, table, iterator, rddSchema, nullTypes,
batchSize, dialect)
    }

So, my understanding is, all of my 4 executors will parallely run the
savePartition function (or closure) where they do not know which one should
write data before the other!

In the savePartition method, in the comment, it says
"Saves a partition of a DataFrame to the JDBC database.  This is done in
   * a single database transaction in order to avoid repeatedly inserting
   * data as much as possible."

I want to understand, how these parallel executors save the partition
without harming the order of the results? Is it by locking the database
resource, from each executor (i.e. ex0 would first obtain a lock for the
table and write the partition0, while ex1 ... ex3 would wait till the lock
is released )?

In my experience, there is no harm done to the order of the results at the
end of the day!

Would like to hear from you guys! :-)

[1]
https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277

-- 
Niranda Perera
@n1r44 <https://twitter.com/N1R44>
+94 71 554 8430
https://www.linkedin.com/in/niranda
https://pythagoreanscript.wordpress.com/

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

Posted by Sachith Withana <sw...@gmail.com>.

Hi all,

To explain the scenario a bit more.

We need to retain the order when writing to the RDBMS tables.
The way we found was to execute the DB Write *job* for each partition which
is really costly.
One reason being that the partition count is really high( 200) and it seems
we cannot control the count( due to the count being inferred from the
parent RDD).

When we execute the insert job, the executors are run in parallel to
execute the writing tasks which jumbles up the order.
Is there anyway we can execute the tasks sequentially? or any other way of
doing this?
We have noticed that you handle this from inside Spark itself, to retain
the order when writing to RDBMS from Spark.

Thanks,
Sachith



On Fri, Nov 25, 2016 at 8:05 AM, nirandap <ni...@gmail.com> wrote:

> Hi Maciej,
>
> Thanks again for the reply. Once small clarification about the answer
> about my #1 point.
> I put local[4] and shouldn't this be forcing spark to read from 4
> partitions in parallel and write in parallel (by parallel I mean, the order
> from which partition, the data is read from a set of 4 partitions, is
> non-deterministic)? That was the reason why I was surprised to see that the
> final results are in the same order.
>
> On Tue, Nov 22, 2016 at 5:24 PM, Maciej Szymkiewicz [via Apache Spark
> Developers List] <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=20016&i=0>> wrote:
>
>> On 11/22/2016 12:11 PM, nirandap wrote:
>>
>> Hi Maciej,
>>
>> Thank you for your reply.
>>
>> I have 2 queries.
>> 1. I can understand your explanation. But in my experience, when I check
>> the final RDBMS table, I see that the results follow the expected order,
>> without an issue. Is this just a coincidence?
>>
>> Not exactly a coincidence. This is typically a result of a physical
>> location on the disk. If writes and reads are sequential, (this is usually
>> the case) you'll see things in the expected order, but you have to remember
>> that location on disk is not stable. For example if you perform some
>> updates, deletes and VACUM ALL (PostgreSQL) physical location on disk will
>> change and with it things you see.
>>
>> There of course more advanced mechanisms out there. For example modern
>> columnar RDBMS like HANA use techniques like dimensions sorting and
>> differential stores so even the initial order may differ. There probably
>> some other solutions which choose different strategies (maybe some times
>> series oriented projects?) I am not aware of.
>>
>>
>> 2. I was further looking into this. So, say I run this query
>> "select value, count(*) from table1 group by value order by value"
>>
>> and I call df.collect() in the resultant dataframe. From my experience, I
>> see that the given values follow the expected order. May I know how spark
>> manages to retain the order of the results in a collect operation?
>>
>> Once you execute ordered operation each partition is sorted and the order
>> of partitions defines the global ordering. All what collect does is just
>> preserving this order by creating an array of results for each partition
>> and flattening it.
>>
>>
>> Best
>>
>>
>> On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark
>> Developers List] <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=19985&i=0>> wrote:
>>
>>> In commonly used RDBM systems relations have no fixed order and physical
>>> location of the records can change during routine maintenance operations.
>>> Unless you explicitly order data during retrieval order you see is
>>> incidental and not guaranteed.
>>>
>>> Conclusion: order of inserts just doesn't matter.
>>> On 11/21/2016 10:03 AM, Niranda Perera wrote:
>>>
>>> Hi,
>>>
>>> Say, I have a table with 1 column and 1000 rows. I want to save the
>>> result in a RDBMS table using the jdbc relation provider. So I run the
>>> following query,
>>>
>>> "insert into table table2 select value, count(*) from table1 group by
>>> value order by value"
>>>
>>> While debugging, I found that the resultant df from select value,
>>> count(*) from table1 group by value order by value would have around 200+
>>> partitions and say I have 4 executors attached to my driver. So, I would
>>> have 200+ writing tasks assigned to 4 executors. I want to understand, how
>>> these executors are able to write the data to the underlying RDBMS table of
>>> table2 without messing up the order.
>>>
>>> I checked the jdbc insertable relation and in jdbcUtils [1] it does the
>>> following
>>>
>>> df.foreachPartition { iterator =>
>>>       savePartition(getConnection, table, iterator, rddSchema,
>>> nullTypes, batchSize, dialect)
>>>     }
>>>
>>> So, my understanding is, all of my 4 executors will parallely run the
>>> savePartition function (or closure) where they do not know which one should
>>> write data before the other!
>>>
>>> In the savePartition method, in the comment, it says
>>> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>>>    * a single database transaction in order to avoid repeatedly inserting
>>>    * data as much as possible."
>>>
>>> I want to understand, how these parallel executors save the partition
>>> without harming the order of the results? Is it by locking the database
>>> resource, from each executor (i.e. ex0 would first obtain a lock for the
>>> table and write the partition0, while ex1 ... ex3 would wait till the lock
>>> is released )?
>>>
>>> In my experience, there is no harm done to the order of the results at
>>> the end of the day!
>>>
>>> Would like to hear from you guys! :-)
>>>
>>> [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/src
>>> /main/scala/org/apache/spark/sql/execution/datasources/jdbc/
>>> JdbcUtils.scala#L277
>>>
>>> --
>>> Niranda Perera
>>> @n1r44 <https://twitter.com/N1R44>
>>> <a href="tel:%2B94%2071%20554%208430" value="<a
>>> href="tel:%2B94715548430" value="+94715548430 <071%20554%208430>"
>>> target="_blank">+94715548430 <071%20554%208430>" target="_blank">+94 71
>>> 554 8430
>>> https://www.linkedin.com/in/niranda
>>> https://pythagoreanscript.wordpress.com/
>>>
>>>
>>> --
>>> Best regards,
>>> Maciej Szymkiewicz
>>>
>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Ho
>>> w-is-the-order-ensured-in-the-jdbc-relation-provider-when-
>>> inserting-data-from-multiple-executors-tp19970p19971.html
>>> To start a new topic under Apache Spark Developers List, email [hidden
>>> email] <http:///user/SendEmail.jtp?type=node&node=19985&i=1>
>>> To unsubscribe from Apache Spark Developers List, click here.
>>> NAML
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>>
>> --
>> Niranda Perera
>> @n1r44 <https://twitter.com/N1R44>
>> <a href="tel:%2B94%2071%20554%208430" value="+94715548430
>> <071%20554%208430>" target="_blank">+94 71 554 8430 <071%20554%208430>
>> https://www.linkedin.com/in/niranda
>> https://pythagoreanscript.wordpress.com/
>>
>> ------------------------------
>> View this message in context: Re: How is the order ensured in the jdbc
>> relation provider when inserting data from multiple executors
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19985.html>
>> Sent from the Apache Spark Developers List mailing list archive
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
>> Nabble.com.
>>
>>
>> --
>> Maciej Szymkiewicz
>>
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> How-is-the-order-ensured-in-the-jdbc-relation-provider-
>> when-inserting-data-from-multiple-executors-tp19970p19986.html
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=20016&i=1>
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
>
> --
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430 <071%20554%208430>
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/
>
> ------------------------------
> View this message in context: Re: How is the order ensured in the jdbc
> relation provider when inserting data from multiple executors
> <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p20016.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>



-- 
Thanks,
Sachith Withana

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

Posted by nirandap <ni...@gmail.com>.

Hi Maciej,

Thanks again for the reply. Once small clarification about the answer about
my #1 point.
I put local[4] and shouldn't this be forcing spark to read from 4
partitions in parallel and write in parallel (by parallel I mean, the order
from which partition, the data is read from a set of 4 partitions, is
non-deterministic)? That was the reason why I was surprised to see that the
final results are in the same order.

On Tue, Nov 22, 2016 at 5:24 PM, Maciej Szymkiewicz [via Apache Spark
Developers List] <ml...@n3.nabble.com> wrote:

> On 11/22/2016 12:11 PM, nirandap wrote:
>
> Hi Maciej,
>
> Thank you for your reply.
>
> I have 2 queries.
> 1. I can understand your explanation. But in my experience, when I check
> the final RDBMS table, I see that the results follow the expected order,
> without an issue. Is this just a coincidence?
>
> Not exactly a coincidence. This is typically a result of a physical
> location on the disk. If writes and reads are sequential, (this is usually
> the case) you'll see things in the expected order, but you have to remember
> that location on disk is not stable. For example if you perform some
> updates, deletes and VACUM ALL (PostgreSQL) physical location on disk will
> change and with it things you see.
>
> There of course more advanced mechanisms out there. For example modern
> columnar RDBMS like HANA use techniques like dimensions sorting and
> differential stores so even the initial order may differ. There probably
> some other solutions which choose different strategies (maybe some times
> series oriented projects?) I am not aware of.
>
>
> 2. I was further looking into this. So, say I run this query
> "select value, count(*) from table1 group by value order by value"
>
> and I call df.collect() in the resultant dataframe. From my experience, I
> see that the given values follow the expected order. May I know how spark
> manages to retain the order of the results in a collect operation?
>
> Once you execute ordered operation each partition is sorted and the order
> of partitions defines the global ordering. All what collect does is just
> preserving this order by creating an array of results for each partition
> and flattening it.
>
>
> Best
>
>
> On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark
> Developers List] <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19985&i=0>> wrote:
>
>> In commonly used RDBM systems relations have no fixed order and physical
>> location of the records can change during routine maintenance operations.
>> Unless you explicitly order data during retrieval order you see is
>> incidental and not guaranteed.
>>
>> Conclusion: order of inserts just doesn't matter.
>> On 11/21/2016 10:03 AM, Niranda Perera wrote:
>>
>> Hi,
>>
>> Say, I have a table with 1 column and 1000 rows. I want to save the
>> result in a RDBMS table using the jdbc relation provider. So I run the
>> following query,
>>
>> "insert into table table2 select value, count(*) from table1 group by
>> value order by value"
>>
>> While debugging, I found that the resultant df from select value,
>> count(*) from table1 group by value order by value would have around 200+
>> partitions and say I have 4 executors attached to my driver. So, I would
>> have 200+ writing tasks assigned to 4 executors. I want to understand, how
>> these executors are able to write the data to the underlying RDBMS table of
>> table2 without messing up the order.
>>
>> I checked the jdbc insertable relation and in jdbcUtils [1] it does the
>> following
>>
>> df.foreachPartition { iterator =>
>>       savePartition(getConnection, table, iterator, rddSchema, nullTypes,
>> batchSize, dialect)
>>     }
>>
>> So, my understanding is, all of my 4 executors will parallely run the
>> savePartition function (or closure) where they do not know which one should
>> write data before the other!
>>
>> In the savePartition method, in the comment, it says
>> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>>    * a single database transaction in order to avoid repeatedly inserting
>>    * data as much as possible."
>>
>> I want to understand, how these parallel executors save the partition
>> without harming the order of the results? Is it by locking the database
>> resource, from each executor (i.e. ex0 would first obtain a lock for the
>> table and write the partition0, while ex1 ... ex3 would wait till the lock
>> is released )?
>>
>> In my experience, there is no harm done to the order of the results at
>> the end of the day!
>>
>> Would like to hear from you guys! :-)
>>
>> [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/src
>> /main/scala/org/apache/spark/sql/execution/datasources/
>> jdbc/JdbcUtils.scala#L277
>>
>> --
>> Niranda Perera
>> @n1r44 <https://twitter.com/N1R44>
>> <a href="tel:%2B94%2071%20554%208430" value="+94715548430"
>> target="_blank">+94 71 554 8430
>> https://www.linkedin.com/in/niranda
>> https://pythagoreanscript.wordpress.com/
>>
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>>
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> How-is-the-order-ensured-in-the-jdbc-relation-provider-
>> when-inserting-data-from-multiple-executors-tp19970p19971.html
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=19985&i=1>
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
>
> --
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/
>
> ------------------------------
> View this message in context: Re: How is the order ensured in the jdbc
> relation provider when inserting data from multiple executors
> <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19985.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>
> --
> Maciej Szymkiewicz
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-
> provider-when-inserting-data-from-multiple-executors-tp19970p19986.html
> To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1h93@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bmlyYW5kYS5wZXJlcmFAZ21haWwuY29tfDF8NjAxMDUyMzU5>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Niranda Perera
@n1r44 <https://twitter.com/N1R44>
+94 71 554 8430
https://www.linkedin.com/in/niranda
https://pythagoreanscript.wordpress.com/




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p20016.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

Posted by Maciej Szymkiewicz <ms...@gmail.com>.

On 11/22/2016 12:11 PM, nirandap wrote:

> Hi Maciej, 
>
> Thank you for your reply. 
>
> I have 2 queries.
> 1. I can understand your explanation. But in my experience, when I
> check the final RDBMS table, I see that the results follow the
> expected order, without an issue. Is this just a coincidence?
Not exactly a coincidence. This is typically a result of a physical
location on the disk. If writes and reads are sequential, (this is
usually the case) you'll see things in the expected order, but you have
to remember that location on disk is not stable. For example if you
perform some updates, deletes and VACUM ALL (PostgreSQL) physical
location on disk will change and with it things you see.

There of course more advanced mechanisms out there. For example modern
columnar RDBMS like HANA use techniques like dimensions sorting and
differential stores so even the initial order may differ. There probably
some other solutions which choose different strategies (maybe some times
series oriented projects?) I am not aware of.

>
> 2. I was further looking into this. So, say I run this query
> "select value, count(*) from table1 group by value order by value"
>
> and I call df.collect() in the resultant dataframe. From my
> experience, I see that the given values follow the expected order. May
> I know how spark manages to retain the order of the results in a
> collect operation?
Once you execute ordered operation each partition is sorted and the
order of partitions defines the global ordering. All what collect does
is just preserving this order by creating an array of results for each
partition and flattening it.
>
> Best 
>
>
> On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark
> Developers List] <[hidden email]
> </user/SendEmail.jtp?type=node&node=19985&i=0>> wrote:
>
>     In commonly used RDBM systems relations have no fixed order and
>     physical location of the records can change during routine
>     maintenance operations. Unless you explicitly order data during
>     retrieval order you see is incidental and not guaranteed. 
>
>     Conclusion: order of inserts just doesn't matter.
>
>     On 11/21/2016 10:03 AM, Niranda Perera wrote:
>>     Hi, 
>>
>>     Say, I have a table with 1 column and 1000 rows. I want to save
>>     the result in a RDBMS table using the jdbc relation provider. So
>>     I run the following query, 
>>
>>     "insert into table table2 select value, count(*) from table1
>>     group by value order by value"
>>
>>     While debugging, I found that the resultant df from select value,
>>     count(*) from table1 group by value order by value would have
>>     around 200+ partitions and say I have 4 executors attached to my
>>     driver. So, I would have 200+ writing tasks assigned to 4
>>     executors. I want to understand, how these executors are able to
>>     write the data to the underlying RDBMS table of table2 without
>>     messing up the order. 
>>
>>     I checked the jdbc insertable relation and in jdbcUtils [1] it
>>     does the following
>>
>>     df.foreachPartition { iterator =>
>>           savePartition(getConnection, table, iterator, rddSchema,
>>     nullTypes, batchSize, dialect)
>>         }
>>
>>     So, my understanding is, all of my 4 executors will parallely run
>>     the savePartition function (or closure) where they do not know
>>     which one should write data before the other! 
>>
>>     In the savePartition method, in the comment, it says 
>>     "Saves a partition of a DataFrame to the JDBC database.  This is
>>     done in
>>        * a single database transaction in order to avoid repeatedly
>>     inserting
>>        * data as much as possible."
>>
>>     I want to understand, how these parallel executors save the
>>     partition without harming the order of the results? Is it by
>>     locking the database resource, from each executor (i.e. ex0 would
>>     first obtain a lock for the table and write the partition0, while
>>     ex1 ... ex3 would wait till the lock is released )? 
>>
>>     In my experience, there is no harm done to the order of the
>>     results at the end of the day! 
>>
>>     Would like to hear from you guys! :-) 
>>
>>     [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277
>>     <https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277>
>>
>>     -- 
>>     Niranda Perera
>>     @n1r44 <https://twitter.com/N1R44>
>>     <a href="tel:%2B94%2071%20554%208430" value="+94715548430"
>>     target="_blank">+94 71 554 8430
>>     https://www.linkedin.com/in/niranda
>>     <https://www.linkedin.com/in/niranda>
>>     https://pythagoreanscript.wordpress.com/
>>     <https://pythagoreanscript.wordpress.com/>
>
>     -- 
>     Best regards,
>     Maciej Szymkiewicz
>
>
>
>     ------------------------------------------------------------------------
>     If you reply to this email, your message will be added to the
>     discussion below:
>     http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19971.html
>     <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19971.html>
>
>     To start a new topic under Apache Spark Developers List, email
>     [hidden email] </user/SendEmail.jtp?type=node&node=19985&i=1>
>     To unsubscribe from Apache Spark Developers List, click here.
>     NAML
>     <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
>
>
> -- 
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/
>
> ------------------------------------------------------------------------
> View this message in context: Re: How is the order ensured in the jdbc
> relation provider when inserting data from multiple executors
> <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19985.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.

-- 
Maciej Szymkiewicz

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

Posted by nirandap <ni...@gmail.com>.

Hi Maciej,

Thank you for your reply.

I have 2 queries.
1. I can understand your explanation. But in my experience, when I check
the final RDBMS table, I see that the results follow the expected order,
without an issue. Is this just a coincidence?

2. I was further looking into this. So, say I run this query
"select value, count(*) from table1 group by value order by value"

and I call df.collect() in the resultant dataframe. From my experience, I
see that the given values follow the expected order. May I know how spark
manages to retain the order of the results in a collect operation?

Best


On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark
Developers List] <ml...@n3.nabble.com> wrote:

> In commonly used RDBM systems relations have no fixed order and physical
> location of the records can change during routine maintenance operations.
> Unless you explicitly order data during retrieval order you see is
> incidental and not guaranteed.
>
> Conclusion: order of inserts just doesn't matter.
> On 11/21/2016 10:03 AM, Niranda Perera wrote:
>
> Hi,
>
> Say, I have a table with 1 column and 1000 rows. I want to save the result
> in a RDBMS table using the jdbc relation provider. So I run the following
> query,
>
> "insert into table table2 select value, count(*) from table1 group by
> value order by value"
>
> While debugging, I found that the resultant df from select value, count(*)
> from table1 group by value order by value would have around 200+ partitions
> and say I have 4 executors attached to my driver. So, I would have 200+
> writing tasks assigned to 4 executors. I want to understand, how these
> executors are able to write the data to the underlying RDBMS table of
> table2 without messing up the order.
>
> I checked the jdbc insertable relation and in jdbcUtils [1] it does the
> following
>
> df.foreachPartition { iterator =>
>       savePartition(getConnection, table, iterator, rddSchema, nullTypes,
> batchSize, dialect)
>     }
>
> So, my understanding is, all of my 4 executors will parallely run the
> savePartition function (or closure) where they do not know which one should
> write data before the other!
>
> In the savePartition method, in the comment, it says
> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>    * a single database transaction in order to avoid repeatedly inserting
>    * data as much as possible."
>
> I want to understand, how these parallel executors save the partition
> without harming the order of the results? Is it by locking the database
> resource, from each executor (i.e. ex0 would first obtain a lock for the
> table and write the partition0, while ex1 ... ex3 would wait till the lock
> is released )?
>
> In my experience, there is no harm done to the order of the results at the
> end of the day!
>
> Would like to hear from you guys! :-)
>
> [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/
> src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.
> scala#L277
>
> --
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/
>
>
> --
> Best regards,
> Maciej Szymkiewicz
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-
> provider-when-inserting-data-from-multiple-executors-tp19970p19971.html
> To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1h93@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bmlyYW5kYS5wZXJlcmFAZ21haWwuY29tfDF8NjAxMDUyMzU5>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Niranda Perera
@n1r44 <https://twitter.com/N1R44>
+94 71 554 8430
https://www.linkedin.com/in/niranda
https://pythagoreanscript.wordpress.com/




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19985.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

Posted by Maciej Szymkiewicz <ms...@gmail.com>.

In commonly used RDBM systems relations have no fixed order and physical
location of the records can change during routine maintenance
operations. Unless you explicitly order data during retrieval order you
see is incidental and not guaranteed. 

Conclusion: order of inserts just doesn't matter.

On 11/21/2016 10:03 AM, Niranda Perera wrote:
> Hi, 
>
> Say, I have a table with 1 column and 1000 rows. I want to save the
> result in a RDBMS table using the jdbc relation provider. So I run the
> following query, 
>
> "insert into table table2 select value, count(*) from table1 group by
> value order by value"
>
> While debugging, I found that the resultant df from select value,
> count(*) from table1 group by value order by value would have around
> 200+ partitions and say I have 4 executors attached to my driver. So,
> I would have 200+ writing tasks assigned to 4 executors. I want to
> understand, how these executors are able to write the data to the
> underlying RDBMS table of table2 without messing up the order. 
>
> I checked the jdbc insertable relation and in jdbcUtils [1] it does
> the following
>
> df.foreachPartition { iterator =>
>       savePartition(getConnection, table, iterator, rddSchema,
> nullTypes, batchSize, dialect)
>     }
>
> So, my understanding is, all of my 4 executors will parallely run the
> savePartition function (or closure) where they do not know which one
> should write data before the other! 
>
> In the savePartition method, in the comment, it says 
> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>    * a single database transaction in order to avoid repeatedly inserting
>    * data as much as possible."
>
> I want to understand, how these parallel executors save the partition
> without harming the order of the results? Is it by locking the
> database resource, from each executor (i.e. ex0 would first obtain a
> lock for the table and write the partition0, while ex1 ... ex3 would
> wait till the lock is released )? 
>
> In my experience, there is no harm done to the order of the results at
> the end of the day! 
>
> Would like to hear from you guys! :-) 
>
> [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277
>
> -- 
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/

-- 
Best regards,
Maciej Szymkiewicz