You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Pedro Rodriguez <sk...@gmail.com> on 2016/07/25 23:18:17 UTC

Spark SQL overwrite/append for partitioned tables

What would be the best way to accomplish the following behavior:

1. There is a table which is partitioned by date
2. Spark job runs on a particular date, we would like it to wipe out all
data for that date. This is to make the job idempotent and lets us rerun a
job if it failed without fear of duplicated data
3. Preserve data for all other dates

I am guessing that overwrite would not work here or if it does its not
guaranteed to stay that way, but am not sure. If thats the case, is there a
good/robust way to get this behavior?

-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: Spark SQL overwrite/append for partitioned tables

Posted by Yash Sharma <ya...@gmail.com>.

Correction -
dataDF.write.partitionBy(“year”, “month”,
“date”).mode(SaveMode.Append).text(“s3://data/test2/events/”)

On Tue, Jul 26, 2016 at 10:59 AM, Yash Sharma <ya...@gmail.com> wrote:

> Based on the behavior of spark [1], Overwrite mode will delete all your
> data when you try to overwrite a particular partition.
>
> What I did-
> - Use S3 api to delete all partitions
> - Use spark df to write in Append mode [2]
>
>
> 1.
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-td18219.html
>
> 2. dataDF.write.partitionBy(“year”, “month”,
> “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)
>
> On Tue, Jul 26, 2016 at 9:37 AM, Pedro Rodriguez <sk...@gmail.com>
> wrote:
>
>> Probably should have been more specific with the code we are using, which
>> is something like
>>
>> val df = ....
>> df.write.mode("append or overwrite
>> here").partitionBy("date").saveAsTable("my_table")
>>
>> Unless there is something like what I described on the native API, I will
>> probably take the approach of having a S3 API call to wipe out that
>> partition before the job starts, but it would be nice to not have to
>> incorporate another step in the job.
>>
>> Pedro
>>
>> On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri <rk...@collectivei.com>
>> wrote:
>>
>>> You can have a temporary file to capture the data that you would like to
>>> overwrite. And swap that with existing partition that you would want to
>>> wipe the data away. Swapping can be done by simple rename of the partition
>>> and just repair the table to pick up the new partition.
>>>
>>> Am not sure if that addresses your scenario.
>>>
>>> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez <sk...@gmail.com>
>>> wrote:
>>>
>>> What would be the best way to accomplish the following behavior:
>>>
>>> 1. There is a table which is partitioned by date
>>> 2. Spark job runs on a particular date, we would like it to wipe out all
>>> data for that date. This is to make the job idempotent and lets us rerun a
>>> job if it failed without fear of duplicated data
>>> 3. Preserve data for all other dates
>>>
>>> I am guessing that overwrite would not work here or if it does its not
>>> guaranteed to stay that way, but am not sure. If thats the case, is there a
>>> good/robust way to get this behavior?
>>>
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> UC Berkeley AMPLab Alumni
>>>
>>> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>>
>>> Collective[i] dramatically improves sales and marketing performance
>>> using technology, applications and a revolutionary network designed to
>>> provide next generation analytics and decision-support directly to business
>>> users. Our goal is to maximize human potential and minimize mistakes. In
>>> most cases, the results are astounding. We cannot, however, stop emails
>>> from sometimes being sent to the wrong person. If you are not the intended
>>> recipient, please notify us by replying to this email's sender and deleting
>>> it (and any attachments) permanently from your system. If you are, please
>>> respect the confidentiality of this communication's contents.
>>
>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>

Re: Spark SQL overwrite/append for partitioned tables

Posted by Yash Sharma <ya...@gmail.com>.

Based on the behavior of spark [1], Overwrite mode will delete all your
data when you try to overwrite a particular partition.

What I did-
- Use S3 api to delete all partitions
- Use spark df to write in Append mode [2]


1.
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-td18219.html

2. dataDF.write.partitionBy(“year”, “month”,
“date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)

On Tue, Jul 26, 2016 at 9:37 AM, Pedro Rodriguez <sk...@gmail.com>
wrote:

> Probably should have been more specific with the code we are using, which
> is something like
>
> val df = ....
> df.write.mode("append or overwrite
> here").partitionBy("date").saveAsTable("my_table")
>
> Unless there is something like what I described on the native API, I will
> probably take the approach of having a S3 API call to wipe out that
> partition before the job starts, but it would be nice to not have to
> incorporate another step in the job.
>
> Pedro
>
> On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri <rk...@collectivei.com> wrote:
>
>> You can have a temporary file to capture the data that you would like to
>> overwrite. And swap that with existing partition that you would want to
>> wipe the data away. Swapping can be done by simple rename of the partition
>> and just repair the table to pick up the new partition.
>>
>> Am not sure if that addresses your scenario.
>>
>> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez <sk...@gmail.com>
>> wrote:
>>
>> What would be the best way to accomplish the following behavior:
>>
>> 1. There is a table which is partitioned by date
>> 2. Spark job runs on a particular date, we would like it to wipe out all
>> data for that date. This is to make the job idempotent and lets us rerun a
>> job if it failed without fear of duplicated data
>> 3. Preserve data for all other dates
>>
>> I am guessing that overwrite would not work here or if it does its not
>> guaranteed to stay that way, but am not sure. If thats the case, is there a
>> good/robust way to get this behavior?
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>>
>> Collective[i] dramatically improves sales and marketing performance using
>> technology, applications and a revolutionary network designed to provide
>> next generation analytics and decision-support directly to business users.
>> Our goal is to maximize human potential and minimize mistakes. In most
>> cases, the results are astounding. We cannot, however, stop emails from
>> sometimes being sent to the wrong person. If you are not the intended
>> recipient, please notify us by replying to this email's sender and deleting
>> it (and any attachments) permanently from your system. If you are, please
>> respect the confidentiality of this communication's contents.
>
>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Re: Spark SQL overwrite/append for partitioned tables

Posted by Pedro Rodriguez <sk...@gmail.com>.

Probably should have been more specific with the code we are using, which
is something like

val df = ....
df.write.mode("append or overwrite
here").partitionBy("date").saveAsTable("my_table")

Unless there is something like what I described on the native API, I will
probably take the approach of having a S3 API call to wipe out that
partition before the job starts, but it would be nice to not have to
incorporate another step in the job.

Pedro

On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri <rk...@collectivei.com> wrote:

> You can have a temporary file to capture the data that you would like to
> overwrite. And swap that with existing partition that you would want to
> wipe the data away. Swapping can be done by simple rename of the partition
> and just repair the table to pick up the new partition.
>
> Am not sure if that addresses your scenario.
>
> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez <sk...@gmail.com>
> wrote:
>
> What would be the best way to accomplish the following behavior:
>
> 1. There is a table which is partitioned by date
> 2. Spark job runs on a particular date, we would like it to wipe out all
> data for that date. This is to make the job idempotent and lets us rerun a
> job if it failed without fear of duplicated data
> 3. Preserve data for all other dates
>
> I am guessing that overwrite would not work here or if it does its not
> guaranteed to stay that way, but am not sure. If thats the case, is there a
> good/robust way to get this behavior?
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>
>
> Collective[i] dramatically improves sales and marketing performance using
> technology, applications and a revolutionary network designed to provide
> next generation analytics and decision-support directly to business users.
> Our goal is to maximize human potential and minimize mistakes. In most
> cases, the results are astounding. We cannot, however, stop emails from
> sometimes being sent to the wrong person. If you are not the intended
> recipient, please notify us by replying to this email's sender and deleting
> it (and any attachments) permanently from your system. If you are, please
> respect the confidentiality of this communication's contents.




-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: Spark SQL overwrite/append for partitioned tables

Posted by RK Aduri <rk...@collectivei.com>.

You can have a temporary file to capture the data that you would like to overwrite. And swap that with existing partition that you would want to wipe the data away. Swapping can be done by simple rename of the partition and just repair the table to pick up the new partition.

Am not sure if that addresses your scenario.

> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez <sk...@gmail.com> wrote:
> 
> What would be the best way to accomplish the following behavior:
> 
> 1. There is a table which is partitioned by date
> 2. Spark job runs on a particular date, we would like it to wipe out all data for that date. This is to make the job idempotent and lets us rerun a job if it failed without fear of duplicated data
> 3. Preserve data for all other dates
> 
> I am guessing that overwrite would not work here or if it does its not guaranteed to stay that way, but am not sure. If thats the case, is there a good/robust way to get this behavior?
> 
> -- 
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
> 
> ski.rodriguez@gmail.com <ma...@gmail.com> | pedrorodriguez.io <http://pedrorodriguez.io/> | 909-353-4423
> Github: github.com/EntilZha <http://github.com/EntilZha> | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience <https://www.linkedin.com/in/pedrorodriguezscience>
> 


-- 
Collective[i] dramatically improves sales and marketing performance using 
technology, applications and a revolutionary network designed to provide 
next generation analytics and decision-support directly to business users. 
Our goal is to maximize human potential and minimize mistakes. In most 
cases, the results are astounding. We cannot, however, stop emails from 
sometimes being sent to the wrong person. If you are not the intended 
recipient, please notify us by replying to this email's sender and deleting 
it (and any attachments) permanently from your system. If you are, please 
respect the confidentiality of this communication's contents.