You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Shubham Chaurasia <sh...@gmail.com> on 2019/05/07 13:36:02 UTC

Static partitioning in partitionBy()

Hi All,

Is there a way I can provide static partitions in partitionBy()?

Like:
df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save

Above code gives following error as it tries to find column `c=c1` in df.

org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found
in schema struct<a:string,b:string,c:string>;

Thanks,
Shubham

Re: Static partitioning in partitionBy()

Posted by Gourav Sengupta <go...@gmail.com>.
Hi Burak,
Hurray!!!! so you made finally delta open source :)
I always thought of asking TD, is there any chance we could get the
streaming graphs back in the SPARK UI? It will just be wonderful.

Hi Shubham,
there are always easier way and super fancy way to solve problems,
filtering data before persisting is a simple way. Similarly handling data
skew in a simple way would be by using monotonically increasing id function
in spark with modulus operator. For the fancy way I am sure that someone in
the world will be working for mere mortals like me :)


Regards,
Gourav Sengupta





On Wed, May 8, 2019 at 1:41 PM Shubham Chaurasia <sh...@gmail.com>
wrote:

> Thanks
>
> On Wed, May 8, 2019 at 10:36 AM Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> You could
>>
>> df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save
>>
>> It could get some data skew problem but might work for you
>>
>>
>>
>> ------------------------------
>> *From:* Burak Yavuz <br...@gmail.com>
>> *Sent:* Tuesday, May 7, 2019 9:35:10 AM
>> *To:* Shubham Chaurasia
>> *Cc:* dev; user@spark.apache.org
>> *Subject:* Re: Static partitioning in partitionBy()
>>
>> It depends on the data source. Delta Lake (https://delta.io) allows you
>> to do it with the .option("replaceWhere", "c = c1"). With other file
>> formats, you can write directly into the partition directory
>> (tablePath/c=c1), but you lose atomicity.
>>
>> On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia <sh...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Is there a way I can provide static partitions in partitionBy()?
>>>
>>> Like:
>>>
>>> df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save
>>>
>>> Above code gives following error as it tries to find column `c=c1` in df.
>>>
>>> org.apache.spark.sql.AnalysisException: Partition column `c=c1` not
>>> found in schema struct<a:string,b:string,c:string>;
>>>
>>> Thanks,
>>> Shubham
>>>
>>

Re: Static partitioning in partitionBy()

Posted by Shubham Chaurasia <sh...@gmail.com>.
Thanks

On Wed, May 8, 2019 at 10:36 AM Felix Cheung <fe...@hotmail.com>
wrote:

> You could
>
> df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save
>
> It could get some data skew problem but might work for you
>
>
>
> ------------------------------
> *From:* Burak Yavuz <br...@gmail.com>
> *Sent:* Tuesday, May 7, 2019 9:35:10 AM
> *To:* Shubham Chaurasia
> *Cc:* dev; user@spark.apache.org
> *Subject:* Re: Static partitioning in partitionBy()
>
> It depends on the data source. Delta Lake (https://delta.io) allows you
> to do it with the .option("replaceWhere", "c = c1"). With other file
> formats, you can write directly into the partition directory
> (tablePath/c=c1), but you lose atomicity.
>
> On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia <sh...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Is there a way I can provide static partitions in partitionBy()?
>>
>> Like:
>> df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save
>>
>> Above code gives following error as it tries to find column `c=c1` in df.
>>
>> org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found
>> in schema struct<a:string,b:string,c:string>;
>>
>> Thanks,
>> Shubham
>>
>

Re: Static partitioning in partitionBy()

Posted by Shubham Chaurasia <sh...@gmail.com>.
Thanks

On Wed, May 8, 2019 at 10:36 AM Felix Cheung <fe...@hotmail.com>
wrote:

> You could
>
> df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save
>
> It could get some data skew problem but might work for you
>
>
>
> ------------------------------
> *From:* Burak Yavuz <br...@gmail.com>
> *Sent:* Tuesday, May 7, 2019 9:35:10 AM
> *To:* Shubham Chaurasia
> *Cc:* dev; user@spark.apache.org
> *Subject:* Re: Static partitioning in partitionBy()
>
> It depends on the data source. Delta Lake (https://delta.io) allows you
> to do it with the .option("replaceWhere", "c = c1"). With other file
> formats, you can write directly into the partition directory
> (tablePath/c=c1), but you lose atomicity.
>
> On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia <sh...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Is there a way I can provide static partitions in partitionBy()?
>>
>> Like:
>> df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save
>>
>> Above code gives following error as it tries to find column `c=c1` in df.
>>
>> org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found
>> in schema struct<a:string,b:string,c:string>;
>>
>> Thanks,
>> Shubham
>>
>

Re: Static partitioning in partitionBy()

Posted by Felix Cheung <fe...@hotmail.com>.
You could

df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save

It could get some data skew problem but might work for you



________________________________
From: Burak Yavuz <br...@gmail.com>
Sent: Tuesday, May 7, 2019 9:35:10 AM
To: Shubham Chaurasia
Cc: dev; user@spark.apache.org
Subject: Re: Static partitioning in partitionBy()

It depends on the data source. Delta Lake (https://delta.io) allows you to do it with the .option("replaceWhere", "c = c1"). With other file formats, you can write directly into the partition directory (tablePath/c=c1), but you lose atomicity.

On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia <sh...@gmail.com>> wrote:
Hi All,

Is there a way I can provide static partitions in partitionBy()?

Like:
df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save

Above code gives following error as it tries to find column `c=c1` in df.

org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found in schema struct<a:string,b:string,c:string>;

Thanks,
Shubham

Re: Static partitioning in partitionBy()

Posted by Felix Cheung <fe...@hotmail.com>.
You could

df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save

It could get some data skew problem but might work for you



________________________________
From: Burak Yavuz <br...@gmail.com>
Sent: Tuesday, May 7, 2019 9:35:10 AM
To: Shubham Chaurasia
Cc: dev; user@spark.apache.org
Subject: Re: Static partitioning in partitionBy()

It depends on the data source. Delta Lake (https://delta.io) allows you to do it with the .option("replaceWhere", "c = c1"). With other file formats, you can write directly into the partition directory (tablePath/c=c1), but you lose atomicity.

On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia <sh...@gmail.com>> wrote:
Hi All,

Is there a way I can provide static partitions in partitionBy()?

Like:
df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save

Above code gives following error as it tries to find column `c=c1` in df.

org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found in schema struct<a:string,b:string,c:string>;

Thanks,
Shubham

Re: Static partitioning in partitionBy()

Posted by Burak Yavuz <br...@gmail.com>.
It depends on the data source. Delta Lake (https://delta.io) allows you to
do it with the .option("replaceWhere", "c = c1"). With other file formats,
you can write directly into the partition directory (tablePath/c=c1), but
you lose atomicity.

On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia <sh...@gmail.com>
wrote:

> Hi All,
>
> Is there a way I can provide static partitions in partitionBy()?
>
> Like:
> df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save
>
> Above code gives following error as it tries to find column `c=c1` in df.
>
> org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found
> in schema struct<a:string,b:string,c:string>;
>
> Thanks,
> Shubham
>

Re: Static partitioning in partitionBy()

Posted by Burak Yavuz <br...@gmail.com>.
It depends on the data source. Delta Lake (https://delta.io) allows you to
do it with the .option("replaceWhere", "c = c1"). With other file formats,
you can write directly into the partition directory (tablePath/c=c1), but
you lose atomicity.

On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia <sh...@gmail.com>
wrote:

> Hi All,
>
> Is there a way I can provide static partitions in partitionBy()?
>
> Like:
> df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save
>
> Above code gives following error as it tries to find column `c=c1` in df.
>
> org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found
> in schema struct<a:string,b:string,c:string>;
>
> Thanks,
> Shubham
>