You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gourav Sengupta <go...@gmail.com> on 2021/08/05 08:17:45 UTC

Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

Hi,

we are trying to migrate some of the data lake pipelines to run in SPARK
3.x, where as the dependent pipelines using those tables will be still
running in SPARK 2.4.x for sometime to come.

Does anyone know of any issues that can happen:
1. when reading Parquet files written in 3.1.x in SPARK 2.4
2. when in the data lake some partitions have parquet files written in
SPARK 2.4.x and some are in SPARK 3.1.x.

Please note that there are no changes in schema, but later on we might end
up adding or removing some columns.

I will be really grateful for your kind help on this.

Regards,
Gourav Sengupta

Re: [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Saurabh,

a very big note of thanks from Gourav :)

Regards,
Gourav Sengupta

On Thu, Aug 12, 2021 at 4:16 PM Saurabh Gulati
<sa...@fedex.com.invalid> wrote:

> We had issues with this migration mainly because of changes in spark date
> calendars. See
> <https://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-proleptic-calendar-date-time-management/read>
> We got this working by setting the below params:
>
> ("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY"),
> ("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED"),
> ("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY"),
> ("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
>
>
>
> But otherwise, it's a change for good. Performance seems better.
> Also, there were bugs in 3.0.1 which have been addressed in 3.1.1.
> ------------------------------
> *From:* Gourav Sengupta <go...@gmail.com>
> *Sent:* 05 August 2021 10:17
> *To:* user @spark <us...@spark.apache.org>
> *Subject:* [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated
> parquet in SPARK 2.4.x
>
> *Caution! This email originated outside of FedEx. Please do not open
> attachments or click links from an unknown or suspicious origin*.
> Hi,
>
> we are trying to migrate some of the data lake pipelines to run in SPARK
> 3.x, where as the dependent pipelines using those tables will be still
> running in SPARK 2.4.x for sometime to come.
>
> Does anyone know of any issues that can happen:
> 1. when reading Parquet files written in 3.1.x in SPARK 2.4
> 2. when in the data lake some partitions have parquet files written in
> SPARK 2.4.x and some are in SPARK 3.1.x.
>
> Please note that there are no changes in schema, but later on we might end
> up adding or removing some columns.
>
> I will be really grateful for your kind help on this.
>
> Regards,
> Gourav Sengupta
>

Re: [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

Posted by Saurabh Gulati <sa...@fedex.com.INVALID>.

We had issues with this migration mainly because of changes in spark date calendars. See<https://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-proleptic-calendar-date-time-management/read>
We got this working by setting the below params:

("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY"),
("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED"),
("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY"),
("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")


But otherwise, it's a change for good. Performance seems better.
Also, there were bugs in 3.0.1 which have been addressed in 3.1.1.
________________________________
From: Gourav Sengupta <go...@gmail.com>
Sent: 05 August 2021 10:17
To: user @spark <us...@spark.apache.org>
Subject: [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

Caution! This email originated outside of FedEx. Please do not open attachments or click links from an unknown or suspicious origin.

Hi,

we are trying to migrate some of the data lake pipelines to run in SPARK 3.x, where as the dependent pipelines using those tables will be still running in SPARK 2.4.x for sometime to come.

Does anyone know of any issues that can happen:
1. when reading Parquet files written in 3.1.x in SPARK 2.4
2. when in the data lake some partitions have parquet files written in SPARK 2.4.x and some are in SPARK 3.1.x.

Please note that there are no changes in schema, but later on we might end up adding or removing some columns.

I will be really grateful for your kind help on this.

Regards,
Gourav Sengupta