You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Gautam <ga...@gmail.com> on 2019/05/15 21:59:56 UTC

Vanilla Spark Readers on Iceberg written data..

Hello There,
                    I am currently doing some testing with Vanilla Spark
Readers'  ability to read Iceberg generated data. This is both from an
Iceberg/Parquet Reader interoperability and Spark version backward
compatibility standpoint (e.g. Spark distributions running v2.3.x  which
doesn't support Iceberg DataSource vs. those running 2.4.x) .

To be clear I am talking about doing the following on data written by
Iceberg :

spark.read.format("parquet").load($icebergBasePath + "/data")

Can I safely assume this will continue to work? If not then what could be
the reasons and associated risks?

This would be good to know coz these things often come up in migration path
discussions and evaluating costs associated with generating and keeping two
copies of the same data.

thanks,
- Gautam.

Re: Vanilla Spark Readers on Iceberg written data..

Posted by Gautam <ga...@gmail.com>.

RD,
          Trying to figure  if there are regressions expected between
reader and data. Bypassing metadata is easy for us coz data is in separate
directory. ETL pipeline can point the reader config to the correct
location.

On Wed, May 15, 2019 at 5:14 PM RD <rd...@gmail.com> wrote:

> Is backporting relevant datasource patches to Spark 2.3 a non starter? If
> this were doable I believe this is much simpler than bypassing Iceberg
> metadata to read files directly.
>
> -R
>
> On Wed, May 15, 2019 at 3:02 PM Gautam <ga...@gmail.com> wrote:
>
>> Just wanted to add, from what I have tested so far I see this working
>> fine with Vanilla Spark reading Iceberg data.
>>
>> On Wed, May 15, 2019 at 2:59 PM Gautam <ga...@gmail.com> wrote:
>>
>>> Hello There,
>>>                     I am currently doing some testing with Vanilla Spark
>>> Readers'  ability to read Iceberg generated data. This is both from an
>>> Iceberg/Parquet Reader interoperability and Spark version backward
>>> compatibility standpoint (e.g. Spark distributions running v2.3.x  which
>>> doesn't support Iceberg DataSource vs. those running 2.4.x) .
>>>
>>> To be clear I am talking about doing the following on data written by
>>> Iceberg :
>>>
>>> spark.read.format("parquet").load($icebergBasePath + "/data")
>>>
>>> Can I safely assume this will continue to work? If not then what could
>>> be the reasons and associated risks?
>>>
>>> This would be good to know coz these things often come up in migration
>>> path discussions and evaluating costs associated with generating and
>>> keeping two copies of the same data.
>>>
>>> thanks,
>>> - Gautam.
>>>
>>

Re: Vanilla Spark Readers on Iceberg written data..

Posted by RD <rd...@gmail.com>.

Is backporting relevant datasource patches to Spark 2.3 a non starter? If
this were doable I believe this is much simpler than bypassing Iceberg
metadata to read files directly.

-R

On Wed, May 15, 2019 at 3:02 PM Gautam <ga...@gmail.com> wrote:

> Just wanted to add, from what I have tested so far I see this working fine
> with Vanilla Spark reading Iceberg data.
>
> On Wed, May 15, 2019 at 2:59 PM Gautam <ga...@gmail.com> wrote:
>
>> Hello There,
>>                     I am currently doing some testing with Vanilla Spark
>> Readers'  ability to read Iceberg generated data. This is both from an
>> Iceberg/Parquet Reader interoperability and Spark version backward
>> compatibility standpoint (e.g. Spark distributions running v2.3.x  which
>> doesn't support Iceberg DataSource vs. those running 2.4.x) .
>>
>> To be clear I am talking about doing the following on data written by
>> Iceberg :
>>
>> spark.read.format("parquet").load($icebergBasePath + "/data")
>>
>> Can I safely assume this will continue to work? If not then what could be
>> the reasons and associated risks?
>>
>> This would be good to know coz these things often come up in migration
>> path discussions and evaluating costs associated with generating and
>> keeping two copies of the same data.
>>
>> thanks,
>> - Gautam.
>>
>

Re: Vanilla Spark Readers on Iceberg written data..

Posted by Gautam <ga...@gmail.com>.

Just wanted to add, from what I have tested so far I see this working fine
with Vanilla Spark reading Iceberg data.

On Wed, May 15, 2019 at 2:59 PM Gautam <ga...@gmail.com> wrote:

> Hello There,
>                     I am currently doing some testing with Vanilla Spark
> Readers'  ability to read Iceberg generated data. This is both from an
> Iceberg/Parquet Reader interoperability and Spark version backward
> compatibility standpoint (e.g. Spark distributions running v2.3.x  which
> doesn't support Iceberg DataSource vs. those running 2.4.x) .
>
> To be clear I am talking about doing the following on data written by
> Iceberg :
>
> spark.read.format("parquet").load($icebergBasePath + "/data")
>
> Can I safely assume this will continue to work? If not then what could be
> the reasons and associated risks?
>
> This would be good to know coz these things often come up in migration
> path discussions and evaluating costs associated with generating and
> keeping two copies of the same data.
>
> thanks,
> - Gautam.
>