You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Javier Sanchez Beltran <ja...@expediagroup.com.INVALID> on 2021/03/05 13:15:27 UTC

Question about Snappy compression format.

Hello Iceberg team!

I have been researching Apache Iceberg to see how would work in our environment. We are still trying out things. We would like to have Parquet format with SNAPPY compression type.

I already try changing these two properties to SNAPPY, but it didn’t work (https://iceberg.apache.org/configuration/):


write.avro.compression-codec
Gzip -> SNAPPY
write.parquet.compression-codec
Gzip -> SNAPPY
In this way:


dataset

          .writeStream()

          .format("iceberg")

          .outputMode("append")

          .option("write.parquet.compression-codec", "SNAPPY")

          .option("write.avro.compression-codec", "SNAPPY")

          …start()




Did I do something in a bad way? Or maybe we need to take care of the implementation of this SNAPPY compression?

Thank you in advance,
Javier.


Re: Question about Snappy compression format.

Posted by Russell Spitzer <ru...@gmail.com>.
I think they all have different names and that's what I would be
whitelisting, so any table options or a-like would be rejected as invalid
options.

On Fri, Mar 5, 2021 at 10:54 AM Ryan Blue <rb...@netflix.com> wrote:

> Do we support any table options passed through here? I thought we had
> separate options defined that use shorter names (like target-size).
>
> On Fri, Mar 5, 2021 at 8:50 AM Russell Spitzer <ru...@gmail.com>
> wrote:
>
>> I think if we are going to have our write behavior work like that we
>> should probably switch to a whitelisting of valid properties for Spark
>> writes, so we can warn folks that some options won't actually do anything.
>> I think the current behavior is a bit of a surprise, I also don't like
>> silent options :)
>>
>> On Mar 5, 2021, at 10:47 AM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>
>> Russell is right. The property you're trying to set is a table property
>> and needs to be set on the table.
>>
>> We don't currently support overriding arbitrary table properties in write
>> options, mainly because we want to encourage people to set their
>> configuration on the table instead of in jobs. That's a best practice that
>> I highly recommend so you don't need to configure every job that writes to
>> the table, and so you can make changes and have them automatically take
>> effect without recompiling your write job.
>>
>> On Fri, Mar 5, 2021 at 8:44 AM Russell Spitzer <ru...@gmail.com>
>> wrote:
>>
>>> I believe those are currently only respected as table properties and not
>>> as "spark write" properties although there is a case to be made that we
>>> should accept them there as well. You can alter your table so that it
>>> contains those properties and new files will be created with the
>>> compression you would like.
>>>
>>> On Mar 5, 2021, at 7:15 AM, Javier Sanchez Beltran <
>>> jabeltran@expediagroup.com.INVALID> wrote:
>>>
>>> Hello Iceberg team!
>>>
>>> I have been researching Apache Iceberg to see how would work in our
>>> environment. We are still trying out things. We would like to have Parquet
>>> format with SNAPPY compression type.
>>>
>>> I already try changing these two properties to SNAPPY, but it didn’t
>>> work (https://iceberg.apache.org/configuration/):
>>>
>>>
>>> write.avro.compression-codec
>>>
>>> Gzip -> SNAPPY
>>>
>>> write.parquet.compression-codec
>>>
>>> Gzip -> SNAPPY
>>> In this way:
>>>
>>> dataset
>>>           .writeStream()
>>>           .format("iceberg")
>>>           .outputMode("append")
>>>           .option("write.parquet.compression-codec", "SNAPPY")
>>>           .option("write.avro.compression-codec", "SNAPPY")
>>>           …start()
>>>
>>>
>>> Did I do something in a bad way? Or maybe we need to take care of the
>>> implementation of this SNAPPY compression?
>>>
>>> Thank you in advance,
>>> Javier.
>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Question about Snappy compression format.

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Do we support any table options passed through here? I thought we had
separate options defined that use shorter names (like target-size).

On Fri, Mar 5, 2021 at 8:50 AM Russell Spitzer <ru...@gmail.com>
wrote:

> I think if we are going to have our write behavior work like that we
> should probably switch to a whitelisting of valid properties for Spark
> writes, so we can warn folks that some options won't actually do anything.
> I think the current behavior is a bit of a surprise, I also don't like
> silent options :)
>
> On Mar 5, 2021, at 10:47 AM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> Russell is right. The property you're trying to set is a table property
> and needs to be set on the table.
>
> We don't currently support overriding arbitrary table properties in write
> options, mainly because we want to encourage people to set their
> configuration on the table instead of in jobs. That's a best practice that
> I highly recommend so you don't need to configure every job that writes to
> the table, and so you can make changes and have them automatically take
> effect without recompiling your write job.
>
> On Fri, Mar 5, 2021 at 8:44 AM Russell Spitzer <ru...@gmail.com>
> wrote:
>
>> I believe those are currently only respected as table properties and not
>> as "spark write" properties although there is a case to be made that we
>> should accept them there as well. You can alter your table so that it
>> contains those properties and new files will be created with the
>> compression you would like.
>>
>> On Mar 5, 2021, at 7:15 AM, Javier Sanchez Beltran <
>> jabeltran@expediagroup.com.INVALID> wrote:
>>
>> Hello Iceberg team!
>>
>> I have been researching Apache Iceberg to see how would work in our
>> environment. We are still trying out things. We would like to have Parquet
>> format with SNAPPY compression type.
>>
>> I already try changing these two properties to SNAPPY, but it didn’t work
>> (https://iceberg.apache.org/configuration/):
>>
>>
>> write.avro.compression-codec
>>
>> Gzip -> SNAPPY
>>
>> write.parquet.compression-codec
>>
>> Gzip -> SNAPPY
>> In this way:
>>
>> dataset
>>           .writeStream()
>>           .format("iceberg")
>>           .outputMode("append")
>>           .option("write.parquet.compression-codec", "SNAPPY")
>>           .option("write.avro.compression-codec", "SNAPPY")
>>           …start()
>>
>>
>> Did I do something in a bad way? Or maybe we need to take care of the
>> implementation of this SNAPPY compression?
>>
>> Thank you in advance,
>> Javier.
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Question about Snappy compression format.

Posted by Russell Spitzer <ru...@gmail.com>.
I think if we are going to have our write behavior work like that we should probably switch to a whitelisting of valid properties for Spark writes, so we can warn folks that some options won't actually do anything. I think the current behavior is a bit of a surprise, I also don't like silent options :)

> On Mar 5, 2021, at 10:47 AM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Russell is right. The property you're trying to set is a table property and needs to be set on the table.
> 
> We don't currently support overriding arbitrary table properties in write options, mainly because we want to encourage people to set their configuration on the table instead of in jobs. That's a best practice that I highly recommend so you don't need to configure every job that writes to the table, and so you can make changes and have them automatically take effect without recompiling your write job.
> 
> On Fri, Mar 5, 2021 at 8:44 AM Russell Spitzer <russell.spitzer@gmail.com <ma...@gmail.com>> wrote:
> I believe those are currently only respected as table properties and not as "spark write" properties although there is a case to be made that we should accept them there as well. You can alter your table so that it contains those properties and new files will be created with the compression you would like.
> 
>> On Mar 5, 2021, at 7:15 AM, Javier Sanchez Beltran <jabeltran@expediagroup.com.INVALID <ma...@expediagroup.com.INVALID>> wrote:
>> 
>> Hello Iceberg team!
>>  
>> I have been researching Apache Iceberg to see how would work in our environment. We are still trying out things. We would like to have Parquet format with SNAPPY compression type.
>>  
>> I already try changing these two properties to SNAPPY, but it didn’t work (https://iceberg.apache.org/configuration/ <https://iceberg.apache.org/configuration/>):
>> 
>> 
>> write.avro.compression-codec
>> 
>> Gzip -> SNAPPY
>> 
>> write.parquet.compression-codec
>> 
>> Gzip -> SNAPPY
>> 
>> In this way:
>>  
>> dataset
>>           .writeStream()
>>           .format("iceberg")
>>           .outputMode("append")
>>           .option("write.parquet.compression-codec", "SNAPPY")
>>           .option("write.avro.compression-codec", "SNAPPY")
>>           …start()
>>  
>>  
>> Did I do something in a bad way? Or maybe we need to take care of the implementation of this SNAPPY compression?
>>  
>> Thank you in advance,
>> Javier.
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: Question about Snappy compression format.

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Russell is right. The property you're trying to set is a table property and
needs to be set on the table.

We don't currently support overriding arbitrary table properties in write
options, mainly because we want to encourage people to set their
configuration on the table instead of in jobs. That's a best practice that
I highly recommend so you don't need to configure every job that writes to
the table, and so you can make changes and have them automatically take
effect without recompiling your write job.

On Fri, Mar 5, 2021 at 8:44 AM Russell Spitzer <ru...@gmail.com>
wrote:

> I believe those are currently only respected as table properties and not
> as "spark write" properties although there is a case to be made that we
> should accept them there as well. You can alter your table so that it
> contains those properties and new files will be created with the
> compression you would like.
>
> On Mar 5, 2021, at 7:15 AM, Javier Sanchez Beltran <
> jabeltran@expediagroup.com.INVALID> wrote:
>
> Hello Iceberg team!
>
> I have been researching Apache Iceberg to see how would work in our
> environment. We are still trying out things. We would like to have Parquet
> format with SNAPPY compression type.
>
> I already try changing these two properties to SNAPPY, but it didn’t work (
> https://iceberg.apache.org/configuration/):
>
>
> write.avro.compression-codec
>
> Gzip -> SNAPPY
>
> write.parquet.compression-codec
>
> Gzip -> SNAPPY
> In this way:
>
> dataset
>           .writeStream()
>           .format("iceberg")
>           .outputMode("append")
>           .option("write.parquet.compression-codec", "SNAPPY")
>           .option("write.avro.compression-codec", "SNAPPY")
>           …start()
>
>
> Did I do something in a bad way? Or maybe we need to take care of the
> implementation of this SNAPPY compression?
>
> Thank you in advance,
> Javier.
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Question about Snappy compression format.

Posted by Russell Spitzer <ru...@gmail.com>.
I believe those are currently only respected as table properties and not as "spark write" properties although there is a case to be made that we should accept them there as well. You can alter your table so that it contains those properties and new files will be created with the compression you would like.

> On Mar 5, 2021, at 7:15 AM, Javier Sanchez Beltran <ja...@expediagroup.com.INVALID> wrote:
> 
> Hello Iceberg team!
>  
> I have been researching Apache Iceberg to see how would work in our environment. We are still trying out things. We would like to have Parquet format with SNAPPY compression type.
>  
> I already try changing these two properties to SNAPPY, but it didn’t work (https://iceberg.apache.org/configuration/ <https://iceberg.apache.org/configuration/>):
> 
> 
> write.avro.compression-codec
> 
> Gzip -> SNAPPY
> 
> write.parquet.compression-codec
> 
> Gzip -> SNAPPY
> 
> In this way:
>  
> dataset
>           .writeStream()
>           .format("iceberg")
>           .outputMode("append")
>           .option("write.parquet.compression-codec", "SNAPPY")
>           .option("write.avro.compression-codec", "SNAPPY")
>           …start()
>  
>  
> Did I do something in a bad way? Or maybe we need to take care of the implementation of this SNAPPY compression?
>  
> Thank you in advance,
> Javier.