You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Xinli shang <sh...@uber.com.INVALID> on 2020/05/21 01:34:34 UTC

ZSTD-JNI

Hi all,

I see parquet-mr has been using ZSTD-JNI <https://github.com/luben/zstd-jni>for
the parquet-cli
<https://github.com/apache/parquet-mr/blob/master/parquet-cli/pom.xml#L48>
project. It is a clean approach to use this JNI for testing ZSTD instead of
using Hadoop implementation, especially when testing in localhost. I am
wondering maybe we can promote it to parquet-hadoop project as ZSTD
becomes more and more popular. I have a prototype working but I would like
to ask if anybody knows any issues (performance, reliability etc) of
ZSTD-JNI <https://github.com/luben/zstd-jni>? It is welcome to share any
feedback on using this JNI.

BTW, I am also trying out the AirCompressor
<https://github.com/airlift/aircompressor> approach, but it seems the ZSTD
compression level is not adjustable.

-- 
Xinli Shang

Re: ZSTD-JNI

Posted by Xinli shang <sh...@uber.com.INVALID>.
Thank you so much Luben! Here
<https://github.com/apache/parquet-mr/pull/793> is the PR. Please have a
look!

On Wed, May 20, 2020 at 6:51 PM Любен <ka...@gmail.com> wrote:

> Hi,
>
> I don't know any performance or correctness problems with Zstd-JNI. It
> tracks very closely the upstream (the native part) and tries to expose most
> of the functionality. Regarding streaming interfaces, assuming that you are
> going to use them,  there are currently 2 approaches:
>
> - ZstdInputStream/ZstdOutputStream filters that decompress/compress
> streams, similar to the Gzip implementation from the standard library.
> - variants that work with direct buffers. If it fits with how your code is
> structured, it may be slightly faster.
>
> If you have any specific questions, please let me know. Also you can send
> me your PR when it's ready so I may have suggestions.
>
> BTW, it's strange Hadoop decided to reimplement it their own way. The rest
> of the ecosystem is using Zstd-JNI, e.g. Spark, Flink, Cassandra, etc.
>
> Regards,
> luben
>
>
>
>
> On Thu, May 21, 2020 at 2:34 AM Xinli shang <sh...@uber.com> wrote:
>
>> Hi all,
>>
>> I see parquet-mr has been using ZSTD-JNI
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_luben_zstd-2Djni&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=OwMxoSaxdP-kXD9aHpK8orXERL4hJVC5SqNa9Qvd6ek&s=LO0yXYHXoWUpVFKpuvUoJi5BVOiE7AH8ItThuc0PCZw&e=>for
>> the parquet-cli
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_blob_master_parquet-2Dcli_pom.xml-23L48&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=OwMxoSaxdP-kXD9aHpK8orXERL4hJVC5SqNa9Qvd6ek&s=pbMGYR8ZDFJ5C-a0nZuZ_RfZorwmmRJfuLx8SlHiIJg&e=>
>> project. It is a clean approach to use this JNI for testing ZSTD instead of
>> using Hadoop implementation, especially when testing in localhost. I am
>> wondering maybe we can promote it to parquet-hadoop project as ZSTD
>> becomes more and more popular. I have a prototype working but I would like
>> to ask if anybody knows any issues (performance, reliability etc) of
>> ZSTD-JNI
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_luben_zstd-2Djni&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=OwMxoSaxdP-kXD9aHpK8orXERL4hJVC5SqNa9Qvd6ek&s=LO0yXYHXoWUpVFKpuvUoJi5BVOiE7AH8ItThuc0PCZw&e=>?
>> It is welcome to share any feedback on using this JNI.
>>
>> BTW, I am also trying out the AirCompressor
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_airlift_aircompressor&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=OwMxoSaxdP-kXD9aHpK8orXERL4hJVC5SqNa9Qvd6ek&s=AWRDbQ7XL7can-3rUwioL-QGc5r_jQpzpE86RmQuUq8&e=> approach,
>> but it seems the ZSTD compression level is not adjustable.
>>
>> --
>> Xinli Shang
>>
>

-- 
Xinli Shang

Re: ZSTD-JNI

Posted by Любен <ka...@gmail.com>.
Hi,

I don't know any performance or correctness problems with Zstd-JNI. It
tracks very closely the upstream (the native part) and tries to expose most
of the functionality. Regarding streaming interfaces, assuming that you are
going to use them,  there are currently 2 approaches:

- ZstdInputStream/ZstdOutputStream filters that decompress/compress
streams, similar to the Gzip implementation from the standard library.
- variants that work with direct buffers. If it fits with how your code is
structured, it may be slightly faster.

If you have any specific questions, please let me know. Also you can send
me your PR when it's ready so I may have suggestions.

BTW, it's strange Hadoop decided to reimplement it their own way. The rest
of the ecosystem is using Zstd-JNI, e.g. Spark, Flink, Cassandra, etc.

Regards,
luben




On Thu, May 21, 2020 at 2:34 AM Xinli shang <sh...@uber.com> wrote:

> Hi all,
>
> I see parquet-mr has been using ZSTD-JNI
> <https://github.com/luben/zstd-jni>for the parquet-cli
> <https://github.com/apache/parquet-mr/blob/master/parquet-cli/pom.xml#L48>
> project. It is a clean approach to use this JNI for testing ZSTD instead of
> using Hadoop implementation, especially when testing in localhost. I am
> wondering maybe we can promote it to parquet-hadoop project as ZSTD
> becomes more and more popular. I have a prototype working but I would like
> to ask if anybody knows any issues (performance, reliability etc) of
> ZSTD-JNI <https://github.com/luben/zstd-jni>? It is welcome to share any
> feedback on using this JNI.
>
> BTW, I am also trying out the AirCompressor
> <https://github.com/airlift/aircompressor> approach, but it seems the
> ZSTD compression level is not adjustable.
>
> --
> Xinli Shang
>