You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sid <fl...@gmail.com> on 2022/09/14 18:13:56 UTC

Splittable or not?

Hello experts,

I know that Gzip and snappy files are not splittable i.e data won't be
distributed into multiple blocks rather it would try to load the data in a
single partition/block

So, my question is when I write the parquet data via spark it gets stored
at the destination with something like *part*.snappy.parquet*

So, when I read this data will it affect my performance?

Please help me if there is any understanding gap.

Thanks,
Sid

Re: Splittable or not?

Posted by Jack Goodson <ja...@gmail.com>.

When reading in Gzip files, I’ve always read them into a data frame and then written out to parquet/delta more or less in their raw form and then used these files for my transformations as the workloads are now parallelisable from these split files, when reading in Gzips these will be read by the driver so you will be limited by the memory in the driver so you may need to have an iterative step initially if all your Gzips cannot fit into memory in the driver (this may require some experimentation)

If you don’t want to have an intermediate step of writing the files you can use SparkContext.parallelize(yourgzipfile) 

Hope this helps 

> On 19/09/2022, at 9:45 PM, Sid <fl...@gmail.com> wrote:
> 
> Cool. Thanks, everyone for the reply.
> 
> On Sat, Sep 17, 2022 at 9:50 PM Enrico Minack <info@enrico.minack.dev <ma...@enrico.minack.dev>> wrote:
> If with "won't affect the performance" you mean "parquet is splittable though it uses snappy", then yes. Splittable files allow for optimal parallelization, which "won't affect performance".
> 
> Spark writing data will split the data into multiple files already (here parquet files). Even if each file would not be splittable, your data have been split already. Splittable parquet files allow for more granularity (more splitting if your data), in case those files are big.
> 
> Enrico
> 
> 
> Am 14.09.22 um 21:57 schrieb Sid:
>> Okay so you mean to say that parquet compresses the denormalized data using snappy so it won't affect the performance.
>> 
>> Only using snappy will affect the performance
>> 
>> Am I correct?
>> 
>> On Thu, 15 Sep 2022, 01:08 Amit Joshi, <mailtojoshiamit@gmail.com <ma...@gmail.com>> wrote:
>> Hi Sid,
>> 
>> Snappy itself is not splittable. But the format that contains the actual data like parquet (which are basically divided into row groups) can be compressed using snappy.
>> This works because blocks(pages of parquet format) inside the parquet can be independently compressed using snappy.
>> 
>> Thanks
>> Amit
>> 
>> On Wed, Sep 14, 2022 at 8:14 PM Sid <flinkbyheart@gmail.com <ma...@gmail.com>> wrote:
>> Hello experts,
>> 
>> I know that Gzip and snappy files are not splittable i.e data won't be distributed into multiple blocks rather it would try to load the data in a single partition/block
>> 
>> So, my question is when I write the parquet data via spark it gets stored at the destination with something like part*.snappy.parquet
>> 
>> So, when I read this data will it affect my performance?
>> 
>> Please help me if there is any understanding gap.
>> 
>> Thanks,
>> Sid
>

Re: Splittable or not?

Posted by Sid <fl...@gmail.com>.

Cool. Thanks, everyone for the reply.

On Sat, Sep 17, 2022 at 9:50 PM Enrico Minack <in...@enrico.minack.dev>
wrote:

> If with "won't affect the performance" you mean "parquet is splittable
> though it uses snappy", then yes. Splittable files allow for optimal
> parallelization, which "won't affect performance".
>
> Spark writing data will split the data into multiple files already (here
> parquet files). Even if each file would not be splittable, your data have
> been split already. Splittable parquet files allow for more granularity
> (more splitting if your data), in case those files are big.
>
> Enrico
>
>
> Am 14.09.22 um 21:57 schrieb Sid:
>
> Okay so you mean to say that parquet compresses the denormalized data
> using snappy so it won't affect the performance.
>
> Only using snappy will affect the performance
>
> Am I correct?
>
> On Thu, 15 Sep 2022, 01:08 Amit Joshi, <ma...@gmail.com> wrote:
>
>> Hi Sid,
>>
>> Snappy itself is not splittable. But the format that contains the actual
>> data like parquet (which are basically divided into row groups) can be
>> compressed using snappy.
>> This works because blocks(pages of parquet format) inside the parquet can
>> be independently compressed using snappy.
>>
>> Thanks
>> Amit
>>
>> On Wed, Sep 14, 2022 at 8:14 PM Sid <fl...@gmail.com> wrote:
>>
>>> Hello experts,
>>>
>>> I know that Gzip and snappy files are not splittable i.e data won't be
>>> distributed into multiple blocks rather it would try to load the data in a
>>> single partition/block
>>>
>>> So, my question is when I write the parquet data via spark it gets
>>> stored at the destination with something like *part*.snappy.parquet*
>>>
>>> So, when I read this data will it affect my performance?
>>>
>>> Please help me if there is any understanding gap.
>>>
>>> Thanks,
>>> Sid
>>>
>>
>

Re: Splittable or not?

Posted by Enrico Minack <in...@enrico.minack.dev>.

If with "won't affect the performance" you mean "parquet is splittable 
though it uses snappy", then yes. Splittable files allow for optimal 
parallelization, which "won't affect performance".

Spark writing data will split the data into multiple files already (here 
parquet files). Even if each file would not be splittable, your data 
have been split already. Splittable parquet files allow for more 
granularity (more splitting if your data), in case those files are big.

Enrico


Am 14.09.22 um 21:57 schrieb Sid:
> Okay so you mean to say that parquet compresses the denormalized data 
> using snappy so it won't affect the performance.
>
> Only using snappy will affect the performance
>
> Am I correct?
>
> On Thu, 15 Sep 2022, 01:08 Amit Joshi, <ma...@gmail.com> wrote:
>
>     Hi Sid,
>
>     Snappy itself is not splittable. But the format that contains the
>     actual data like parquet (which are basically divided into row
>     groups) can be compressed using snappy.
>     This works because blocks(pages of parquet format) inside the
>     parquet can be independently compressed using snappy.
>
>     Thanks
>     Amit
>
>     On Wed, Sep 14, 2022 at 8:14 PM Sid <fl...@gmail.com> wrote:
>
>         Hello experts,
>
>         I know that Gzip and snappy files are not splittable i.e data
>         won't be distributed into multiple blocks rather it would try
>         to load the data in a single partition/block
>
>         So, my question is when I write the parquet data via spark it
>         gets stored at the destination with something like
>         /part*.snappy.parquet/
>         /
>         /
>         So, when I read this data will it affect my performance?
>
>         Please help me if there is any understanding gap.
>
>         Thanks,
>         Sid
>

Re: Splittable or not?

Posted by Sid <fl...@gmail.com>.

Okay so you mean to say that parquet compresses the denormalized data using
snappy so it won't affect the performance.

Only using snappy will affect the performance

Am I correct?

On Thu, 15 Sep 2022, 01:08 Amit Joshi, <ma...@gmail.com> wrote:

> Hi Sid,
>
> Snappy itself is not splittable. But the format that contains the actual
> data like parquet (which are basically divided into row groups) can be
> compressed using snappy.
> This works because blocks(pages of parquet format) inside the parquet can
> be independently compressed using snappy.
>
> Thanks
> Amit
>
> On Wed, Sep 14, 2022 at 8:14 PM Sid <fl...@gmail.com> wrote:
>
>> Hello experts,
>>
>> I know that Gzip and snappy files are not splittable i.e data won't be
>> distributed into multiple blocks rather it would try to load the data in a
>> single partition/block
>>
>> So, my question is when I write the parquet data via spark it gets stored
>> at the destination with something like *part*.snappy.parquet*
>>
>> So, when I read this data will it affect my performance?
>>
>> Please help me if there is any understanding gap.
>>
>> Thanks,
>> Sid
>>
>

Re: Splittable or not?

Posted by Amit Joshi <ma...@gmail.com>.

Hi Sid,

Snappy itself is not splittable. But the format that contains the actual
data like parquet (which are basically divided into row groups) can be
compressed using snappy.
This works because blocks(pages of parquet format) inside the parquet can
be independently compressed using snappy.

Thanks
Amit

On Wed, Sep 14, 2022 at 8:14 PM Sid <fl...@gmail.com> wrote:

> Hello experts,
>
> I know that Gzip and snappy files are not splittable i.e data won't be
> distributed into multiple blocks rather it would try to load the data in a
> single partition/block
>
> So, my question is when I write the parquet data via spark it gets stored
> at the destination with something like *part*.snappy.parquet*
>
> So, when I read this data will it affect my performance?
>
> Please help me if there is any understanding gap.
>
> Thanks,
> Sid
>