You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by HARSH TAKKAR <ta...@gmail.com> on 2020/01/13 12:31:45 UTC

Reading 7z file in spark

Hi,


Is it possible to read 7z compressed file in spark?


Kind Regards
Harsh Takkar

Re: Reading 7z file in spark

Posted by Andrew Melo <an...@gmail.com>.

It only makes sense if the underlying file is also splittable, and even
then, it doesn't really do anything for you if you don't explicitly tell
spark about the split boundaries

On Tue, Jan 14, 2020 at 7:36 PM Someshwar Kale <sk...@gmail.com> wrote:

> I would suggest to use other compression technique which is splittable for
> eg. Bzip2, lzo, lz4.
>
> On Wed, Jan 15, 2020, 1:32 AM Enrico Minack <ma...@enrico.minack.dev>
> wrote:
>
>> Hi,
>>
>> Spark does not support 7z natively, but you can read any file in Spark:
>>
>> def read(stream: PortableDataStream): Iterator[String] = { Seq(stream.getPath()).iterator }
>>
>> spark.sparkContext
>>   .binaryFiles("*.7z")
>>   .flatMap(file => read(file._2))
>>   .toDF("path")
>>   .show(false)
>>
>> This scales with the number of files. A single large 7z file would not
>> scale well (a single partition).
>>
>> Any file that matches *.7z will be loaded via the read(stream:
>> PortableDataStream) method, which returns an iterator over the rows.
>> This method is executed on the executor and can implement the 7z specific
>> code, which is independent of Spark and should not be too hard (here it
>> does not open the input stream but returns the path only).
>>
>> If you are planning to read the same files more than once, then it would
>> be worth to first uncompress and convert them into files Spark supports.
>> Then Spark can scale much better.
>>
>> Regards,
>> Enrico
>>
>>
>> Am 13.01.20 um 13:31 schrieb HARSH TAKKAR:
>>
>> Hi,
>>
>>
>> Is it possible to read 7z compressed file in spark?
>>
>>
>> Kind Regards
>> Harsh Takkar
>>
>>
>>

Re: Reading 7z file in spark

Posted by Someshwar Kale <sk...@gmail.com>.

I would suggest to use other compression technique which is splittable for
eg. Bzip2, lzo, lz4.

On Wed, Jan 15, 2020, 1:32 AM Enrico Minack <ma...@enrico.minack.dev> wrote:

> Hi,
>
> Spark does not support 7z natively, but you can read any file in Spark:
>
> def read(stream: PortableDataStream): Iterator[String] = { Seq(stream.getPath()).iterator }
>
> spark.sparkContext
>   .binaryFiles("*.7z")
>   .flatMap(file => read(file._2))
>   .toDF("path")
>   .show(false)
>
> This scales with the number of files. A single large 7z file would not
> scale well (a single partition).
>
> Any file that matches *.7z will be loaded via the read(stream:
> PortableDataStream) method, which returns an iterator over the rows. This
> method is executed on the executor and can implement the 7z specific code,
> which is independent of Spark and should not be too hard (here it does not
> open the input stream but returns the path only).
>
> If you are planning to read the same files more than once, then it would
> be worth to first uncompress and convert them into files Spark supports.
> Then Spark can scale much better.
>
> Regards,
> Enrico
>
>
> Am 13.01.20 um 13:31 schrieb HARSH TAKKAR:
>
> Hi,
>
>
> Is it possible to read 7z compressed file in spark?
>
>
> Kind Regards
> Harsh Takkar
>
>
>

Re: Reading 7z file in spark

Posted by Enrico Minack <ma...@Enrico.Minack.dev>.

Hi,

Spark does not support 7z natively, but you can read any file in Spark:

def read(stream: PortableDataStream):Iterator[String] = {Seq(stream.getPath()).iterator }

spark.sparkContext
   .binaryFiles("*.7z")
   .flatMap(file => read(file._2))
   .toDF("path")
   .show(false)

This scales with the number of files. A single large 7z file would not 
scale well (a single partition).

Any file that matches *.7z will be loaded via the read(stream: 
PortableDataStream) method, which returns an iterator over the rows. 
This method is executed on the executor and can implement the 7z 
specific code, which is independent of Spark and should not be too hard 
(here it does not open the input stream but returns the path only).

If you are planning to read the same files more than once, then it would 
be worth to first uncompress and convert them into files Spark supports. 
Then Spark can scale much better.

Regards,
Enrico


Am 13.01.20 um 13:31 schrieb HARSH TAKKAR:
> Hi,
>
>
> Is it possible to read 7z compressed file in spark?
>
>
> Kind Regards
> Harsh Takkar