You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Christopher Piggott <cp...@gmail.com> on 2017/12/30 20:44:07 UTC

Converting binary files

I have been searching for examples, but not finding exactly what I need.

I am looking for the paradigm for using spark 2.2 to convert a bunch of
binary files into a bunch of different binary files.  I'm starting with:

   val files = spark.sparkContext.binaryFiles("hdfs://1.2.3.4/input")

then convert them:

   val converted = files.map {   case (filename, content) =>   ( filename
-> convert(content) }

but I don't really want to save by 'partition', I want to save the file
using the original name but in a different directory.e.g. "converted/*"

I'm not quite sure how I'm supposed to do this within the framework of
what's available to me in SparkContext.  Do I need to do it myself using
the HDFS api?

It would seem like this would be a pretty normal thing to do.  Imagine for
instance I were saying take a bunch of binary files and compress them, and
save the compressed output to a different directory.  I feel like I'm
missing something fundamental here.

--C

Re: Converting binary files

Posted by "Lalwani, Jayesh" <Ja...@capitalone.com>.

You can repartition your dataframe into 1 partition and all the data will land into one partition. However, doing this is perilious because you will end up with all your data on one node, and if you have too much data you will run out of memory. In fact, anytime you are thinking about putting data in a single file, you should ask yourself “Does this data fit into memory?”

The reason why Spark is geared towards reading and writing data in a partitioned manner is because fundamentally, partitioning data is how you scale your applications. Partitioned data allows Spark (or really any application that is designed to scale on a cluster) to read data in parallel, process it and spit out, without any bottlenecking. Humans prefer all their data in a single file/table, because humans have a limited ability of keeping track of multitude of files. Grid enabled software hate single files, simply because there is no good way for 2 nodes to read a large file without having some sort of bottlenecking

Imagine a data processing pipeline that starts with some sort of ingestion and transformation at one end, which feeds into several analytical processes. Usually there are humans at the end who are looking at the results of the analytics.  These humans love to get their analytics in a dashboard that gives them a high-level view of the data. However, all the data processing systems that go from input to analytics, prefer their data to be cut up into bite sized chunks

From: Christopher Piggott <cp...@gmail.com>
Date: Saturday, December 30, 2017 at 3:45 PM
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: Converting binary files

I have been searching for examples, but not finding exactly what I need.

I am looking for the paradigm for using spark 2.2 to convert a bunch of binary files into a bunch of different binary files.  I'm starting with:

   val files = spark.sparkContext.binaryFiles("hdfs://1.2.3.4/input<http://1.2.3.4/input>")

then convert them:

   val converted = files.map {   case (filename, content) =>   ( filename -> convert(content) }

but I don't really want to save by 'partition', I want to save the file using the original name but in a different directory.e.g. "converted/*"

I'm not quite sure how I'm supposed to do this within the framework of what's available to me in SparkContext.  Do I need to do it myself using the HDFS api?

It would seem like this would be a pretty normal thing to do.  Imagine for instance I were saying take a bunch of binary files and compress them, and save the compressed output to a different directory.  I feel like I'm missing something fundamental here.

--C

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.