You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Singh <mo...@gmail.com> on 2014/03/01 02:18:36 UTC

Lazyoutput format in spark

Hi,
  Is there something equivalent of LazyOutputFormat equivalent in spark
(pyspark)
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/LazyOutputFormat.html
Basically, something like where I only save files which has some data in it
rather than saving all the files as in some cases, your majority of files
can be empty?
Thanks

-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates

Re: Lazyoutput format in spark

Posted by Matei Zaharia <ma...@gmail.com>.

You can probably use LazyOutputFormat directly. If there’s one for the hadoop.mapred API, you can use it with PairRDDFunctions.saveAsHadoopRDD() today, otherwise there’s going to be a version of that for the hadoop.mapreduce API as well in Spark 1.0.

Matei

On Feb 28, 2014, at 5:18 PM, Mohit Singh <mo...@gmail.com> wrote:

> Hi,
>   Is there something equivalent of LazyOutputFormat equivalent in spark (pyspark)
> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/LazyOutputFormat.html
> Basically, something like where I only save files which has some data in it rather than saving all the files as in some cases, your majority of files can be empty?
> Thanks
> 
> -- 
> Mohit
> 
> "When you want success as badly as you want the air, then you will get it. There is no other secret of success."
> -Socrates