You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Steve Lewis <lo...@gmail.com> on 2014/12/12 20:19:45 UTC

how to convert an rdd to a single output file

I have an RDD which is potentially too large to store in memory with
collect. I want a single task to write the contents as a file to hdfs. Time
is not a large issue but memory is.
I say the following converting my RDD (scans) to a local Iterator. This
works but hasNext shows up as a separate task and takes on the order of 20
sec for a medium sized job -
is *toLocalIterator a bad function to call in this case and is there a
better one?*











*public void writeScores(final Appendable out, JavaRDD<IScoredScan>
scans) {    writer.appendHeader(out, getApplication());
Iterator<IScoredScan> scanIterator = scans.toLocalIterator();
while(scanIterator.hasNext())  {        IScoredScan scan =
scanIterator.next();        writer.appendScan(out, getApplication(),
scan);    }    writer.appendFooter(out, getApplication());}*

Re: how to convert an rdd to a single output file

Posted by Steve Lewis <lo...@gmail.com>.

what would good spill settings be?

On Fri, Dec 12, 2014 at 2:45 PM, Sameer Farooqui <sa...@databricks.com>
wrote:
>
> You could try re-partitioning or coalescing the RDD to partition and then
> write it to disk. Make sure you have good spill settings enabled so that
> the RDD can spill to the local temp dirs if it has to.
>
> On Fri, Dec 12, 2014 at 2:39 PM, Steve Lewis <lo...@gmail.com>
> wrote:
>>
>> The objective is to let the Spark application generate a file in a format
>> which can be consumed by other programs - as I said I am willing to give up
>> parallelism at this stage (all the expensive steps were earlier but do want
>> an efficient way to pass once through an RDD without the requirement to
>> hold it in memory as a list.
>>
>> On Fri, Dec 12, 2014 at 12:22 PM, Sameer Farooqui <sameerf@databricks.com
>> > wrote:
>>
>>> Instead of doing this on the compute side, I would just write out the
>>> file with different blocks initially into HDFS and then use "hadoop fs
>>> -getmerge" or HDFSConcat to get one final output file.
>>>
>>>
>>> - SF
>>>
>>> On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis <lo...@gmail.com>
>>> wrote:
>>>>
>>>>
>>>> I have an RDD which is potentially too large to store in memory with
>>>> collect. I want a single task to write the contents as a file to hdfs. Time
>>>> is not a large issue but memory is.
>>>> I say the following converting my RDD (scans) to a local Iterator. This
>>>> works but hasNext shows up as a separate task and takes on the order of 20
>>>> sec for a medium sized job -
>>>> is *toLocalIterator a bad function to call in this case and is there a
>>>> better one?*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *public void writeScores(final Appendable out, JavaRDD<IScoredScan> scans) {    writer.appendHeader(out, getApplication());    Iterator<IScoredScan> scanIterator = scans.toLocalIterator();    while(scanIterator.hasNext())  {        IScoredScan scan = scanIterator.next();        writer.appendScan(out, getApplication(), scan);    }    writer.appendFooter(out, getApplication());}*
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Skype lordjoe_com
>>
>>

-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: how to convert an rdd to a single output file

Posted by Sameer Farooqui <sa...@databricks.com>.

You could try re-partitioning or coalescing the RDD to partition and then
write it to disk. Make sure you have good spill settings enabled so that
the RDD can spill to the local temp dirs if it has to.

On Fri, Dec 12, 2014 at 2:39 PM, Steve Lewis <lo...@gmail.com> wrote:
>
> The objective is to let the Spark application generate a file in a format
> which can be consumed by other programs - as I said I am willing to give up
> parallelism at this stage (all the expensive steps were earlier but do want
> an efficient way to pass once through an RDD without the requirement to
> hold it in memory as a list.
>
> On Fri, Dec 12, 2014 at 12:22 PM, Sameer Farooqui <sa...@databricks.com>
> wrote:
>
>> Instead of doing this on the compute side, I would just write out the
>> file with different blocks initially into HDFS and then use "hadoop fs
>> -getmerge" or HDFSConcat to get one final output file.
>>
>>
>> - SF
>>
>> On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis <lo...@gmail.com>
>> wrote:
>>>
>>>
>>> I have an RDD which is potentially too large to store in memory with
>>> collect. I want a single task to write the contents as a file to hdfs. Time
>>> is not a large issue but memory is.
>>> I say the following converting my RDD (scans) to a local Iterator. This
>>> works but hasNext shows up as a separate task and takes on the order of 20
>>> sec for a medium sized job -
>>> is *toLocalIterator a bad function to call in this case and is there a
>>> better one?*
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *public void writeScores(final Appendable out, JavaRDD<IScoredScan> scans) {    writer.appendHeader(out, getApplication());    Iterator<IScoredScan> scanIterator = scans.toLocalIterator();    while(scanIterator.hasNext())  {        IScoredScan scan = scanIterator.next();        writer.appendScan(out, getApplication(), scan);    }    writer.appendFooter(out, getApplication());}*
>>>
>>>
>>>
>>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>

Re: how to convert an rdd to a single output file

Posted by Steve Lewis <lo...@gmail.com>.

The objective is to let the Spark application generate a file in a format
which can be consumed by other programs - as I said I am willing to give up
parallelism at this stage (all the expensive steps were earlier but do want
an efficient way to pass once through an RDD without the requirement to
hold it in memory as a list.

On Fri, Dec 12, 2014 at 12:22 PM, Sameer Farooqui <sa...@databricks.com>
wrote:

> Instead of doing this on the compute side, I would just write out the file
> with different blocks initially into HDFS and then use "hadoop fs
> -getmerge" or HDFSConcat to get one final output file.
>
>
> - SF
>
> On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis <lo...@gmail.com>
> wrote:
>>
>>
>> I have an RDD which is potentially too large to store in memory with
>> collect. I want a single task to write the contents as a file to hdfs. Time
>> is not a large issue but memory is.
>> I say the following converting my RDD (scans) to a local Iterator. This
>> works but hasNext shows up as a separate task and takes on the order of 20
>> sec for a medium sized job -
>> is *toLocalIterator a bad function to call in this case and is there a
>> better one?*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *public void writeScores(final Appendable out, JavaRDD<IScoredScan> scans) {    writer.appendHeader(out, getApplication());    Iterator<IScoredScan> scanIterator = scans.toLocalIterator();    while(scanIterator.hasNext())  {        IScoredScan scan = scanIterator.next();        writer.appendScan(out, getApplication(), scan);    }    writer.appendFooter(out, getApplication());}*
>>
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: how to convert an rdd to a single output file

Posted by Sameer Farooqui <sa...@databricks.com>.

Instead of doing this on the compute side, I would just write out the file
with different blocks initially into HDFS and then use "hadoop fs
-getmerge" or HDFSConcat to get one final output file.


- SF

On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis <lo...@gmail.com> wrote:
>
>
> I have an RDD which is potentially too large to store in memory with
> collect. I want a single task to write the contents as a file to hdfs. Time
> is not a large issue but memory is.
> I say the following converting my RDD (scans) to a local Iterator. This
> works but hasNext shows up as a separate task and takes on the order of 20
> sec for a medium sized job -
> is *toLocalIterator a bad function to call in this case and is there a
> better one?*
>
>
>
>
>
>
>
>
>
>
>
> *public void writeScores(final Appendable out, JavaRDD<IScoredScan> scans) {    writer.appendHeader(out, getApplication());    Iterator<IScoredScan> scanIterator = scans.toLocalIterator();    while(scanIterator.hasNext())  {        IScoredScan scan = scanIterator.next();        writer.appendScan(out, getApplication(), scan);    }    writer.appendFooter(out, getApplication());}*
>
>
>