You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Romi Kuntsman <ro...@totango.com> on 2014/11/24 15:20:36 UTC

ExternalAppendOnlyMap: Thread spilling in-memory map of to disk many times slowly

Hello,

I have a large data calculation in Spark, distributed across serveral
nodes. In the end, I want to write to a single output file.

For this I do:
   output.coalesce(1, false).saveAsTextFile(filename).

What happens is all the data from the workers flows to a single worker, and
that one writes the data.
If the data is small enough, it all goes well.
However, for a RDD from a certain size, I get a lot of the following
messages (see below).

>From what I understand, ExternalAppendOnlyMap spills the data to disk when
it can't hold it in memory.
Is there a way to tell it to stream the data right to disk, instead of
spilling each block slowly?

14/11/24 12:54:59 INFO MapOutputTrackerWorker: Got the output locations
14/11/24 12:54:59 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/24 12:54:59 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 69 non-empty blocks out of 90 blocks
14/11/24 12:54:59 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 3 remote fetches in 22 ms
14/11/24 12:55:11 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/24 12:55:11 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 70 non-empty blocks out of 90 blocks
14/11/24 12:55:11 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 3 remote fetches in 4 ms
14/11/24 12:55:11 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 13 MB to disk (1 time so far)
14/11/24 12:55:11 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 12 MB to disk (2 times so far)
14/11/24 12:55:11 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 12 MB to disk (3 times so far)
[...trimmed...]
14/11/24 13:13:28 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/24 13:13:28 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 69 non-empty blocks out of 90 blocks
14/11/24 13:13:28 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 3 remote fetches in 2 ms
14/11/24 13:13:28 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 15 MB to disk (1 time so far)
14/11/24 13:13:28 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 16 MB to disk (2 times so far)
14/11/24 13:13:28 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 14 MB to disk (3 times so far)
[...trimmed...]
14/11/24 13:13:32 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 13 MB to disk (33 times so far)
14/11/24 13:13:32 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 13 MB to disk (34 times so far)
14/11/24 13:13:32 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 13 MB to disk (35 times so far)
14/11/24 13:13:40 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/24 13:13:40 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 69 non-empty blocks out of 90 blocks
14/11/24 13:13:40 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 3 remote fetches in 4 ms
14/11/24 13:13:40 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 10 MB to disk (1 time so far)
14/11/24 13:13:41 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 10 MB to disk (2 times so far)
14/11/24 13:13:41 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 9 MB to disk (3 times so far)
[...trimmed...]
14/11/24 13:13:45 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 12 MB to disk (36 times so far)
14/11/24 13:13:45 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 11 MB to disk (37 times so far)
14/11/24 13:13:56 INFO FileOutputCommitter: Saved output of task
'attempt_201411241250_0000_m_000000_90' to s3n://mybucket/mydir/output

*Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com

ExternalAppendOnlyMap: Thread spilling in-memory map of to disk many times slowly

Posted by Romi Kuntsman <ro...@totango.com>.
Hello,

I have a large data calculation in Spark, distributed across serveral
nodes. In the end, I want to write to a single output file.

For this I do:
   output.coalesce(1, false).saveAsTextFile(filename).

What happens is all the data from the workers flows to a single worker, and
that one writes the data.
If the data is small enough, it all goes well.
However, for a RDD from a certain size, I get a lot of the following
messages (see below).

>From what I understand, ExternalAppendOnlyMap spills the data to disk when
it can't hold it in memory.
Is there a way to tell it to stream the data right to disk, instead of
spilling each block slowly?

14/11/24 12:54:59 INFO MapOutputTrackerWorker: Got the output locations
14/11/24 12:54:59 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/24 12:54:59 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 69 non-empty blocks out of 90 blocks
14/11/24 12:54:59 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 3 remote fetches in 22 ms
14/11/24 12:55:11 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/24 12:55:11 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 70 non-empty blocks out of 90 blocks
14/11/24 12:55:11 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 3 remote fetches in 4 ms
14/11/24 12:55:11 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 13 MB to disk (1 time so far)
14/11/24 12:55:11 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 12 MB to disk (2 times so far)
14/11/24 12:55:11 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 12 MB to disk (3 times so far)
[...trimmed...]
14/11/24 13:13:28 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/24 13:13:28 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 69 non-empty blocks out of 90 blocks
14/11/24 13:13:28 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 3 remote fetches in 2 ms
14/11/24 13:13:28 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 15 MB to disk (1 time so far)
14/11/24 13:13:28 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 16 MB to disk (2 times so far)
14/11/24 13:13:28 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 14 MB to disk (3 times so far)
[...trimmed...]
14/11/24 13:13:32 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 13 MB to disk (33 times so far)
14/11/24 13:13:32 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 13 MB to disk (34 times so far)
14/11/24 13:13:32 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 13 MB to disk (35 times so far)
14/11/24 13:13:40 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/24 13:13:40 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 69 non-empty blocks out of 90 blocks
14/11/24 13:13:40 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 3 remote fetches in 4 ms
14/11/24 13:13:40 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 10 MB to disk (1 time so far)
14/11/24 13:13:41 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 10 MB to disk (2 times so far)
14/11/24 13:13:41 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 9 MB to disk (3 times so far)
[...trimmed...]
14/11/24 13:13:45 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 12 MB to disk (36 times so far)
14/11/24 13:13:45 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory
map of 11 MB to disk (37 times so far)
14/11/24 13:13:56 INFO FileOutputCommitter: Saved output of task
'attempt_201411241250_0000_m_000000_90' to s3n://mybucket/mydir/output

*Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com