You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Surendranauth Hiraman <su...@velos.io> on 2014/06/19 20:19:11 UTC

Re: Trailing Tasks Saving to HDFS

I've created an issue for this but if anyone has any advice, please let me
know.

Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two
remaining tasks (out of 320). Those tasks seem to be waiting on data from
another task on another node. Eventually (about 2 hours later) they time
out with a connection reset by peer.

All the data actually seems to be on HDFS as the expected part files. It
just seems like the remaining tasks have corrupted "metadata", so that they
do not realize that they are done. Just a guess though.

https://issues.apache.org/jira/browse/SPARK-2202

-Suren




On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman <
suren.hiraman@velos.io> wrote:

> Looks like eventually there was some type of reset or timeout and the
> tasks have been reassigned. I'm guessing they'll keep failing until max
> failure count.
>
> The machine it disconnected from was a remote machine, though I've seen
> such failures from connections to itself with other problems. The log lines
> from the remote machine are also below.
>
> Any thoughts or guesses would be appreciated!
>
> *"HUNG" WORKER*
>
> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from
> connection to ConnectionManagerId(172.16.25.103,57626)
>
> java.io.IOException: Connection reset by peer
>
> at sun.nio.ch.FileDispatcher.read0(Native Method)
>
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
>
> at
> org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>
> at java.lang.Thread.run(Thread.java:679)
>
> 14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection
> error on connection to ConnectionManagerId(172.16.25.103,57626)
>
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)
>
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
> SendingConnection to ConnectionManagerId(172.16.25.103,57626)
>
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)
>
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
> SendingConnectionManagerId not found
>
>
> *REMOTE WORKER*
>
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
>
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
> SendingConnectionManagerId not found
>
>
>
> On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman <
> suren.hiraman@velos.io> wrote:
>
>> I have a flow that ends with saveAsTextFile() to HDFS.
>>
>> It seems all the expected files per partition have been written out,
>> based on the number of part files and the file sizes.
>>
>> But the driver logs show 2 tasks still not completed and has no activity
>> and the worker logs show no activity for those two tasks for a while now.
>>
>> Has anyone run into this situation? It's happened to me a couple of times
>> now.
>>
>> Thanks.
>>
>> -- Suren
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <su...@sociocast.com>elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: Trailing Tasks Saving to HDFS

Posted by Patrick Wendell <pw...@gmail.com>.

I'll make a comment on the JIRA - thanks for reporting this, let's get
to the bottom of it.

On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman
<su...@velos.io> wrote:
> I've created an issue for this but if anyone has any advice, please let me
> know.
>
> Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two
> remaining tasks (out of 320). Those tasks seem to be waiting on data from
> another task on another node. Eventually (about 2 hours later) they time out
> with a connection reset by peer.
>
> All the data actually seems to be on HDFS as the expected part files. It
> just seems like the remaining tasks have corrupted "metadata", so that they
> do not realize that they are done. Just a guess though.
>
> https://issues.apache.org/jira/browse/SPARK-2202
>
> -Suren
>
>
>
>
> On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman
> <su...@velos.io> wrote:
>>
>> Looks like eventually there was some type of reset or timeout and the
>> tasks have been reassigned. I'm guessing they'll keep failing until max
>> failure count.
>>
>> The machine it disconnected from was a remote machine, though I've seen
>> such failures from connections to itself with other problems. The log lines
>> from the remote machine are also below.
>>
>> Any thoughts or guesses would be appreciated!
>>
>> "HUNG" WORKER
>>
>> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from
>> connection to ConnectionManagerId(172.16.25.103,57626)
>>
>> java.io.IOException: Connection reset by peer
>>
>> at sun.nio.ch.FileDispatcher.read0(Native Method)
>>
>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>
>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>
>> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>
>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>
>> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
>>
>> at
>> org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>
>> at java.lang.Thread.run(Thread.java:679)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection
>> error on connection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> SendingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
>> SendingConnectionManagerId not found
>>
>>
>> REMOTE WORKER
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
>>
>> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
>> SendingConnectionManagerId not found
>>
>>
>>
>>
>> On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman
>> <su...@velos.io> wrote:
>>>
>>> I have a flow that ends with saveAsTextFile() to HDFS.
>>>
>>> It seems all the expected files per partition have been written out,
>>> based on the number of part files and the file sizes.
>>>
>>> But the driver logs show 2 tasks still not completed and has no activity
>>> and the worker logs show no activity for those two tasks for a while now.
>>>
>>> Has anyone run into this situation? It's happened to me a couple of times
>>> now.
>>>
>>> Thanks.
>>>
>>> -- Suren
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@velos.io
>>> W: www.velos.io
>>>
>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@velos.io
>> W: www.velos.io
>>
>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@velos.io
> W: www.velos.io
>

Re: Trailing Tasks Saving to HDFS

Posted by Patrick Wendell <pw...@gmail.com>.

I'll make a comment on the JIRA - thanks for reporting this, let's get
to the bottom of it.

On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman
<su...@velos.io> wrote:
> I've created an issue for this but if anyone has any advice, please let me
> know.
>
> Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two
> remaining tasks (out of 320). Those tasks seem to be waiting on data from
> another task on another node. Eventually (about 2 hours later) they time out
> with a connection reset by peer.
>
> All the data actually seems to be on HDFS as the expected part files. It
> just seems like the remaining tasks have corrupted "metadata", so that they
> do not realize that they are done. Just a guess though.
>
> https://issues.apache.org/jira/browse/SPARK-2202
>
> -Suren
>
>
>
>
> On Wed, Jun 18, 2014 at 8:35 PM, Surendranauth Hiraman
> <su...@velos.io> wrote:
>>
>> Looks like eventually there was some type of reset or timeout and the
>> tasks have been reassigned. I'm guessing they'll keep failing until max
>> failure count.
>>
>> The machine it disconnected from was a remote machine, though I've seen
>> such failures from connections to itself with other problems. The log lines
>> from the remote machine are also below.
>>
>> Any thoughts or guesses would be appreciated!
>>
>> "HUNG" WORKER
>>
>> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from
>> connection to ConnectionManagerId(172.16.25.103,57626)
>>
>> java.io.IOException: Connection reset by peer
>>
>> at sun.nio.ch.FileDispatcher.read0(Native Method)
>>
>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>
>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>
>> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>
>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>
>> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
>>
>> at
>> org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>
>> at java.lang.Thread.run(Thread.java:679)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Handling connection
>> error on connection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> SendingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.103,57626)
>>
>> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
>> SendingConnectionManagerId not found
>>
>>
>> REMOTE WORKER
>>
>> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing
>> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
>>
>> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding
>> SendingConnectionManagerId not found
>>
>>
>>
>>
>> On Wed, Jun 18, 2014 at 7:16 PM, Surendranauth Hiraman
>> <su...@velos.io> wrote:
>>>
>>> I have a flow that ends with saveAsTextFile() to HDFS.
>>>
>>> It seems all the expected files per partition have been written out,
>>> based on the number of part files and the file sizes.
>>>
>>> But the driver logs show 2 tasks still not completed and has no activity
>>> and the worker logs show no activity for those two tasks for a while now.
>>>
>>> Has anyone run into this situation? It's happened to me a couple of times
>>> now.
>>>
>>> Thanks.
>>>
>>> -- Suren
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@velos.io
>>> W: www.velos.io
>>>
>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@velos.io
>> W: www.velos.io
>>
>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@velos.io
> W: www.velos.io
>