You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by yao <ya...@gmail.com> on 2014/08/26 01:53:39 UTC

too many CancelledKeyException throwed from ConnectionManager

Hi Folks,

We are testing our home-made KMeans algorithm using Spark on Yarn.
Recently, we've found that the application failed frequently when doing
clustering over 300,000,000 users (each user is represented by a feature
vector and the whole data set is around 600,000,000). After digging into
the job log, we've found that there are many CancelledKeyException throwed
by ConnectionManager but not observed other exceptions. We double frequent
CancelledKeyException brings the whole application down since the
application often failed on the third or fourth iteration for large
datasets. Welcome to any directional suggestions.

*Errors in job log*:
java.nio.channels.CancelledKeyException
        at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
        at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,43199)
14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@2570cd62
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@2570cd62
java.nio.channels.CancelledKeyException
        at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
        at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@37c8b85a
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@37c8b85a
java.nio.channels.CancelledKeyException
        at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287)
        at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-668.rfiserve.net,41913)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-668.rfiserve.net,41913)
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@fcea3a4
14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@fcea3a4


Best
Shengzhe

Re: too many CancelledKeyException throwed from ConnectionManager

Posted by yao <ya...@gmail.com>.
Wow, great job. We will take a look and try our application again with your
patch.


On Tue, Aug 26, 2014 at 5:31 AM, Kousuke Saruta <sa...@oss.nttdata.co.jp>
wrote:

> Hi Shengzhe
>
> I faced to same situation.
>
> I think, Connection and ConnectionManager have some race condition issues
> and the error you mentioned may be caused by the issues.
> Now I'm trying to resolve the issue in https://github.com/apache/
> spark/pull/2019.
> Please check it out.
>
> - Kousuke
>
>
> (2014/08/26 8:53), yao wrote:
>
>> Hi Folks,
>>
>> We are testing our home-made KMeans algorithm using Spark on Yarn.
>> Recently, we've found that the application failed frequently when doing
>> clustering over 300,000,000 users (each user is represented by a feature
>> vector and the whole data set is around 600,000,000). After digging into
>> the job log, we've found that there are many CancelledKeyException throwed
>> by ConnectionManager but not observed other exceptions. We double frequent
>> CancelledKeyException brings the whole application down since the
>> application often failed on the third or fourth iteration for large
>> datasets. Welcome to any directional suggestions.
>>
>> *Errors in job log*:
>>
>> java.nio.channels.CancelledKeyException
>>          at
>> org.apache.spark.network.ConnectionManager.run(
>> ConnectionManager.scala:363)
>>          at
>> org.apache.spark.network.ConnectionManager$$anon$4.run(
>> ConnectionManager.scala:116)
>> 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
>> ConnectionManagerId(lsv-289.rfiserve.net,43199)
>> 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
>> SendingConnectionManagerId not found
>> 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
>> sun.nio.ch.SelectionKeyImpl@2570cd62
>> 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
>> sun.nio.ch.SelectionKeyImpl@2570cd62
>> java.nio.channels.CancelledKeyException
>>          at
>> org.apache.spark.network.ConnectionManager.run(
>> ConnectionManager.scala:363)
>>          at
>> org.apache.spark.network.ConnectionManager$$anon$4.run(
>> ConnectionManager.scala:116)
>> 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
>> ConnectionManagerId(lsv-289.rfiserve.net,56727)
>> 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
>> ConnectionManagerId(lsv-289.rfiserve.net,56727)
>> 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
>> ConnectionManagerId(lsv-289.rfiserve.net,56727)
>> 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
>> sun.nio.ch.SelectionKeyImpl@37c8b85a
>> 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
>> sun.nio.ch.SelectionKeyImpl@37c8b85a
>> java.nio.channels.CancelledKeyException
>>          at
>> org.apache.spark.network.ConnectionManager.run(
>> ConnectionManager.scala:287)
>>          at
>> org.apache.spark.network.ConnectionManager$$anon$4.run(
>> ConnectionManager.scala:116)
>> 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
>> ConnectionManagerId(lsv-668.rfiserve.net,41913)
>> 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
>> ConnectionManagerId(lsv-668.rfiserve.net,41913)
>> 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
>> sun.nio.ch.SelectionKeyImpl@fcea3a4
>> 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
>> SendingConnectionManagerId not found
>> 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
>> sun.nio.ch.SelectionKeyImpl@fcea3a4
>>
>>
>> Best
>> Shengzhe
>>
>>
>

Re: too many CancelledKeyException throwed from ConnectionManager

Posted by Kousuke Saruta <sa...@oss.nttdata.co.jp>.
Hi Shengzhe

I faced to same situation.

I think, Connection and ConnectionManager have some race condition issues
and the error you mentioned may be caused by the issues.
Now I'm trying to resolve the issue in 
https://github.com/apache/spark/pull/2019.
Please check it out.

- Kousuke

(2014/08/26 8:53), yao wrote:
> Hi Folks,
>
> We are testing our home-made KMeans algorithm using Spark on Yarn.
> Recently, we've found that the application failed frequently when doing
> clustering over 300,000,000 users (each user is represented by a feature
> vector and the whole data set is around 600,000,000). After digging into
> the job log, we've found that there are many CancelledKeyException throwed
> by ConnectionManager but not observed other exceptions. We double frequent
> CancelledKeyException brings the whole application down since the
> application often failed on the third or fourth iteration for large
> datasets. Welcome to any directional suggestions.
>
> *Errors in job log*:
> java.nio.channels.CancelledKeyException
>          at
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
>          at
> org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
> 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
> ConnectionManagerId(lsv-289.rfiserve.net,43199)
> 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
> sun.nio.ch.SelectionKeyImpl@2570cd62
> 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
> sun.nio.ch.SelectionKeyImpl@2570cd62
> java.nio.channels.CancelledKeyException
>          at
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
>          at
> org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
> 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
> ConnectionManagerId(lsv-289.rfiserve.net,56727)
> 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
> ConnectionManagerId(lsv-289.rfiserve.net,56727)
> 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
> ConnectionManagerId(lsv-289.rfiserve.net,56727)
> 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
> sun.nio.ch.SelectionKeyImpl@37c8b85a
> 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
> sun.nio.ch.SelectionKeyImpl@37c8b85a
> java.nio.channels.CancelledKeyException
>          at
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287)
>          at
> org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
> 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
> ConnectionManagerId(lsv-668.rfiserve.net,41913)
> 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
> ConnectionManagerId(lsv-668.rfiserve.net,41913)
> 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
> sun.nio.ch.SelectionKeyImpl@fcea3a4
> 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
> sun.nio.ch.SelectionKeyImpl@fcea3a4
>
>
> Best
> Shengzhe
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org