You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Alessandro Presta (JIRA)" <ji...@apache.org> on 2012/08/17 21:35:38 UTC

[jira] [Created] (GIRAPH-304) Closed channels between workers

Alessandro Presta created GIRAPH-304:
----------------------------------------

             Summary: Closed channels between workers
                 Key: GIRAPH-304
                 URL: https://issues.apache.org/jira/browse/GIRAPH-304
             Project: Giraph
          Issue Type: Bug
            Reporter: Alessandro Presta
            Assignee: Alessandro Presta


With GIRAPH-300 we are able to complete jobs with higher numbers of workers thanks to retrying failed connections. However, we still observe ClosedChannelException with more than a 100 workers.
The patch also introduces a default TCP backlog of 100, so we should probably set this dynamically to equal the number of workers instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (GIRAPH-304) Closed channels between workers

Posted by Eli Reisman <in...@gmail.com>.
Its been a few weeks since I could run jobs all the time like I have been
the last few days. We're seeing all sorts of connection errors now as you
guys have been. We are also seeing a point as we scale out to more workers
where Netty just can't handle that many connections at once.  These Netty
improvements you've been working on have us in a transitional stage, I bet
as this levels off and little tweaks/bugfixes occur you will see most of
this go away. In Netty's original "raw form" on Giraph it was smooth
sailing in this department for us. The reliability improvements are
absolutely needed and are really great, I'm sure this down side will vanish
as those improvements settle in. In use here, Netty has proven previously
it can handle the scale orders of magnitude better than Hadoop RPC ever
did. Great work, don't give up!


On Fri, Aug 17, 2012 at 4:47 PM, Eli Reisman (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/GIRAPH-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437152#comment-13437152]
>
> Eli Reisman commented on GIRAPH-304:
> ------------------------------------
>
> We are routinely able to run 4 figures of workers here without problems,
> we see connection errors only after the netty buffers or worker memory in
> general got overwhelmed and crashed, and is trying to restart. Funny that
> you guys would get this error at such a low number of workers. How far into
> the job does this happen?
>
>
> > Closed channels between workers
> > -------------------------------
> >
> >                 Key: GIRAPH-304
> >                 URL: https://issues.apache.org/jira/browse/GIRAPH-304
> >             Project: Giraph
> >          Issue Type: Bug
> >            Reporter: Alessandro Presta
> >            Assignee: Alessandro Presta
> >
> > With GIRAPH-300 we are able to complete jobs with higher numbers of
> workers thanks to retrying failed connections. However, we still observe
> ClosedChannelException with more than a 100 workers.
> > The patch also introduces a default TCP backlog of 100, so we should
> probably set this dynamically to equal the number of workers instead.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

[jira] [Commented] (GIRAPH-304) Closed channels between workers

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437638#comment-13437638 ] 

Avery Ching commented on GIRAPH-304:
------------------------------------

Alessandro, I don't see a patch.

Eli, I don't think our network is as reliable as yours =).  We get all kinds of errors.
                
> Closed channels between workers
> -------------------------------
>
>                 Key: GIRAPH-304
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-304
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Alessandro Presta
>            Assignee: Alessandro Presta
>
> With GIRAPH-300 we are able to complete jobs with higher numbers of workers thanks to retrying failed connections. However, we still observe ClosedChannelException with more than a 100 workers.
> The patch also introduces a default TCP backlog of 100, so we should probably set this dynamically to equal the number of workers instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-304) Closed channels between workers

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454592#comment-13454592 ] 

Eli Reisman commented on GIRAPH-304:
------------------------------------

Does anyone have a reason to wait on this, or should we commit? It looks good to me.

                
> Closed channels between workers
> -------------------------------
>
>                 Key: GIRAPH-304
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-304
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Alessandro Presta
>            Assignee: Alessandro Presta
>         Attachments: GIRAPH-304.patch
>
>
> With GIRAPH-300 we are able to complete jobs with higher numbers of workers thanks to retrying failed connections. However, we still observe ClosedChannelException with more than a 100 workers.
> The patch also introduces a default TCP backlog of 100, so we should probably set this dynamically to equal the number of workers instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-304) Closed channels between workers

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438118#comment-13438118 ] 

Eli Reisman commented on GIRAPH-304:
------------------------------------

Turns out we get errors now too (see mailing list posting.) This reliability work is great because we are sharing a cluster with lots of other MR jobs all the time. Looks great!

                
> Closed channels between workers
> -------------------------------
>
>                 Key: GIRAPH-304
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-304
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Alessandro Presta
>            Assignee: Alessandro Presta
>         Attachments: GIRAPH-304.patch
>
>
> With GIRAPH-300 we are able to complete jobs with higher numbers of workers thanks to retrying failed connections. However, we still observe ClosedChannelException with more than a 100 workers.
> The patch also introduces a default TCP backlog of 100, so we should probably set this dynamically to equal the number of workers instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-304) Closed channels between workers

Posted by "Alessandro Presta (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alessandro Presta updated GIRAPH-304:
-------------------------------------

    Attachment: GIRAPH-304.patch

This sets a TCP backlog equal to the number of workers.
                
> Closed channels between workers
> -------------------------------
>
>                 Key: GIRAPH-304
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-304
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Alessandro Presta
>            Assignee: Alessandro Presta
>         Attachments: GIRAPH-304.patch
>
>
> With GIRAPH-300 we are able to complete jobs with higher numbers of workers thanks to retrying failed connections. However, we still observe ClosedChannelException with more than a 100 workers.
> The patch also introduces a default TCP backlog of 100, so we should probably set this dynamically to equal the number of workers instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-304) Closed channels between workers

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437152#comment-13437152 ] 

Eli Reisman commented on GIRAPH-304:
------------------------------------

We are routinely able to run 4 figures of workers here without problems, we see connection errors only after the netty buffers or worker memory in general got overwhelmed and crashed, and is trying to restart. Funny that you guys would get this error at such a low number of workers. How far into the job does this happen?

                
> Closed channels between workers
> -------------------------------
>
>                 Key: GIRAPH-304
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-304
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Alessandro Presta
>            Assignee: Alessandro Presta
>
> With GIRAPH-300 we are able to complete jobs with higher numbers of workers thanks to retrying failed connections. However, we still observe ClosedChannelException with more than a 100 workers.
> The patch also introduces a default TCP backlog of 100, so we should probably set this dynamically to equal the number of workers instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-304) Closed channels between workers

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438120#comment-13438120 ] 

Eli Reisman commented on GIRAPH-304:
------------------------------------

After running jobs all weekend I am reluctantly thinking perhaps we need to raise the default thread pool max from 32 a little higher again (maybe not 64 but 48 or something?) because when the resources are available the higher limit seems to give us a bit more headroom before a worker's Netty impl crashes. Just an observation based on playing with it, not sure if there's a better fix to go with the symptoms. But thought I'd throw it out there.

                
> Closed channels between workers
> -------------------------------
>
>                 Key: GIRAPH-304
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-304
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Alessandro Presta
>            Assignee: Alessandro Presta
>         Attachments: GIRAPH-304.patch
>
>
> With GIRAPH-300 we are able to complete jobs with higher numbers of workers thanks to retrying failed connections. However, we still observe ClosedChannelException with more than a 100 workers.
> The patch also introduces a default TCP backlog of 100, so we should probably set this dynamically to equal the number of workers instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira