You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@giraph.apache.org by "Jake Mannix (Created) (JIRA)" <ji...@apache.org> on 2011/11/03 02:17:32 UTC

[jira] [Created] (GIRAPH-72) Running multiple Giraph jobs on the same cluster can lead to port collisions

Running multiple Giraph jobs on the same cluster can lead to port collisions
----------------------------------------------------------------------------

                 Key: GIRAPH-72
                 URL: https://issues.apache.org/jira/browse/GIRAPH-72
             Project: Giraph
          Issue Type: Bug
          Components: lib, zookeeper
    Affects Versions: 0.70.0
         Environment: production hadoop cluster, in-process ZK.
            Reporter: Jake Mannix


Had a Giraph mini-hackathon at work today, and lots of us launched simultaneous test jobs at the same time, and often ran into the following collision:

------
startSuperstep: WORKER_ONLY - Attempt=0, Superstep=-1
2-Nov-2011 23:40:08

java.net.BindException: Problem binding to <hostname>/<hostIP>:30000 : Address already in use
	at org.apache.hadoop.ipc.Server.bind(Server.java:196)
	at org.apache.hadoop.ipc.Server$Listener.(Server.java:259)
	at org.apache.hadoop.ipc.Server.(Server.java:1039)
	at org.apache.hadoop.ipc.RPC$Server.(RPC.java:492)
	at org.apache.hadoop.ipc.RPC.getServer(RPC.java:454)
	at org.apache.giraph.comm.RPCCommunications.getRPCServer(RPCCommunications.java:99)
	at org.apache.giraph.comm.BasicRPCCommunications.(BasicRPCCommunications.java:362)
	at org.apache.giraph.comm.RPCCommunications.(RPCCommunications.java:71)
	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:570)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.net.BindException: Address already in use
	at sun.nio.ch.Net.bind(Native Method)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
	at org.apache.hadoop.ipc.Server.bind(Server.java:194)
	... 12 more
----

The job then simply hung.  What it should do, I'd imagine, is at a bare minimum, catch this exception and allow the task to die quickly so it can get retried on another machine, or better yet, allow for a command-line arg at startup (and then passed into the Configuration) decide what ports to use.  Best yet, something automagic which allows multiple GraphMappers on the same machine without manually picking ports (pick one at random and store it in zookeeper?  but then what about the in-process zookeeper...) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-72) Running multiple Giraph jobs on the same cluster can lead to port collisions

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-72?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142785#comment-13142785 ] 

Avery Ching commented on GIRAPH-72:
-----------------------------------

Yes, this is a possible problem.  In the past, I've tried to grab all the map slots of a given task tracker via the appropriate memory configuration.  Right now, it's kind of nice to have the ports correspond to the task partition for debugability.  Would love to hear any other ideas.  
                
> Running multiple Giraph jobs on the same cluster can lead to port collisions
> ----------------------------------------------------------------------------
>
>                 Key: GIRAPH-72
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-72
>             Project: Giraph
>          Issue Type: Bug
>          Components: lib, zookeeper
>    Affects Versions: 0.70.0
>         Environment: production hadoop cluster, in-process ZK.
>            Reporter: Jake Mannix
>
> Had a Giraph mini-hackathon at work today, and lots of us launched simultaneous test jobs at the same time, and often ran into the following collision:
> ------
> startSuperstep: WORKER_ONLY - Attempt=0, Superstep=-1
> 2-Nov-2011 23:40:08
> java.net.BindException: Problem binding to <hostname>/<hostIP>:30000 : Address already in use
> 	at org.apache.hadoop.ipc.Server.bind(Server.java:196)
> 	at org.apache.hadoop.ipc.Server$Listener.(Server.java:259)
> 	at org.apache.hadoop.ipc.Server.(Server.java:1039)
> 	at org.apache.hadoop.ipc.RPC$Server.(RPC.java:492)
> 	at org.apache.hadoop.ipc.RPC.getServer(RPC.java:454)
> 	at org.apache.giraph.comm.RPCCommunications.getRPCServer(RPCCommunications.java:99)
> 	at org.apache.giraph.comm.BasicRPCCommunications.(BasicRPCCommunications.java:362)
> 	at org.apache.giraph.comm.RPCCommunications.(RPCCommunications.java:71)
> 	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:570)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.net.BindException: Address already in use
> 	at sun.nio.ch.Net.bind(Native Method)
> 	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
> 	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
> 	at org.apache.hadoop.ipc.Server.bind(Server.java:194)
> 	... 12 more
> ----
> The job then simply hung.  What it should do, I'd imagine, is at a bare minimum, catch this exception and allow the task to die quickly so it can get retried on another machine, or better yet, allow for a command-line arg at startup (and then passed into the Configuration) decide what ports to use.  Best yet, something automagic which allows multiple GraphMappers on the same machine without manually picking ports (pick one at random and store it in zookeeper?  but then what about the in-process zookeeper...) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-72) Running multiple Giraph jobs on the same cluster can lead to port collisions

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-72?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142870#comment-13142870 ] 

Avery Ching commented on GIRAPH-72:
-----------------------------------

Ah, you can set the base rpc port by the way, sorry forgot to mention.

    /** Initial port to start using for the RPC communication */
    public static final String RPC_INITIAL_PORT = "giraph.rpcInitialPort";
    /** Default port to start using for the RPC communication */
    public static final int RPC_INITIAL_PORT_DEFAULT = 30000;

The right thing is to fix the failures early on in the application in general (something that doesn't kick in until after the first checkpoint currently).  Let's address that issue after GIRAPH-11.  It's a monstrous change that is working right now (passed local and MR unittests) and will change a lot on how to do that exactly.  Thanks for filing the issue, agreed it's a problem.
                
> Running multiple Giraph jobs on the same cluster can lead to port collisions
> ----------------------------------------------------------------------------
>
>                 Key: GIRAPH-72
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-72
>             Project: Giraph
>          Issue Type: Bug
>          Components: lib, zookeeper
>    Affects Versions: 0.70.0
>         Environment: production hadoop cluster, in-process ZK.
>            Reporter: Jake Mannix
>
> Had a Giraph mini-hackathon at work today, and lots of us launched simultaneous test jobs at the same time, and often ran into the following collision:
> ------
> startSuperstep: WORKER_ONLY - Attempt=0, Superstep=-1
> 2-Nov-2011 23:40:08
> java.net.BindException: Problem binding to <hostname>/<hostIP>:30000 : Address already in use
> 	at org.apache.hadoop.ipc.Server.bind(Server.java:196)
> 	at org.apache.hadoop.ipc.Server$Listener.(Server.java:259)
> 	at org.apache.hadoop.ipc.Server.(Server.java:1039)
> 	at org.apache.hadoop.ipc.RPC$Server.(RPC.java:492)
> 	at org.apache.hadoop.ipc.RPC.getServer(RPC.java:454)
> 	at org.apache.giraph.comm.RPCCommunications.getRPCServer(RPCCommunications.java:99)
> 	at org.apache.giraph.comm.BasicRPCCommunications.(BasicRPCCommunications.java:362)
> 	at org.apache.giraph.comm.RPCCommunications.(RPCCommunications.java:71)
> 	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:570)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.net.BindException: Address already in use
> 	at sun.nio.ch.Net.bind(Native Method)
> 	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
> 	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
> 	at org.apache.hadoop.ipc.Server.bind(Server.java:194)
> 	... 12 more
> ----
> The job then simply hung.  What it should do, I'd imagine, is at a bare minimum, catch this exception and allow the task to die quickly so it can get retried on another machine, or better yet, allow for a command-line arg at startup (and then passed into the Configuration) decide what ports to use.  Best yet, something automagic which allows multiple GraphMappers on the same machine without manually picking ports (pick one at random and store it in zookeeper?  but then what about the in-process zookeeper...) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (GIRAPH-72) Running multiple Giraph jobs on the same cluster can lead to port collisions

Posted by "Jakob Homan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-72?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jakob Homan resolved GIRAPH-72.
-------------------------------

    Resolution: Duplicate

This has been (hopefully) fixed by GIRAPH-128.  Closing as duplicate.
                
> Running multiple Giraph jobs on the same cluster can lead to port collisions
> ----------------------------------------------------------------------------
>
>                 Key: GIRAPH-72
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-72
>             Project: Giraph
>          Issue Type: Bug
>          Components: lib, zookeeper
>    Affects Versions: 0.1.0
>         Environment: production hadoop cluster, in-process ZK.
>            Reporter: Jake Mannix
>
> Had a Giraph mini-hackathon at work today, and lots of us launched simultaneous test jobs at the same time, and often ran into the following collision:
> ------
> startSuperstep: WORKER_ONLY - Attempt=0, Superstep=-1
> 2-Nov-2011 23:40:08
> java.net.BindException: Problem binding to <hostname>/<hostIP>:30000 : Address already in use
> 	at org.apache.hadoop.ipc.Server.bind(Server.java:196)
> 	at org.apache.hadoop.ipc.Server$Listener.(Server.java:259)
> 	at org.apache.hadoop.ipc.Server.(Server.java:1039)
> 	at org.apache.hadoop.ipc.RPC$Server.(RPC.java:492)
> 	at org.apache.hadoop.ipc.RPC.getServer(RPC.java:454)
> 	at org.apache.giraph.comm.RPCCommunications.getRPCServer(RPCCommunications.java:99)
> 	at org.apache.giraph.comm.BasicRPCCommunications.(BasicRPCCommunications.java:362)
> 	at org.apache.giraph.comm.RPCCommunications.(RPCCommunications.java:71)
> 	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:570)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.net.BindException: Address already in use
> 	at sun.nio.ch.Net.bind(Native Method)
> 	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
> 	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
> 	at org.apache.hadoop.ipc.Server.bind(Server.java:194)
> 	... 12 more
> ----
> The job then simply hung.  What it should do, I'd imagine, is at a bare minimum, catch this exception and allow the task to die quickly so it can get retried on another machine, or better yet, allow for a command-line arg at startup (and then passed into the Configuration) decide what ports to use.  Best yet, something automagic which allows multiple GraphMappers on the same machine without manually picking ports (pick one at random and store it in zookeeper?  but then what about the in-process zookeeper...) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-72) Running multiple Giraph jobs on the same cluster can lead to port collisions

Posted by "Jake Mannix (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-72?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142862#comment-13142862 ] 

Jake Mannix commented on GIRAPH-72:
-----------------------------------

Well, as I said, at a bare minimum, if you're on a cluster where you can't just grab all the slots on a box, being able to manually set the ports used would mean that you only collide with *yourself* in a given task, because everybody is using different ports on different jobs (or maybe even by default generate a random port based on the job id if not specified?).  More clever things would be great, of course, but I'm not sure what the best ones would be.  Even less clever things would be helpful: on this specific exception, die early in a way that sends your task to another box.  Is that possible/easy?
                
> Running multiple Giraph jobs on the same cluster can lead to port collisions
> ----------------------------------------------------------------------------
>
>                 Key: GIRAPH-72
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-72
>             Project: Giraph
>          Issue Type: Bug
>          Components: lib, zookeeper
>    Affects Versions: 0.70.0
>         Environment: production hadoop cluster, in-process ZK.
>            Reporter: Jake Mannix
>
> Had a Giraph mini-hackathon at work today, and lots of us launched simultaneous test jobs at the same time, and often ran into the following collision:
> ------
> startSuperstep: WORKER_ONLY - Attempt=0, Superstep=-1
> 2-Nov-2011 23:40:08
> java.net.BindException: Problem binding to <hostname>/<hostIP>:30000 : Address already in use
> 	at org.apache.hadoop.ipc.Server.bind(Server.java:196)
> 	at org.apache.hadoop.ipc.Server$Listener.(Server.java:259)
> 	at org.apache.hadoop.ipc.Server.(Server.java:1039)
> 	at org.apache.hadoop.ipc.RPC$Server.(RPC.java:492)
> 	at org.apache.hadoop.ipc.RPC.getServer(RPC.java:454)
> 	at org.apache.giraph.comm.RPCCommunications.getRPCServer(RPCCommunications.java:99)
> 	at org.apache.giraph.comm.BasicRPCCommunications.(BasicRPCCommunications.java:362)
> 	at org.apache.giraph.comm.RPCCommunications.(RPCCommunications.java:71)
> 	at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:570)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.net.BindException: Address already in use
> 	at sun.nio.ch.Net.bind(Native Method)
> 	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
> 	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
> 	at org.apache.hadoop.ipc.Server.bind(Server.java:194)
> 	... 12 more
> ----
> The job then simply hung.  What it should do, I'd imagine, is at a bare minimum, catch this exception and allow the task to die quickly so it can get retried on another machine, or better yet, allow for a command-line arg at startup (and then passed into the Configuration) decide what ports to use.  Best yet, something automagic which allows multiple GraphMappers on the same machine without manually picking ports (pick one at random and store it in zookeeper?  but then what about the in-process zookeeper...) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira