You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Patrik Modesto (Updated) (JIRA)" <ji...@apache.org> on 2012/03/15 10:11:38 UTC

[jira] [Updated] (CASSANDRA-3811) Empty rpc_address prevents running MapReduce job outside a cluster

     [ https://issues.apache.org/jira/browse/CASSANDRA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrik Modesto updated CASSANDRA-3811:
--------------------------------------

             Priority: Critical  (was: Major)
    Affects Version/s: 0.8.10

Changed to critical, because Cassandra-Hadoop doesn't work.

The problem really is the rpc_endpoint: 0.0.0.0, the CFIF can't handle that.

What happens if you setup a Cassandra cluster with rpc_endpoint: 0.0.0.0:

A) you run mapreduce job from outside the cluster, where there is no Cassandra server on localhost; job fails, can't connect to localhost to get splits.

B) you run mapreduce job from inside the cluster, where there is Cassandra server on localhost:
   1) CFIF calls describe_ring that returns something like this:
{noformat}
148873535527910577765226390751398592512 - 21267647932558653966460912964485513216 [10.0.18.129,10.0.18.99,10.0.18.98] [0.0.0.0,0.0.0.0,0.0.0.0]
106338239662793269832304564822427566080 - 148873535527910577765226390751398592512 [10.0.18.87,10.0.18.129,10.0.18.99] [0.0.0.0,0.0.0.0,0.0.0.0]
63802943797675961899382738893456539648 - 106338239662793269832304564822427566080 [10.0.18.98,10.0.18.87,10.0.18.129] [0.0.0.0,0.0.0.0,0.0.0.0]
21267647932558653966460912964485513216 - 63802943797675961899382738893456539648 [10.0.18.99,10.0.18.98,10.0.18.87] [0.0.0.0,0.0.0.0,0.0.0.0]
{noformat}
      Note the 0.0.0.0 IPs returned as rpc_endpoints.
  2) CFIF.getSplits then asks each node for the respective key range, which is the 0.0.0.0 ie. localhost instead of a node that really owns the key range
  3) localhost has of course just its key range and for it, correctly returns the splits, for other key ranges it returns start_key:end_key which is wrong
  4) hadoop then uses these wrong splits to calculate work for tasks etc. and such tasks never finish and get killed eventualy

Here is output of my simple test utility:
{noformat}
$ ./describe.py rfTest2
describe_ring
148873535527910577765226390751398592512 - 21267647932558653966460912964485513216 [10.0.18.129,10.0.18.99] [0.0.0.0,0.0.0.0]
106338239662793269832304564822427566080 - 148873535527910577765226390751398592512 [10.0.18.87,10.0.18.129] [0.0.0.0,0.0.0.0]
63802943797675961899382738893456539648 - 106338239662793269832304564822427566080 [10.0.18.98,10.0.18.87] [0.0.0.0,0.0.0.0]
21267647932558653966460912964485513216 - 63802943797675961899382738893456539648 [10.0.18.99,10.0.18.98] [0.0.0.0,0.0.0.0]
10.0.18.98:  ['148873535527910577765226390751398592512', '21267647932558653966460912964485513216']
10.0.18.98:  ['106338239662793269832304564822427566080', '148873535527910577765226390751398592512']
10.0.18.98:  ['63802943797675961899382738893456539648', '68793533432627989494832763003260446472', '74819769657966890059528779911565558455', '80567991868944382942831588469855825734', '87891603877459256288845990379651315512', '93924679813695495884062398757642798961', '100192950219560445380847254251687782801', '106338239662793269832304564822427566080']
10.0.18.98:  ['21267647932558653966460912964485513216', '26244106837171755875962953279096666742', '32201975146808227304585609407713826911', '38824800339023975211549544003547061559', '45039424797795217820051587252107982434', '50205785598336646901229997590646295071', '57012896007316411899806797335411421637', '63802943797675961899382738893456539648']
{noformat}

To explain the output. I have 4 node test cluster, keyspace rfTest2 has RF=2. It calls describe_ring to get node list. Then it calls describe_splits for each key range but asks always the same node, the same way CFIF does. You can see that nodes which don't have the key range return just start_key:end_key.

Solutions:
A) never return 0.0.0.0 from describe_ring
B) fix CFIF to use endpoint if rpc_endpoint is localhost

Regars,
Patrik
                
> Empty rpc_address prevents running MapReduce job outside a cluster
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-3811
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3811
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.8.9, 0.8.10
>         Environment: Debian Stable,
> Cassandra 0.8.9,
> Java(TM) SE Runtime Environment (build 1.6.0_26-b03),
> Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
>            Reporter: Patrik Modesto
>            Priority: Critical
>
> Setting rpc_address to empty to make Cassandra listen on all network intefaceces breaks running mapredude job from outside the cluster. The jobs wont even start, showing these messages:
> {noformat}
> 12/01/26 11:15:21 DEBUG  hadoop.ColumnFamilyInputFormat: failed
> connect to endpoint 0.0.0.0
> java.io.IOException: unable to connect to server
>        at org.apache.cassandra.hadoop.ConfigHelper.createConnection(ConfigHelper.java:389)
>        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSubSplits(ColumnFamilyInputFormat.java:224)
>        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.access$200(ColumnFamilyInputFormat.java:73)
>        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:193)
>        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:178)
>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.thrift.transport.TTransportException:
> java.net.ConnectException: Connection refused
>        at org.apache.thrift.transport.TSocket.open(TSocket.java:183)
>        at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
>        at org.apache.cassandra.hadoop.ConfigHelper.createConnection(ConfigHelper.java:385)
>        ... 9 more
> Caused by: java.net.ConnectException: Connection refused
>        at java.net.PlainSocketImpl.socketConnect(Native Method)
>        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
>        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:211)
>        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
>        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>        at java.net.Socket.connect(Socket.java:529)
>        at org.apache.thrift.transport.TSocket.open(TSocket.java:178)
>        ... 11 more
> ...
> Caused by: java.util.concurrent.ExecutionException:
> java.io.IOException: failed connecting to all endpoints
> 10.0.18.129,10.0.18.99,10.0.18.98
>        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:156)
>        ... 19 more
> Caused by: java.io.IOException: failed connecting to all endpoints
> 10.0.18.129,10.0.18.99,10.0.18.98
>        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSubSplits(ColumnFamilyInputFormat.java:241)
>        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.access$200(ColumnFamilyInputFormat.java:73)
>        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:193)
>        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:178)
>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> {noformat}
> Describe ring retunrs:
> {noformat}
> describe_ring returns:
> endpoints: 10.0.18.129,10.0.18.99,10.0.18.98
> rpc_endpoints: 0.0.0.0,0.0.0.0,0.0.0.0
> {noformat}
> [Michael Frisch|http://www.mail-archive.com/user@cassandra.apache.org/msg20180.html] found possible bug in the Cassandra source:
> {quote}
> If the code in the 0.8 branch is reflective of what is actually included in 
> Cassandra 0.8.9 (here: 
> http://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java)
>  then the problem is that line 202 is doing an == comparison on strings.  The 
> correct way to compare would be endpoint_address.equals("0.0.0.0") instead.
> - Mike
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira